You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

MariaDB/troubleshooting: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jcrespo
mNo edit summary
imported>Jcrespo
m (Reverted edits by Kjordan13 (talk) to last revision by Marostegui)
 
(32 intermediate revisions by 12 users not shown)
Line 1: Line 1:
{{Navigation Wikimedia infrastructure|expand=db}}
[[File:Replication problems.jpg|thumbnail|right|DBAs are working on it]]
[[File:Replication problems.jpg|thumbnail|right|DBAs are working on it]]
''This guide is a work in progress.''
''This guide is a work in progress. See also [[MariaDB/monitoring]] and [[MariaDB/Backups#Recovering_a_Snapshot]]''


== Depooling a slave ==
{{Warning|content=All dbctl write commands must be run with sudo}}


=== Quick depooling, for emergencies only ===
== Depooling a replica ==
Connect to the offending server. Log in to mysql (<code>sudo mysql --defaults-file=/root/.my.cnf</code>), and execute the following command:
STOP SLAVE;


If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-eqiad.php;e443d3a6f3c11eb83af2bba753fd789700e0d123$161 after X seconds]. This is not the preferred method for several reasons:
From one of the cluster management hosts ({{CuminHosts}}):
dbctl instance dbXXXX depool
dbctl config commit -m "Depool dbXXXX"


* The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases
More example commands at: https://wikitech.wikimedia.org/wiki/Dbctl#Usage
* The depool is not total, the server may receive queries still if all servers are lagged


The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases.
'''Create a task with the DBA tag''' so DBAs can follow up and checkout what happened, a proper fix etc


=== Normal depooling ===
* Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connections. You can also monitor connections with regular linux tools like netstat/ss on port 3306 (or the right mysql port). Monitoring tools regularly check the host, but they use separate users.
Comment its server name from <code>db-eqiad.php</code> or <code>db-codfw.php</code>.
 
* Hopefully, you will have the [https://phabricator.wikimedia.org/diffusion/OMWC/repository/master/ operations/mediawiki-config] repository already configured
* Edit the file [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-eqiad.php wmf-config/db-eqiad.php] (or [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-codfw.php wmf-config/db-codfw.php], if it is a server in dallas) and comment the line of the normal weight, in the ​'sectionLoads' array, like this:
 
  's8' => array(
    'db1045' => 0,  # 1.4TB  64GB
# depooled because reasons 'db2094' => 100,
    'db2095' => 100,
    'db2096' => 100,
    'db2097' => 100,
  ),
:*Do not just set the weight to 0, that will not fully depool the server
* If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
 
  's8' => array(
    'watchlist' => array(
#      'db2094' => 1,
      'db2055' => 1,
    ),
    'recentchanges' => array(
#      'db2094' => 1,
      'db2055' => 1,
    ),
    'recentchangeslinked' => array(
#      'db2094' => 1,
      'db2055' => 1,
),
 
:* Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of [[MariaDB#Special_Slaves|"special slaves"]] optimized for a particular role.
* commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
* Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users.


Example:
Example:
Line 71: Line 38:
(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)
(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)


:* Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (<code>KILL <#ID></code>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill.
:* Except the dump replica -while creating the dumps- or some specific maintenance or long running tasks, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (<code>KILL <#ID></code>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill. Sadly there are some times where idle connections keep connected for a long time.
 
=== x2 special topology ===
If the failed replica is part of x2, '''there is no need to depool it as they aren't used'''.
 
See [[MariaDB#x2]] for more details.
 
== Master comes back in read only ==
 
A master coming back in READ ONLY mode is '''expected after a crash''' and it is done to prevent accidental corruption (or even more corruption on a crash).
 
'''Unless you know what you are doing, do not set it back to writable: if unsure, call a DBA'''
 
=== Impact ===
Reads will remain unaffected but no writes will be able to go through, the wikis on that master will be on read-only mode.
 
To find out which wikis are those:
ssh to the host: mysql -e "show databases"
 
=== What to do ===
 
'''If unsure, call a DBA'''
 
If you know what you are doing:
* Check the reason for the crash:
journalctl -xe -umariadb -f
dmesg
/var/log/messages
 
* Check the state of the data:
** Check errors above
** Check if tables are marked as corrupted or you see InnoDB crashes
** Select from the main tables and see if you can get results: revision text users watchlist actor comment
 
** If storage is the cause, you most likely want to failover to a different host: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Emergency_failover
** If a memory dimm is the cause, you most like want to:
*** Disable puppet
*** Reduce innodb buffer pool size on my.cnf
*** Restart mysql
*** Check data from a few tables
*** Check that all replicas are in sync
*** set global read_only=OFF;
*** Create a task to follow up
 
If this is part of a maintenance and pages it could be due to expired downtime or because it was forgotten to be downtimed. If that is the case, contact whoever is doing the maintenance, if it is you, remember to:
set global read_only=OFF;


== Depooling a master (a.k.a. promoting a new slave to master) ==
== Depooling a master (a.k.a. promoting a new replica to master) ==


See: [[Switch master]] (most of it still relevant).
See: [[Switch master]] (most of it still relevant).
Line 79: Line 91:
=== Planned switchover ===
=== Planned switchover ===


<code>//FIXME: add the actual commands and some nice pictures:</code>
See an example task and checklist at: https://phabricator.wikimedia.org/T313383
 
From the <code>operations/software/</code> repo, run the following script to get a proper checklist of all the steps needed:
python3 dbtools/switchover-tmpl.py --dc eqiad --old $OLDMASTERHOSTNAME --ticket TXXXXX $SECTION $NEWMASTERHOSTNAME
 
* $SECTION and $NEWMASTERHOSTNAME are required, dc and old master would be automatically looked up if omitted.
 
'''Follow those steps'''
 
The key part is to first change the topology to get all the hosts to replicate from the new master:
Assuming old master: db1181 and new master: db1136
sudo db-switchover --timeout=25 --only-slave-move db1181 db1136
 
Once that is finished, we are ready to get the section into read-only and then change the master:


This is the easiest case: master is working (or at least replication is, or if replication is not working- it has cleanly stopped leaving all slaves on the same exact data state- in other words, there is no lag) and we only need to make a different server be the current master for a particular shard. General summary of the steps:
To interact with mediawiki dependencies we use dbctl:


# Put mediawiki in read-only mode (so users are aware of maintenance mode)
sudo dbctl --scope eqiad section sX ro "Maintenance until 06:15 UTC - TXXXX"
# Put the current mariadb master in read_only mode (now all servers should be in read-only mode, as slaves should generally always in read only mode)
sudo dbctl config commit -m "Set sX eqiad as read-only for maintenance - TXXXX"
# Wait until all slaves catch up (under regular, non-lagged conditions, it should take only 1-10 seconds)
# RESET SLAVE ALL the new master, so it is no longer replicating from any other server
# Note the binary log position of the new master
# CHANGE MASTER on all slaves except the new master to replicate from the new master
# CHANGE MASTER on the original master to replicate from the "new master"
# Modify mediawiki configuration to set the new master as the actual configured master
# Disable read_only mode on the new mysql master
# Disable mediawiki's read only mode


Topology changes can indeed be done in a hot way before the start of the maintenance, with scripts like repl.pl, minimizing the time mediawiki is in read only-mode if replication is running. That would alter the process to be:
Now we can perform the switch:
sudo db-switchover --skip-slave-move db1181 db1136


# Set all slaves except the new master to be children of the new master
Assuming the above went fine, we can set the section back to WR and promote the new master
# Create a circular replication between the old and the new master by setting the old master as a slave of the new master
# Execute steps 1, 2, 3, 4, 8, 9, 10


This can be automatized, MHA can do that, repl.pl does not do all steps, but it can help with slave changes, or a script like switch.php could be done to perform that.
sudo dbctl --scope eqiad section sX set-master db1136
sudo dbctl --scope eqiad section sX rw
sudo dbctl config commit -m "Promote db1136 to sX primary and set section read-write TXXXX"


There are some ops-level operations to clean up after the failover:
The rest of the steps provided <code>switchover-tmpl.py</code> by the after the switchover need to be followed in order to ensure that the post-switchover tasks are completed successfully.


* At the time of writing this lines, there is no puppet configuration to mark a master- this, however, is planned (to control the read_only config and monitoring). Review puppet and update it if necessary.
Please also see the main [[Dbctl#Setting_a_host_as_new_master_and_also_depool_the_previous_master_(which_is_what_we_normally_do_when_we_failover_a_master)|dbctl article]]
* There are some tools that depend on the master being sX-master: update the dns as necessary
* Make sure pt-heartbeat, if not yet controlled by puppet, is running on the master- mediawiki may be already using it (instead of SHOW SLAVE STATUS) to calculate the lag between the master and the slave.


=== Emergency failover ===
=== Emergency failover ===


If the master is not available, or replication is broken, this is a more complex case. The reasons is that slaves will have executed different amount of transactions and will be in a close, but different state. E.g. slave1 has executed transactions A, while slave2 has executed transactions A, B and C. In addition to that, if we do not have access to the master's binary log (or it has not properly been synchronized to disk after a crash), we will have to recover from a slave. In theory, with semi-sync replication, no transaction will be lost, and at least one slave will have the change, but all other slave will be on different coordinates (and binary log position is only local to the masters).
If the master is not available, or replication is broken, this is a more complex case. The reason is that replicas will have executed different amount of transactions and will be in a close, but different state. E.g. slave1 has executed transactions A, while slave2 has executed transactions A, B and C. In addition to that, if we do not have access to the master's binary log (or it has not properly been synchronized to disk after a crash), we will have to recover from a replica. In theory, with semi-sync replication, no transaction will be lost, and at least one replica will have the change, but all other replicas will be on different coordinates (and binary log position is only local to the masters).


'''Scenario 1''' -master is recoverable: just wait until the master restarts, it will avoid headaches and be faster and less disruptive than trying to failover it.
'''Scenario 1''' -master is recoverable: just wait until the master restarts, it will avoid headaches and be faster and less disruptive than trying to failover it.


'''Scenario 2''' -master is not recoverable, but its binary log is (and all slaves have less or equal amount of data):
'''Scenario 2''' -master is not recoverable, but its binary log is (and all replicas have a less or equal amount of data):
# For each slave: send the master log position, starting from the last Exec_master_position so all slaves are in the same starting state
# For each replica: send the master log position, starting from the last Exec_master_position so all replicas are in the same starting state
# Follow regular failover steps as mentioned in the scheduled maintenance
# Follow regular failover steps as mentioned in the scheduled maintenance


'''Scenario 3''' -neither master is recoverable nor its binary logs (or a master binary log is behind a slave binary log): We need to put all servers in the same state, using the most up-to-date slave, then perform the regular failover process. This is the most complicated part without using GTIDs:
'''Scenario 3''' -neither master is recoverable nor its binary logs (or a master binary log is behind a replica binary log): We need to put all servers in the same state, using the most up-to-date replica, then perform the regular failover process. This is the most complicated part without using GTIDs:
# Identify the most up to date slave by comparing Exec_master_log_pos
# Identify the most up to date replica by comparing Exec_master_log_pos
# By comparing binary log positions, try to find the binlog coordinate that corresponds to the the other's slaves binlog to the most up to date slave's binlog. This is the tricky part. pt-heartbeat should be able to find this.
# By comparing binary log positions, try to find the binlog coordinate that corresponds to the other's replicas binlog to the most up to date replica's binlog. This is the tricky part. pt-heartbeat should be able to find this.
# Excute the pending transactions on each slave
# Execute the pending transactions on each replica
# Follow the regular steps for regular scheduled maintenance
# Follow the regular steps for regular scheduled maintenance


Line 127: Line 143:


== Replication lag ==
== Replication lag ==
See also [[MySQL#Replication lag]] for additional tips.


=== Caused by hardware ===
=== Caused by hardware ===


[[File:Hwraid.png|thumb|600px|This is what a half-failing diks looks like in monitoring (small lag until it becomes critical).]]
[[File:Hwraid.png|thumb|600px|This is what a half-failing disk looks like in monitoring (small lag until it becomes critical).]]


One common cause of lag that is easy to check and repair is hardware issues. Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.
One common cause of lag that is easy to check and repair is hardware issues.


As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "<code>1 failed LD(s) (Degraded)</code>" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under guarantee, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.
==== Disks about to fail ====
 
Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.
 
As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "<code>1 failed LD(s) (Degraded)</code>" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under warranty, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.


To check the RAID status, execute:
To check the RAID status, execute:
Line 147: Line 169:
Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).
Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).


Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completelly; in a state of "almost failing", there could be lag issues.  If that is the case, executing:
Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completely; in a state of "almost failing", there could be lag issues.  If that is the case, executing:


   megacli -PDOffline -PhysDrv \[#:#\] -aALL
   megacli -PDOffline -PhysDrv \[#:#\] -aALL


where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later
where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later.
 
==== Bad or defective BBU ====
 
If all the disks are looking good, it can be that the RAID controller went to WriteThrough mode because of a failed BBU or because it is in a learning cycle (which shouldn't because it is disabled in our environment). If the Cache Policy is set to WriteThrough it will dramatically affect performance. In order to check the ''Current'' Policy (that is it, the active one):
 
megacli -LDInfo -LAll -aAll | grep "Cache Policy:"
 
If it is not in WriteBack mode, it means (most likely) that the BBU has failed for some reason and the default is to switch back to WriteThrough as it is safer.
You can check the BBU status with:
megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging'
 
If you are in an emergency, you can always force WriteBack, but '''this can lead to data loss if there is a power failure, so use it carefully'''
 
megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll
 
== Overload due to long running queries ==
 
{{main|Query killer}}
{{main|db-kill}}
 
If a host or several of them are complaining due to max_connection limit as seen by the processlist and max_connection errors on the mysql graphs [https://grafana.wikimedia.org/d/000000273/mysql] [https://grafana.wikimedia.org/d/000000278/mysql-aggregated] or on application server logs (sometimes causing also "Not enough idle PHP-FPM workers for Mediawiki" as a cascade error), it is possible that one or several queries run by mysql are taking over all mysql resources, snowballing into blocking others for execution. While mysql server will likely stay up, as they use a queue to avoid executing many of them at the same time, the queing will increase extra latency on execution, affecting application servers resources, too.
 
This can be caused by a deliberate abuse of expensive functionality or API call, by a bug on application code sending poorly-optimized queries, or a change on the query planner that makes old fast queries now slower.
 
To fix the issue, the best way is '''disabling on mediawiki the functionality that is sending the long running queries'''.
 
MariaDB mediawiki instances have a [[query killer]] enabled that kicks in when it detects idle queries or queries executing for longer than 60 seconds (agreed as the maximum query time allowed by web requests). If a connection overload is detected (by monitoring current connections), an overload killer also kicks in.
 
For a faster recovery, or to gain time until a longer term fix is deployed, roots can use the [[db-kill]] script from cumin hosts for the affected databases, to kill long running queries. Please note that this can alleviate the symptoms, but won't fix the root cause.
 
== Memory issues ==
 
There is an alert "MariaDB memory", that checks that there is not memory over-commitment due to a missconfiguration, heavy activity or a leak. For now the alert is just a check that there is enough free memory (not counting filesystem cache) for client connections, utilities and background tasks. If a host starts alerting (or starts OOMing), reduce the buffer pool or load to go back to the 80-90% memory usage.
 
While MariaDB can reduce the buffer pool in a hot way, just by typing <code>SET GLOBAL innodb_buffer_pool_size = <new size in bytes>;</code>, I don't recommend doing that on a pooled database because it will create stalls/replication lag while the buffer pool is being resized. It is not dangerous, but it is a heavy operation to try to avoid while pooled.


== Data loss ==
== Data loss ==


=== Cloning a slave ===
=== Normal reprovisioning ===
 
The best (and fastest) way to repair a replica (or a master) is to use the regular provisioning workflow - to copy the latest snapshot from dbprovXXXX hosts. Depending on the section it should take minutes to 1h.
 
If only a partial recovery is needed (single dropped table), logical backups (on the same dbprov* hosts) may be faster and more flexible. A full logical recovery can take from 12 hours to 1 day.
 
''More info at [[MariaDB/Backups]].''
 
=== Long term backups ===
 
If for some reason the short term backups/provisioning files are not enough, you can get those from bacula. Recover from bacula to dbprovXXXX then use the same method as above.
 
''More info at [[Bacula]].''
 
=== Cloning ===
 
If for some reason no backups are available, but replicas are, we can clone a running mariadb server with xtrabackup or the files of a stopped one into another host (see [[transfer.py]] utility for both file and xtrabackup transfers).
 
=== Binlogs ===
 
Binlogs are not a backup method, but they are files containing the transactions in the last month, on every master and replica. They are helpful to do point in time recovery if replication is not working, allowing to move forward a backup until an arbitrary point in time (e.g. before a DROP was sent).
 
Point in time recovery at the moment is fully manual but its automation is a work in progress.
 
== Data inconsistency between nodes "drift" / replication broken ==
 
[https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/compare.py;a2c2eaec25fe9dca11c0e4df38f925264bc33ebe?as=source&blame=off compare.py] utility allows to manually check the difference between 2 hosts. This is right now manually run, but it is schedule to be constantly running comparing host inconsistencies.
 
Aside from a manual check, the other most common way to find inconsistencies is for replication to break as a change is applied to a master that cannot be applied to the replica.
 
Steps to perform:
* If an inconsistency happens on one replica, it is likely '''the host data got corrupted, depool it''' and research should be done why it happened. If the issue was due to only replica issues, wipe data and recover from provisioning hosts.
* If it happens on all replicas, it should be checked if there is master corruption or other operational error. If an operational error "a table exist on the master an not on others", it could be corrected manually (e.g. create the table manually). Otherwise, failover the master so it can be depooled and continue with a different host as the new master.
* In all cases, it should be clear which host or set of hosts have the right data; the "bad host(s)" should be put out of production, itrs data deleted and reprovisioned from backup
 
'''NEVER use sql_slave_skip_counter!''' Not only you will skip full transactions (despite maybe a single row being problematic, creating more drift issues), you will make hosts have a different gtid counter. If you have to manually change something on only 1 host, apply the dml with <code>set sql_log_bin=0</code> so it doesn't go to the binlog/gtid counter/replication.
 
=== Split brain ===
 
There is an additional case, which is if a "split brain" has happened and correct data was written to both master and replicas.


The best (and fastest) way to repair a slave (or a master) is to clone from another slave. Either stop a slave (after depooling it) and just send the files away or use xtrabackup to clone it. The rest of the techniques assume that the data loss is fleet-wide (a logical/application/DBA error, rather than a physical problem).
This is rare given that all replicas are set in read only to prevent this issue. Also this is difficult to handle- ideally data should be merged into a single unified version.


=== Delayed slave ===
If data affected is derived (non-canonical) ", eg. *links tables, you could temporarily chose a single set of servers, go back to read/write and the try to merge the difference in the background.


There are 2 delayed slaves with the main production data, 1 day behind: dbstore1001 and dbstore2001. If a data loss is detected within less than 24 hours, it is important to stop those slaves so they do not execute the offending queries. It is not easy to stop them, as they are programmed to be always 1 day behind. To stop replication on those, disable the events scheduling:
If canonical data is affected (page, revision, user), consider setting up the application/section in read only until data is reconciliated, so new versions are not added that could make the 2 data version merging more complicated.


  SET GLOBAL event_scheduler = 0;
== Depooling a Labs dbproxy ==


Then <code>STOP ALL SLAVES;</code> or <code> STOP SLAVE 's1';</code>, to stop only connection s1 (enwiki replication), for example. Then you can roll forward changes, starting from the current slave position, using mysqldump:
The first thing is to depool it from the [https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replica_DNS Wiki Replicas DNS]
Once that is done and if you are depooling dbproxy1010 all the traffic will go to dbproxy1011 which only has one server as active. The other one is a backup host as can be seen at the hiera file that lives on our puppet repo:
cat hieradata/hosts/dbproxy1011.yaml


  mysqldump --start-position=<relay log position> --stop-position/--stop-datetime=<stopping relay coordinate or date time> <relay logs> | mysql
profile::mariadb::proxy::master::primary_name: 'labsdb1009'
profile::mariadb::proxy::master::primary_addr: '10.64.4.14:3306'
profile::mariadb::proxy::master::secondary_name: 'labsdb1010'
profile::mariadb::proxy::master::secondary_addr: '10.64.37.23:3306'


or just START SLAVE UNTIL;
That means when dbproxy1010 is depooled, all its traffic will go to labsdb1009.
So it is advised to change haproxy configuration temporarily to make labsdb1010 also active (round robin dns).
To do so:


And finally, you can use mysqldump to import and export certain rows or tables:
ssh dbproxy1011
puppet agent --disable "Changing haproxy temporarily"
vim /etc/haproxy/conf.d/db-master.cfg


  mysqldump -h dbstore1001 --single-transaction database table | mysql -h remote-master
Replace the line:
server labsdb1010 10.64.37.23:3306 check backup


Please note that tables in dbstores are most of the time in TokuDB format, so they should be converted to innodb on import.
With:
server labsdb1010 10.64.37.23:3306 check inter 3s fall 20 rise 99999999


=== Backups ===
Reload HAProxy
systemctl reload haproxy


There are weekly logical backups in /srv/backups on dbstore1001 (for 3 weeks) and in bacula (for 6 months). They are very slow to recover (it could take 3 days to revcover a single shard), so think twice before recovering from them. If possible, use paralelism to load the tables.
== Using mysqlbinlog ==
{{Note|content=[[User:Jcrespo|Jaime]] recommends using this alias: <code> alias mysqlbinlog='mysqlbinlog -vv --base64-output=DECODE-ROWS --skip-ssl'</code>}}
[[Category:MySQL]]
[[Category:Runbooks]]


{{SRE/Data Persistence/Footer}}


== Data inconsistency between nodes ==
[[Category:MariaDB]]

Latest revision as of 12:36, 26 August 2022

DBAs are working on it

This guide is a work in progress. See also MariaDB/monitoring and MariaDB/Backups#Recovering_a_Snapshot

Depooling a replica

From one of the cluster management hosts (cumin1001.eqiad.wmnet, cumin2002.codfw.wmnet):

dbctl instance dbXXXX depool
dbctl config commit -m "Depool dbXXXX"

More example commands at: https://wikitech.wikimedia.org/wiki/Dbctl#Usage

Create a task with the DBA tag so DBAs can follow up and checkout what happened, a proper fix etc

  • Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connections. You can also monitor connections with regular linux tools like netstat/ss on port 3306 (or the right mysql port). Monitoring tools regularly check the host, but they use separate users.

Example:

MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST;
+---------+-----------------+-------------------+--------------------+---------+-
| Id      | User            | Host              | db                 | Command | 
+---------+-----------------+-------------------+--------------------+---------+-
# internal process, ignore
|       2 | event_scheduler | localhost         | NULL               | Daemon  | 
# replication users, ignore
| 3192579 | system user     |                   | NULL               | Connect | 
| 3192580 | system user     |                   | NULL               | Connect | 
# monitoring users, ignore
| 6284249 | watchdog        | 10.XX.XX.XX:34525 | information_schema | Sleep   | 
| 6284250 | watchdog        | 10.XX.XX.XX:34716 | information_schema | Sleep   | 
| 6284253 | watchdog        | 10.XX.XX.XX:34890 | mysql              | Sleep   | 
# this is your own connection
| 6311084 | root            | localhost         | NULL               | Query   | 
+---------+-----------------+-------------------+--------------------+---------+-

(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)

  • Except the dump replica -while creating the dumps- or some specific maintenance or long running tasks, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (KILL <#ID>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill. Sadly there are some times where idle connections keep connected for a long time.

x2 special topology

If the failed replica is part of x2, there is no need to depool it as they aren't used.

See MariaDB#x2 for more details.

Master comes back in read only

A master coming back in READ ONLY mode is expected after a crash and it is done to prevent accidental corruption (or even more corruption on a crash).

Unless you know what you are doing, do not set it back to writable: if unsure, call a DBA

Impact

Reads will remain unaffected but no writes will be able to go through, the wikis on that master will be on read-only mode.

To find out which wikis are those:

ssh to the host: mysql -e "show databases"

What to do

If unsure, call a DBA

If you know what you are doing:

  • Check the reason for the crash:
journalctl -xe -umariadb -f
dmesg
/var/log/messages
  • Check the state of the data:
    • Check errors above
    • Check if tables are marked as corrupted or you see InnoDB crashes
    • Select from the main tables and see if you can get results: revision text users watchlist actor comment
    • If storage is the cause, you most likely want to failover to a different host: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Emergency_failover
    • If a memory dimm is the cause, you most like want to:
      • Disable puppet
      • Reduce innodb buffer pool size on my.cnf
      • Restart mysql
      • Check data from a few tables
      • Check that all replicas are in sync
      • set global read_only=OFF;
      • Create a task to follow up

If this is part of a maintenance and pages it could be due to expired downtime or because it was forgotten to be downtimed. If that is the case, contact whoever is doing the maintenance, if it is you, remember to:

set global read_only=OFF;

Depooling a master (a.k.a. promoting a new replica to master)

See: Switch master (most of it still relevant).

Planned switchover

See an example task and checklist at: https://phabricator.wikimedia.org/T313383

From the operations/software/ repo, run the following script to get a proper checklist of all the steps needed:

python3 dbtools/switchover-tmpl.py --dc eqiad --old $OLDMASTERHOSTNAME --ticket TXXXXX $SECTION $NEWMASTERHOSTNAME
  • $SECTION and $NEWMASTERHOSTNAME are required, dc and old master would be automatically looked up if omitted.

Follow those steps

The key part is to first change the topology to get all the hosts to replicate from the new master: Assuming old master: db1181 and new master: db1136

sudo db-switchover --timeout=25 --only-slave-move db1181 db1136

Once that is finished, we are ready to get the section into read-only and then change the master:

To interact with mediawiki dependencies we use dbctl:

sudo dbctl --scope eqiad section sX ro "Maintenance until 06:15 UTC - TXXXX"
sudo dbctl config commit -m "Set sX eqiad as read-only for maintenance - TXXXX"

Now we can perform the switch:

sudo db-switchover --skip-slave-move db1181 db1136

Assuming the above went fine, we can set the section back to WR and promote the new master

sudo dbctl --scope eqiad section sX set-master db1136
sudo dbctl --scope eqiad section sX rw
sudo dbctl config commit -m "Promote db1136 to sX primary and set section read-write TXXXX"

The rest of the steps provided switchover-tmpl.py by the after the switchover need to be followed in order to ensure that the post-switchover tasks are completed successfully.

Please also see the main dbctl article

Emergency failover

If the master is not available, or replication is broken, this is a more complex case. The reason is that replicas will have executed different amount of transactions and will be in a close, but different state. E.g. slave1 has executed transactions A, while slave2 has executed transactions A, B and C. In addition to that, if we do not have access to the master's binary log (or it has not properly been synchronized to disk after a crash), we will have to recover from a replica. In theory, with semi-sync replication, no transaction will be lost, and at least one replica will have the change, but all other replicas will be on different coordinates (and binary log position is only local to the masters).

Scenario 1 -master is recoverable: just wait until the master restarts, it will avoid headaches and be faster and less disruptive than trying to failover it.

Scenario 2 -master is not recoverable, but its binary log is (and all replicas have a less or equal amount of data):

  1. For each replica: send the master log position, starting from the last Exec_master_position so all replicas are in the same starting state
  2. Follow regular failover steps as mentioned in the scheduled maintenance

Scenario 3 -neither master is recoverable nor its binary logs (or a master binary log is behind a replica binary log): We need to put all servers in the same state, using the most up-to-date replica, then perform the regular failover process. This is the most complicated part without using GTIDs:

  1. Identify the most up to date replica by comparing Exec_master_log_pos
  2. By comparing binary log positions, try to find the binlog coordinate that corresponds to the other's replicas binlog to the most up to date replica's binlog. This is the tricky part. pt-heartbeat should be able to find this.
  3. Execute the pending transactions on each replica
  4. Follow the regular steps for regular scheduled maintenance

Again, these steps can be automatized.

Replication lag

See also MySQL#Replication lag for additional tips.

Caused by hardware

This is what a half-failing disk looks like in monitoring (small lag until it becomes critical).

One common cause of lag that is easy to check and repair is hardware issues.

Disks about to fail

Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.

As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "1 failed LD(s) (Degraded)" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under warranty, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.

To check the RAID status, execute:

 megacli -AdpAllInfo -aALL

And check the section "Devices present"

To identify the particular disk

 megacli -PDList -aALL

Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).

Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completely; in a state of "almost failing", there could be lag issues. If that is the case, executing:

 megacli -PDOffline -PhysDrv \[#:#\] -aALL

where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later.

Bad or defective BBU

If all the disks are looking good, it can be that the RAID controller went to WriteThrough mode because of a failed BBU or because it is in a learning cycle (which shouldn't because it is disabled in our environment). If the Cache Policy is set to WriteThrough it will dramatically affect performance. In order to check the Current Policy (that is it, the active one):

megacli -LDInfo -LAll -aAll | grep "Cache Policy:"

If it is not in WriteBack mode, it means (most likely) that the BBU has failed for some reason and the default is to switch back to WriteThrough as it is safer. You can check the BBU status with:

megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging'

If you are in an emergency, you can always force WriteBack, but this can lead to data loss if there is a power failure, so use it carefully

megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Overload due to long running queries

Main article: Query killer
Main article: db-kill

If a host or several of them are complaining due to max_connection limit as seen by the processlist and max_connection errors on the mysql graphs [1] [2] or on application server logs (sometimes causing also "Not enough idle PHP-FPM workers for Mediawiki" as a cascade error), it is possible that one or several queries run by mysql are taking over all mysql resources, snowballing into blocking others for execution. While mysql server will likely stay up, as they use a queue to avoid executing many of them at the same time, the queing will increase extra latency on execution, affecting application servers resources, too.

This can be caused by a deliberate abuse of expensive functionality or API call, by a bug on application code sending poorly-optimized queries, or a change on the query planner that makes old fast queries now slower.

To fix the issue, the best way is disabling on mediawiki the functionality that is sending the long running queries.

MariaDB mediawiki instances have a query killer enabled that kicks in when it detects idle queries or queries executing for longer than 60 seconds (agreed as the maximum query time allowed by web requests). If a connection overload is detected (by monitoring current connections), an overload killer also kicks in.

For a faster recovery, or to gain time until a longer term fix is deployed, roots can use the db-kill script from cumin hosts for the affected databases, to kill long running queries. Please note that this can alleviate the symptoms, but won't fix the root cause.

Memory issues

There is an alert "MariaDB memory", that checks that there is not memory over-commitment due to a missconfiguration, heavy activity or a leak. For now the alert is just a check that there is enough free memory (not counting filesystem cache) for client connections, utilities and background tasks. If a host starts alerting (or starts OOMing), reduce the buffer pool or load to go back to the 80-90% memory usage.

While MariaDB can reduce the buffer pool in a hot way, just by typing SET GLOBAL innodb_buffer_pool_size = <new size in bytes>;, I don't recommend doing that on a pooled database because it will create stalls/replication lag while the buffer pool is being resized. It is not dangerous, but it is a heavy operation to try to avoid while pooled.

Data loss

Normal reprovisioning

The best (and fastest) way to repair a replica (or a master) is to use the regular provisioning workflow - to copy the latest snapshot from dbprovXXXX hosts. Depending on the section it should take minutes to 1h.

If only a partial recovery is needed (single dropped table), logical backups (on the same dbprov* hosts) may be faster and more flexible. A full logical recovery can take from 12 hours to 1 day.

More info at MariaDB/Backups.

Long term backups

If for some reason the short term backups/provisioning files are not enough, you can get those from bacula. Recover from bacula to dbprovXXXX then use the same method as above.

More info at Bacula.

Cloning

If for some reason no backups are available, but replicas are, we can clone a running mariadb server with xtrabackup or the files of a stopped one into another host (see transfer.py utility for both file and xtrabackup transfers).

Binlogs

Binlogs are not a backup method, but they are files containing the transactions in the last month, on every master and replica. They are helpful to do point in time recovery if replication is not working, allowing to move forward a backup until an arbitrary point in time (e.g. before a DROP was sent).

Point in time recovery at the moment is fully manual but its automation is a work in progress.

Data inconsistency between nodes "drift" / replication broken

compare.py utility allows to manually check the difference between 2 hosts. This is right now manually run, but it is schedule to be constantly running comparing host inconsistencies.

Aside from a manual check, the other most common way to find inconsistencies is for replication to break as a change is applied to a master that cannot be applied to the replica.

Steps to perform:

  • If an inconsistency happens on one replica, it is likely the host data got corrupted, depool it and research should be done why it happened. If the issue was due to only replica issues, wipe data and recover from provisioning hosts.
  • If it happens on all replicas, it should be checked if there is master corruption or other operational error. If an operational error "a table exist on the master an not on others", it could be corrected manually (e.g. create the table manually). Otherwise, failover the master so it can be depooled and continue with a different host as the new master.
  • In all cases, it should be clear which host or set of hosts have the right data; the "bad host(s)" should be put out of production, itrs data deleted and reprovisioned from backup

NEVER use sql_slave_skip_counter! Not only you will skip full transactions (despite maybe a single row being problematic, creating more drift issues), you will make hosts have a different gtid counter. If you have to manually change something on only 1 host, apply the dml with set sql_log_bin=0 so it doesn't go to the binlog/gtid counter/replication.

Split brain

There is an additional case, which is if a "split brain" has happened and correct data was written to both master and replicas.

This is rare given that all replicas are set in read only to prevent this issue. Also this is difficult to handle- ideally data should be merged into a single unified version.

If data affected is derived (non-canonical) ", eg. *links tables, you could temporarily chose a single set of servers, go back to read/write and the try to merge the difference in the background.

If canonical data is affected (page, revision, user), consider setting up the application/section in read only until data is reconciliated, so new versions are not added that could make the 2 data version merging more complicated.

Depooling a Labs dbproxy

The first thing is to depool it from the Wiki Replicas DNS Once that is done and if you are depooling dbproxy1010 all the traffic will go to dbproxy1011 which only has one server as active. The other one is a backup host as can be seen at the hiera file that lives on our puppet repo:

cat hieradata/hosts/dbproxy1011.yaml
profile::mariadb::proxy::master::primary_name: 'labsdb1009'
profile::mariadb::proxy::master::primary_addr: '10.64.4.14:3306'
profile::mariadb::proxy::master::secondary_name: 'labsdb1010'
profile::mariadb::proxy::master::secondary_addr: '10.64.37.23:3306'

That means when dbproxy1010 is depooled, all its traffic will go to labsdb1009. So it is advised to change haproxy configuration temporarily to make labsdb1010 also active (round robin dns). To do so:

ssh dbproxy1011
puppet agent --disable "Changing haproxy temporarily"
vim /etc/haproxy/conf.d/db-master.cfg

Replace the line:

server labsdb1010 10.64.37.23:3306 check backup

With:

server labsdb1010 10.64.37.23:3306 check inter 3s fall 20 rise 99999999

Reload HAProxy

systemctl reload haproxy 

Using mysqlbinlog



This page is a part of the SRE Data Persistence technical documentation
(go here for a list of all our pages)