You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
MariaDB/troubleshooting: Difference between revisions
imported>Jcrespo (→Caused by hardware: headers for easier reading) |
imported>Jcrespo (→Depooling a master (a.k.a. promoting a new slave to master): updating to document switchover.py) |
||
Line 4: | Line 4: | ||
== Depooling a slave == | == Depooling a slave == | ||
Comment all the instances of its server name from <code>db-eqiad.php</code> or <code>db-codfw.php</code> (it will likely appear on more than one load group). | |||
* Hopefully, you will have the [https://phabricator.wikimedia.org/diffusion/OMWC/repository/master/ operations/mediawiki-config] repository already configured | * Hopefully, you will have the [https://phabricator.wikimedia.org/diffusion/OMWC/repository/master/ operations/mediawiki-config] repository already configured | ||
* Edit the file [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-eqiad.php wmf-config/db-eqiad.php] (or [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-codfw.php wmf-config/db-codfw.php], if it is a server in | * Edit the file [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-eqiad.php wmf-config/db-eqiad.php] (or [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-codfw.php wmf-config/db-codfw.php], if it is a server in Dallas) and comment the line of the normal weight, in the 'sectionLoads' array, like this: | ||
's8' => array( | 's8' => array( | ||
Line 48: | Line 36: | ||
* commit the change and deploy on tin: | * commit the change and deploy on tin: | ||
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons" | /srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons" | ||
* Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users. | * Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306 (or the right mysql port). Monitoring tools regularly check the host, but they use separate users. | ||
Example: | Example: | ||
Line 71: | Line 59: | ||
(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.) | (no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.) | ||
:* Except | :* Except the dump slave -while creating the dumps- or some specific maintenance or long running tasks, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (<code>KILL <#ID></code>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill. Sadly there are some times where idle conenctions keep connected for a long time. | ||
== Depooling a master (a.k.a. promoting a new slave to master) == | == Depooling a master (a.k.a. promoting a new slave to master) == | ||
See: [[Switch master]] (most of it still relevant). | See: [[Switch master]] (most of it still relevant). | ||
Line 81: | Line 67: | ||
=== Planned switchover === | === Planned switchover === | ||
<code>// | There is a script, <code>switchover.py</code> https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/switchover.py , to be run from a Wikimedia mysql root client (neodymium or sarin at the moment), which will automate the most complex steps. However, due to mediawiki dependencies, we still need at the moment to perform some extra steps: | ||
* Set mediawiki in read only for that master (if possible) or migrate the service away. Normally that is a line uncommenting on <code>db-eqiad.php</code> or <code>db-codfw.php</code>: | |||
'readOnlyBySection' => [ | |||
's1' => 'English Wikipedia in read only because reasons.', | |||
... pointing the parsercache to another host: | |||
$wmgParserCacheDBs = [ | |||
'10.64.0.12' => '10.64.32.72', # pc1004, A3 2.4TB 256GB, temporarily failed over to pc1005 | |||
# | '10.64.32.72' => '10.64.32.72', # pc1005, C7 2.4TB 256GB | ||
# | |||
... or depooling it: | |||
$wgDefaultExternalStore = [ | |||
'DB://cluster24', | |||
# 'DB://cluster25', | |||
]; | |||
Once that is deployed, execute switchover.py, with the original master and the target one as parameters: | |||
./switchover.py db1052 db1067 | |||
This is an example of a successful output: | |||
* | Starting preflight checks... | ||
* Original read only values are as expected (master: read_only=0, slave: read_only=1) | |||
* The host to fail over is a direct replica of the master | |||
* Replication is up and running between the 2 hosts | |||
* The replication lag is acceptable: 0 (lower than the configured or default timeout) | |||
* The master is not a replica of any other host | |||
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- | |||
6313 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es3 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /tmp/mysql.sock --daemonize --pid /var/run/pt-heartbeat.pid | |||
================ | |||
PASS: |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 4.23hosts/s] | |||
FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] | |||
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. | |||
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. | |||
Stopping heartbeat pid 6313 at es1014.eqiad.wmnet:3306/(none) | |||
----- OUTPUT of '/bin/kill 6313' ----- | |||
================ | |||
PASS: |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 4.46hosts/s] | |||
FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] | |||
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 6313'. | |||
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. | |||
Setting up original master as read-only | |||
Slave caught up to the master after waiting 0.010378122329711914 seconds | |||
Servers sync at master: es1014-bin.002508:184384418 slave: es1017-bin.002491:41215873 | |||
Stopping original master->slave replication | |||
Setting up replica as read-write | |||
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0 | |||
Trying to invert replication direction | |||
Starting heartbeat section es3 at es1017.eqiad.wmnet | |||
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' ----- | |||
================ | |||
PASS: |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 3.29hosts/s] | |||
FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] | |||
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'. | |||
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. | |||
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- | |||
12107 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es3 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid | |||
================ | |||
PASS: |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 3.22hosts/s] | |||
FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] | |||
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. | |||
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. | |||
Detected heartbeat at es1017.eqiad.wmnet running with PID 12107 | |||
Verifying everything went as expected... | |||
SUCCESS: Master switch completed successfully | |||
This will move the replicas below the other host, and perform the replication changes to migrate the service, while maintaining data consistency. Note that at the moment of writing this lines, the replica migration has issues, so you should do it manually with repl.pl (which stops and restart replication and only works for direct working replicas). Thus, you should execute switchover skipping that step: | |||
./switchover.py --skip-slave-move db1052 db1067 | |||
Finally, set the service back in read-write/update master configuration by deploying mediawiki. | |||
A full list of manual steps can be found at: [[MariaDB#Production_section_failover_checklist]] | |||
=== Emergency failover === | === Emergency failover === | ||
Line 203: | Line 237: | ||
=== Backups === | === Backups === | ||
There are weekly logical backups in /srv/backups on | There are weekly logical backups in /srv/backups/latest on es2001 (for 2 weeks) and in bacula (for 30 days, aiming for 90 with a future expansion). They are very slow to recover (it can take 7-10 hours to recover a full shard), so think twice before recovering fully from them- use instead a binary backup or use them for partial recoveries. See [[MariaDB/Backups]] for more details on how they are created and how to recover them. | ||
== Data inconsistency between nodes == | == Data inconsistency between nodes == | ||
[[Category:MySQL]] |
Revision as of 08:43, 26 July 2018
This guide is a work in progress.
Depooling a slave
Comment all the instances of its server name from db-eqiad.php
or db-codfw.php
(it will likely appear on more than one load group).
- Hopefully, you will have the operations/mediawiki-config repository already configured
- Edit the file wmf-config/db-eqiad.php (or wmf-config/db-codfw.php, if it is a server in Dallas) and comment the line of the normal weight, in the 'sectionLoads' array, like this:
's8' => array( 'db1045' => 0, # 1.4TB 64GB # depooled because reasons 'db2094' => 100, 'db2095' => 100, 'db2096' => 100, 'db2097' => 100, ),
- Do not just set the weight to 0, that will not fully depool the server
- If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
's8' => array( 'watchlist' => array( # 'db2094' => 1, 'db2055' => 1, ), 'recentchanges' => array( # 'db2094' => 1, 'db2055' => 1, ), 'recentchangeslinked' => array( # 'db2094' => 1, 'db2055' => 1, ),
- Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of "special slaves" optimized for a particular role.
- commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
- Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306 (or the right mysql port). Monitoring tools regularly check the host, but they use separate users.
Example:
MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST; +---------+-----------------+-------------------+--------------------+---------+- | Id | User | Host | db | Command | +---------+-----------------+-------------------+--------------------+---------+- # internal process, ignore | 2 | event_scheduler | localhost | NULL | Daemon | # replication users, ignore | 3192579 | system user | | NULL | Connect | | 3192580 | system user | | NULL | Connect | # monitoring users, ignore | 6284249 | watchdog | 10.XX.XX.XX:34525 | information_schema | Sleep | | 6284250 | watchdog | 10.XX.XX.XX:34716 | information_schema | Sleep | | 6284253 | watchdog | 10.XX.XX.XX:34890 | mysql | Sleep | # this is your own connection | 6311084 | root | localhost | NULL | Query | +---------+-----------------+-------------------+--------------------+---------+-
(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)
- Except the dump slave -while creating the dumps- or some specific maintenance or long running tasks, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (
KILL <#ID>
) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill. Sadly there are some times where idle conenctions keep connected for a long time.
- Except the dump slave -while creating the dumps- or some specific maintenance or long running tasks, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (
Depooling a master (a.k.a. promoting a new slave to master)
See: Switch master (most of it still relevant).
Planned switchover
There is a script, switchover.py
https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/switchover.py , to be run from a Wikimedia mysql root client (neodymium or sarin at the moment), which will automate the most complex steps. However, due to mediawiki dependencies, we still need at the moment to perform some extra steps:
- Set mediawiki in read only for that master (if possible) or migrate the service away. Normally that is a line uncommenting on
db-eqiad.php
ordb-codfw.php
:
'readOnlyBySection' => [ 's1' => 'English Wikipedia in read only because reasons.',
... pointing the parsercache to another host:
$wmgParserCacheDBs = [ '10.64.0.12' => '10.64.32.72', # pc1004, A3 2.4TB 256GB, temporarily failed over to pc1005 '10.64.32.72' => '10.64.32.72', # pc1005, C7 2.4TB 256GB
... or depooling it:
$wgDefaultExternalStore = [ 'DB://cluster24', # 'DB://cluster25', ];
Once that is deployed, execute switchover.py, with the original master and the target one as parameters:
./switchover.py db1052 db1067
This is an example of a successful output:
Starting preflight checks... * Original read only values are as expected (master: read_only=0, slave: read_only=1) * The host to fail over is a direct replica of the master * Replication is up and running between the 2 hosts * The replication lag is acceptable: 0 (lower than the configured or default timeout) * The master is not a replica of any other host ----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- 6313 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es3 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /tmp/mysql.sock --daemonize --pid /var/run/pt-heartbeat.pid ================ PASS: |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 4.23hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Stopping heartbeat pid 6313 at es1014.eqiad.wmnet:3306/(none) ----- OUTPUT of '/bin/kill 6313' ----- ================ PASS: |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 4.46hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 6313'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Setting up original master as read-only Slave caught up to the master after waiting 0.010378122329711914 seconds Servers sync at master: es1014-bin.002508:184384418 slave: es1017-bin.002491:41215873 Stopping original master->slave replication Setting up replica as read-write All commands where successful, current status: original master read_only: 1 / original slave read_only: 0 Trying to invert replication direction Starting heartbeat section es3 at es1017.eqiad.wmnet ----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' ----- ================ PASS: |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 3.29hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. ----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' ----- 12107 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es3 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid ================ PASS: |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 3.22hosts/s] FAIL: | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Detected heartbeat at es1017.eqiad.wmnet running with PID 12107 Verifying everything went as expected... SUCCESS: Master switch completed successfully
This will move the replicas below the other host, and perform the replication changes to migrate the service, while maintaining data consistency. Note that at the moment of writing this lines, the replica migration has issues, so you should do it manually with repl.pl (which stops and restart replication and only works for direct working replicas). Thus, you should execute switchover skipping that step:
./switchover.py --skip-slave-move db1052 db1067
Finally, set the service back in read-write/update master configuration by deploying mediawiki.
A full list of manual steps can be found at: MariaDB#Production_section_failover_checklist
Emergency failover
If the master is not available, or replication is broken, this is a more complex case. The reasons is that slaves will have executed different amount of transactions and will be in a close, but different state. E.g. slave1 has executed transactions A, while slave2 has executed transactions A, B and C. In addition to that, if we do not have access to the master's binary log (or it has not properly been synchronized to disk after a crash), we will have to recover from a slave. In theory, with semi-sync replication, no transaction will be lost, and at least one slave will have the change, but all other slave will be on different coordinates (and binary log position is only local to the masters).
Scenario 1 -master is recoverable: just wait until the master restarts, it will avoid headaches and be faster and less disruptive than trying to failover it.
Scenario 2 -master is not recoverable, but its binary log is (and all slaves have less or equal amount of data):
- For each slave: send the master log position, starting from the last Exec_master_position so all slaves are in the same starting state
- Follow regular failover steps as mentioned in the scheduled maintenance
Scenario 3 -neither master is recoverable nor its binary logs (or a master binary log is behind a slave binary log): We need to put all servers in the same state, using the most up-to-date slave, then perform the regular failover process. This is the most complicated part without using GTIDs:
- Identify the most up to date slave by comparing Exec_master_log_pos
- By comparing binary log positions, try to find the binlog coordinate that corresponds to the the other's slaves binlog to the most up to date slave's binlog. This is the tricky part. pt-heartbeat should be able to find this.
- Excute the pending transactions on each slave
- Follow the regular steps for regular scheduled maintenance
Again, these steps can be automatized.
Replication lag
See also MySQL#Replication lag for additional tips.
Caused by hardware
One common cause of lag that is easy to check and repair is hardware issues.
Disks about to fail
Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.
As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "1 failed LD(s) (Degraded)
" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under warranty, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.
To check the RAID status, execute:
megacli -AdpAllInfo -aALL
And check the section "Devices present"
To identify the particular disk
megacli -PDList -aALL
Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).
Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completely; in a state of "almost failing", there could be lag issues. If that is the case, executing:
megacli -PDOffline -PhysDrv \[#:#\] -aALL
where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later.
Bad or defective BBU
If all the disks are looking good, it can be that the RAID controller went to WriteThrough mode because of a failed BBU or because it is in a learning cycle (which shouldn't because it is disabled in our environment). If the Cache Policy is set to WriteThrough it will dramatically affect performance. In order to check the Current Policy (that is it, the active one):
megacli -LDInfo -LAll -aAll | grep "Cache Policy:"
If it is not in WriteBack mode, it means (most likely) that the BBU has failed for some reason and the default is to switch back to WriteThrough as it is safer. You can check the BBU status with:
megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging'
If you are in an emergency, you can always force WriteBack, but this can lead to data loss if there is a power failure, so use it carefully
megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll
Data loss
Cloning a slave
The best (and fastest) way to repair a slave (or a master) is to clone from another slave. Either stop a slave (after depooling it) and just send the files away or use xtrabackup to clone it. The rest of the techniques assume that the data loss is fleet-wide (a logical/application/DBA error, rather than a physical problem).
Delayed slave
There are 2 delayed slaves with the main production data, 1 day behind: dbstore1001 and dbstore2001. If a data loss is detected within less than 24 hours, it is important to stop those slaves so they do not execute the offending queries. It is not easy to stop them, as they are programmed to be always 1 day behind. To stop replication on those, disable the events scheduling:
SET GLOBAL event_scheduler = 0;
Then STOP ALL SLAVES;
or STOP SLAVE 's1';
, to stop only connection s1 (enwiki replication), for example.
Then you can use `mysqldump` to import and export certain rows or tables:
mysqldump -h dbstore1001.eqiad.wmnet --single-transaction testwiki revision --where rev_id "BETWEEN 1 and 1000" | mysql -h db1002.eqiad.wmnet testwiki
Then you can roll forward changes, if necessary, starting from the current slave position, using mysqlbinlog, filtering by db or table:
mysqlbinlog --start-position=<relay log position> --stop-position/--stop-datetime=<stopping relay coordinate or date time> <relay logs> > events.sql && (check the events are ok) && mysql -h <host> <db> < events.sql
or just START SLAVE UNTIL;
Please note that tables in dbstores are most of the time in TokuDB format (they are being converted to InnoDB so a binary copy is also possible). Make sure to use InnoDB on production, if table has to be recreated.
Backups
There are weekly logical backups in /srv/backups/latest on es2001 (for 2 weeks) and in bacula (for 30 days, aiming for 90 with a future expansion). They are very slow to recover (it can take 7-10 hours to recover a full shard), so think twice before recovering fully from them- use instead a binary backup or use them for partial recoveries. See MariaDB/Backups for more details on how they are created and how to recover them.