You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
MariaDB/troubleshooting
This guide is a work in progress.
Depooling a slave
Quick depooling, for emergencies only
Connect to the offending server. Log in to mysql (sudo mysql --defaults-file=/root/.my.cnf
), and execute the following command:
STOP SLAVE;
If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically after X seconds. This is not the preferred method for several reasons:
- The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases
- The depool is not total, the server may receive queries still if all servers are lagged
The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases.
Normal depooling
Comment its server name from db-eqiad.php
or db-codfw.php
.
- Hopefully, you will have the operations/mediawiki-config repository already configured
- Edit the file wmf-config/db-eqiad.php (or wmf-config/db-codfw.php, if it is a server in dallas) and comment the line of the normal weight, in the 'sectionLoads' array, like this:
's8' => array( 'db1045' => 0, # 1.4TB 64GB # depooled because reasons 'db2094' => 100, 'db2095' => 100, 'db2096' => 100, 'db2097' => 100, ),
- Do not just set the weight to 0, that will not fully depool the server
- If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
's8' => array( 'watchlist' => array( # 'db2094' => 1, 'db2055' => 1, ), 'recentchanges' => array( # 'db2094' => 1, 'db2055' => 1, ), 'recentchangeslinked' => array( # 'db2094' => 1, 'db2055' => 1, ),
- Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of "special slaves" optimized for a particular role.
- commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
- Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users.
Example:
MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST; +---------+-----------------+-------------------+--------------------+---------+- | Id | User | Host | db | Command | +---------+-----------------+-------------------+--------------------+---------+- # internal process, ignore | 2 | event_scheduler | localhost | NULL | Daemon | # replication users, ignore | 3192579 | system user | | NULL | Connect | | 3192580 | system user | | NULL | Connect | # monitoring users, ignore | 6284249 | watchdog | 10.XX.XX.XX:34525 | information_schema | Sleep | | 6284250 | watchdog | 10.XX.XX.XX:34716 | information_schema | Sleep | | 6284253 | watchdog | 10.XX.XX.XX:34890 | mysql | Sleep | # this is your own connection | 6311084 | root | localhost | NULL | Query | +---------+-----------------+-------------------+--------------------+---------+-
(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)
- Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (
KILL <#ID>
) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill.
- Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (
Depooling a master (a.k.a. promoting a new slave to master)
Replication lag
Caused by hardware
One common cause of lag that is easy to check and repair is hardware issues. Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.
As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "1 failed LD(s) (Degraded)
" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under guarantee, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.
To check the RAID status, execute:
megacli -AdpAllInfo -aALL
And check the section "Devices present"
To identify the particular disk
megacli -PDList -aALL
Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).
Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completelly; in a state of "almost failing", there could be lag issues. If that is the case, executing:
megacli -PDOffline -PhysDrv \[#:#\] -aALL
where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later