You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

MariaDB/troubleshooting: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jcrespo
mNo edit summary
imported>Jcrespo
Line 75: Line 75:


== Replication lag ==
== Replication lag ==
=== Caused by hardware ===
[[File:Hwraid.png|thumb|600px|This is what a half-failing diks looks like in monitoring (small lag until it becomes critical).]]
One common cause of lag that is easy to check and repair is hardware issues. Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.
As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "<code>1 failed LD(s) (Degraded)</code>" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under guarantee, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.
To check the RAID status, execute:
  megacli -AdpAllInfo -aALL
And check the section "Devices present"
To identify the particular disk
  megacli -PDList -aALL
Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).
Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completelly; in a state of "almost failing", there could be lag issues.  If that is the case, executing:
  megacli -PDOffline -PhysDrv \[#:#\] -aALL
where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later


== Data inconsistency between nodes ==
== Data inconsistency between nodes ==

Revision as of 09:42, 5 December 2015

This guide is a work in progress.

Depooling a slave

Quick depooling, for emergencies only

Connect to the offending server. Log in to mysql (sudo mysql --defaults-file=/root/.my.cnf), and execute the following command:

STOP SLAVE;

If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically after X seconds. This is not the preferred method for several reasons:

  • The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases
  • The depool is not total, the server may receive queries still if all servers are lagged

The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases.

Normal depooling

Comment its server name from db-eqiad.php or db-codfw.php.

  's8' => array(
    'db1045' => 0,   # 1.4TB  64GB
# depooled because reasons 'db2094' => 100,
    'db2095' => 100,
    'db2096' => 100,
    'db2097' => 100,
 ),
  • Do not just set the weight to 0, that will not fully depool the server
  • If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
 's8' => array(
   'watchlist' => array(
#      'db2094' => 1,
     'db2055' => 1,
   ),
   'recentchanges' => array(
#      'db2094' => 1,
     'db2055' => 1,
   ),
   'recentchangeslinked' => array(
#      'db2094' => 1,
     'db2055' => 1,
),
  • Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of "special slaves" optimized for a particular role.
  • commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
  • Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users.

Example:

MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST;
+---------+-----------------+-------------------+--------------------+---------+-
| Id      | User            | Host              | db                 | Command | 
+---------+-----------------+-------------------+--------------------+---------+-
# internal process, ignore
|       2 | event_scheduler | localhost         | NULL               | Daemon  | 
# replication users, ignore
| 3192579 | system user     |                   | NULL               | Connect | 
| 3192580 | system user     |                   | NULL               | Connect | 
# monitoring users, ignore
| 6284249 | watchdog        | 10.XX.XX.XX:34525 | information_schema | Sleep   | 
| 6284250 | watchdog        | 10.XX.XX.XX:34716 | information_schema | Sleep   | 
| 6284253 | watchdog        | 10.XX.XX.XX:34890 | mysql              | Sleep   | 
# this is your own connection
| 6311084 | root            | localhost         | NULL               | Query   | 
+---------+-----------------+-------------------+--------------------+---------+-

(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)

  • Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (KILL <#ID>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill.

Depooling a master (a.k.a. promoting a new slave to master)

Replication lag

Caused by hardware

This is what a half-failing diks looks like in monitoring (small lag until it becomes critical).

One common cause of lag that is easy to check and repair is hardware issues. Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.

As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "1 failed LD(s) (Degraded)" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under guarantee, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.

To check the RAID status, execute:

 megacli -AdpAllInfo -aALL

And check the section "Devices present"

To identify the particular disk

 megacli -PDList -aALL

Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).

Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completelly; in a state of "almost failing", there could be lag issues. If that is the case, executing:

 megacli -PDOffline -PhysDrv \[#:#\] -aALL

where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later

Data inconsistency between nodes