You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

MariaDB/troubleshooting

From Wikitech-static
< MariaDB
Revision as of 18:14, 28 October 2015 by imported>Jcrespo
Jump to navigation Jump to search

This guide is a work in progress.

Depooling a slave

Quick depooling, for emergencies only

Connect to the offending server. Log in to mysql (sudo mysql --defaults-file=/root/.my.cnf), and execute the following command:

STOP SLAVE;

If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically after X seconds. This is not the preferred method for several reasons:

  • The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases
  • The depool is not total, the server may receive queries still if all servers are lagged

The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases.

Normal depooling

Comment its server name from db-eqiad.php or db-codfw.php.

  's8' => array(
    'db1045' => 0,   # 1.4TB  64GB
# depooled because reasons 'db2094' => 100,
    'db2095' => 100,
    'db2096' => 100,
    'db2097' => 100,
 ),
  • Do not just set the weight to 0, that will not fully depool the server
  • If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
 's8' => array(
   'watchlist' => array(
#      'db2094' => 1,
     'db2055' => 1,
   ),
   'recentchanges' => array(
#      'db2094' => 1,
     'db2055' => 1,
   ),
   'recentchangeslinked' => array(
#      'db2094' => 1,
     'db2055' => 1,
),
  • Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of "special slaves" optimized for a particular role.
  • commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
  • Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users.

Example:

MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST;
+---------+-----------------+-------------------+--------------------+---------+-
| Id      | User            | Host              | db                 | Command | 
+---------+-----------------+-------------------+--------------------+---------+-
# internal process, ignore
|       2 | event_scheduler | localhost         | NULL               | Daemon  | 
# replication users, ignore
| 3192579 | system user     |                   | NULL               | Connect | 
| 3192580 | system user     |                   | NULL               | Connect | 
# monitoring users, ignore
| 6284249 | watchdog        | 10.XX.XX.XX:34525 | information_schema | Sleep   | 
| 6284250 | watchdog        | 10.XX.XX.XX:34716 | information_schema | Sleep   | 
| 6284253 | watchdog        | 10.XX.XX.XX:34890 | mysql              | Sleep   | 
# this is your own connection
| 6311084 | root            | localhost         | NULL               | Query   | 
+---------+-----------------+-------------------+--------------------+---------+-

(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)

  • Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (KILL <#ID>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill.

Depooling a master (a.k.a. promoting a new slave to master)

Replication lag

Data inconsistency between nodes