You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
MariaDB/troubleshooting: Difference between revisions
< MariaDB
Jump to navigation
Jump to search
imported>Jcrespo (Created page with "''This guide is a work in progress.'' == Depooling a slave == === Quick depooling, for emergencies only === STOP SLAVE; === Normal depooling === Comment its server name fr...") |
imported>Jcrespo mNo edit summary |
||
Line 4: | Line 4: | ||
=== Quick depooling, for emergencies only === | === Quick depooling, for emergencies only === | ||
Connect to the offending server. Log in to mysql (<code>sudo mysql --defaults-file=/root/.my.cnf</code>), and execute the following command: | |||
STOP SLAVE; | STOP SLAVE; | ||
If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-eqiad.php;e443d3a6f3c11eb83af2bba753fd789700e0d123$161 after X seconds]. This is not the preferred method for several reasons: | |||
* The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases | |||
* The depool is not total, the server may receive queries still if all servers are lagged | |||
The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases. | |||
=== Normal depooling === | === Normal depooling === | ||
Comment its server name from <code>db-eqiad.php</code> or <code>db-codfw.php</code>. | Comment its server name from <code>db-eqiad.php</code> or <code>db-codfw.php</code>. | ||
* Hopefully, you will have the [https://phabricator.wikimedia.org/diffusion/OMWC/repository/master/ operations/mediawiki-config] repository already configured | |||
* Edit the file [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-eqiad.php wmf-config/db-eqiad.php] (or [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-codfw.php wmf-config/db-codfw.php], if it is a server in dallas) and comment the line of the normal weight, in the 'sectionLoads' array, like this: | |||
's8' => array( | |||
'db1045' => 0, # 1.4TB 64GB | |||
# depooled because reasons 'db2094' => 100, | |||
'db2095' => 100, | |||
'db2096' => 100, | |||
'db2097' => 100, | |||
), | |||
:*Do not just set the weight to 0, that will not fully depool the server | |||
* If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array: | |||
's8' => array( | |||
'watchlist' => array( | |||
# 'db2094' => 1, | |||
'db2055' => 1, | |||
), | |||
'recentchanges' => array( | |||
# 'db2094' => 1, | |||
'db2055' => 1, | |||
), | |||
'recentchangeslinked' => array( | |||
# 'db2094' => 1, | |||
'db2055' => 1, | |||
), | |||
:* Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of [[MariaDB#Special_Slaves|"special slaves"]] optimized for a particular role. | |||
* commit the change and deploy on tin: | |||
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons" | |||
* Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users. | |||
Example: | |||
MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST; | |||
+---------+-----------------+-------------------+--------------------+---------+- | |||
| Id | User | Host | db | Command | | |||
+---------+-----------------+-------------------+--------------------+---------+- | |||
# internal process, ignore | |||
| 2 | event_scheduler | localhost | NULL | Daemon | | |||
# replication users, ignore | |||
| 3192579 | system user | | NULL | Connect | | |||
| 3192580 | system user | | NULL | Connect | | |||
# monitoring users, ignore | |||
| 6284249 | watchdog | 10.XX.XX.XX:34525 | information_schema | Sleep | | |||
| 6284250 | watchdog | 10.XX.XX.XX:34716 | information_schema | Sleep | | |||
| 6284253 | watchdog | 10.XX.XX.XX:34890 | mysql | Sleep | | |||
# this is your own connection | |||
| 6311084 | root | localhost | NULL | Query | | |||
+---------+-----------------+-------------------+--------------------+---------+- | |||
(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.) | |||
:* Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (<code>KILL <#ID></code>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill. | |||
== Depooling a master (a.k.a. promoting a new slave to master) == | == Depooling a master (a.k.a. promoting a new slave to master) == |
Revision as of 18:14, 28 October 2015
This guide is a work in progress.
Depooling a slave
Quick depooling, for emergencies only
Connect to the offending server. Log in to mysql (sudo mysql --defaults-file=/root/.my.cnf
), and execute the following command:
STOP SLAVE;
If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically after X seconds. This is not the preferred method for several reasons:
- The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases
- The depool is not total, the server may receive queries still if all servers are lagged
The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases.
Normal depooling
Comment its server name from db-eqiad.php
or db-codfw.php
.
- Hopefully, you will have the operations/mediawiki-config repository already configured
- Edit the file wmf-config/db-eqiad.php (or wmf-config/db-codfw.php, if it is a server in dallas) and comment the line of the normal weight, in the 'sectionLoads' array, like this:
's8' => array( 'db1045' => 0, # 1.4TB 64GB # depooled because reasons 'db2094' => 100, 'db2095' => 100, 'db2096' => 100, 'db2097' => 100, ),
- Do not just set the weight to 0, that will not fully depool the server
- If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
's8' => array( 'watchlist' => array( # 'db2094' => 1, 'db2055' => 1, ), 'recentchanges' => array( # 'db2094' => 1, 'db2055' => 1, ), 'recentchangeslinked' => array( # 'db2094' => 1, 'db2055' => 1, ),
- Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of "special slaves" optimized for a particular role.
- commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
- Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users.
Example:
MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST; +---------+-----------------+-------------------+--------------------+---------+- | Id | User | Host | db | Command | +---------+-----------------+-------------------+--------------------+---------+- # internal process, ignore | 2 | event_scheduler | localhost | NULL | Daemon | # replication users, ignore | 3192579 | system user | | NULL | Connect | | 3192580 | system user | | NULL | Connect | # monitoring users, ignore | 6284249 | watchdog | 10.XX.XX.XX:34525 | information_schema | Sleep | | 6284250 | watchdog | 10.XX.XX.XX:34716 | information_schema | Sleep | | 6284253 | watchdog | 10.XX.XX.XX:34890 | mysql | Sleep | # this is your own connection | 6311084 | root | localhost | NULL | Query | +---------+-----------------+-------------------+--------------------+---------+-
(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)
- Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (
KILL <#ID>
) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill.
- Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (