You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

MariaDB/troubleshooting: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jcrespo
(Created page with "''This guide is a work in progress.'' == Depooling a slave == === Quick depooling, for emergencies only === STOP SLAVE; === Normal depooling === Comment its server name fr...")
 
imported>Jcrespo
mNo edit summary
Line 4: Line 4:


=== Quick depooling, for emergencies only ===
=== Quick depooling, for emergencies only ===
Connect to the offending server. Log in to mysql (<code>sudo mysql --defaults-file=/root/.my.cnf</code>), and execute the following command:
  STOP SLAVE;
  STOP SLAVE;
If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-eqiad.php;e443d3a6f3c11eb83af2bba753fd789700e0d123$161 after X seconds]. This is not the preferred method for several reasons:
* The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases
* The depool is not total, the server may receive queries still if all servers are lagged
The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases.


=== Normal depooling ===
=== Normal depooling ===
Comment its server name from <code>db-eqiad.php</code> or <code>db-codfw.php</code>.
Comment its server name from <code>db-eqiad.php</code> or <code>db-codfw.php</code>.
* Hopefully, you will have the [https://phabricator.wikimedia.org/diffusion/OMWC/repository/master/ operations/mediawiki-config] repository already configured
* Edit the file [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-eqiad.php wmf-config/db-eqiad.php] (or [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/db-codfw.php wmf-config/db-codfw.php], if it is a server in dallas) and comment the line of the normal weight, in the ​'sectionLoads' array, like this:
  's8' => array(
    'db1045' => 0,  # 1.4TB  64GB
# depooled because reasons 'db2094' => 100,
    'db2095' => 100,
    'db2096' => 100,
    'db2097' => 100,
  ),
:*Do not just set the weight to 0, that will not fully depool the server
* If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
  's8' => array(
    'watchlist' => array(
#      'db2094' => 1,
      'db2055' => 1,
    ),
    'recentchanges' => array(
#      'db2094' => 1,
      'db2055' => 1,
    ),
    'recentchangeslinked' => array(
#      'db2094' => 1,
      'db2055' => 1,
),
:* Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of [[MariaDB#Special_Slaves|"special slaves"]] optimized for a particular role.
* commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
* Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users.
Example:
MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST;
+---------+-----------------+-------------------+--------------------+---------+-
| Id      | User            | Host              | db                | Command |
+---------+-----------------+-------------------+--------------------+---------+-
# internal process, ignore
|      2 | event_scheduler | localhost        | NULL              | Daemon  |
# replication users, ignore
| 3192579 | system user    |                  | NULL              | Connect |
| 3192580 | system user    |                  | NULL              | Connect |
# monitoring users, ignore
| 6284249 | watchdog        | 10.XX.XX.XX:34525 | information_schema | Sleep  |
| 6284250 | watchdog        | 10.XX.XX.XX:34716 | information_schema | Sleep  |
| 6284253 | watchdog        | 10.XX.XX.XX:34890 | mysql              | Sleep  |
# this is your own connection
| 6311084 | root            | localhost        | NULL              | Query  |
+---------+-----------------+-------------------+--------------------+---------+-
(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)
:* Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (<code>KILL <#ID></code>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill.


== Depooling a master (a.k.a. promoting a new slave to master) ==
== Depooling a master (a.k.a. promoting a new slave to master) ==

Revision as of 18:14, 28 October 2015

This guide is a work in progress.

Depooling a slave

Quick depooling, for emergencies only

Connect to the offending server. Log in to mysql (sudo mysql --defaults-file=/root/.my.cnf), and execute the following command:

STOP SLAVE;

If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically after X seconds. This is not the preferred method for several reasons:

  • The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases
  • The depool is not total, the server may receive queries still if all servers are lagged

The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases.

Normal depooling

Comment its server name from db-eqiad.php or db-codfw.php.

  's8' => array(
    'db1045' => 0,   # 1.4TB  64GB
# depooled because reasons 'db2094' => 100,
    'db2095' => 100,
    'db2096' => 100,
    'db2097' => 100,
 ),
  • Do not just set the weight to 0, that will not fully depool the server
  • If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
 's8' => array(
   'watchlist' => array(
#      'db2094' => 1,
     'db2055' => 1,
   ),
   'recentchanges' => array(
#      'db2094' => 1,
     'db2055' => 1,
   ),
   'recentchangeslinked' => array(
#      'db2094' => 1,
     'db2055' => 1,
),
  • Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of "special slaves" optimized for a particular role.
  • commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
  • Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users.

Example:

MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST;
+---------+-----------------+-------------------+--------------------+---------+-
| Id      | User            | Host              | db                 | Command | 
+---------+-----------------+-------------------+--------------------+---------+-
# internal process, ignore
|       2 | event_scheduler | localhost         | NULL               | Daemon  | 
# replication users, ignore
| 3192579 | system user     |                   | NULL               | Connect | 
| 3192580 | system user     |                   | NULL               | Connect | 
# monitoring users, ignore
| 6284249 | watchdog        | 10.XX.XX.XX:34525 | information_schema | Sleep   | 
| 6284250 | watchdog        | 10.XX.XX.XX:34716 | information_schema | Sleep   | 
| 6284253 | watchdog        | 10.XX.XX.XX:34890 | mysql              | Sleep   | 
# this is your own connection
| 6311084 | root            | localhost         | NULL               | Query   | 
+---------+-----------------+-------------------+--------------------+---------+-

(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)

  • Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (KILL <#ID>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill.

Depooling a master (a.k.a. promoting a new slave to master)

Replication lag

Data inconsistency between nodes