You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
For emergencies, check the subpage about MariaDB troubleshooting.
Previously we mainly used MySQL. Now it is only used on the core masters and some auxiliary servers that have not yet been updated.
For a general overview, check the MySQL@Wikipedia slides.
For a full (?) list, look at db.php.
- s1: English Wikipedia
- s2: Large wikis
- s3: Most small wikis (~800)
- s4: Commons
- s5: Wikidata and German Wikipedia
- s6: Large wikis
- s7: Centralauth, metawiki and some large wikipedias
- es1: Read-only Clusters
- es2-es3: Read/write cluster
- x1: Notifications, Flow
- m1-m5: Internal services databases (Puppet, Phabricator, Openstack, Wordpress, ...) Documented at /misc
- m3: Phabricator
- pc: Parsercache
- Analytics cluster
Start / Stop
In most cases, mariadb will not start automatically on server start. This is by design- a server is not ready to be put into production just after start for many reasons, plus it could be in a bad state (e.g. after crash, requiring upgrade or maintenance).
If a server has been shutdown normally, it can be started with:
(current MariaDB packages do not ship a systemd unit, although technically, jessie's backwards compatibility would make them work with systemctl. Avoid service, as it does not allow to pass extra parameters.
Right now, starting mysql does start the slave automatically- however, this again may not be desired if mysql_upgrade has to be run or the server otherwise checked. Prefer:
/etc/init.d/mysql start --skip-slave-start
this will require later to run at the mysql prompt:
mysql> START SLAVE;
mysql> START ALL SLAVES;
if it is using multi-source replication.
To shutdown, make sure mysql is not running first:
If you just do "shutdown -[hr] now", there is a high chance that mysql will timeout and the operating system kills it uncleanly. This is usually due to one of these 3 reasons:
- It takes a lot of time to empty the buffer pool (can take several minutes!)
- The replication thread is killing/committing the ongoing transaction, and it takes a lot of time. This would seem that the slave is "blocked"
- TokuDB has crashed and replication is "stuck" (in this last case, you actually have to kill the server)
If either action takes a long while:
tail -f /a/sqldata/<hostname>.err
Do not assume that things will happen quickly, particularly if a box has been under heavy write load. Failing to wait for the server to stop will, with high probability, corrupt its data if non-transactional tables and GTID are being used- forcing to reload all data (a multi-hour or multi-day task!).
Buffer pool dump
In order to speed up the warming of the buffer pool (more frequent data should be in memory), automatic load/dump has been activated. See buffer pool dump for details.
InnoDB Shutdown Speed
InnoDB can take a while (hours) to write dirty pages out to disk on shutdown. If the instance must have a predictable downtime then make MariaDB begin flushing dirty pages well in advance of the shutdown process:
set global innodb_max_dirty_pages_pct=0;
show global status like '%dirty%';
When the counters approach zero, a shutdown and/or restart will be fast. Of course, since MariaDB is still handling traffic in the meantime, if write load is high the counters may never drop :-) Have to depool the box in that case.
Stopping a slave: avoiding replication corruption
Until GTID with transactional replication control is deployed, and probably forever on places like labs (where non-transactional tables are the norm), it is advised to stop replication first. In 99% percent of the cases, mysql already does that on shutdown, and shutdown does cleanly stop mysql -in theory. In reality, if shutdown takes too much time, systemd/rc/etc. kills mysql, which despite "server transactionality" (server is consistent on crash), that means that replication state is not itself transitional. This happens, for example, if long running updates (vslow, imports running on labs) prevent replication from stopping. In order to do that, just execute:
$ mysql --skip-ssl -e "STOP SLAVE"
This is specially true on crash: in most cases, reimaging a slave and reimporting it is the fastest way to assure data integrity (again, until production gets transactional replication state for InnoDB).
Asher started us using stock upstream packages from mariadb.org with the debs "de-debianized" each time to have fewer hooks, allowing puppet to handle stuff. Simply:
apt-get install mariadb-server-5.5
We also have 5.5.36 and 10.0.10 packages built from source, with a few patches, that install to /opt. So far these are used only for non-frontline boxes and/or advance testing new versions. Currently they probably aren't relevant outside production, but nonetheless can be found by:
apt-get install wmf-mariadb wmf-mariadb10
Need to run /opt/wmf-mariadb/install to install the rc.d script and symlinks in /usr/local/bin if cli tools are not otherwise available. Unlike the upstream versions our packages are fine to install alongside one another.
See mariadb roles in puppet.
Still relevant to MariaDB: Setting up a MySQL replica.
First decide if the schema change can be done online. This only works if:
- The table has a PRIMARY KEY or UNIQUE KEY on NOT NULL field(s)
- Mediawiki can ignore the change behind a feature flag
- Table metadata locking has been considered
Offline Schema Changes
- Schedule downtime
- See operations/mediawiki-config/db-eqiad.php
- May need a master rotation?
Online Schema Changes
Use pt-online-schema-change or the ops wrapper script operations/software/dbtools/osc_host.sh.
$ osc_host.sh --host=dbstore1001 --dblist=/a/common/s2.dblist --table=page_props \ --method=percona "add pp_sortkey float default null"
- --method=percona uses pt-online-schema-change and automatically checks slave lag on the host itself and its slaves.
- --method=ddl uses normal ALTER TABLE which will probably cause replication lag.
The wrapper script prompts for confirmation of settings first and also does a dry run for each wiki. It generally requires little supervision but should never run unmonitored.
If modifying indexes, or adding new fields that are nullable and/or have a default value, the change can be applied to each slave first and their masters last, using --no-replicate; ie, alter the replication tree bottom-to-top so that no slave ever tries to replicate a non-existant field from a master. This:
- Avoids risking the entire shard all at once; one can only break a single slave at a time :-)
- Allows the slave to be depooled if something goes wrong or load is too high.
$ osc_host.sh --host=dbstore1001 --dblist=/a/common/s2.dblist --table=page_props \ --no-replicate --method=percona "add pp_sortkey float default null"
If the change simply must go to masters first and propagate to slaves through replication then the wrapper script may still be appropriate to use but such cases deserve careful consideration from a DBA. This especially includes any changes to primary keys!
Sanitarium and Labsdbs
The sanitarium hosts have triggers defined on some tables which will clash with pt-online-schema-change. For small tables which should be fast to alter simply use --method=ddl. For large tables where DDL would cause unacceptable lag just schedule downtime.
Table Metadata Locking
Both ALTER TABLE and pt-online-schema-change need to hold the table metadata lock, even just briefly for the latter. They must wait for open transactions to close the table and also block new transactions opening the table. On a busy server and a hot table like page or revision this can easily result in many seconds of delay which is more than enough time for connections to pile up and hit max_connections.
Consider reducing load or depooling the box.
Replication Lag Trick
Issuing STOP SLAVE SQL_THREAD on a slave before starting pt_online_schema_change will cause immediate replication lag, which will in turn make Mediawiki load balancing code reduce traffic on the slave until it catches up. Just a few seconds of leeway is enough to allow the tool to create triggers, leading to far less interruption than waiting out the metadata locks traffic jam.
Replication lag can be checked on https://noc.wikimedia.org/dbtree/.
Manipulating the Replication Tree
Tendril displays the full replication tree.
A common task is manipulating slaves within the replication hierarchy. A script exists to help out: operations/software/dbtools/repl.pl. It allows controlling two slaves in a parent/child or sibling/sibling relationship and will do sanity checks plus confirmation prompts. It has nothing to do with masters so don't try to use it for those :-)
Child to Sibling
Move a slave one layer upward in the hierarchy:
./repl.pl --switch-child-to-sibling --parent=db1007.eqiad.wmnet --child=db1035.eqiad.wmnet
Child must be replicating directly from parent.
Sibling to Child
Move a slave one layer downward in the hierarchy:
./repl.pl --switch-sibling-to-child --parent=db1007.eqiad.wmnet --child=db1035.eqiad.wmnet
Both slaves must be replicating from the same master.
Stop Siblings in Sync
Stop two slaves on the same layer in the hierarchy at a common binlog position:
./repl.pl --stop-siblings-in-sync --host1=db1007.eqiad.wmnet --host2=db1035.eqiad.wmnet
Both slaves must be replicating from the same master.
A MariaDB 10 slave can have multiple masters. Have to set the default_master_connection session variable to indicate which stream is to be maniulated by subsequent commands:
./repl.pl --switch-child-to-sibling --parent=db1007.eqiad.wmnet --child=dbstore1001.eqiad.wmnet --child-set=default_master_connection=s7
The *links tables tend to have data skewed toward certain namespaces depending on the wiki. In most cases this doesn't matter and the MariaDB optimizer always chooses a fast execution plan. However some edge cases on heavily used namespaces can cause massive filesorts. Historically mediawiki used STRAIGHT_JOIN however that blunt instrument only introduced a different set of edge cases.
The *links tables respond well to range partitioning on namespace with ranges chosen appropriately on a case-by-case basis. Eg, commonswiki:
CREATE TABLE templatelinks ( tl_from int(8) unsigned NOT NULL DEFAULT '0', tl_namespace int(11) NOT NULL DEFAULT '0', tl_title varbinary(255) NOT NULL DEFAULT , UNIQUE KEY tl_from (tl_from,tl_namespace,tl_title), KEY tl_namespace (tl_namespace,tl_title,tl_from) ) ENGINE=InnoDB DEFAULT CHARSET=binary PARTITION BY RANGE (tl_namespace) ( PARTITION p_9 VALUES LESS THAN (10), PARTITION p_10 VALUES LESS THAN (11), PARTITION p_827 VALUES LESS THAN (828), PARTITION p_828 VALUES LESS THAN (829), PARTITION p_max VALUES LESS THAN MAXVALUE)
Shards S1-S7 have one slave each with table partitioning in place for revision and logging based on user id. These boxes handle special mediawiki query groups like recentchangeslinked, contributions, and logpager.
Eg, from eswiki:
CREATE TABLE logging ( log_id int(10) unsigned NOT NULL AUTO_INCREMENT, log_type varbinary(32) NOT NULL DEFAULT , log_action varbinary(32) NOT NULL DEFAULT , log_timestamp varbinary(14) NOT NULL DEFAULT '19700101000000', log_user int(10) unsigned NOT NULL DEFAULT '0', log_namespace int(11) NOT NULL DEFAULT '0', log_title varbinary(255) NOT NULL DEFAULT , log_comment varbinary(255) NOT NULL DEFAULT , log_params blob NOT NULL, log_deleted tinyint(3) unsigned NOT NULL DEFAULT '0', log_user_text varbinary(255) NOT NULL DEFAULT , log_page int(10) unsigned DEFAULT NULL, PRIMARY KEY (log_id,log_user), KEY type_time (log_type,log_timestamp), KEY type_action (log_type,log_action,log_timestamp), KEY user_time (log_user,log_timestamp), KEY page_time (log_namespace,log_title,log_timestamp), KEY times (log_timestamp) ) ENGINE=InnoDB DEFAULT CHARSET=binary PARTITION BY RANGE (log_user) (PARTITION p00 VALUES LESS THAN (100000), PARTITION p01 VALUES LESS THAN (200000), PARTITION p02 VALUES LESS THAN (300000), PARTITION p03 VALUES LESS THAN (400000), PARTITION p04 VALUES LESS THAN (500000), PARTITION p05 VALUES LESS THAN (1000000), PARTITION p06 VALUES LESS THAN (1500000), PARTITION p07 VALUES LESS THAN (2000000), PARTITION p08 VALUES LESS THAN (2500000), PARTITION p09 VALUES LESS THAN MAXVALUE);
CREATE TABLE revision ( rev_id int(8) unsigned NOT NULL AUTO_INCREMENT, rev_page int(8) unsigned NOT NULL DEFAULT '0', rev_text_id int(8) unsigned NOT NULL DEFAULT '0', rev_comment varbinary(255) NOT NULL, rev_user int(5) unsigned NOT NULL DEFAULT '0', rev_user_text varbinary(255) NOT NULL DEFAULT , rev_timestamp varbinary(14) NOT NULL DEFAULT , rev_minor_edit tinyint(1) unsigned NOT NULL DEFAULT '0', rev_deleted tinyint(1) unsigned NOT NULL DEFAULT '0', rev_len int(8) unsigned DEFAULT NULL, rev_parent_id int(8) unsigned DEFAULT NULL, rev_sha1 varbinary(32) NOT NULL DEFAULT , PRIMARY KEY (rev_id,rev_user), KEY rev_timestamp (rev_timestamp), KEY page_timestamp (rev_page,rev_timestamp), KEY user_timestamp (rev_user,rev_timestamp), KEY usertext_timestamp (rev_user_text,rev_timestamp) ) ENGINE=InnoDB AUTO_INCREMENT=74016150 DEFAULT CHARSET=binary PARTITION BY RANGE (rev_user) (PARTITION p00 VALUES LESS THAN (10000), PARTITION p01 VALUES LESS THAN (20000), PARTITION p02 VALUES LESS THAN (30000), PARTITION p03 VALUES LESS THAN (40000), PARTITION p04 VALUES LESS THAN (50000), PARTITION p05 VALUES LESS THAN (60000), PARTITION p06 VALUES LESS THAN (70000), PARTITION p07 VALUES LESS THAN (80000), PARTITION p08 VALUES LESS THAN (90000), PARTITION p09 VALUES LESS THAN (100000), PARTITION p10 VALUES LESS THAN (200000), PARTITION p11 VALUES LESS THAN (300000), PARTITION p12 VALUES LESS THAN (400000), PARTITION p13 VALUES LESS THAN (500000), PARTITION p14 VALUES LESS THAN (600000), PARTITION p15 VALUES LESS THAN (700000), PARTITION p16 VALUES LESS THAN (800000), PARTITION p17 VALUES LESS THAN (900000), PARTITION p18 VALUES LESS THAN (1000000), PARTITION p19 VALUES LESS THAN (2000000);
Queries that filter on log_user by equality or range run faster. Same for rev_user. The partition sizes are chosen on a per-wiki basis because one size does not fit all. Tendril has a report for choosing sizes based on row distribution.
Note the modified PRIMARY KEY definition that includes log_user. This is relatively safe for a slave but not appropriate for a master, so a partitioned slave should never be eligible for promotion. See coredb::$topology in puppet, or mediawiki-config/db-eqiad.php to identify them.
Extra notes: Having only 1 special node is a "Single Point of Slowdown". I am currently making sure that we have at least 2 nodes load-balancing this kind of traffic, which can be significant for uncached traffic, and allows to properly perform maintenance. Be careful, running the ALTER TABLE, as is, takes
Query OK, 629385781 rows affected (5 days 54 min 6.51 sec) for the English wikipedia, and as much amount of space as the original table, so probably pt-online-schema-change or other methods can be considered to avoid failure and painful rollbacks.
DB Store Slaves
Two boxes replicate all shards using MariaDB 10 multi-source replication: dbstore1001 and dbstore1002.
dbstore1001 & dbstore2001
- MariaDB 10
- Replication delayed by 24h
- InnoDB and TokuDB (Originally it was going to be eventually all TokuDB. Due to some issues with tokudb, I am leaving some newly imported tables as InnoDB. We need to explore different compression strategies. Having a different engine provents us from doing a binary copy.)
- dbstore1001 (dbstore2001 is in "standby"): Source for weekly logical backups via mysqldump and bacula
dbstore1002 & dbstore2002
- MariaDB 10
- dbstore1002 has a CNAME analytics-store.eqiad.wmnet
- Entirely TokuDB for replicated shards; Aria for Analytics' tables. (See the comment above)
- Replicates m4 with "custom" import methods for efficient purging
- This hosts should be the only ones used for operation production checks such as table sizes, checking the existance of records, etc.
Originally Analytics had access to one slave per shard. This is changing to a data warehouse type setup utilizing db1047 and dbstore1002 with both boxes having access to all shards.
Important! dbstore1002 and db1047 currently are not "real" slaves. Sean implemented a very clever custom replication, that is run from terbium.eqiad.wmnet, by which only new rows are added as a client process. This makes things faster BUT things will break if we run an ALTER TABLE on the m4-master tables, or the terbium process dies. This requires better monitoring, but at least you are warned!
dbproxy100x boxes run HAProxy. Besides making for easier Ops server rotations simply by having a proxy IP in the mix, there are two modes for other stuff: load balancing for slaves and failover for masters. So far, only m2 shard master uses it via dbproxy1002.
HAProxy is configured to know about a primary (DB master) a secondary (DB replication slave) but only one node is active at any timeː
listen mariadb 0.0.0.0:3306 mode tcp balance roundrobin option tcpka option mysql-check user haproxy server <%= @primary_name %> <%= @primary_addr %> check inter 3s fall 3 rise 99999999 server <%= @secondary_name %> <%= @secondary_addr %> check backup
If the primary fails health checks the backup is brought online. The rise 99999999 trick (about 10 years) means that the primary does not come back without human intervention, even if it starts passing HAProxy health checks again.
Now, this all sounds good, but there are still some catches:
- At present m2-slave is still running read_only=1 so read traffic will fail over but writes will start to be blocked until a human verifies that the old master is properly dead and runs SET GLOBAL read_only=0;. Applications on m2 like gerrit, ieg, otrs, exim and scholarships will complain but remain semi-useful.
- Persistent connections like those from the eventlogging consumer did not failover nicely during trials, instead hitting a TCP timeout and causing just about as much annoyance (and backfilling) as having no HAProxy at all. This needs more research.
So for the present, if dbproxy1002 complains:
- Check that m2-master is really down. If not, restart haproxy on dbproxy1002 and figure out why health checks failed.
- If the master is fubar ensure its mysqld is stopped before setting read_only=0 on the slave.
- If the slave is fubar most m2 apps probably don't care, so do nothing.
The main module for DBAs on the operations/puppet tree is "mariadb", which is in its own repo operations/puppet/mariadb. Remember to update the subrepo when committing changes to the mariadb module, otherwise it will not be caught by palladium, strontium, CI, etc.
There used to be a class coredb_mysql, used from 5.5 nodes. Right now it is obsolete, but it is still in use by nodes that have not been updated to mariadb10, in particular, the masters. When all nodes are updated to 5.6/10, we will discontinue it, but right now it is essential for everything to work correctly.
Despite all the previous, there are mariadb-related files on the main repo- shared among the clases. Those are the puppet/templates/mariadb/ files, where the main configuration and grants lie.
There is a forth place where you will find mariadb-related files, and that is the private repository, but I suppose that requires no explanation (passwords, etc.).
Other DBA related repos are:
- operations/software/dbtools: for scripts used for maintenance
- operations/software/redactatron: labs filtering
- operations/software/dbtree: DBtree
- operations/software/tendril: DB monitoring
To quickly address multiple core databases, additional custom grains for Salt have been added, this is the current list:
mysql_shard: contains the shard name (
x1, etc.) and address all the hosts in that shard in any location (masters, slaves, eqiad, codfw)
mysql_role: contains the name of the mysql role, valid values are:
To combine those additional grains with already existing ones use the compound matching, for example to get all the
slaves of shard
eqiad execute from the salt master:
sudo salt -C 'G@site:eqiad and G@mysql_shard:s7 and G@mysql_role:slave' test.ping
Long running queries
There is some event logic running on the servers trying to kill potential query exhaustion. Needs research.
I am investigating running:
pt-kill --print --kill --victims all --interval 10 --match-command Query --match-user wikiuser --group-by fingerprint --any-busy-time 50 --query-count 10 F=/root/.my.cnf
on a screen session to try to see if it is effective enough/does not create false positives. It requires more work, and finally, puppetization -then deciding which of the two options to follow.
Most of our tables use TokuDB.