You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
MariaDB: Difference between revisions
|Line 27:||Line 27:|
=== Extension storage ===
=== Extension storage ===
* x1: Notifications, Flow,
* x1: Notifications, Flow, , , ... (in general, databases that have a very different write pattern or need to be shared between wikis)
=== Miscellaneous ===
=== Miscellaneous ===
Revision as of 08:41, 5 July 2019
For emergencies, check the subpage about MariaDB troubleshooting.
For a general overview, check the MySQL@Wikipedia (2015) slides.
Sections and shards
For a full (?) list, look at db.php.
Core MediaWiki databases
- s1: English Wikipedia
- s2: Some large wikis (Chinese, Italian, Dutch, Polish, Portuguese, …)
- s3: Most small wikis (~800)
- s4: Commons
- s5: German Wikipedia (scheduled to move the largest s3 wikis here due to low load)
- s6: Some large wikis (French, Japanese, Russian)
- s7: CentralAuth, meta, and some large wikis (Arabic, Spanish, Farsi, Hebrew, …)
- s8: Wikidata
The external storage servers host the full text of all page revisions in a compressed format.
- es1: Read-only Clusters
- es2-es3: Read/write cluster
- pc1-pc3: parser caches
- x1: Notifications, Flow, ContentTranslation, Cognate, ReadingLists, UrlShortener, ... (in general, databases that have a very different write pattern or need to be shared between wikis)
The miscellaneous servers host databases for various auxiliary services.
- m1: Basic ops utilities: bacula, etherpad, librenms, puppet (only stats), racktables, others
- m2: OTRS, Gerrit, debmonitor and others
- m3: Phabricator
- m4: EventLogging system
- m5: testreduce, openstack and other labs-related dbs
- tendril: tendril (including dbtree backend) and zarcillo
- labsdb1009/10/11: Wikireplicas
- labsdb1004/5: ToolsDB
- Other misc mariadb an postgres hosts
The master database is the primary server for a given shard. If the master database server fails, then a replica can be promoted to be a new master. See Master switch for more information.
A database replica (or slave database) is another MariaDB or MySQL instance that replicates data from the master database. Most often, we add replicas in production for the purpose of load-balancing read queries, and for backup/failover purposes.
Cloud Services Wiki Replicas
Copies of the core databases with private data removed are part of the Data Services hosted by Wikimedia Cloud Services for bot, tool, and analysis use. The redaction is done during an intermediate stop on Sanitarium.
Full copies of the core and extension storage databases are available for internal, NDA-only use on dbstore1003,4 and 5, although they are usually accessed using different dns alias.
Aside from replicas of production databases, there is both a master and a replica (accessible for NDA users) for the eventlogging service.
A bit of history:
- Originally Analytics had access to one replica per shard.
- This has changed to a data warehouse type setup utilizing db1047 and dbstore1002 with both boxes having access to all shards.
- With T156844 dbstore1002 stopped replicating the log database (eventlogging) and db1047 was decommissioned. db1108 is currently the only eventlogging log database replica (using a very clever custom replication script running locally).
- dbstore1002 is finally decommissioned on March 2019, replaced by 3 multi-instance hosts
- Main article: MariaDB/Backups
Replication protects against (in general) against hardware issues, as well as physical corruption. However, it is not enough to prevent application or operation errors. Also many people (e.g. developers and service maintainers) sometimes require observing past state of the database to solve data application issues.
In the past, 2 boxes replicate all shards using MariaDB 10 multi-source replication:
dbstore1002 (as well as
dbstore2002 on codfw). dbstore1001 and dbstore2001 also used to be delayed 24 hours. Multisource was abandoned for several reasons (task T159423): TokuDB (used for compression) bugs, difficult to recover and maintain, as well as difficulty for scalability as data grew more and more, MariaDB bugs and GTID limitations, among others.
In addition to the delayed slave, a bacula-orchestrated dumpdump was done weekly from dbstore1001 on misc and metadata core databases, and stored syncronously on bacula.
In 2017, multiple hardware and scalability issues lead to a focus and complete redesign of the database backup architecture, leading to the current one explained at MariaDB/Backups, with focus on full coverage, validation and testing and full automation recovery.
Start / Stop
In most cases, mariadb will not start automatically on server start. This is by design- a server is not ready to be put into production just after start for many reasons, plus it could be in a bad state (e.g. after crash, requiring upgrade or maintenance).
If a server has been shutdown normally, it can be started with:
systemctl start mariadb
(current MariaDB packages do not ship a systemd unit, although technically, jessie's backwards compatibility would make them work with systemctl. Avoid service, as it does not allow to pass extra parameters.
Right now, starting mysql does start the slave automatically- however, this again may not be desired if mysql_upgrade has to be run or the server otherwise checked. Prefer:
systemctl set-environment MYSQLD_OPTS="--skip-slave-start" systemctl start mariadb
this will require later to run at the mysql prompt:
mysql> START SLAVE;
mysql> START ALL SLAVES;
if it is using multi-source replication.
To shutdown, make sure mysql replication is not running first:
mysql> STOP SLAVE; mysql> SHOW [ALL] SLAVE[S] STATUS; systemctl stop mariadb
If you just do "shutdown -[hr] now", there is a high chance that mysql will timeout and the operating system kills it uncleanly. This is usually due to one of these 3 reasons:
- It takes a lot of time to empty the buffer pool (can take several minutes!)
- The replication thread is killing/committing the ongoing transaction, and it takes a lot of time. This would seem that the slave is "blocked"
- TokuDB has crashed and replication is "stuck" (in this last case, you actually have to kill the server)
If either action takes a long while:
journalctl -u mariadb
Do not assume that things will happen quickly, particularly if a box has been under heavy write load. Failing to wait for the server to stop will, with high probability, corrupt its data if non-transactional tables and GTID are being used- forcing to reload all data (a multi-hour or multi-day task!).
Consider also unmounting manually the /srv partition. Apparently, systemd does not respect the umounting time in some cases and it can lead to corruption:
puppet agent --disable "server shutdown" # prevents puppet messing up with the inexistent data umount /srv
Buffer pool dump
In order to speed up the warming of the buffer pool (more frequent data should be in memory), automatic load/dump has been activated. See buffer pool dump for details.
InnoDB Shutdown Speed
InnoDB can take a while (hours) to write dirty pages out to disk on shutdown. If the instance must have a predictable downtime then make MariaDB begin flushing dirty pages well in advance of the shutdown process:
set global innodb_max_dirty_pages_pct=0;
show global status like '%dirty%';
When the counters approach zero, a shutdown and/or restart will be fast. Of course, since MariaDB is still handling traffic in the meantime, if write load is high the counters may never drop :-) Have to depool the box in that case.
Stopping a slave: avoiding replication corruption
Until GTID with transactional replication control is deployed, and probably forever on places like labs (where non-transactional tables are the norm), it is advised to stop replication first. In 99% percent of the cases, mysql already does that on shutdown, and shutdown does cleanly stop mysql -in theory. In reality, if shutdown takes too much time, systemd/rc/etc. kills mysql, which despite "server transactionality" (server is consistent on crash), that means that replication state is not itself transitional. This happens, for example, if long running updates (vslow, imports running on labs) prevent replication from stopping. In order to do that, just execute:
$ mysql --skip-ssl -e "STOP SLAVE"
This is specially true on crash: in most cases, reimaging a slave and reimporting it is the fastest way to assure data integrity (again, until production gets transactional replication state for InnoDB).
Asher started us using stock upstream packages from mariadb.org with the debs "de-debianized" each time to have fewer hooks, allowing puppet to handle stuff. Simply:
apt install wmf-mariadb101
We currently have wmf-mariadb10, wmf-mariadb101 (yes, I know, the names are horrible) and wmf-mysql80 built from source, with no patches except forcing the usage of openssl instead of yassl, that install to /opt.
Puppet controls manually the rc.d script and the package uses update-alternatives to update the symlinks in
/usr/local/bin. Unlike the upstream versions our packages are fine to install alongside one another.
See mariadb roles in puppet.
Loading Data / provisioning
Setting up a new replica (or repairing a current one) is done through the recovery/provisioning system: MariaDB/Backups
However, one can do still manual hot or cold copies of running or stopped servers through transfer.py script on cumin hosts.
First decide if the schema change can be done online. This only works if:
- The table has a PRIMARY KEY or UNIQUE KEY on NOT NULL field(s)
- MediaWiki can ignore the change behind a feature flag
- Table metadata locking has been considered
Offline Schema Changes
- Schedule downtime
- See operations/mediawiki-config/db-eqiad.php
- May need a master rotation?
Online Schema Changes
Use pt-online-schema-change or the ops wrapper script operations/software/dbtools/osc_host.sh.
$ osc_host.sh --host=dbstore1001 --dblist=/a/common/s2.dblist --table=page_props \ --method=percona "add pp_sortkey float default null"
- --method=percona uses pt-online-schema-change and automatically checks slave lag on the host itself and its slaves.
- --method=ddl uses normal ALTER TABLE which will probably cause replication lag.
The wrapper script prompts for confirmation of settings first and also does a dry run for each wiki. It generally requires little supervision but should never run unmonitored.
If modifying indexes, or adding new fields that are nullable and/or have a default value, the change can be applied to each slave first and their masters last, using --no-replicate; ie, alter the replication tree bottom-to-top so that no slave ever tries to replicate a non-existant field from a master. This:
- Avoids risking the entire shard all at once; one can only break a single slave at a time :-)
- Allows the slave to be depooled if something goes wrong or load is too high.
$ osc_host.sh --host=dbstore1001 --dblist=/a/common/s2.dblist --table=page_props \ --no-replicate --method=percona "add pp_sortkey float default null"
If the change simply must go to masters first and propagate to slaves through replication then the wrapper script may still be appropriate to use but such cases deserve careful consideration from a DBA. This especially includes any changes to primary keys!
Sanitarium and Labsdbs
See more: Sanitarium and labsdb
The sanitarium hosts have triggers defined on some tables which will clash with pt-online-schema-change. For small tables which should be fast to alter simply use --method=ddl. For large tables where DDL would cause unacceptable lag just schedule downtime.
Table Metadata Locking
Both ALTER TABLE and pt-online-schema-change need to hold the table metadata lock, even just briefly for the latter. They must wait for open transactions to close the table and also block new transactions opening the table. On a busy server and a hot table like page or revision this can easily result in many seconds of delay which is more than enough time for connections to pile up and hit max_connections.
Consider reducing load or depooling the box.
Replication Lag Trick
Issuing STOP SLAVE SQL_THREAD on a slave before starting pt_online_schema_change will cause immediate replication lag, which will in turn make MediaWiki load balancing code reduce traffic on the slave until it catches up. Just a few seconds of leeway is enough to allow the tool to create triggers, leading to far less interruption than waiting out the metadata locks traffic jam.
Replication lag can be checked on https://noc.wikimedia.org/dbtree/.
Manipulating the Replication Tree
Tendril displays the full replication tree.
The following is interesting info, but thanks to gtid replication, implemented almost everywhere in production, except multi-source replicated slaves, you can move slaves just by executing:
STOP SLAVE; CHANGE MASTER TO MASTER_HOST='<new master fqdn>'; START SLAVE;
A common task is manipulating slaves within the replication hierarchy. A script exists to help out: operations/software/dbtools/repl.pl. It allows controlling two slaves in a parent/child or sibling/sibling relationship and will do sanity checks plus confirmation prompts. It has nothing to do with masters so don't try to use it for those :-)
Child to Sibling
Move a slave one layer upward in the hierarchy:
./repl.pl --switch-child-to-sibling --parent=db1007.eqiad.wmnet --child=db1035.eqiad.wmnet
Child must be replicating directly from parent.
Sibling to Child
Move a slave one layer downward in the hierarchy:
./repl.pl --switch-sibling-to-child --parent=db1007.eqiad.wmnet --child=db1035.eqiad.wmnet
Both slaves must be replicating from the same master.
Stop Siblings in Sync
Stop two slaves on the same layer in the hierarchy at a common binlog position:
./repl.pl --stop-siblings-in-sync --host1=db1007.eqiad.wmnet --host2=db1035.eqiad.wmnet
Both slaves must be replicating from the same master.
A MariaDB 10 slave can have multiple masters. Have to set the default_master_connection session variable to indicate which stream is to be maniulated by subsequent commands:
./repl.pl --switch-child-to-sibling --parent=db1007.eqiad.wmnet --child=dbstore1001.eqiad.wmnet --child-set=default_master_connection=s7
The *links tables tend to have data skewed toward certain namespaces depending on the wiki. In most cases this doesn't matter and the MariaDB optimizer always chooses a fast execution plan. However some edge cases on heavily used namespaces can cause massive filesorts. Historically MediaWiki used STRAIGHT_JOIN however that blunt instrument only introduced a different set of edge cases.
The *links tables respond well to range partitioning on namespace with ranges chosen appropriately on a case-by-case basis. Eg, commonswiki:
CREATE TABLE templatelinks ( tl_from int(8) unsigned NOT NULL DEFAULT '0', tl_namespace int(11) NOT NULL DEFAULT '0', tl_title varbinary(255) NOT NULL DEFAULT , UNIQUE KEY tl_from (tl_from,tl_namespace,tl_title), KEY tl_namespace (tl_namespace,tl_title,tl_from) ) ENGINE=InnoDB DEFAULT CHARSET=binary PARTITION BY RANGE (tl_namespace) ( PARTITION p_9 VALUES LESS THAN (10), PARTITION p_10 VALUES LESS THAN (11), PARTITION p_827 VALUES LESS THAN (828), PARTITION p_828 VALUES LESS THAN (829), PARTITION p_max VALUES LESS THAN MAXVALUE)
Shards S1-S7 have one slave each with table partitioning in place for revision and logging based on user id. These boxes handle special MediaWiki query groups like recentchangeslinked, contributions, and logpager.
Eg, from eswiki:
CREATE TABLE logging ( log_id int(10) unsigned NOT NULL AUTO_INCREMENT, log_type varbinary(32) NOT NULL DEFAULT , log_action varbinary(32) NOT NULL DEFAULT , log_timestamp varbinary(14) NOT NULL DEFAULT '19700101000000', log_user int(10) unsigned NOT NULL DEFAULT '0', log_namespace int(11) NOT NULL DEFAULT '0', log_title varbinary(255) NOT NULL DEFAULT , log_comment varbinary(255) NOT NULL DEFAULT , log_params blob NOT NULL, log_deleted tinyint(3) unsigned NOT NULL DEFAULT '0', log_user_text varbinary(255) NOT NULL DEFAULT , log_page int(10) unsigned DEFAULT NULL, PRIMARY KEY (log_id,log_user), KEY type_time (log_type,log_timestamp), KEY type_action (log_type,log_action,log_timestamp), KEY user_time (log_user,log_timestamp), KEY page_time (log_namespace,log_title,log_timestamp), KEY times (log_timestamp) ) ENGINE=InnoDB DEFAULT CHARSET=binary PARTITION BY RANGE (log_user) (PARTITION p00 VALUES LESS THAN (100000), PARTITION p01 VALUES LESS THAN (200000), PARTITION p02 VALUES LESS THAN (300000), PARTITION p03 VALUES LESS THAN (400000), PARTITION p04 VALUES LESS THAN (500000), PARTITION p05 VALUES LESS THAN (1000000), PARTITION p06 VALUES LESS THAN (1500000), PARTITION p07 VALUES LESS THAN (2000000), PARTITION p08 VALUES LESS THAN (2500000), PARTITION p09 VALUES LESS THAN MAXVALUE);
CREATE TABLE revision ( rev_id int(8) unsigned NOT NULL AUTO_INCREMENT, rev_page int(8) unsigned NOT NULL DEFAULT '0', rev_text_id int(8) unsigned NOT NULL DEFAULT '0', rev_comment varbinary(255) NOT NULL, rev_user int(5) unsigned NOT NULL DEFAULT '0', rev_user_text varbinary(255) NOT NULL DEFAULT , rev_timestamp varbinary(14) NOT NULL DEFAULT , rev_minor_edit tinyint(1) unsigned NOT NULL DEFAULT '0', rev_deleted tinyint(1) unsigned NOT NULL DEFAULT '0', rev_len int(8) unsigned DEFAULT NULL, rev_parent_id int(8) unsigned DEFAULT NULL, rev_sha1 varbinary(32) NOT NULL DEFAULT , PRIMARY KEY (rev_id,rev_user), KEY rev_timestamp (rev_timestamp), KEY page_timestamp (rev_page,rev_timestamp), KEY user_timestamp (rev_user,rev_timestamp), KEY usertext_timestamp (rev_user_text,rev_timestamp) ) ENGINE=InnoDB AUTO_INCREMENT=74016150 DEFAULT CHARSET=binary PARTITION BY RANGE (rev_user) (PARTITION p00 VALUES LESS THAN (10000), PARTITION p01 VALUES LESS THAN (20000), PARTITION p02 VALUES LESS THAN (30000), PARTITION p03 VALUES LESS THAN (40000), PARTITION p04 VALUES LESS THAN (50000), PARTITION p05 VALUES LESS THAN (60000), PARTITION p06 VALUES LESS THAN (70000), PARTITION p07 VALUES LESS THAN (80000), PARTITION p08 VALUES LESS THAN (90000), PARTITION p09 VALUES LESS THAN (100000), PARTITION p10 VALUES LESS THAN (200000), PARTITION p11 VALUES LESS THAN (300000), PARTITION p12 VALUES LESS THAN (400000), PARTITION p13 VALUES LESS THAN (500000), PARTITION p14 VALUES LESS THAN (600000), PARTITION p15 VALUES LESS THAN (700000), PARTITION p16 VALUES LESS THAN (800000), PARTITION p17 VALUES LESS THAN (900000), PARTITION p18 VALUES LESS THAN (1000000), PARTITION p19 VALUES LESS THAN (2000000);
Queries that filter on log_user by equality or range run faster. Same for rev_user. The partition sizes are chosen on a per-wiki basis because one size does not fit all. Tendril has a report for choosing sizes based on row distribution.
Note the modified PRIMARY KEY definition that includes log_user. This is relatively safe for a slave but not appropriate for a master, so a partitioned slave should never be eligible for promotion. See coredb::$topology in puppet, or mediawiki-config/db-eqiad.php to identify them.
Extra notes: Having only 1 special node is a "Single Point of Slowdown". I am currently making sure that we have at least 2 nodes load-balancing this kind of traffic, which can be significant for uncached traffic, and allows to properly perform maintenance. Be careful, running the ALTER TABLE, as is, takes
Query OK, 629385781 rows affected (5 days 54 min 6.51 sec) for the English wikipedia, and as much amount of space as the original table, so probably pt-online-schema-change or other methods can be considered to avoid failure and painful rollbacks.
Main page containing : HAProxy
dbproxy1XXX boxes run HAProxy. Besides making for easier Ops server rotations simply by having a proxy IP in the mix, there are two modes for other stuff: load balancing for slaves and failover for masters. So far, only misc shards masters and the latest labsdbs replicas uses it via dbproxy1XXX.
The main module for DBAs on the operations/puppet tree is "mariadb", which is in its own repo operations/puppet/mariadb. Remember to update the subrepo when committing changes to the mariadb module, otherwise it will not be caught by palladium, strontium, CI, etc.
There used to be a class coredb_mysql, used from 5.5 nodes. Right now it is obsolete, but it is still in use by nodes that have not been updated to mariadb10, in particular, the masters. When all nodes are updated to 5.6/10, we will discontinue it, but right now it is essential for everything to work correctly.
Despite all the previous, there are mariadb-related files on the main repo- shared among the clases. Those are the puppet/templates/mariadb/ files, where the main configuration and grants lie.
There is a forth place where you will find mariadb-related files, and that is the private repository, but I suppose that requires no explanation (passwords, etc.).
Other DBA related repos are:
- operations/software/dbtools: for scripts used for maintenance
- operations/software/redactatron: labs filtering
- operations/software/dbtree: DBtree
- operations/software/tendril: DB monitoring
TODO: Document cumin interaction instead
This page contains historical information. It is probably no longer true.
To quickly address multiple core databases, additional custom grains for Salt have been added, this is the current list:
mysql_shard: contains the shard name (
x1, etc.) and address all the hosts in that shard in any location (masters, slaves, eqiad, codfw)
mysql_role: contains the name of the mysql role, valid values are:
To combine those additional grains with already existing ones use the compound matching, for example to get all the
slaves of shard
eqiad execute from the salt master:
sudo salt -C 'G@site:eqiad and G@mysql_shard:s7 and G@mysql_role:slave' test.ping
Long running queries
There is some event logic running on the servers trying to kill potential query exhaustion. Needs research.
I am investigating running:
pt-kill --print --kill --victims all --interval 10 --match-command Query --match-user wikiuser --group-by fingerprint --any-busy-time 50 --query-count 10 F=/root/.my.cnf
on a screen session to try to see if it is effective enough/does not create false positives. It requires more work, and finally, puppetization -then deciding which of the two options to follow.
Importing table spaces from other hosts with multi source replication
This feature provides a fast way of copying data from one host to another over the network using compression and
nc for instance.
To learn more about to copy
.ibd files from one host to another, please check out MariaDB/ImportTableSpace.
root@localhost account no longer uses passwords for authentication (task T150446), but the UNIX socket authentication plugin. This allows stop using passwords (that can be copied, exposed and compromised), and allows the root system user to login from localhost by running:
sudo mysql --skip-ssl
Because mariadb uses the uid of the linux user, there is no need to write passwords to the filesystem anymore, and in the event of a host compromise, only localhost is affected, and not any other host sharing the same password. This could later be extended to other system accounts that only connect from localhost, such as icinga or prometheus. Note that if a superuser is compromised on a host, not having a password is not a further barrier, as root has the ability to manage, read and modify at will MySQL files.
How unix_socket authentication works
All mysql servers should have been configured to load the auth_socket plugin (https://gerrit.wikimedia.org/r/320822):
plugin-load = unix_socket=auth_socket.so
It can also be enabled at runtime with:
INSTALL PLUGIN unix_socket SONAME 'auth_socket';
That only enables to plugin, to use it, we have to alter the user we want to authenticate:
GRANT USAGE ON *.* TO root@localhost IDENTIFIED VIA unix_socket;
This will delete the user password, but will indicate it does not uses the mysql native password authentication, but unix socket.
MariaDB [(none)]> SELECT user,host,password,plugin FROM mysql.user WHERE user='root'; +------+--------------+-------------------------------------------+-------------+ | user | host | password | plugin | +------+--------------+-------------------------------------------+-------------+ | root | localhost | | unix_socket |
By the way, note that the authentication name is unix_socket, but the plugin loaded is auth_socket.so. Do not use auth_socket on the GRANT/CREATE USER statements.
Also, if for some reason, you revert this change, make sure you put a password back:
GRANT USAGE ON *.* TO root@localhost IDENTIFIED BY '<super_secure_password>';
Dumping tables with mydumper
What is mydumper?
From the documentation:
* Parallelism (hence, speed) and performance (avoids expensive character set conversion routines, efficient code overall) * Easier to manage output (separate files for tables, dump metadata, etc, easy to view/parse data) * Consistency - maintains snapshot across all threads, provides accurate master and slave log positions, etc * Manageability - supports PCRE for specifying database and tables inclusions and exclusions
Why are we starting to consider it?
Its speed and parallelism makes it perfect to save time and load data a lot faster.
It is easier to recover single tables or rows.
It compress nicely
From the tests we have seen
- Taking a full dump of s3 takes 1:15h - 88GB
- Taking a full dump of s1 takes 53 minutes - 93GB
- Taking a full dump of s5 takes 1:20h - 93G
Right now and in order to start it - as it doesn't accept any flag as no to read the default file, the following section needs to be commented out on the host my.cnf
# ssl-ca=/etc/ssl/certs/Puppet_Internal_CA.pem # ssl-cert=/etc/mysql/ssl/cert.pem # ssl-key=/etc/mysql/ssl/server.key # ssl-verify-server-cert
Once that is done, a typical way to dump a database:
mydumper -c -h localhost -t 8 -u root -r 100000000 -B wikidatawiki -S /tmp/mysql.sock -o output_directory_name
-c To compress the data files
-h The host
-t How many parallel threads you want dumping at the same time
-u The user
-r Try to split tables into chunks of this many rows
-B Database you want to dump
-S Specify the socket to connect to
"-o" Specify the output directory where you want to save the files to. If not specified, it will generate a directory under your current path called: export-2017xxxxx-xxxxx with all the files.
Please check the dump data section to make sure you are able to connect (comment out the SSL options).
By default it disables the logging into the binlog
Once that is done, typically you want to run:
myloader -h localhost -t 8 -u root -S /tmp/mysql.sock -d export-xxxxx
-h The host
-t How many parallel threads you want loading the data at the same time
-u The user
-S Specify the socket to connect to
-d The directory you want to import the files from
Production section failover checklist
- It has been suggested that this page should be merged with MariaDB/troubleshooting#Depooling_a_master_(a.k.a._promoting_a_new_slave_to_master)
Note there is a script.
switchover.py that automates most of these steps.
NEW master: OLD master:
- Change expire_log_days on the NEW master and OLD master
- Silence alerts on all hosts
- Disable GTID on NEW
- Topology changes, connect everything to NEW
repl.pl --switch-sibling-to-child --parent=NEW.FQDN:PORT --child=SLAVE.FQDN:PORT
- Disable puppet OLD, puppet NEW
puppet agent --disable "switchover to NEW"
- Enable semisync on new master and replicas
mysql.py -hNEW -e "SET GLOBAL rpl_semi_sync_master_enabled = 1;" mysql.py -h(for each slave) -e "STOP SLAVE; START SLAVE;" --slaves should already be enabled, but not active until the master is, too mysql.py -hNEW -e "SHOW GLOBAL STATUS like 'Rpl_semi_sync_master_clients';" -- should be >1 mysql.py -hOLD -e "SET GLOBAL rpl_semi_sync_master_enabled=0;"
- Enable GTID on all replicas
STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;
- Merge gerrit puppet patch to promote NEW to master (example): https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439530/
- Start the failover
!log Starting sX failover from OLD to NEW - TASK_NUMBER
- Deploy mediawiki-config read-only example: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/439531/
scap sync-file --force wmf-config/db-eqiad.php "Set s3 on read-only T219115 "
- run switchover script from cumin1001:
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# ./switchover.py --skip-slave-move $OLD $NEW
- Promote NEW to master example: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/439532/
scap sync-file --force wmf-config/db-eqiad.php "Switchover s3 master eqiad from db1078 to db1075 T219115"
- Remove read-only from mediawiki:
scap sync-file --force wmf-config/db-eqiad.php "Remove s3 ready only T219115"
Equivalent manual steps
- Set old master in read only:
./mysql.py -h OLD -e "set global read_only=ON;"
- Disable heartbeat OLD
- Confirm new master has catched up
host=OLD; echo "== $host =="; ./mysql.py -h $host -e "show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'; host=NEW; echo "== $host =="; ./mysql.py -h $host -e "select @@hostname; show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'
- Stop slave on new master
mysql -h NEW.eqiad.wmnet -e "STOP SLAVE;"
- Restart puppet on old and new masters (for heartbeat): NEW and OLD
puppet agent --enable && puppet agent -tv
- Confirm pt-hearbeat has been started on NEW
- Remove read-only from NEW: ./mysql.py -h NEW -e "set global read_only=OFF;"
- Remove read-only from MediaWiki and promote NEW to master example: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/439532/
- RESET SLAVE ALL on new master
./mysql.py -h NEW.eqiad.wmnet -e "reset slave all;"
- Change old master to replicate from new master
./mysql.py --h OLD: change master to master_host='NEW.eqiad.wmnet', master_user='repl', master_password=, master_port=3306, master_log_file=, master_log_pos=, master_ssl=1;
if all looks good: start slave;
End of equivalent manual steps
- Update tendril and zarcillo master server id for sX
example: UPDATE shards set master_id = 1144 where name='s2' LIMIT 1; UPDATE masters SET instance = 'db1075' WHERE dc = 'eqiad' and section = 's3' LIMIT 1
- Update dns example :https://gerrit.wikimedia.org/r/#/c/operations/dns/+/439533/
- Patch prometheus, example: https://gerrit.wikimedia.org/r/#/c/operations/software/+/439534/
- Double check GTID on all the slaves
- Create decommissioning ticket for OLD
- Ensure all replicas and masters have the right events on the ops database (
- Update/resolve phabricator ticket about failover
Special section: x1 master switchover
x1 is a "special" section which cannot be put on read-only on mediawiki, so it needs to relay on the switchover.py script which puts MySQL on read-only. When failing over, please tag the following Teams and people on the phabricator task so they can have a heads up as they are x1 stakeholders and need to know that x1 will have writes blocked for around one minute.
- Subscribers: Tgr, JoeWalsh, Dbrant, Ladsgroup, Addshore, Legoktm, Mholloway
- Tags: Cognate, Growth-Team, Language-Team, User-notice, UrlShortener, StructuredDiscussions, MediaWiki-extensions-BounceHandler, ContentTranslation, Reading List Service, WikimediaEditorTasks.
Example task: https://phabricator.wikimedia.org/T226358
Decommissioning a mariadb host server checklist
- Create a decommission ticket with the following template example: https://phabricator.wikimedia.org/T197063
- If there is hardware problems, please specify so for the DCOps to label it so we do not re-use broken pieces.
- Depool it and remove it from both db-eqiad.php and db-codfw.php at mediawiki-config for core hosts
- Remove it from active configuration from haproxy for misc hosts
- Remove references on hiera
- Remove from monitoring (right now, prometheus lists)
- Remove it from operations-software dbtools
- Remove it from tendril/dbtree
- Remove from site.pp (or move it to spare role)
- Downtime host and all services
- Shutdown mysql so it is no longer running and exposing data outside
- Run puppet so firewall gets shutdown
- Run puppet on icinga active host to remove mysql-related checks
Depool a broken or lagged replica
Checkout the mediawiki-config repo if you haven't already:
git clone ssh://email@example.com:29418/operations/mediawiki-config cd mediawiki-config vim wmf-config/db-eqiad.php
Assuming eqiad is the active DC
- A replica can run just one MySQL instance
- If a replica runs more than one MySQL instance, it will be specified in its definition with the following format
This means db1090 has an instance running on 3317 (and probably others in different ports). The port is formed in the following way
3306 -> MySQL standard port 331X -> X meaning the section where the server belongs to. 1 -> s1 2 -> s2 .. 8 -> s8 3320 -> x1 3321 -> m1 3322 -> m2 3323 -> m3 3324 -> m4 3325 -> m5
So db1090:3317 is a replica that runs on s7
- A replica can handle more than one type of traffic.
These are the types of traffic we have
main recentchanges api vslow
They are normally specified on the definition files, let's use db-eqiad.php as it is the active datacenter Let's examine s7 where db1090 belongs to: This is the main traffic section where the servers and weights are specified. Servers with weight 1 normally means they are still checked for lag by the MediaWiki LB, but the have almost no main traffic, because they are special replicas and they have more weight for special traffic (check below)
's7' => [ 'db1062' => 0, # D4 2.8TB 128GB, master 'db1079' => 300, # A2 3.6TB 512GB, api # master for sanitarium db1125 'db1086' => 400, # B3 3.6TB 512GB, api # candidate master 'db1090:3317' => 1, # C3 3.6TB 512GB, vslow, dump: s2 and s7, old master 'db1094' => 500, # D2 3.6TB 512GB 'db1098:3317' => 1, # B5 3.6TB 512GB # rc, log: s6 and s7 'db1101:3317' => 1, # C2 3.6TB 512GB # rc, log: s7 and s8
And there is also a special slaves section below, look for it: db1090:3317 is a single vslow slave db1079 and db1086 share API traffic, db1079 has more weight for API because db1086 has more weight for main traffic (see above) db1098:3317 db1101:3317 are recentchanges replicas and they share the same amount of load for all those special services.
's7' => [ 'vslow' => [ 'db1090:3317' => 1, ], 'dump' => [ 'db1090:3317' => 1, ], 'api' => [ 'db1079' => 3, 'db1086' => 1, ], 'watchlist' => [ 'db1098:3317' => 1, 'db1101:3317' => 1, ], 'recentchanges' => [ 'db1098:3317' => 1, 'db1101:3317' => 1, ], 'recentchangeslinked' => [ 'db1098:3317' => 1, 'db1101:3317' => 1, ], 'contributions' => [ 'db1098:3317' => 1, 'db1101:3317' => 1, ], 'logpager' => [ 'db1098:3317' => 1, 'db1101:3317' => 1, ], ],
Locate the broken slave - keep in mind it can be in more than one section. The easiest way is probably to look for it on the file. For this example let's assume db1090:3317 is broken with replication broken, but its other instance (3312) is working fine, so no need to depool it.
This host is a vslow host, so we need to replace it with another host from the same section. Let's chose db1079 to replace it.
If the host is a special host and servers some special traffic (vslow,api, recentchanges, it needs to be replaced if it is the only one, we cannot leave a section without a slave)
We need to look for it and comment it out from every single line where db1090:3317 appears. However DO NOT COMMENT the line with its definition which is already really explicit:
'db1090:3317' => '10.64.32.116:3317', # do not remove or comment out
So this should be the diff once we are ready to commit: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/447984/1/wmf-config/db-eqiad.php
git add wmf-config/db-eqiad.php git commit git review
Once we have +2 on gerrit and the change has been merged automatically:
ssh deployment.eqiad.wmnet cd /srv/mediawiki-staging/ git fetch ; git rebase git show (this should show our change) scap sync-file wmf-config/db-eqiad.php "Depool db1090:3317 as it is broken"
Once the change is deployed, we should be able to see our change on: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php
If a host is completely down and it has multiple instances it needs to be depooled from all the instances Let's say db1090 had a hardware crash and it is powered off, that means that both instances 3312 and 3317 are down. We need to also depool the instance db1090:3312 with the same method as described above.
Create a task with the DBA tag so DBAs can follow up and checkout what happened, a proper fix etc
As of today (20th June 2019) there are two testing hosts on a shard called test-s4 that were bought for MCR testing
DB master: db1111 writable
DB slave: db1077 read only
Whilst those hosts are normally up, it is not warranted they will always be available or up.
Those hosts have static data, and are not updated with production data. They have two wikis and a few tables (more tables can be requested if needed by filing a ticket to DBAs). It is not warranted that they'll be included, but they can be studied and some expectations about ETA and if it is possible to have that data will be given on the ticket.
Access to those hosts is granted individually to those who already have an NDA and cluster access if _really_ needed. Please file a ticket and tag DBA to let us know why you need access so we can start a discussion about it.
People with access