You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

MariaDB: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jcrespo
(Long running queries)
imported>BryanDavis
(→‎Replication lag: Link to Map of database maintenance and Tool:Replag both of which may be more important for Toolforge users, but Replag links here so maybe this will help them)
(95 intermediate revisions by 20 users not shown)
Line 1: Line 1:
{{Navigation Wikimedia infrastructure|expand=db}}
''For emergencies, check the subpage about [[/troubleshooting|MariaDB troubleshooting]].''
''For emergencies, check the subpage about [[/troubleshooting|MariaDB troubleshooting]].''


Previously we mainly used [[MySQL]]. Now it is only used on the core masters and some auxiliary servers that have not yet been updated.
[[w:en:MariaDB|MariaDB]] is the main [[w:en:database management system|database management system]] used to run the Wikimedia sites.


For a general overview, check the [[mw:File:MySQL_at_Wikipedia.pdf|MySQL@Wikipedia]] slides.
For a general overview, check the [[mw:File:MySQL_at_Wikipedia.pdf|MySQL@Wikipedia (2015)]] slides (MariaDB is a drop-in replacement for [[w:en:MySQL|MySQL]], which we used to use).  


== Start / Stop ==
== Sections and shards ==
For a full (?) list, look at [https://noc.wikimedia.org/db.php db.php].


/etc/init.d/mysql start
=== {{anchor|s1|s2|s3|s4|s5|s6|s7|s8}} Core MediaWiki databases ===
/etc/init.d/mysql stop
*[https://noc.wikimedia.org/conf/highlight.php?file=dblists/s1.dblist s1]: English Wikipedia
*[https://noc.wikimedia.org/conf/highlight.php?file=dblists/s2.dblist s2]: 17 large wikis, including Chinese, Italian, Dutch, Polish, and Portuguese Wikipedias and English Wiktionary
*[https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist s3]: Most small wikis (~800)
*[https://noc.wikimedia.org/conf/highlight.php?file=dblists/s4.dblist s4]: Commons
*[https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist s5]: German Wikipedia, some other large wikis, and most new wikis since mid-2020 ([[phab:T259438|T259438]])
*[https://noc.wikimedia.org/conf/highlight.php?file=dblists/s6.dblist s6]: French, Japanese, and Russian Wikipedias, and Wikitech
*[https://noc.wikimedia.org/conf/highlight.php?file=dblists/s7.dblist s7]: CentralAuth, Meta, 10 large Wikipedias, and French Wiktionary
*[https://noc.wikimedia.org/conf/highlight.php?file=dblists/s8.dblist s8]: Wikidata


If either action takes a long while:
=== External storage ===
The [[external storage]] servers host the full text of all page revisions in a compressed format.
*'''es1''': read-only cluster
*'''es2''': read-only cluster
*'''es3''': read-only cluster
*'''es4''' and '''es5''': read/write cluster


tail -f /a/sqldata/<hostname>.err
=== Parsercaches ===
* '''pc1'''-'''pc3''': [[parser cache]]s


Do not assume that things will happen quickly, particularly if a box has been under heavy write load.
=== Extension storage ===
{{See also|Debugging in production#Debugging databases}}<!-- [[x1]] and [[extension1]] redirect here -->


=== Buffer pool dump ===
=== x1 ===
The x1 cluster is used by [[MediaWiki at WMF]] for databases that are "global" or "cross-wiki" in nature, and are typically associated with a MediaWiki extension. Apart from being used across wikis, it is otherwise configured, replicated, and queries in the same way as "core" databases.
{| class="wikitable"
|+
!Feature
!Database
!Phabricator project
!Steward
|-
|BounceHandler
|<code>wikishared.bounce_*</code>
|{{Phabricator inline tag|MediaWiki-extensions-BounceHandler}}
|
|-
|Cognate
|<code>cognate_wiktionary.*</code>
|{{Phabricator inline tag|Cognate}}
|
|-
|ContentTranslation
|<code>wikishared.cx_*</code>
|{{Phabricator inline tag|ContentTranslation}}
|Language Team
|-
|Echo
|<code>wikishared.echo_*</code>, and
<code>[wiki].echo_*</code>
|{{Phabricator inline tag|Notifications}}
|Growth Team
|-
|Flow
|<code>flowdb.*</code>
|{{Phabricator inline tag|StructuredDiscussions}}
|Growth Team
|-
|GrowthExperiments
|<code>[wiki].growthexperiments_*</code>
|{{Phabricator inline tag|MediaWiki-extensions-GrowthExperiments}}
|Growth Team
|-
|ReadingLists
|<code>wikishared.reading_list_*</code>
|{{Phabricator inline tag|Reading List Service}}
|
|-
|UrlShortener
|<code>wikishared.urlshortcodes</code>
|{{Phabricator inline tag|MediaWiki-extensions-UrlShortener}}
|
|-
|WikimediaEditorTasks
|<code>wikishared.wikimedia_editor_tasks_*</code>
|{{Phabricator inline tag|Product-Infrastructure-Team-Backlog}}
|
|}


In order to speed up the warming of the buffer pool (more frequent data should be in memory), automatic load/dump has been activated. See [[MariaDB/buffer_pool_dump|buffer pool dump]] for details.
=== x2 ===
This cluster is reserved for use by the [[mw:MainStash|MainStash]]. It is read-write in all data centers and replicated in both directions.


=== InnoDB Shutdown Speed ===
As of June 2021, the x2 cluster is not yet in use and MainStash is still backed by [[Redis]].


InnoDB can take a while (hours) to write dirty pages out to disk on shutdown. If the instance must have a predictable downtime then make MariaDB begin flushing dirty pages well in advance of the shutdown process:
[[phab:T212129|Launch task: T212129]].


set global innodb_max_dirty_pages_pct=0;
=== Miscellaneous ===
The [[MariaDB/misc|miscellaneous servers]] host databases for various auxiliary services.
* '''m1''': Basic ops utilities: [[Bacula]], [[Etherpad]], [[LibreNMS]], [[Racktables]], others
* '''m2''': [[OTRS]], [[Gerrit]], [[debmonitor|DebMonitor]], [[XHGui]], and others
* '''m3''': [[Phabricator]]
* <strike>'''m4''': [[Analytics/Systems/EventLogging|EventLogging]]</strike> No longer a MariaDB database.
* {{Anchor|m5}}'''m5''': testreduce, WMCS-related dbs, [[Toolsadmin.wikimedia.org]], [[Toolhub.wikimedia.org]]
* '''tendril''': [[Tendril]] (including dbtree backend) and [[Zarcillo]]


Then poll:
==== Database creation template ====
If you need to request a new database please create a task with the DBA tag and filling out the following fields if you know them (or can estimate them)
* QPS:
* Size: (An estimation is fine)
* DB Name:
* User:
* Accessed from server (s):
* Backup Policy: Needed? Frequency?
* Grants needed:


show global status like '%dirty%';
=== Cloud Services (WMCS) Wiki replicas ===
* clouddb1013-1020: [[Wiki replicas]]


When the counters approach zero, a shutdown and/or restart will be fast. Of course, since MariaDB is still handling traffic in the meantime, if write load is high the counters may never drop :-) Have to depool the box in that case.
=== Analytics ===
* clouddb1021: Analytics dedicated replica. Used the first days of the month.
* db1108: Eventlogging hosts
* dbstore1003-1005: Multi-instance analytics hosts
 
== Replicas ==
The '''master database''' is the primary server for a given shard. If the master database server fails, then a replica can be promoted to be a new master. See [[Master switch]] for more information.
 
A '''database replica''' (or '''slave database''') is another MariaDB or MySQL instance that replicates data from the master database. Most often, we add replicas in production for the purpose of load-balancing read queries, and for backup/failover purposes.
 
=== Cloud Services Wiki Replicas ===
[[Help:Toolforge/Database|Copies of the core databases]] with private data removed are part of the [[Portal:Data Services|Data Services]] hosted by [[Help:Cloud Services introduction|Wikimedia Cloud Services]] for bot, tool, and analysis use. The redaction is done during [[MariaDB/Sanitarium and Labsdbs|an intermediate stop on Sanitarium]].
As of Feb 2021, wikireplicas run in multiple hosts, with multiple processes, see [[Portal:Data_Services/Admin/Wiki_Replicas#Multi-instance_replicas|new wiki replicas setup]] for more details.
 
To see how to operate (pool/depool) the new multi-instances check the [[Portal:Data_Services/Admin/Wiki_Replicas#Multi-instance_replicas|new wiki replicas setup]]
 
==== Recloning a Wiki replica ====
 
'''This is an old procedure for the multi-source hosts, which are deprecated. This needs updating.'''
 
In case a Wiki Replica needs to be recloned from another one, there are several steps that needs to be done as these hosts are running multi-source.
 
* Stop all slaves on the source and gather its replication coordinates
* Stop MySQL on the source
* Stop MySQL on the target host
* Remove /srv/sqldata on the target host
* Transfer the data from the source to the target:
transfer.py --no-encrypt --no-checksum SOURCE.eqiad.wmnet:/srv/sqldata TARGET.eqiad.wmnet:/srv
* Make sure replication won't start automatically on neither of the hosts:
systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
* Once everything is transferred: on the target host remove the following files
relay-log-sX.info
multi-master.info
master-sX.info
 
* Start MySQL on both hosts
* If necessary run
mysql_upgrade
* On the target issue:
reset slave all;
* On the target configure replication using the coordinates from the source that were gathered previously and start replication
 
=== Analytics MariaDB cluster ===
The [[Analytics/Systems/MariaDB|Analytics MariaDB cluster]] contains full replicas of the core and extension storage databases.
 
== Database backups ==
 
{{main|MariaDB/Backups}}
 
Replication protects (in general) against hardware issues, as well as physical corruption. However, it is not enough to prevent application or operation errors. Also many people (e.g. developers and service maintainers) sometimes require observing past state of the database to solve data application issues.
 
In the past, 2 boxes replicate all shards using MariaDB 10 multi-source replication: <code>dbstore1001</code> and <code>dbstore1002</code> (as well as <code>dbstore2001</code> and <code>dbstore2002</code> on codfw). dbstore1001 and dbstore2001 also used to be delayed 24 hours. Multisource was abandoned for several reasons ({{phabricator|T159423}}): TokuDB (used for compression) bugs, difficult to recover and maintain, as well as difficulty for scalability as data grew more and more, MariaDB bugs and GTID limitations, among others.
 
In addition to the delayed slave, a bacula-orchestrated dumpdump was done weekly from dbstore1001 on misc and metadata core databases, and stored syncronously on bacula.
 
In 2017, multiple hardware and scalability issues lead to a focus and [[phabricator:T169517|complete redesign]] of the database backup architecture, leading to '''the current one explained at [[MariaDB/Backups]]''', with focus on full coverage, validation and testing and full automation recovery.
 
== Start / stop ==
See [[MariaDB/Start_and_stop]]


== Packages ==
== Packages ==
Line 36: Line 181:
Asher started us using stock upstream packages from mariadb.org with the debs "de-debianized" each time to have fewer hooks, allowing puppet to handle stuff. Simply:
Asher started us using stock upstream packages from mariadb.org with the debs "de-debianized" each time to have fewer hooks, allowing puppet to handle stuff. Simply:


  apt-get install mariadb-server-5.5
  apt install wmf-mariadb101


We also have 5.5.36 and 10.0.10 packages built from source, with a few patches, that install to /opt. So far these are used only for non-frontline boxes and/or advance testing new versions. Currently they probably aren't relevant outside production, but nonetheless can be found by:
We currently have wmf-mariadb10, wmf-mariadb101 (yes, I know, the names are horrible) and wmf-mysql80 built from source, with no patches except forcing the usage of openssl instead of yassl, that install to /opt.


apt-get install wmf-mariadb wmf-mariadb10
Puppet controls manually the rc.d script and the package uses update-alternatives to update the symlinks in <code>/usr/local/bin</code>. Unlike the upstream versions our packages are fine to install alongside one another.
 
Need to run '''/opt/wmf-mariadb/install''' to install the rc.d script and symlinks in /usr/local/bin if cli tools are not otherwise available. Unlike the upstream versions our packages are fine to install alongside one another.


== Installation ==
== Installation ==
Line 48: Line 191:
See '''mariadb''' roles in puppet.
See '''mariadb''' roles in puppet.


== Loading Data ==
=== Hardware installation checklist ===
 
Once the physical racking and OS installation has been, normally by DCOPs, the following items must be checked before considering the host ready to receive data.
 
* Amount of memory showing up on the OS matches the expected
 
* RAID setup
** RAID level (typically 10)
** RAID stripe size (256K)
** BBU in place and policy set to WriteBack
 
* Amount of disk space available for ''srv'' matches the expected
 
* Number of CPUs matches the expected value
 
=== Setting up a fresh server ===


Still relevant to MariaDB: [[Setting up a MySQL replica]].
* To initialize the database: <code>/opt/wmf-mariadb104/scripts/mysql_install_db --basedir=/opt/wmf-mariadb104/</code>
** If this is a multi-instance db server, you'll need to provide <code>--datadir=/srv/sqldata.<SECTION></code> as well.
* Start the service: <code>systemctl start mariadb</code>
** For multi-instance, the service is <code>mariadb@<SECTION></code>
* Set up defaults: <code>/opt/wmf-mariadb104/bin/mysql_secure_installation --basedir=/opt/wmf-mariadb104/ -S /run/mysqld/mysqld.sock</code>
** For multi-instance, change it to <code>-S /run/mysqld/mysqld.<SECTION>.sock</code>
** Don't set root password, accept defaults for everything else.
 
== Loading Data / provisioning ==
 
Setting up a new replica (or repairing a current one) is done through the recovery/provisioning system: [[MariaDB/Backups]]
 
However, one can do still manual hot or cold copies of running or stopped servers through '''[[transfer.py]]''' script on the cluster management hosts ({{CuminHosts}}).


== Schema Changes ==
== Schema Changes ==
See [[Auto schema]]


First decide if the schema change can be done online. This only works if:
First decide if the schema change can be done online. This only works if:


* The table has a PRIMARY KEY or UNIQUE KEY on NOT NULL field(s)
* The table has a PRIMARY KEY or UNIQUE KEY on NOT NULL field(s)
* Mediawiki can ignore the change behind a feature flag
* MediaWiki can ignore the change behind a feature flag
* [[#Table_Metadata_Locking|Table metadata locking]] has been considered
* [[#Table_Metadata_Locking|Table metadata locking]] has been considered


Line 89: Line 260:


=== Sanitarium and Labsdbs ===
=== Sanitarium and Labsdbs ===
See more: [[/Sanitarium_and_Labsdbs|Sanitarium and labsdb]]


The sanitarium hosts have triggers defined on some tables which will clash with ''pt-online-schema-change''. For small tables which should be fast to alter simply use ''--method=ddl''. For large tables where DDL would cause unacceptable lag just schedule downtime.
The sanitarium hosts have triggers defined on some tables which will clash with ''pt-online-schema-change''. For small tables which should be fast to alter simply use ''--method=ddl''. For large tables where DDL would cause unacceptable lag just schedule downtime.
Line 100: Line 273:
==== Replication Lag Trick ====
==== Replication Lag Trick ====


Issuing STOP SLAVE SQL_THREAD on a slave before starting ''pt_online_schema_change'' will cause immediate replication lag, which will in turn make Mediawiki load balancing code reduce traffic on the slave until it catches up. Just a few seconds of leeway is enough to allow the tool to create triggers, leading to far less interruption than waiting out the metadata locks traffic jam.
Issuing STOP SLAVE SQL_THREAD on a slave before starting ''pt_online_schema_change'' will cause immediate replication lag, which will in turn make MediaWiki load balancing code reduce traffic on the slave until it catches up. Just a few seconds of leeway is enough to allow the tool to create triggers, leading to far less interruption than waiting out the metadata locks traffic jam.
 
== Monitoring ==
 
See '''[[MariaDB/monitoring]]'''.
 
=== Replication lag ===
{{See also|Map of database maintenance|Tool:Replag}}
 
Replication lag can be checked on https://noc.wikimedia.org/dbtree/.
 
See [[MariaDB/troubleshooting#Replication lag]] and [[MySQL#Replication lag]] on how to handle replication lag.


== Manipulating the Replication Tree ==
== Manipulating the Replication Tree ==


[https://tendril.wikimedia.org/tree Tendril] displays the full replication tree.
[https://tendril.wikimedia.org/tree Tendril] displays the full replication tree.
The following is interesting info, but thanks to gtid replication, implemented almost everywhere in production, except multi-source replicated slaves, you can move slaves just by executing:
STOP SLAVE; CHANGE MASTER TO MASTER_HOST='<new master fqdn>'; START SLAVE;


A common task is manipulating slaves within the replication hierarchy. A script exists to help out: '''operations/software/dbtools/repl.pl'''. It allows controlling two slaves in a parent/child or sibling/sibling relationship and will do sanity checks plus confirmation prompts. It '''has nothing to do with masters''' so don't try to use it for those :-)
A common task is manipulating slaves within the replication hierarchy. A script exists to help out: '''operations/software/dbtools/repl.pl'''. It allows controlling two slaves in a parent/child or sibling/sibling relationship and will do sanity checks plus confirmation prompts. It '''has nothing to do with masters''' so don't try to use it for those :-)
Line 142: Line 330:
=== Links Tables ===
=== Links Tables ===


The *links tables tend to have data skewed toward certain namespaces depending on the wiki. In most cases this doesn't matter and the MariaDB optimizer always chooses a fast execution plan. However some edge cases on heavily used namespaces can cause massive filesorts. Historically mediawiki used STRAIGHT_JOIN however that blunt instrument only introduced a different set of edge cases.
The *links tables tend to have data skewed toward certain namespaces depending on the wiki. In most cases this doesn't matter and the MariaDB optimizer always chooses a fast execution plan. However some edge cases on heavily used namespaces can cause massive filesorts. Historically MediaWiki used STRAIGHT_JOIN however that blunt instrument only introduced a different set of edge cases.


The *links tables respond well to [http://dev.mysql.com/doc/refman/5.5/en/partitioning-range.html range partitioning] on namespace with ranges chosen appropriately on a case-by-case basis. Eg, commonswiki:
The *links tables respond well to [http://dev.mysql.com/doc/refman/5.5/en/partitioning-range.html range partitioning] on namespace with ranges chosen appropriately on a case-by-case basis. Eg, commonswiki:
Line 162: Line 350:
=== Special Slaves ===
=== Special Slaves ===


Shards S1-S7 have one slave each with table partitioning in place for '''revision''' and '''logging''' based on user id. These boxes handle special mediawiki query groups like ''recentchangeslinked'', ''contributions'', and ''logpager''.
Shards S1-S7 have one slave each with table partitioning in place for '''revision''' and '''logging''' based on user id. These boxes handle special MediaWiki query groups like ''recentchangeslinked'', ''contributions'', and ''logpager''.


Eg, from '''eswiki''':
Eg, from '''eswiki''':
Line 245: Line 433:
'''Extra notes:''' Having only 1 special node is a "Single Point of Slowdown". I am currently making sure that we have at least 2 nodes load-balancing this kind of traffic, which can be significant for uncached traffic, and allows to properly perform maintenance. Be careful, running the ALTER TABLE, as is, takes <code>Query OK, 629385781 rows affected (5 days 54 min 6.51 sec)</code> for the English wikipedia, and as much amount of space as the original table, so probably pt-online-schema-change or other methods can be considered to avoid failure and painful rollbacks.
'''Extra notes:''' Having only 1 special node is a "Single Point of Slowdown". I am currently making sure that we have at least 2 nodes load-balancing this kind of traffic, which can be significant for uncached traffic, and allows to properly perform maintenance. Be careful, running the ALTER TABLE, as is, takes <code>Query OK, 629385781 rows affected (5 days 54 min 6.51 sec)</code> for the English wikipedia, and as much amount of space as the original table, so probably pt-online-schema-change or other methods can be considered to avoid failure and painful rollbacks.


== DB Store Slaves ==
== HAProxy ==
''Main page containing : [[HAProxy]]''
 
dbproxy1XXX boxes run HAProxy. Besides making for easier Ops server rotations simply by having a proxy IP in the mix, there are two modes for other stuff: load balancing for slaves and failover for masters. So far, only misc shards masters and the latest labsdbs replicas uses it via dbproxy1XXX.
 
=== Failover ===
 
See role::mariadb::proxy::master
 
=== Load Balancing ===
 
See role::mariadb::proxy::slaves
 
== Puppet ==
The main module for DBAs on the [https://phabricator.wikimedia.org/diffusion/OPUP/ operations/puppet] tree is "mariadb", which is in its own repo [https://phabricator.wikimedia.org/diffusion/OPMD/browse/master/;692ee3e2393456d729baafe2d12571da21c2fbad operations/puppet/mariadb]. Remember to update the subrepo when committing changes to the mariadb module, otherwise it will not be caught by palladium, strontium, CI, etc.
 
There used to be a class coredb_mysql, used from 5.5 nodes. Right now it is obsolete, but it is still in use by nodes that have not been updated to mariadb10, in particular, the masters. When all nodes are updated to 5.6/10, we will discontinue it, but right now it is essential for everything to work correctly.
 
Despite all the previous, there are mariadb-related files on the main repo- shared among the clases. Those are the [https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/templates/mariadb/ puppet/templates/mariadb/] files, where the main configuration and grants lie.
 
There is a forth place where you will find mariadb-related files, and that is the private repository, but I suppose that requires no explanation (passwords, etc.).
 
Other DBA related repos are:
* operations/software/dbtools: for scripts used for maintenance
* [https://phabricator.wikimedia.org/diffusion/OSRE/ operations/software/redactatron]: labs filtering
* [https://phabricator.wikimedia.org/diffusion/OSDB/ operations/software/dbtree]: DBtree
* operations/software/tendril: DB monitoring
 
== Long running queries ==
{{main|/troubleshooting#Overload_due_to_long_running_queries}}
 
You can identify long running queries on [https://logstash.wikimedia.org/goto/c30c91260fbd8eaf07ca99444032d95f logstash]. The [[query killer]] should stop read queries running longer than 60 seconds, but may not function properly under high load. [[db-kill]] can be used in emergencies to kill any query running for longer than 10 seconds.
 
== Storage engines ==
 
=== TokuDB ===
In 2015, [https://www.percona.com/doc/percona-tokudb/index.html TokuDB] was trialled but was found to be [[phab:T109069|buggy]]. We're now back to using InnoDB.
 
== Importing table spaces from other hosts with multi source replication ==
 
=== Transportable tablespaces ===
Since MySQL 5.6 it is posible to use advantage of  [http://dev.mysql.com/doc/refman/5.6/en/tablespace-copying.html transportable table spaces] to move around <code>.ibd</code> files from one server to another (if [http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_file_per_table file per table] is enabled.
 
This feature provides a fast way of [[/ImportTableSpace|copying data from one host to another]] over the network using compression and <code>nc</code> for instance.
To learn more about to copy <code>.ibd</code> files from one host to another, please check out [[MariaDB/ImportTableSpace]].
 
== Account handling ==
 
root@localhost account no longer uses passwords for authentication ({{phabricator|T150446}}), but the [https://mariadb.com/kb/en/mariadb/unix_socket-authentication-plugin/ UNIX socket authentication plugin]. This allows us to stop using passwords (that can be copied, exposed and compromised), and allows the root system user to login from localhost by running:
sudo mysql
or
sudo mysql --skip-ssl
 
Because mariadb uses the uid of the linux user, there is no need to write passwords to the filesystem anymore, and in the event of a host compromise, only localhost is affected, and not any other host sharing the same password. This could later be extended to other system accounts that only connect from localhost, such as icinga or prometheus. Note that if a superuser is compromised on a host, not having a password is not a further barrier, as root has the ability to manage, read and modify at will MySQL files.
 
=== How unix_socket authentication works ===
 
All mysql servers should have been configured to load the auth_socket plugin (https://gerrit.wikimedia.org/r/320822):
 
plugin-load = unix_socket=auth_socket.so
 
It can also be enabled at runtime with:
INSTALL PLUGIN unix_socket SONAME 'auth_socket';
 
That only enables to plugin, to use it, we have to alter the user we want to authenticate:
 
GRANT USAGE ON *.* TO root@localhost IDENTIFIED VIA unix_socket;
 
This will delete the user password, but will indicate it does not uses the mysql native password authentication, but unix socket.
MariaDB [(none)]> SELECT user,host,password,plugin FROM mysql.user WHERE user='root';
+------+--------------+-------------------------------------------+-------------+
| user | host        | password                                  | plugin      |
+------+--------------+-------------------------------------------+-------------+
| root | localhost    |                                          | unix_socket |
 
By the way, note that the authentication name is unix_socket, but the plugin loaded is auth_socket.so. Do not use auth_socket on the GRANT/CREATE USER statements.
 
Also, if for some reason, you revert this change, make sure you put a password back:
GRANT USAGE ON *.* TO root@localhost IDENTIFIED BY '<super_secure_password>';
 
== Dumping tables with mydumper ==
 
=== What is [https://github.com/maxbube/mydumper mydumper]? ===
 
=== From the documentation: ===
* Parallelism (hence, speed) and performance (avoids expensive character set conversion routines, efficient code overall)
* Easier to manage output (separate files for tables, dump metadata, etc, easy to view/parse data)
* Consistency - maintains snapshot across all threads, provides accurate master and slave log positions, etc
* Manageability - supports PCRE for specifying database and tables inclusions and exclusions
 
=== Why are we starting to consider it? ===
Its speed and parallelism makes it perfect to save time and load data a lot faster.
 
It is easier to recover single tables or rows.
 
It compress nicely
 
From the tests we have seen
* Taking a full dump of s3 takes 1:15h - 88GB
* Taking a full dump of s1 takes 53 minutes - 93GB
* Taking a full dump of s5 takes 1:20h - 93G
 
=== Quick cheatsheet ===
 
==== Dump data ====
Right now and in order to start it - as it doesn't accept any flag as no to read the default file, the following section needs to be commented out on the host my.cnf
# ssl-ca=/etc/ssl/certs/Puppet_Internal_CA.pem
# ssl-cert=/etc/mysql/ssl/cert.pem
# ssl-key=/etc/mysql/ssl/server.key
# ssl-verify-server-cert
Once that is done, a typical way to dump a database:
mydumper -c -h localhost -t 8 -u root -r 100000000 -B wikidatawiki -S /tmp/mysql.sock -o output_directory_name
''-c'' To compress the data files
 
''-h'' The host
 
''-t'' How many parallel threads you want dumping at the same time
 
''-u'' The user
 
''-r'' Try to split tables into chunks of this many rows
 
''-B'' Database you want to dump
 
''-S'' Specify the socket to connect to
 
"-o" Specify the output directory where you want to save the files to. If not specified, it will generate a directory under your current path called: ''export-2017xxxxx-xxxxx'' with all the files.
 
==== Load data ====
Please check the dump data section to make sure you are able to connect (comment out the SSL options).
 
''By default it disables the logging into the binlog''
 
Once that is done, typically you want to run:
myloader -h localhost -t 8 -u root -S /tmp/mysql.sock -d export-xxxxx
''-h'' The host
 
''-t'' How many parallel threads you want loading the data at the same time
 
''-u'' The user
 
''-S'' Specify the socket to connect to
 
''-d'' The directory you want to import the files from
 
== Production section failover checklist ==
{{merge|MariaDB/troubleshooting#Depooling_a_master_(a.k.a._promoting_a_new_slave_to_master)}}
'''Note there is a script. <code>db-switchover</code> that automates most of these steps'''.
 
In order to generate the task you can use the script ''switchover-tmpl.py'' that lives on ''operations/software/dbtools/'' and will generate the task content. As an example, generate a task for the ''s1'' switchover where ''db1163'' will be the new master.
python3 switchover-tmpl.py s1 db1163
 
If you want to generate the task manually, use the following checklist:
 
NEW master:
OLD master:
# Check configuration differences between new and old master
pt-config-diff h=NEW.eqiad.wmnet,F=/root/.my.cnf h=OLD.eqiad.wmnet,F=/root/.my.cnf
 
 
# Silence alerts on all hosts
# Set NEW master with weight 0 and depool it from its section
dbctl instance NEW edit
dbctl config commit -m "Set NEW with weight 0 TXXXXXX"
# Topology changes, connect everything to NEW
db-switchover --timeout=15 --only-slave-move OLD.eqiad.wmnet NEW.eqiad.wmnet
# Disable puppet @NEW and @OLD
  puppet agent --disable "switchover to NEW"
# Merge gerrit puppet change to promote NEW to master (Example): https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538747/
# Start the failover
!log Starting sX failover from OLD to NEW - TXXXXXX
dbctl --scope eqiad section sX ro "Maintenance till 05:30AM UTC TXXXXXX" && dbctl config commit -m "Set sX as read-only for maintenance TXXXXXX"
# Check that sX is indeed on read-only
# run switchover script from one of the cluster management hosts ({{CuminHosts}}):
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move OLD NEW ; echo OLD; mysql.py -hOLD -e "show slave status\G" ; echo NEW ; mysql.py -hNEW -e "show slave status\G"
# Promote NEW to master and remove read-only, leave OLD (old master) with weight 0 for now
dbctl --scope eqiad section sX set-master NEW && dbctl --scope eqiad section sX rw && dbctl config commit -m "Promote NEW to sX master and remove read-only from sX TXXXXXX"
# Restart puppet on old and new masters (for heartbeat): OLD and NEW
  run-puppet-agent -e "switchover to NEW"
 
<small>
''(Equivalent manual steps)
# Set old master in read only: mysql.py -hOLD -e "set global read_only=ON;"
# Disable heartbeat @OLD
  killall perl
# Confirm new master has catched up
  host=OLD; echo "== $host =="; mysql.py -h $host -e "show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'; host=NEW; echo "== $host =="; mysql.py -h $host -e "select @@hostname; show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'
# Stop slave on new master
  mysql.py -h NEW -e "STOP SLAVE;"
# Confirm pt-hearbeat has been started on NEW
  pgrep perl  /  mysql.py -h NEW -e "SELECT * FROM heartbeat.heartbeat ORDER BY ts DESC LIMIT 1\G"
# Remove read-only from NEW:  mysql.py -hNEW -e "set global read_only=OFF;"
# RESET SLAVE ALL on new master
  mysql.py -h NEW.eqiad.wmnet -e "reset slave all;"
# Change old master to replicate from new master
  mysql.py -hOLD: change master to master_host='NEW.eqiad.wmnet', master_user='repl', master_password='', master_port=3306, master_log_file='', master_log_pos=, master_ssl=1;
if all looks good: start slave;
Enable GTID on old master (OLD)


Two boxes replicate all shards using MariaDB 10 multi-source replication: dbstore1001 and dbstore1002.
''(End of equivalent manual steps)''
</small>


=== dbstore1001 ===
Clean up tasks:
# change events for query killer:
    events_coredb_master.sql on the new master NEW
    events_coredb_slave.sql on the new slave OLD
# Update DNS (example): https://gerrit.wikimedia.org/r/#/c/operations/dns/+/538748/
# Clean up lag in orchestrator. Typically you need to go to the new master's heartbeat database and clean up the old master server_id from that table (with replication enabled):
select * from heartbeat; -- Get the server_id from the old master
delete from heartbeat where server_id=171970778; -- example id for db1183
# Give weight to OLD if needed
# If the old master will be decommissioned or will go under maintenance: depool it from dbctl: dbctl instance HOSTNAME depool
# Change candidate master note (generally remove it from NEW and add it to OLD)
dbctl instance OLD set-candidate-master --section sX true
dbctl instance NEW set-candidate-master --section sX false
# Update/resolve phabricator ticket about failover


* MariaDB 10
=== External store section failover checklist ===
* Replication delayed by 24h
External store failovers differ a bit from current sX ones, as we need to disable writes on the given section first.
* InnoDB and TokuDB (Originally it was going to be eventually all TokuDB. Due to some issues with tokudb, I am leaving some newly imported tables as InnoDB. We need to explore different compression strategies. Having a different engine provents us from doing a binary copy.)
RO external stores (as of today, es1, es2 and es3) do not need this, as those hosts are stand alone and only read only.
* Source for weekly logical backups via mysqldump and bacula


=== dbstore1002 ===
As of today, es4 and es5 are RW, so they have a normal replication topology.


* MariaDB 10
Disabling writes on a section can be done safely, so the failover needs no rush. Even if it can be done hours ahead, disabling writes should be done just a few minutes the actual failover time, to avoid un-balancing of both sections.
* Has a CNAME analytics-store.eqiad.wmnet
* Entirely TokuDB for replicated shards; Aria for Analytics' tables. (See the comment above)


== Analytics Slaves ==
When one section has writes disabled, reads are still happening without any issues.


Originally Analytics had access to one slave per shard. This is changing to a data warehouse type setup utilizing db1047 and dbstore1002 with both boxes having access to all shards.
'''If the idea is to depool the old master, make sure to leave the new master with a bit of weight, to avoid having just one slave serving traffic. Usually leaving the master with weight 50 and the slave with weight 100 is enough.
'''


'''Important!''' dbstore1002 and db1047 currently are not "real" slaves. Sean implemented [https://github.com/wikimedia/operations-software/blob/master/dbtools/eventlogging_sync.sh a very clever custom replication], that is run from terbium.eqiad.wmnet, by which only new rows are added as a client process. This makes things faster BUT things will break if we run an ALTER TABLE on the m4-master tables, or the terbium process dies. This requires better monitoring, but at least you are warned!
Check list for failing over an es section with es1023 and es1024 as examples:


== HAProxy ==
NEW master: es1024
OLD master: es1023
# Check configuration differences between new and old master
$ pt-config-diff h=es1023.eqiad.wmnet,F=/root/.my.cnf h=es1024.eqiad.wmnet,F=/root/.my.cnf 
 
 
# Silence alerts on all hosts
# Set NEW master with weight 50
dbctl instance es1024 edit
dbctl config commit -m "Set es1024 with weight 50 T255755"
# Topology changes, connect everything to es1024
db-switchover --timeout=15 --only-slave-move es1023.eqiad.wmnet es1024.eqiad.wmnet
# Disable puppet @es1023 and @es1024
  puppet agent --disable "switchover to es1024"
# Merge gerrit puppet change to promote es1024
es1024: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/607236/
# Start the failover 
!log "Starting es failover from es1023 to es1024 - https://phabricator.wikimedia.org/T255755"
Disable es5 writes https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/606663/
 
# Check that es5 is indeed on read-only
# run switchover script from one of the cluster management hosts ({{CuminHosts}}):
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move es1023 es1024 ; echo es1023; mysql.py -hes1023 -e "show slave status\G" ; echo es1024 ; mysql.py -hes1024 -e "show slave status\G"
# Promote es1024 to master and remove read-only, leave es1023 (old master) with weight 0 
dbctl --scope eqiad section es5 set-master es1024 && dbctl config commit -m "Promote es1024 to es5 master T255755"
# Restart puppet on old and new masters (for heartbeat): es1023 and es1024
  run-puppet-agent -e "switchover to es1024"
# Enable es5 on MW (REVERT PATCH https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/606663/)
Clean up tasks:
# change events for query killer:
    events_coredb_master.sql on the new master es1024
    events_coredb_slave.sql on the new slave es1023
# Update DNS: https://gerrit.wikimedia.org/r/609899
# Clean up lag in orchestrator. Typically you need to go to the new master's heartbeat database and clean up the old master server_id from that table (with replication enabled):
select * from heartbeat; -- Get the server_id from the old master
delete from heartbeat where server_id=171970778; -- example id for db1183
# Update/resolve phabricator ticket about failover
 
== Misc section failover checklist (example with m2) ==
 
OLD MASTER: db1065
 
NEW MASTER: db1132
 
# Check configuration differences between new and old master
$ pt-config-diff h=db1068.eqiad.wmnet,F=/root/.my.cnf h=db1081.eqiad.wmnet,F=/root/.my.cnf
 
# Silence alerts on all hosts
# Topology changes: move everything under db1132
db-switchover --timeout=1--only-slave-move db1065.eqiad.wmnet db1132.eqiad.wmnet
# Disable puppet @db1065, puppet @db1132
  puppet agent --disable "switchover to db1132"
# Merge gerrit: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519975/
# Run puppet on dbproxy1002 and dbproxy1007 and check the config
puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg
# Start the failover
!log Failover m2 from db1065 to db1132 - T226952
  root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1065 db1132
# Reload haproxies
dbproxy1002:  systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1007:  systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
# kill connections on the old master (db1065)
  pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysql.sock
 
#
# START Equivalent manual steps
# Disable GTID on db1132
# Disable heartbeat @db1065
  killall perl
# Set old m2 master in read only
  ./mysql.py -h db1065 -e "SET GLOBAL read_only=1"
# Confirm new master has catched up DONE
  host=db1065 echo "== $host =="; ./mysql.py -h $host -e "show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'; host=db1132; echo "== $host =="; ./mysql.py -h $host -e "select @@hostname; show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'
  # Reload haproxies
  dbproxy1002:  systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  dbproxy1007:  systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  # Restart puppet on old and new masters (for heartbeat): db1065 and db1132
  puppet agent --enable && puppet agent -tv
# Set new master as read-write and stop slave
  mysql -h db1132.eqiad.wmnet -e "STOP SLAVE; SET GLOBAL read_only=0;"
# RESET SLAVE ALL on new master
  ./mysql.py -h db1132.eqiad.wmnet -e "reset slave all;"
# Change old master to replicate from new master DONE
  ./mysql.py -hdb1065: change master to master_host='db1132.eqiad.wmnet', master_user='repl', master_password='', master_port=3306, master_log_file='xx', master_log_pos=xx, master_ssl=1;
if all looks good: start slave;
# Update tendril and zarcillo master server id for m2
  UPDATE shards set master_id = 1628 where name='m2' LIMIT 1;
  mysql.py -h db1115 zarcillo -e "UPDATE masters SET instance = 'db1132' WHERE dc = 'eqiad' and section = 'm2' LIMIT 1"
# Enable GTID on the old master db1065
# STOP equivalent manual steps
 
 
# Restart puppet on old and new masters (for heartbeat):db1065 and db1132
  puppet agent --enable && puppet agent -tv
# Check services affected (otrs,debmonitor) DEBMONITOR and OTRS looking good
# Clean up lag in orchestrator. Typically you need to go to the new master's heartbeat database and clean up the old master server_id from that table (with replication enabled):
select * from heartbeat; -- Get the server_id from the old master
delete from heartbeat where server_id=171970778; -- example id for db1183
# change events for query killer:
    events_coredb_master.sql on the new master db1132
    events_coredb_slave.sql on the new slave db1065
# Update/resolve phabricator ticket about failover https://phabricator.wikimedia.org/T226952
# Create decommissioning ticket for db1065
 
 
=== Special section: x1 master switchover ===
x1 is a "special" section which cannot be put on read-only on mediawiki, so it needs to rely on the db-switchover script which puts MySQL on read-only.
When failing over, please tag the following Teams and people on the phabricator task so they can have a heads up as they are x1 stakeholders and need to know that x1 will have writes blocked for around one minute.
* Subscribers: Tgr, JoeWalsh, Dbrant, Ladsgroup, Addshore, Legoktm
* Tags: Cognate, Growth-Team, Language-Team, User-notice, UrlShortener, StructuredDiscussions, MediaWiki-extensions-BounceHandler, ContentTranslation, Reading List Service, WikimediaEditorTasks.
 
Example task: https://phabricator.wikimedia.org/T226358
 
== Decommissioning a mariadb host server checklist ==
See [https://wikitech.wikimedia.org/wiki/MariaDB/Decommissioning_a_DB_Host MariaDB/Decommissioning_a_DB_Host]
 
== Depool a broken or lagged replica ==
{{Note|content=As of July 31st 2019 we're in a transition period, please follow both procedures described below for now.}}
From one of the cluster management hosts ({{CuminHosts}}):
dbctl instance dbXXXX depool
dbctl config commit
 
'''Create a task with the DBA tag''' so DBAs can follow up and checkout what happened, a proper fix etc
 
 
THE SECTION BELOW IS DEPRECATED:
Checkout the mediawiki-config repo if you haven't already:
git clone ssh://yourusername@gerrit.wikimedia.org:29418/operations/mediawiki-config
cd mediawiki-config
vim wmf-config/db-eqiad.php
 
Assuming eqiad is the active DC
 
* A replica can run just one MySQL instance
* If a replica runs more than one MySQL instance, it will be specified in its definition with the following format
hostname:port
 
Example
db1090:3317
 
This means db1090 has an instance running on 3317 (and probably others in different ports).
The port is formed in the following way
3306 -> MySQL standard port
331X -> X meaning the section where the server belongs to.
1 -> s1
2 -> s2
..
8 -> s8
3320 -> x1
3321 -> m1
3322 -> m2
3323 -> m3
3324 -> m4
3325 -> m5
 
So db1090:3317 is a replica that runs on s7
 
* A replica can handle more than one type of traffic.
These are the types of traffic we have
main
recentchanges
api
vslow
 
They are normally specified on the definition files, let's use ''db-eqiad.php'' as it is the active datacenter
Let's examine s7 where db1090 belongs to:
This is the main traffic section where the servers and weights are specified.
Servers with weight 1 normally means they are still checked for lag by the MediaWiki LB, but the have almost no main traffic, because they are special replicas and they have more weight for special traffic (check below)
 
's7' => [
'db1062' => 0,      # D4 2.8TB 128GB, master
'db1079' => 300,    # A2 3.6TB 512GB, api # master for sanitarium db1125
'db1086' => 400,    # B3 3.6TB 512GB, api # candidate master
'db1090:3317' => 1, # C3 3.6TB 512GB, vslow, dump: s2 and s7, old master
'db1094' => 500,    # D2 3.6TB 512GB
'db1098:3317' => 1, # B5 3.6TB 512GB # rc, log: s6 and s7
'db1101:3317' => 1, # C2 3.6TB 512GB # rc, log: s7 and s8
 
And there is also a special slaves section below, look for it:
db1090:3317 is a single vslow slave
db1079 and db1086 share API traffic, db1079 has more weight for API because db1086 has more weight for main traffic (see above)
db1098:3317 db1101:3317 are recentchanges replicas and they share the same amount of load for all those special services.
 
's7' => [
  'vslow' => [
  'db1090:3317' => 1,
  ],
  'dump' => [
  'db1090:3317' => 1,
],
  'api' => [
  'db1079' => 3,
  'db1086' => 1,
],
  'watchlist' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
  'recentchanges' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
  'recentchangeslinked' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
  'contributions' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
  'logpager' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
],
 
 
Locate the broken slave - keep in mind it can be in more than one section. The easiest way is probably to look for it on the file.
For this example let's assume db1090:3317 is broken with replication broken, but its other instance (3312) is working fine, so no need to depool it.
 
This host is a vslow host, so we need to replace it with another host from the same section. Let's chose db1079 to replace it.
 
If the host is a special host and servers some special traffic (vslow,api, recentchanges, it needs to be replaced if it is the only one, we cannot leave a section without a slave)
 
We need to look for it and comment it out from every single line where db1090:3317 appears.
However '''DO NOT COMMENT''' the line with its definition which is already really explicit:
    'db1090:3317' => '10.64.32.116:3317', # do not remove or comment out
 
So this should be the diff once we are ready to commit: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/447984/1/wmf-config/db-eqiad.php
 
git add wmf-config/db-eqiad.php
git commit
git review
 
Once we have +2 on gerrit and the change has been merged automatically:
ssh deployment.eqiad.wmnet
cd /srv/mediawiki-staging/
git fetch ; git rebase
git show (this should show our change)
scap sync-file wmf-config/db-eqiad.php "Depool db1090:3317 as it is broken"
 
Once the change is deployed, we should be able to see our change on: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php
 
If a host is completely down and it has multiple instances it needs to be depooled from '''all the instances'''
Let's say db1090 had a hardware crash and it is powered off, that means that both instances 3312 and 3317 are down. We need to also depool the instance db1090:3312 with the same method as described above.
 
'''Create a task with the DBA tag''' so DBAs can follow up and checkout what happened, a proper fix etc
</small>
 
== Testing servers ==
As of today (June 2021) there are two testing hosts on a shard called test-s4:
 
* DB master (writable):  db1124.eqiad.wmnet  256GB RAM + RAID10
* DB slave: (read-only): db1125.eqiad.wmnet 256GB RAM + RAID10
 
Whilst these hosts are normally up, it is not warranted they will always be available or up.
 
These hosts have static data, and is not updated with production data. More tables or wikis can be requested if needed by filing a ticket to DBAs).
It is not warranted that they'll be included, but they can be studied and some expectations about ETA and if it is possible to have that data will be given on the ticket.
 
Access to that hosts is granted individually to those who already have an NDA and cluster access if ''really'' needed. Please file a ticket and tag DBA to let us know why you need access so we can start a discussion about it.
 
=== People with access ===
* Data Persistence Members
 
== Stretch + 10.1 -> Buster + 10.4 known issues ==
* '''Fixed''': <del>Optimizer flag ''rowid_filter'' needs to be disabled until it is fixed upstream (and checked) as it causes a regression: [[phabricator:T245489|T245489]] Upstream: https://jira.mariadb.org/browse/MDEV-21794 </del>
* Possible bug with the optimizer choosing the wrong query plan on certain queries, pending upstream pick up: [[phabricator:T246069|T246069]] Upstream: https://jira.mariadb.org/browse/MDEV-21813
* Some metrics have been deprecated in 10.4 and need to be removed from grafana dashboards: [[phabricator:|T244696]]
* prometheus-mysqld-exporter:
** Regression on configuration disallows proper use of mysql auth_socket authentication. Upstream report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=953040
*** Worked around with: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576398/ and
** Some flags have changed their configuration flags from ''-'' to ''--'' on Buster: Fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576368/
** Seems that when the exporter runs before mysql is started and mysql_upgrade run to update internal tables from 10.1 to 10.4 structure, it gets "stuck" and keeps reporting some scrape errors, even though metrics are sent normally: [[phabricator:T247290|T247290]]
*** The workaround is to restart prometheus-exporter once ''mysql_upgrade'' has been run.
** Sometimes the base prometheus-mysqld-exporter.service is started on multi-instance hosts. This needs to be manually stopped and disabled.
* ''wmf-pt-kill'' seems to be working fine, but needs more checking on wiki replicas.
* Tendril doesn't like this in-place upgrade, so it requires a disable + drop + add + enable after upgrade, otherwise the <code>Act. (last contact)</code> field doesn't get updated.
* '''Fixed''': <del>Events on 10.4 starts disabled: [[phabricator:T247728|T247728]] https://jira.mariadb.org/browse/MDEV-21758 - to be fixed on 10.4.13</del>
* On mysql_upgrade, backup user gets an additional grant (as it has SUPER, needed to handle replication) "DELETE HISTORY". This makes mydumper fail with "broken table detected" on sys and other system tables (my guess is because it can list it, but cannot read from it - this is on purpose, we only want backups from certain databases). Revoking the grant fixed mydumper run:
REVOKE DELETE HISTORY ON *.* from 'dump'@'X.X.X.X';
 
== How to enable a new external storage (es) section ==
 
The process of enabling a new external storage section can be complex and require different steps.
 
=== Pre steps ===
 
* Check all the involved servers have replication working and enabled (with SSL). Check replication lag is 0
* Check all the involved servers have weight (if not done, check the section below)
* Check pt-heartbeat is running
* Check all the involved servers have notifications enabled (and Icinga looks clean)
* Check the future active master has ''read_only'' set to OFF
* Check the future servers have all the wikis and ''blobs_clusterXX'' tables created. This is an example ticket [[phabricator:T245720|T245720]]
* Check ''wikiadmin'' ''wikiuser'' have grants.
 
=== Setting up the servers with dbctl ===
 
external storage sections normally have the following weights:
master: 0
slave1: 100
slave2: 100
 
In order to be able to do so, use ''dbctl'' to generate those configs.
dbctl instance esXXXX edit
 
And leave it like this for the master
 
# Editing object eqiad/esXXXX
host_ip: 10.XX.XX.XX
note: ''
port: 3306
sections:
  esX: {percentage: 100, pooled: true, weight: 0}
 
And like this for a slave
 
# Editing object eqiad/esXXXX
host_ip: 10.XX.XX.XX
note: ''
port: 3306
sections:
  esX: {percentage: 100, pooled: true, weight: 100}
 
Commit the change
dbtcl config commit -m "Set weights for the new es hosts - TXXXXXX"
 
=== Mediawiki patches and deployment ===
In order to be able to set the new external storage there are two steps needed from mediawiki point of view. They can be just one, but in order to catch errors, it better be split into two.
 
* Enable the new section in MW, this should be a NOOP, but would allow testing before going fully live.
** Add the section as described here: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/577185/
* Make sure to double check the ''blobs_clusterXX'' entry match the tables number on the future new tables.
 
Once the change is merged and deployed to the passive DC, use Wikimedia Debug extension to browse the site via mwdebug2001.codfw.wmnet and monitor errors.
If everything looks fine, deploy on the active DC.
 
Use ''shell.php'' from a ''mwmaint'' host to fetch stuff from those new tables, to make sure grants and everything is fine. Getting a ''false'' is a good thing. Everything else might require further research.
 
root@mwmaint1002:~# mwscript shell.php --wiki=enwiki
Psy Shell (PHP 7.2.26 — cli) by Justin Hileman
>>> ExternalStore::fetchFromURL( 'DB://cluster26/1' )
Wikimedia/Rdbms/DBConnectionError with message 'Cannot access the database: No working replica DB server: Unknown error'
>>> ExternalStore::fetchFromURL( 'DB://cluster25/1' )
=> b"Ý][û~¸»¿vÇ(i╚
 
 
If everything looks fine, the next patch should be enabling that section to be writable.
https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/577189/
 
'''Before fully deploying that change, test it on mwdebug'''. Generate some writes by writing on your talk page and checking on the new ''blobs_clusterXX'' table. It might take a few writes to be able to show up there, as it is balanced with the other existing external storage masters.
 
Try a few times, and a few projects to make sure the row shows up there and on all the hosts (to double check replication is working as expected).
 
If the writes are showing up, and nothing else is errorring, you are good to go and fully deploy to all DCs.
 
Once deployed, monitor for lag, errors and periodically select from the new tables to make sure it is getting new things.
 
* Update spicerack to get these new sections up to date: https://gerrit.wikimedia.org/r/#/c/operations/software/spicerack/+/576297/
 
== How to set a external storage (es) section to read only ==
 
Whenever an existing external storage section needs to go read-only, these are the steps that need to be taken.
* Create a MW patch that set it to read-only and remove it from the ''DefaultExternalStore'' : https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/578766/
As always, first push to the passive DC, monitor for logs and if all is ok, push to the active DC:
* Once the change is live, monitor the master binlog (skip heartbeat entries) and make sure the INSERTS stopped a few seconds after the change is live everywhere.
* Disable alerts for 10-15 minutes on the involved servers, to avoid any alerts.
* Once writes has stopped, prepare and push a puppet patch to change those hosts to standalone: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/578816/
* Set ''read_only=1'' on the master
* Once puppet has run on the master, check that pt-heartbeat has stopped there. Once done, run ''show master status;'' to make sure nothing is being written.
* Run the following across all the hosts: ''show master status; show slave status\G'' and note those positions somewhere.
* Reset replication across all the hosts: ''stop slave; reset slave all''
* Update tendril to hide that replication tree:
update shards set display=0 where name='esX'
* Update zarcillo to reflect those hosts as standalone
  update sections set standalone=1 where name='esX';
 
* Slowly give some weight to the old masters, and now standalone - probably best done in a few commits, to slowly warm their buffer pool
dbctl instance esXXXX edit
dbctl config commit -m "Slowly give weight to old and now standalone es masters TXXXXXX"
 
''dbctl'' at the moment doesn't treat standalone hosts any different from a normal replicating section. So it will be showed normally on either db-eqiad.json and db-codfw.json even though there is no real master anymore but just standalone hosts.
 
* Update Spicerack to reflect that these new sections are RO https://gerrit.wikimedia.org/r/#/c/operations/software/spicerack/+/576297/
 
== IPv6 and MariaDB ==
 
MariaDB (and MySQL) should have good support of IPv6. However, due to account handling being based on IP addresses, not DNS, a lot of issues, specially with authentication and grants can arise:
 
* Hosts using DNS will try to contact mysqld using IPv6 first. That will cause timeouts (of several seconds) until it fails back to IPv4
* Privileges may fail completely (access denied):
** When setting up and performing mysql backups
** When accessing host from remote mysql root clients (e.g. cumin)
** When using replication
** When using watchdog user for tendril
 
db1108, as of July 2020, is the only host configured using IPv6 while holding a mysql service.
 
== Tables drift between code and production ==
There are known differences between tables definitions in MW and the live version in production. We have historically worked hard to the most important ones, but there are still lots minor ones that need to be solved.
 
[https://phabricator.wikimedia.org/T104459#7090138 Amir created a dashboard] where those can be checked/observed: https://drift-tracker.toolforge.org/report/core/
 
== Warming up production databases ==
This is mostly done before a DC switchover so the databases are not totally cold (aka with nothing cached).
 
We do this in three different ways:
 
=== MW warmup script ===
From a mw maintenance host make sure you have a ''urls-cluster.txt'' file:
 
root@mwmaint1002:/home/marostegui# cat urls-cluster.txt
# Purpose: Root redirect
https://%server/
# Purpose: Main Page, Skin cache, Sidebar cache, Localisation cache
https://%server/wiki/Main_Page
# Purpose: MobileFrontend, Main Page
https://%mobileServer/wiki/Main_Page
# Purpose: Login page
https://%server/wiki/Special:UserLogin
# Purpose: API, Recent changes
https://%server/w/api.php?format=json&action=query&list=recentchanges
 
If you want to warm up eqiad, simply run:
nodejs /var/lib/mediawiki-cache-warmup/warmup.js urls-cluster.txt spread appservers.svc.eqiad.wmnet
 
Normally you might want to run this several times for hours to make sure you touch as many wikis and hosts as possible. In the output of the script you'll be able to see how the loading times start to decrease over time.
 
This script is pretty harmless and be run without much babysitting in a loop:
while true; do  nodejs /var/lib/mediawiki-cache-warmup/warmup.js urls-cluster.txt spread appservers.svc.eqiad.wmnet ; sleep 15; done
 
=== Comparing tables ===
Using ''db-compare'' over several is a good way to bring some of the most accessed tables into memory. This is a very gentle script and can be run on the background for hours/days without many issues.
 
The idea here is to check data consistency between DCs and bring data into the buffer pool.
We have a file with the most important tables to be checked/warmed up, which is called ''tables_to_check.txt'' and lives at: https://raw.githubusercontent.com/wikimedia/operations-software/master/dbtools/tables_to_check.txt
 
The content:
actor actor_id
archive ar_id
change_tag ct_id
comment comment_id
content content_id
logging log_id
pagelinks pl_from
page page_id
revision rev_id
revision_actor_temp revactor_rev
revision_comment_temp revcomment_rev
slots slot_revision_id
text old_id
user user_id
watchlist wl_id
 
What we normally do is to iterate over all the databases on a given section and go through all the tables across all the wikis there. The idea is to select all the production hosts on the passive DC and compare them against the ''vslow,dump'' host on the active DC.
 
The list of hosts per section can be retrieved from zarcillo database, orchestrator or with the ''section'' [https://raw.githubusercontent.com/wikimedia/operations-software/master/dbtools/section tool]:
The list of databases per section can be retrieved from the ''sX.dblist'' files present on ''wmf-config'' [https://github.com/wikimedia/operations-mediawiki-config/tree/master/dblists repo]
 
Once those things are ready, we can simply iterate over all the tables with a loop (example of ''enwiki'' (s1):
 
cat git/mediawiki-config/dblists/s1.dblist | grep -v "#" | while read db; do cat main_tables.txt | while read table index; do echo "$db.$table"; db-compare $db $table $index db2146.codfw.wmnet:3306 db1163 db1099:3311 db1105:3311 db1106 db1118 db1119 db1134 db1135 db1164 db1169 db1184 ; done ; done
 
'''If differences are found, they need to be investigated BEFORE giving greenlight for the DC switchover.'''
 
==== Warming up ES hosts with compare ====
External store hosts are different as they only have one table per wiki, and depending on the section it is name differently, they are all called ''blobs_clusterX''.
 
In Sept 2021 this is the mapping:
Read only sections:
es1 blob_cluster1 until blob_cluster23
es2 blob_cluster24
es3 blob_cluster5
 
Writable sections:
es4 blob_cluster26
es5 blob_cluster27


dbproxy100x boxes run HAProxy. Besides making for easier Ops server rotations simply by having a proxy IP in the mix, there are two modes for other stuff: load balancing for slaves and failover for masters. So far, only '''m2''' shard master uses it via dbproxy1002.
This comparison isn't trustable for data drifts as the PK is a ''blob_id'' but can still be used to warm up things.
In this case, the iteration follow the same pattern as a normal one, you simply need to check the same table across all the databases existing on the hosts, that can be done with the following loop (as data drift isn't to be trusted, no need to compare against the active DC, so just comparing them within the passive DC):


=== Load Balancing ===
for i in `mysql.py -hes1023 -e "show databases" -BN`; do db-compare $i blobs_cluster27 blob_id es1023 es1024 es1025; done


See role::mariadb::proxy::slaves
==== Warming up parsercache hosts ====


=== Failover ===
For parsercache hosts, the easiest way is to simply do a table count on all its tables.


See role::mariadb::proxy::master
for i in `mysql.py -hpc1011 parsercache -BN -e "show tables"`; do echo $i; mysql.py -hpc1012 parsercache -e "select count(*) from $i"; mysql.py -hpc1011 parsercache -e "select count(*) from $i";mysql.py -hpc1013
parsercache -e "select count(*) from $i";  done


HAProxy is configured to know about a primary (DB master) a secondary (DB replication slave) but only one node is active at any timeː
==== Warming up x1 hosts ====
Hosts in x1 have all the wikis but only a few tables per wiki and we mostly only warm up the ''echo'' tables.  The databases present on x1 are at ''echo.dblists'' on the mediawiki-config repo: https://github.com/wikimedia/operations-mediawiki-config/blob/master/dblists/echo.dblist


listen mariadb 0.0.0.0:3306
The tables to check file is present at: https://raw.githubusercontent.com/wikimedia/operations-software/master/dbtools/echo_tables_to_check.txt
    mode tcp
    balance roundrobin
    option tcpka
    option mysql-check user haproxy
    server <%= @primary_name %> <%= @primary_addr %> check inter 3s fall 3 rise 99999999
    server <%= @secondary_name %> <%= @secondary_addr %> check backup


If the primary fails health checks the backup is brought online. The '''rise 99999999''' trick (about 10 years) means that the primary does not come back without human intervention, even if it starts passing HAProxy health checks again.
This is the content on Sept 2021:
echo_event event_id
echo_notification notification_event
echo_target_page etp_page


Now, this all sounds good, but there are still some catches:
The check can be performed with a simple loop:


* At present m2-slave is still running '''read_only=1''' so read traffic will fail over but writes will start to be blocked until a human verifies that the old master is properly dead and runs '''SET GLOBAL read_only=0;'''. Applications on ''m2'' like ''gerrit'', ''ieg'', ''otrs'', ''exim'' and ''scholarships'' will complain but remain semi-useful.
cat git/mediawiki-config/dblists/echo.dblist | grep -v "#" | while read db; do cat echo_tables_to_check.txt | while read table index; do echo "$db.$table"; db-compare $db $table $index db2096.codfw.wmnet db1103 db1120 db1137; done; done
* Persistent connections like those from the ''eventlogging'' consumer did not failover nicely during trials, instead hitting a TCP timeout and causing just about as much annoyance (and backfilling) as having no HAProxy at all. This needs more research.


So for the present, if dbproxy1002 complains:
=== Replaying traffic ===
'''Probably the most effective way to warm up tables, but the most time consuming one and definitely dangerous. So proceed with caution. '''


# Check that m2-master is really down. If not, restart haproxy on dbproxy1002 and figure out why health checks failed.
Another way to warm up more concrete tables and hosts is to reply some of the most common/slow queries that are arriving to the active DC and send them to the passive DC.
# If the master is fubar ensure its mysqld is stopped before setting read_only=0 on the slave.
This requires a bit more manual work but this is a high level way of doing it.
# If the slave is fubar most ''m2'' apps probably don't care, so do nothing.


== Puppet ==
==== Which queries to replay ====
The main module for DBAs on the [https://phabricator.wikimedia.org/diffusion/OPUP/ operations/puppet] tree is "mariadb", which is in its own repo [https://phabricator.wikimedia.org/diffusion/OPMD/browse/master/;692ee3e2393456d729baafe2d12571da21c2fbad operations/puppet/mariadb]. Remember to update the subrepo when committing changes to the mariadb module, otherwise it will not be caught by palladium, strontium, CI, etc.
===== Tendril =====
Using slow queries section on tendril and filtering for the last 24h and getting all the SELECTs into a file is a good way to start.
Then simply iterate that file over all the hosts in a given section and all the wikis on that section.


There used to be a class coredb_mysql, used from 5.5 nodes. Right now it is obsolete, but it is still in use by nodes that have not been updated to mariadb10, in particular, the masters. When all nodes are updated to 5.6/10, we will discontinue it, but right now it is essential for everything to work correctly.
===== Live traffic =====
'''<big>If you are not 100% sure of what you are doing, do not proceed with this warm up method. </big>'''
'''<big> YOU COULD CORRUPT DATA</big>'''


Despite all the previous, there are mariadb-related files on the main repo- shared among the clases. Those are the [https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/templates/mariadb/ puppet/templates/mariadb/] files, where the main configuration and grants lie.
On the passive DC, you can enable slow query log for a few seconds using the following commands. It highly depends on the day the kind of queries you can grab, but in general you cannot really reuse them as some of them have specific timestamps (especially ''recentchanges'').  


There is a forth place where you will find mariadb-related files, and that is the private repository, but I suppose that requires no explanation (passwords, etc.).
First, make sure you are using ''FILE'' as a way to log queries and which file is it:
> SHOW VARIABLES LIKE "general_log%";
+------------------+------------+
| Variable_name    | Value      |
+------------------+------------+
| general_log      | OFF        |
| general_log_file | db1169.log |
+------------------+------------+


Other DBA related repos are:
Now you are ready to enable, capture half a minute or a minute of traffic and then disable it:
* operations/software/dbtools: for scripts used for maintenance
SET global general_log = 1; select sleep (30); SET global general_log = 0;
* [https://phabricator.wikimedia.org/diffusion/OSRE/ operations/software/redactatron]: labs filtering
* [https://phabricator.wikimedia.org/diffusion/OSDB/ operations/software/dbtree]: DBtree
* operations/software/tendril: DM monitoring


== Long running queries ==
Keep in mind that if forgotten, this file can grow to huge values, so always try to use the enable and disable on the same command so you don't forget to disable it.
Unfortunately this captures all queries and not only SELECTs (as ''log_disabled_statements'' isn't dynamic)


There is some event logic running on the servers trying to kill potential query exhaustion. Needs research.
'''Once you've got the file, you need to exclude ''INSERT DELETE UPDATE'' and ONLY leave SELECTs statements otherwise <big> YOU COULD CORRUPT DATA</big>'''


I am investigating running:
'''<big>If you are not 100% sure of what you are doing, do not proceed with this warm up method.</big>'''


pt-kill --print --kill --victims all --interval 10 --match-command Query --match-user wikiuser --group-by fingerprint --any-busy-time 50 --query-count 10
Once the file only has SELECTs, you can replay them to all the hosts you want to warm up


on a screen session to try to see if it is effective enough/does not create false positives. It requires more work, and finally, puppetization -then deciding which of the two options to follow.
[[Category:MySQL]]
[[Category:MariaDB]]
{{SRE/Data Persistence/Footer}}

Revision as of 16:04, 13 August 2022

For emergencies, check the subpage about MariaDB troubleshooting.

MariaDB is the main database management system used to run the Wikimedia sites.

For a general overview, check the MySQL@Wikipedia (2015) slides (MariaDB is a drop-in replacement for MySQL, which we used to use).

Sections and shards

For a full (?) list, look at db.php.

Core MediaWiki databases

  • s1: English Wikipedia
  • s2: 17 large wikis, including Chinese, Italian, Dutch, Polish, and Portuguese Wikipedias and English Wiktionary
  • s3: Most small wikis (~800)
  • s4: Commons
  • s5: German Wikipedia, some other large wikis, and most new wikis since mid-2020 (T259438)
  • s6: French, Japanese, and Russian Wikipedias, and Wikitech
  • s7: CentralAuth, Meta, 10 large Wikipedias, and French Wiktionary
  • s8: Wikidata

External storage

The external storage servers host the full text of all page revisions in a compressed format.

  • es1: read-only cluster
  • es2: read-only cluster
  • es3: read-only cluster
  • es4 and es5: read/write cluster

Parsercaches

Extension storage

x1

The x1 cluster is used by MediaWiki at WMF for databases that are "global" or "cross-wiki" in nature, and are typically associated with a MediaWiki extension. Apart from being used across wikis, it is otherwise configured, replicated, and queries in the same way as "core" databases.

Feature Database Phabricator project Steward
BounceHandler wikishared.bounce_* #MediaWiki-extensions-BounceHandler
Cognate cognate_wiktionary.* #Cognate
ContentTranslation wikishared.cx_* #ContentTranslation Language Team
Echo wikishared.echo_*, and

[wiki].echo_*

#Notifications Growth Team
Flow flowdb.* #StructuredDiscussions Growth Team
GrowthExperiments [wiki].growthexperiments_* #MediaWiki-extensions-GrowthExperiments Growth Team
ReadingLists wikishared.reading_list_* #Reading List Service
UrlShortener wikishared.urlshortcodes #MediaWiki-extensions-UrlShortener
WikimediaEditorTasks wikishared.wikimedia_editor_tasks_* #Product-Infrastructure-Team-Backlog

x2

This cluster is reserved for use by the MainStash. It is read-write in all data centers and replicated in both directions.

As of June 2021, the x2 cluster is not yet in use and MainStash is still backed by Redis.

Launch task: T212129.

Miscellaneous

The miscellaneous servers host databases for various auxiliary services.

Database creation template

If you need to request a new database please create a task with the DBA tag and filling out the following fields if you know them (or can estimate them)

  • QPS:
  • Size: (An estimation is fine)
  • DB Name:
  • User:
  • Accessed from server (s):
  • Backup Policy: Needed? Frequency?
  • Grants needed:

Cloud Services (WMCS) Wiki replicas

Analytics

  • clouddb1021: Analytics dedicated replica. Used the first days of the month.
  • db1108: Eventlogging hosts
  • dbstore1003-1005: Multi-instance analytics hosts

Replicas

The master database is the primary server for a given shard. If the master database server fails, then a replica can be promoted to be a new master. See Master switch for more information.

A database replica (or slave database) is another MariaDB or MySQL instance that replicates data from the master database. Most often, we add replicas in production for the purpose of load-balancing read queries, and for backup/failover purposes.

Cloud Services Wiki Replicas

Copies of the core databases with private data removed are part of the Data Services hosted by Wikimedia Cloud Services for bot, tool, and analysis use. The redaction is done during an intermediate stop on Sanitarium. As of Feb 2021, wikireplicas run in multiple hosts, with multiple processes, see new wiki replicas setup for more details.

To see how to operate (pool/depool) the new multi-instances check the new wiki replicas setup

Recloning a Wiki replica

This is an old procedure for the multi-source hosts, which are deprecated. This needs updating.

In case a Wiki Replica needs to be recloned from another one, there are several steps that needs to be done as these hosts are running multi-source.

  • Stop all slaves on the source and gather its replication coordinates
  • Stop MySQL on the source
  • Stop MySQL on the target host
  • Remove /srv/sqldata on the target host
  • Transfer the data from the source to the target:
transfer.py --no-encrypt --no-checksum SOURCE.eqiad.wmnet:/srv/sqldata TARGET.eqiad.wmnet:/srv
  • Make sure replication won't start automatically on neither of the hosts:
systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
  • Once everything is transferred: on the target host remove the following files
relay-log-sX.info
multi-master.info
master-sX.info
  • Start MySQL on both hosts
  • If necessary run
mysql_upgrade
  • On the target issue:
reset slave all;
  • On the target configure replication using the coordinates from the source that were gathered previously and start replication

Analytics MariaDB cluster

The Analytics MariaDB cluster contains full replicas of the core and extension storage databases.

Database backups

Main article: MariaDB/Backups

Replication protects (in general) against hardware issues, as well as physical corruption. However, it is not enough to prevent application or operation errors. Also many people (e.g. developers and service maintainers) sometimes require observing past state of the database to solve data application issues.

In the past, 2 boxes replicate all shards using MariaDB 10 multi-source replication: dbstore1001 and dbstore1002 (as well as dbstore2001 and dbstore2002 on codfw). dbstore1001 and dbstore2001 also used to be delayed 24 hours. Multisource was abandoned for several reasons (task T159423): TokuDB (used for compression) bugs, difficult to recover and maintain, as well as difficulty for scalability as data grew more and more, MariaDB bugs and GTID limitations, among others.

In addition to the delayed slave, a bacula-orchestrated dumpdump was done weekly from dbstore1001 on misc and metadata core databases, and stored syncronously on bacula.

In 2017, multiple hardware and scalability issues lead to a focus and complete redesign of the database backup architecture, leading to the current one explained at MariaDB/Backups, with focus on full coverage, validation and testing and full automation recovery.

Start / stop

See MariaDB/Start_and_stop

Packages

Asher started us using stock upstream packages from mariadb.org with the debs "de-debianized" each time to have fewer hooks, allowing puppet to handle stuff. Simply:

apt install wmf-mariadb101

We currently have wmf-mariadb10, wmf-mariadb101 (yes, I know, the names are horrible) and wmf-mysql80 built from source, with no patches except forcing the usage of openssl instead of yassl, that install to /opt.

Puppet controls manually the rc.d script and the package uses update-alternatives to update the symlinks in /usr/local/bin. Unlike the upstream versions our packages are fine to install alongside one another.

Installation

See mariadb roles in puppet.

Hardware installation checklist

Once the physical racking and OS installation has been, normally by DCOPs, the following items must be checked before considering the host ready to receive data.

  • Amount of memory showing up on the OS matches the expected
  • RAID setup
    • RAID level (typically 10)
    • RAID stripe size (256K)
    • BBU in place and policy set to WriteBack
  • Amount of disk space available for srv matches the expected
  • Number of CPUs matches the expected value

Setting up a fresh server

  • To initialize the database: /opt/wmf-mariadb104/scripts/mysql_install_db --basedir=/opt/wmf-mariadb104/
    • If this is a multi-instance db server, you'll need to provide --datadir=/srv/sqldata.<SECTION> as well.
  • Start the service: systemctl start mariadb
    • For multi-instance, the service is mariadb@<SECTION>
  • Set up defaults: /opt/wmf-mariadb104/bin/mysql_secure_installation --basedir=/opt/wmf-mariadb104/ -S /run/mysqld/mysqld.sock
    • For multi-instance, change it to -S /run/mysqld/mysqld.<SECTION>.sock
    • Don't set root password, accept defaults for everything else.

Loading Data / provisioning

Setting up a new replica (or repairing a current one) is done through the recovery/provisioning system: MariaDB/Backups

However, one can do still manual hot or cold copies of running or stopped servers through transfer.py script on the cluster management hosts (cumin1001.eqiad.wmnet, cumin2002.codfw.wmnet).

Schema Changes

See Auto schema

First decide if the schema change can be done online. This only works if:

  • The table has a PRIMARY KEY or UNIQUE KEY on NOT NULL field(s)
  • MediaWiki can ignore the change behind a feature flag
  • Table metadata locking has been considered

Offline Schema Changes

  • Schedule downtime
  • See operations/mediawiki-config/db-eqiad.php
  • May need a master rotation?

Online Schema Changes

Use pt-online-schema-change or the ops wrapper script operations/software/dbtools/osc_host.sh.

$ osc_host.sh --host=dbstore1001 --dblist=/a/common/s2.dblist --table=page_props \
  --method=percona "add pp_sortkey float default null"
  • --method=percona uses pt-online-schema-change and automatically checks slave lag on the host itself and its slaves.
  • --method=ddl uses normal ALTER TABLE which will probably cause replication lag.

The wrapper script prompts for confirmation of settings first and also does a dry run for each wiki. It generally requires little supervision but should never run unmonitored.

If modifying indexes, or adding new fields that are nullable and/or have a default value, the change can be applied to each slave first and their masters last, using --no-replicate; ie, alter the replication tree bottom-to-top so that no slave ever tries to replicate a non-existant field from a master. This:

  • Avoids risking the entire shard all at once; one can only break a single slave at a time :-)
  • Allows the slave to be depooled if something goes wrong or load is too high.
$ osc_host.sh --host=dbstore1001 --dblist=/a/common/s2.dblist --table=page_props \
  --no-replicate --method=percona "add pp_sortkey float default null"

If the change simply must go to masters first and propagate to slaves through replication then the wrapper script may still be appropriate to use but such cases deserve careful consideration from a DBA. This especially includes any changes to primary keys!

Sanitarium and Labsdbs

See more: Sanitarium and labsdb

The sanitarium hosts have triggers defined on some tables which will clash with pt-online-schema-change. For small tables which should be fast to alter simply use --method=ddl. For large tables where DDL would cause unacceptable lag just schedule downtime.

Table Metadata Locking

Both ALTER TABLE and pt-online-schema-change need to hold the table metadata lock, even just briefly for the latter. They must wait for open transactions to close the table and also block new transactions opening the table. On a busy server and a hot table like page or revision this can easily result in many seconds of delay which is more than enough time for connections to pile up and hit max_connections.

Consider reducing load or depooling the box.

Replication Lag Trick

Issuing STOP SLAVE SQL_THREAD on a slave before starting pt_online_schema_change will cause immediate replication lag, which will in turn make MediaWiki load balancing code reduce traffic on the slave until it catches up. Just a few seconds of leeway is enough to allow the tool to create triggers, leading to far less interruption than waiting out the metadata locks traffic jam.

Monitoring

See MariaDB/monitoring.

Replication lag

Replication lag can be checked on https://noc.wikimedia.org/dbtree/.

See MariaDB/troubleshooting#Replication lag and MySQL#Replication lag on how to handle replication lag.

Manipulating the Replication Tree

Tendril displays the full replication tree.

The following is interesting info, but thanks to gtid replication, implemented almost everywhere in production, except multi-source replicated slaves, you can move slaves just by executing:

STOP SLAVE; CHANGE MASTER TO MASTER_HOST='<new master fqdn>'; START SLAVE;

A common task is manipulating slaves within the replication hierarchy. A script exists to help out: operations/software/dbtools/repl.pl. It allows controlling two slaves in a parent/child or sibling/sibling relationship and will do sanity checks plus confirmation prompts. It has nothing to do with masters so don't try to use it for those :-)

Child to Sibling

Move a slave one layer upward in the hierarchy:

./repl.pl --switch-child-to-sibling --parent=db1007.eqiad.wmnet --child=db1035.eqiad.wmnet

Child must be replicating directly from parent.

Sibling to Child

Move a slave one layer downward in the hierarchy:

./repl.pl --switch-sibling-to-child --parent=db1007.eqiad.wmnet --child=db1035.eqiad.wmnet

Both slaves must be replicating from the same master.

Stop Siblings in Sync

Stop two slaves on the same layer in the hierarchy at a common binlog position:

./repl.pl --stop-siblings-in-sync --host1=db1007.eqiad.wmnet --host2=db1035.eqiad.wmnet

Both slaves must be replicating from the same master.

Multisource Slaves

A MariaDB 10 slave can have multiple masters. Have to set the default_master_connection session variable to indicate which stream is to be maniulated by subsequent commands:

./repl.pl --switch-child-to-sibling --parent=db1007.eqiad.wmnet --child=dbstore1001.eqiad.wmnet --child-set=default_master_connection=s7

Partitioned Tables

Links Tables

The *links tables tend to have data skewed toward certain namespaces depending on the wiki. In most cases this doesn't matter and the MariaDB optimizer always chooses a fast execution plan. However some edge cases on heavily used namespaces can cause massive filesorts. Historically MediaWiki used STRAIGHT_JOIN however that blunt instrument only introduced a different set of edge cases.

The *links tables respond well to range partitioning on namespace with ranges chosen appropriately on a case-by-case basis. Eg, commonswiki:

CREATE TABLE templatelinks (
  tl_from int(8) unsigned NOT NULL DEFAULT '0',
  tl_namespace int(11) NOT NULL DEFAULT '0',
  tl_title varbinary(255) NOT NULL DEFAULT ,
  UNIQUE KEY tl_from (tl_from,tl_namespace,tl_title),
  KEY tl_namespace (tl_namespace,tl_title,tl_from)
) ENGINE=InnoDB DEFAULT CHARSET=binary
PARTITION BY RANGE (tl_namespace) (
PARTITION p_9 VALUES LESS THAN (10),
PARTITION p_10 VALUES LESS THAN (11),
PARTITION p_827 VALUES LESS THAN (828),
PARTITION p_828 VALUES LESS THAN (829),
PARTITION p_max VALUES LESS THAN MAXVALUE)

Special Slaves

Shards S1-S7 have one slave each with table partitioning in place for revision and logging based on user id. These boxes handle special MediaWiki query groups like recentchangeslinked, contributions, and logpager.

Eg, from eswiki:

CREATE TABLE logging (
  log_id int(10) unsigned NOT NULL AUTO_INCREMENT,
  log_type varbinary(32) NOT NULL DEFAULT ,
  log_action varbinary(32) NOT NULL DEFAULT ,
  log_timestamp varbinary(14) NOT NULL DEFAULT '19700101000000',
  log_user int(10) unsigned NOT NULL DEFAULT '0',
  log_namespace int(11) NOT NULL DEFAULT '0',
  log_title varbinary(255) NOT NULL DEFAULT ,
  log_comment varbinary(255) NOT NULL DEFAULT ,
  log_params blob NOT NULL,
  log_deleted tinyint(3) unsigned NOT NULL DEFAULT '0',
  log_user_text varbinary(255) NOT NULL DEFAULT ,
  log_page int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (log_id,log_user),
  KEY type_time (log_type,log_timestamp),
  KEY type_action (log_type,log_action,log_timestamp),
  KEY user_time (log_user,log_timestamp),
  KEY page_time (log_namespace,log_title,log_timestamp),
  KEY times (log_timestamp)
) ENGINE=InnoDB DEFAULT CHARSET=binary
PARTITION BY RANGE (log_user)
(PARTITION p00 VALUES LESS THAN (100000),
 PARTITION p01 VALUES LESS THAN (200000),
 PARTITION p02 VALUES LESS THAN (300000),
 PARTITION p03 VALUES LESS THAN (400000),
 PARTITION p04 VALUES LESS THAN (500000),
 PARTITION p05 VALUES LESS THAN (1000000),
 PARTITION p06 VALUES LESS THAN (1500000),
 PARTITION p07 VALUES LESS THAN (2000000),
 PARTITION p08 VALUES LESS THAN (2500000),
 PARTITION p09 VALUES LESS THAN MAXVALUE);
CREATE TABLE revision (
  rev_id int(8) unsigned NOT NULL AUTO_INCREMENT,
  rev_page int(8) unsigned NOT NULL DEFAULT '0',
  rev_text_id int(8) unsigned NOT NULL DEFAULT '0',
  rev_comment varbinary(255) NOT NULL,
  rev_user int(5) unsigned NOT NULL DEFAULT '0',
  rev_user_text varbinary(255) NOT NULL DEFAULT ,
  rev_timestamp varbinary(14) NOT NULL DEFAULT ,
  rev_minor_edit tinyint(1) unsigned NOT NULL DEFAULT '0',
  rev_deleted tinyint(1) unsigned NOT NULL DEFAULT '0',
  rev_len int(8) unsigned DEFAULT NULL,
  rev_parent_id int(8) unsigned DEFAULT NULL,
  rev_sha1 varbinary(32) NOT NULL DEFAULT ,
  PRIMARY KEY (rev_id,rev_user),
  KEY rev_timestamp (rev_timestamp),
  KEY page_timestamp (rev_page,rev_timestamp),
  KEY user_timestamp (rev_user,rev_timestamp),
  KEY usertext_timestamp (rev_user_text,rev_timestamp)
) ENGINE=InnoDB AUTO_INCREMENT=74016150 DEFAULT CHARSET=binary
PARTITION BY RANGE (rev_user)
(PARTITION p00 VALUES LESS THAN (10000),
 PARTITION p01 VALUES LESS THAN (20000),
 PARTITION p02 VALUES LESS THAN (30000),
 PARTITION p03 VALUES LESS THAN (40000),
 PARTITION p04 VALUES LESS THAN (50000),
 PARTITION p05 VALUES LESS THAN (60000),
 PARTITION p06 VALUES LESS THAN (70000),
 PARTITION p07 VALUES LESS THAN (80000),
 PARTITION p08 VALUES LESS THAN (90000),
 PARTITION p09 VALUES LESS THAN (100000),
 PARTITION p10 VALUES LESS THAN (200000),
 PARTITION p11 VALUES LESS THAN (300000),
 PARTITION p12 VALUES LESS THAN (400000),
 PARTITION p13 VALUES LESS THAN (500000),
 PARTITION p14 VALUES LESS THAN (600000),
 PARTITION p15 VALUES LESS THAN (700000),
 PARTITION p16 VALUES LESS THAN (800000),
 PARTITION p17 VALUES LESS THAN (900000),
 PARTITION p18 VALUES LESS THAN (1000000),
 PARTITION p19 VALUES LESS THAN (2000000);

Queries that filter on log_user by equality or range run faster. Same for rev_user. The partition sizes are chosen on a per-wiki basis because one size does not fit all. Tendril has a report for choosing sizes based on row distribution.

Note the modified PRIMARY KEY definition that includes log_user. This is relatively safe for a slave but not appropriate for a master, so a partitioned slave should never be eligible for promotion. See coredb::$topology in puppet, or mediawiki-config/db-eqiad.php to identify them.

Extra notes: Having only 1 special node is a "Single Point of Slowdown". I am currently making sure that we have at least 2 nodes load-balancing this kind of traffic, which can be significant for uncached traffic, and allows to properly perform maintenance. Be careful, running the ALTER TABLE, as is, takes Query OK, 629385781 rows affected (5 days 54 min 6.51 sec) for the English wikipedia, and as much amount of space as the original table, so probably pt-online-schema-change or other methods can be considered to avoid failure and painful rollbacks.

HAProxy

Main page containing : HAProxy

dbproxy1XXX boxes run HAProxy. Besides making for easier Ops server rotations simply by having a proxy IP in the mix, there are two modes for other stuff: load balancing for slaves and failover for masters. So far, only misc shards masters and the latest labsdbs replicas uses it via dbproxy1XXX.

Failover

See role::mariadb::proxy::master

Load Balancing

See role::mariadb::proxy::slaves

Puppet

The main module for DBAs on the operations/puppet tree is "mariadb", which is in its own repo operations/puppet/mariadb. Remember to update the subrepo when committing changes to the mariadb module, otherwise it will not be caught by palladium, strontium, CI, etc.

There used to be a class coredb_mysql, used from 5.5 nodes. Right now it is obsolete, but it is still in use by nodes that have not been updated to mariadb10, in particular, the masters. When all nodes are updated to 5.6/10, we will discontinue it, but right now it is essential for everything to work correctly.

Despite all the previous, there are mariadb-related files on the main repo- shared among the clases. Those are the puppet/templates/mariadb/ files, where the main configuration and grants lie.

There is a forth place where you will find mariadb-related files, and that is the private repository, but I suppose that requires no explanation (passwords, etc.).

Other DBA related repos are:

Long running queries

Main article: /troubleshooting#Overload_due_to_long_running_queries

You can identify long running queries on logstash. The query killer should stop read queries running longer than 60 seconds, but may not function properly under high load. db-kill can be used in emergencies to kill any query running for longer than 10 seconds.

Storage engines

TokuDB

In 2015, TokuDB was trialled but was found to be buggy. We're now back to using InnoDB.

Importing table spaces from other hosts with multi source replication

Transportable tablespaces

Since MySQL 5.6 it is posible to use advantage of transportable table spaces to move around .ibd files from one server to another (if file per table is enabled.

This feature provides a fast way of copying data from one host to another over the network using compression and nc for instance. To learn more about to copy .ibd files from one host to another, please check out MariaDB/ImportTableSpace.

Account handling

root@localhost account no longer uses passwords for authentication (task T150446), but the UNIX socket authentication plugin. This allows us to stop using passwords (that can be copied, exposed and compromised), and allows the root system user to login from localhost by running:

sudo mysql

or

sudo mysql --skip-ssl

Because mariadb uses the uid of the linux user, there is no need to write passwords to the filesystem anymore, and in the event of a host compromise, only localhost is affected, and not any other host sharing the same password. This could later be extended to other system accounts that only connect from localhost, such as icinga or prometheus. Note that if a superuser is compromised on a host, not having a password is not a further barrier, as root has the ability to manage, read and modify at will MySQL files.

How unix_socket authentication works

All mysql servers should have been configured to load the auth_socket plugin (https://gerrit.wikimedia.org/r/320822):

plugin-load = unix_socket=auth_socket.so

It can also be enabled at runtime with:

INSTALL PLUGIN unix_socket SONAME 'auth_socket';

That only enables to plugin, to use it, we have to alter the user we want to authenticate:

GRANT USAGE ON *.* TO root@localhost IDENTIFIED VIA unix_socket;

This will delete the user password, but will indicate it does not uses the mysql native password authentication, but unix socket.

MariaDB [(none)]> SELECT user,host,password,plugin FROM mysql.user WHERE user='root';
+------+--------------+-------------------------------------------+-------------+
| user | host         | password                                  | plugin      |
+------+--------------+-------------------------------------------+-------------+
| root | localhost    |                                           | unix_socket |

By the way, note that the authentication name is unix_socket, but the plugin loaded is auth_socket.so. Do not use auth_socket on the GRANT/CREATE USER statements.

Also, if for some reason, you revert this change, make sure you put a password back:

GRANT USAGE ON *.* TO root@localhost IDENTIFIED BY '<super_secure_password>';

Dumping tables with mydumper

What is mydumper?

From the documentation:

* Parallelism (hence, speed) and performance (avoids expensive character set conversion routines, efficient code overall)
* Easier to manage output (separate files for tables, dump metadata, etc, easy to view/parse data)
* Consistency - maintains snapshot across all threads, provides accurate master and slave log positions, etc
* Manageability - supports PCRE for specifying database and tables inclusions and exclusions

Why are we starting to consider it?

Its speed and parallelism makes it perfect to save time and load data a lot faster.

It is easier to recover single tables or rows.

It compress nicely

From the tests we have seen

  • Taking a full dump of s3 takes 1:15h - 88GB
  • Taking a full dump of s1 takes 53 minutes - 93GB
  • Taking a full dump of s5 takes 1:20h - 93G

Quick cheatsheet

Dump data

Right now and in order to start it - as it doesn't accept any flag as no to read the default file, the following section needs to be commented out on the host my.cnf

# ssl-ca=/etc/ssl/certs/Puppet_Internal_CA.pem
# ssl-cert=/etc/mysql/ssl/cert.pem
# ssl-key=/etc/mysql/ssl/server.key
# ssl-verify-server-cert

Once that is done, a typical way to dump a database:

mydumper -c -h localhost -t 8 -u root -r 100000000 -B wikidatawiki -S /tmp/mysql.sock -o output_directory_name

-c To compress the data files

-h The host

-t How many parallel threads you want dumping at the same time

-u The user

-r Try to split tables into chunks of this many rows

-B Database you want to dump

-S Specify the socket to connect to

"-o" Specify the output directory where you want to save the files to. If not specified, it will generate a directory under your current path called: export-2017xxxxx-xxxxx with all the files.

Load data

Please check the dump data section to make sure you are able to connect (comment out the SSL options).

By default it disables the logging into the binlog

Once that is done, typically you want to run:

myloader -h localhost -t 8 -u root -S /tmp/mysql.sock -d export-xxxxx

-h The host

-t How many parallel threads you want loading the data at the same time

-u The user

-S Specify the socket to connect to

-d The directory you want to import the files from

Production section failover checklist

It has been suggested that this page should be merged with MariaDB/troubleshooting#Depooling_a_master_(a.k.a._promoting_a_new_slave_to_master)

Note there is a script. db-switchover that automates most of these steps.

In order to generate the task you can use the script switchover-tmpl.py that lives on operations/software/dbtools/ and will generate the task content. As an example, generate a task for the s1 switchover where db1163 will be the new master.

python3 switchover-tmpl.py s1 db1163

If you want to generate the task manually, use the following checklist:

NEW master: OLD master:

  1. Check configuration differences between new and old master
pt-config-diff h=NEW.eqiad.wmnet,F=/root/.my.cnf h=OLD.eqiad.wmnet,F=/root/.my.cnf 


  1. Silence alerts on all hosts
  2. Set NEW master with weight 0 and depool it from its section
dbctl instance NEW edit
dbctl config commit -m "Set NEW with weight 0 TXXXXXX"
  1. Topology changes, connect everything to NEW
db-switchover --timeout=15 --only-slave-move OLD.eqiad.wmnet NEW.eqiad.wmnet
  1. Disable puppet @NEW and @OLD
 puppet agent --disable "switchover to NEW"
  1. Merge gerrit puppet change to promote NEW to master (Example): https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538747/
  1. Start the failover
!log Starting sX failover from OLD to NEW - TXXXXXX
dbctl --scope eqiad section sX ro "Maintenance till 05:30AM UTC TXXXXXX" && dbctl config commit -m "Set sX as read-only for maintenance TXXXXXX"
  1. Check that sX is indeed on read-only
  2. run switchover script from one of the cluster management hosts (cumin1001.eqiad.wmnet, cumin2002.codfw.wmnet):
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move OLD NEW ; echo OLD; mysql.py -hOLD -e "show slave status\G" ; echo NEW ; mysql.py -hNEW -e "show slave status\G"
  1. Promote NEW to master and remove read-only, leave OLD (old master) with weight 0 for now
dbctl --scope eqiad section sX set-master NEW && dbctl --scope eqiad section sX rw && dbctl config commit -m "Promote NEW to sX master and remove read-only from sX TXXXXXX"
  1. Restart puppet on old and new masters (for heartbeat): OLD and NEW
 run-puppet-agent -e "switchover to NEW"
 

(Equivalent manual steps)

  1. Set old master in read only: mysql.py -hOLD -e "set global read_only=ON;"
  2. Disable heartbeat @OLD
 killall perl
  1. Confirm new master has catched up
 host=OLD; echo "== $host =="; mysql.py -h $host -e "show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'; host=NEW; echo "== $host =="; mysql.py -h $host -e "select @@hostname; show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'
  1. Stop slave on new master
 mysql.py -h NEW -e "STOP SLAVE;" 
  1. Confirm pt-hearbeat has been started on NEW
 pgrep perl  /  mysql.py -h NEW -e "SELECT * FROM heartbeat.heartbeat ORDER BY ts DESC LIMIT 1\G"
  1. Remove read-only from NEW: mysql.py -hNEW -e "set global read_only=OFF;"
  2. RESET SLAVE ALL on new master
 mysql.py -h NEW.eqiad.wmnet -e "reset slave all;"
  1. Change old master to replicate from new master
 mysql.py -hOLD: change master to master_host='NEW.eqiad.wmnet', master_user='repl', master_password=, master_port=3306, master_log_file=, master_log_pos=, master_ssl=1;

if all looks good: start slave; Enable GTID on old master (OLD)

(End of equivalent manual steps)

Clean up tasks:

  1. change events for query killer:
   events_coredb_master.sql on the new master NEW
   events_coredb_slave.sql on the new slave OLD
  1. Update DNS (example): https://gerrit.wikimedia.org/r/#/c/operations/dns/+/538748/
  2. Clean up lag in orchestrator. Typically you need to go to the new master's heartbeat database and clean up the old master server_id from that table (with replication enabled):
select * from heartbeat; -- Get the server_id from the old master
delete from heartbeat where server_id=171970778; -- example id for db1183
  1. Give weight to OLD if needed
  2. If the old master will be decommissioned or will go under maintenance: depool it from dbctl: dbctl instance HOSTNAME depool
  3. Change candidate master note (generally remove it from NEW and add it to OLD)
dbctl instance OLD set-candidate-master --section sX true
dbctl instance NEW set-candidate-master --section sX false
  1. Update/resolve phabricator ticket about failover

External store section failover checklist

External store failovers differ a bit from current sX ones, as we need to disable writes on the given section first. RO external stores (as of today, es1, es2 and es3) do not need this, as those hosts are stand alone and only read only.

As of today, es4 and es5 are RW, so they have a normal replication topology.

Disabling writes on a section can be done safely, so the failover needs no rush. Even if it can be done hours ahead, disabling writes should be done just a few minutes the actual failover time, to avoid un-balancing of both sections.

When one section has writes disabled, reads are still happening without any issues.

If the idea is to depool the old master, make sure to leave the new master with a bit of weight, to avoid having just one slave serving traffic. Usually leaving the master with weight 50 and the slave with weight 100 is enough.

Check list for failing over an es section with es1023 and es1024 as examples:

NEW master: es1024
OLD master: es1023
# Check configuration differences between new and old master
$ pt-config-diff h=es1023.eqiad.wmnet,F=/root/.my.cnf h=es1024.eqiad.wmnet,F=/root/.my.cnf  


# Silence alerts on all hosts
# Set NEW master with weight 50
dbctl instance es1024 edit
dbctl config commit -m "Set es1024 with weight 50 T255755"
# Topology changes, connect everything to es1024
db-switchover --timeout=15 --only-slave-move es1023.eqiad.wmnet es1024.eqiad.wmnet
# Disable puppet @es1023 and @es1024
 puppet agent --disable "switchover to es1024"
# Merge gerrit puppet change to promote es1024
es1024: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/607236/


# Start the failover  
!log "Starting es failover from es1023 to es1024 - https://phabricator.wikimedia.org/T255755"
Disable es5 writes https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/606663/
# Check that es5 is indeed on read-only
# run switchover script from one of the cluster management hosts (cumin1001.eqiad.wmnet, cumin2002.codfw.wmnet):
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move es1023 es1024 ; echo es1023; mysql.py -hes1023 -e "show slave status\G" ; echo es1024 ; mysql.py -hes1024 -e "show slave status\G"
# Promote es1024 to master and remove read-only, leave es1023 (old master) with weight 0  
dbctl --scope eqiad section es5 set-master es1024 && dbctl config commit -m "Promote es1024 to es5 master T255755"
# Restart puppet on old and new masters (for heartbeat): es1023 and es1024
 run-puppet-agent -e "switchover to es1024"
# Enable es5 on MW (REVERT PATCH https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/606663/)

Clean up tasks:
# change events for query killer:
   events_coredb_master.sql on the new master es1024
   events_coredb_slave.sql on the new slave es1023 
# Update DNS: https://gerrit.wikimedia.org/r/609899
# Clean up lag in orchestrator. Typically you need to go to the new master's heartbeat database and clean up the old master server_id from that table (with replication enabled):
select * from heartbeat; -- Get the server_id from the old master
delete from heartbeat where server_id=171970778; -- example id for db1183
# Update/resolve phabricator ticket about failover

Misc section failover checklist (example with m2)

OLD MASTER: db1065

NEW MASTER: db1132

  1. Check configuration differences between new and old master
$ pt-config-diff h=db1068.eqiad.wmnet,F=/root/.my.cnf h=db1081.eqiad.wmnet,F=/root/.my.cnf 
  1. Silence alerts on all hosts
  2. Topology changes: move everything under db1132
db-switchover --timeout=1--only-slave-move db1065.eqiad.wmnet db1132.eqiad.wmnet
  1. Disable puppet @db1065, puppet @db1132
 puppet agent --disable "switchover to db1132"
  1. Merge gerrit: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519975/
  2. Run puppet on dbproxy1002 and dbproxy1007 and check the config
puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg
  1. Start the failover
!log Failover m2 from db1065 to db1132 - T226952
 root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1065 db1132 
  1. Reload haproxies
dbproxy1002:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1007:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  1. kill connections on the old master (db1065)
 pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysql.sock
# 
# START Equivalent manual steps
# Disable GTID on db1132
# Disable heartbeat @db1065
  killall perl
# Set old m2 master in read only
  ./mysql.py -h db1065 -e "SET GLOBAL read_only=1"
# Confirm new master has catched up DONE
  host=db1065 echo "== $host =="; ./mysql.py -h $host -e "show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'; host=db1132; echo "== $host =="; ./mysql.py -h $host -e "select @@hostname; show master status\G show slave status\G" | grep ' File:\| Position:\|Relay_Master_Log_File:\|Exec\_Master\_Log\_Pos:'
 # Reload haproxies
 dbproxy1002:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
 dbproxy1007:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
 # Restart puppet on old and new masters (for heartbeat): db1065 and db1132
  puppet agent --enable && puppet agent -tv
# Set new master as read-write and stop slave
  mysql -h db1132.eqiad.wmnet -e "STOP SLAVE; SET GLOBAL read_only=0;"
# RESET SLAVE ALL on new master
  ./mysql.py -h db1132.eqiad.wmnet -e "reset slave all;"
# Change old master to replicate from new master DONE
  ./mysql.py -hdb1065: change master to master_host='db1132.eqiad.wmnet', master_user='repl', master_password=, master_port=3306, master_log_file='xx', master_log_pos=xx, master_ssl=1;
if all looks good: start slave; 
# Update tendril and zarcillo master server id for m2 
  UPDATE shards set master_id = 1628 where name='m2' LIMIT 1;
 mysql.py -h db1115 zarcillo -e "UPDATE masters SET instance = 'db1132' WHERE dc = 'eqiad' and section = 'm2' LIMIT 1" 
# Enable GTID on the old master db1065
# STOP equivalent manual steps


  1. Restart puppet on old and new masters (for heartbeat):db1065 and db1132
 puppet agent --enable && puppet agent -tv
  1. Check services affected (otrs,debmonitor) DEBMONITOR and OTRS looking good
  2. Clean up lag in orchestrator. Typically you need to go to the new master's heartbeat database and clean up the old master server_id from that table (with replication enabled):
select * from heartbeat; -- Get the server_id from the old master
delete from heartbeat where server_id=171970778; -- example id for db1183
  1. change events for query killer:
   events_coredb_master.sql on the new master db1132
   events_coredb_slave.sql on the new slave db1065 
  1. Update/resolve phabricator ticket about failover https://phabricator.wikimedia.org/T226952
  2. Create decommissioning ticket for db1065


Special section: x1 master switchover

x1 is a "special" section which cannot be put on read-only on mediawiki, so it needs to rely on the db-switchover script which puts MySQL on read-only. When failing over, please tag the following Teams and people on the phabricator task so they can have a heads up as they are x1 stakeholders and need to know that x1 will have writes blocked for around one minute.

  • Subscribers: Tgr, JoeWalsh, Dbrant, Ladsgroup, Addshore, Legoktm
  • Tags: Cognate, Growth-Team, Language-Team, User-notice, UrlShortener, StructuredDiscussions, MediaWiki-extensions-BounceHandler, ContentTranslation, Reading List Service, WikimediaEditorTasks.

Example task: https://phabricator.wikimedia.org/T226358

Decommissioning a mariadb host server checklist

See MariaDB/Decommissioning_a_DB_Host

Depool a broken or lagged replica

From one of the cluster management hosts (cumin1001.eqiad.wmnet, cumin2002.codfw.wmnet):

dbctl instance dbXXXX depool
dbctl config commit

Create a task with the DBA tag so DBAs can follow up and checkout what happened, a proper fix etc


THE SECTION BELOW IS DEPRECATED: Checkout the mediawiki-config repo if you haven't already:

git clone ssh://yourusername@gerrit.wikimedia.org:29418/operations/mediawiki-config
cd mediawiki-config
vim wmf-config/db-eqiad.php

Assuming eqiad is the active DC

  • A replica can run just one MySQL instance
  • If a replica runs more than one MySQL instance, it will be specified in its definition with the following format
hostname:port

Example

db1090:3317

This means db1090 has an instance running on 3317 (and probably others in different ports). The port is formed in the following way

3306 -> MySQL standard port
331X -> X meaning the section where the server belongs to. 
1 -> s1
2 -> s2
..
8 -> s8
3320 -> x1
3321 -> m1
3322 -> m2
3323 -> m3
3324 -> m4
3325 -> m5

So db1090:3317 is a replica that runs on s7

  • A replica can handle more than one type of traffic.

These are the types of traffic we have

main
recentchanges
api
vslow

They are normally specified on the definition files, let's use db-eqiad.php as it is the active datacenter Let's examine s7 where db1090 belongs to: This is the main traffic section where the servers and weights are specified. Servers with weight 1 normally means they are still checked for lag by the MediaWiki LB, but the have almost no main traffic, because they are special replicas and they have more weight for special traffic (check below)

's7' => [
'db1062' => 0,      # D4 2.8TB 128GB, master
'db1079' => 300,    # A2 3.6TB 512GB, api # master for sanitarium db1125
'db1086' => 400,    # B3 3.6TB 512GB, api # candidate master
'db1090:3317' => 1, # C3 3.6TB 512GB, vslow, dump: s2 and s7, old master
'db1094' => 500,    # D2 3.6TB 512GB
'db1098:3317' => 1, # B5 3.6TB 512GB # rc, log: s6 and s7
'db1101:3317' => 1, # C2 3.6TB 512GB # rc, log: s7 and s8

And there is also a special slaves section below, look for it: db1090:3317 is a single vslow slave db1079 and db1086 share API traffic, db1079 has more weight for API because db1086 has more weight for main traffic (see above) db1098:3317 db1101:3317 are recentchanges replicas and they share the same amount of load for all those special services.

's7' => [
 'vslow' => [
  'db1090:3317' => 1,
 ],
 'dump' => [
  'db1090:3317' => 1,
],
 'api' => [
  'db1079' => 3,
  'db1086' => 1,
],
 'watchlist' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
 'recentchanges' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
 'recentchangeslinked' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
 'contributions' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
 'logpager' => [
  'db1098:3317' => 1,
  'db1101:3317' => 1,
],
],


Locate the broken slave - keep in mind it can be in more than one section. The easiest way is probably to look for it on the file. For this example let's assume db1090:3317 is broken with replication broken, but its other instance (3312) is working fine, so no need to depool it.

This host is a vslow host, so we need to replace it with another host from the same section. Let's chose db1079 to replace it.

If the host is a special host and servers some special traffic (vslow,api, recentchanges, it needs to be replaced if it is the only one, we cannot leave a section without a slave)

We need to look for it and comment it out from every single line where db1090:3317 appears. However DO NOT COMMENT the line with its definition which is already really explicit:

    'db1090:3317' => '10.64.32.116:3317', # do not remove or comment out

So this should be the diff once we are ready to commit: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/447984/1/wmf-config/db-eqiad.php

git add wmf-config/db-eqiad.php
git commit
git review

Once we have +2 on gerrit and the change has been merged automatically:

ssh deployment.eqiad.wmnet
cd /srv/mediawiki-staging/
git fetch ; git rebase
git show (this should show our change)
scap sync-file wmf-config/db-eqiad.php "Depool db1090:3317 as it is broken"

Once the change is deployed, we should be able to see our change on: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php

If a host is completely down and it has multiple instances it needs to be depooled from all the instances Let's say db1090 had a hardware crash and it is powered off, that means that both instances 3312 and 3317 are down. We need to also depool the instance db1090:3312 with the same method as described above.

Create a task with the DBA tag so DBAs can follow up and checkout what happened, a proper fix etc

Testing servers

As of today (June 2021) there are two testing hosts on a shard called test-s4:

  • DB master (writable): db1124.eqiad.wmnet 256GB RAM + RAID10
  • DB slave: (read-only): db1125.eqiad.wmnet 256GB RAM + RAID10

Whilst these hosts are normally up, it is not warranted they will always be available or up.

These hosts have static data, and is not updated with production data. More tables or wikis can be requested if needed by filing a ticket to DBAs). It is not warranted that they'll be included, but they can be studied and some expectations about ETA and if it is possible to have that data will be given on the ticket.

Access to that hosts is granted individually to those who already have an NDA and cluster access if really needed. Please file a ticket and tag DBA to let us know why you need access so we can start a discussion about it.

People with access

  • Data Persistence Members

Stretch + 10.1 -> Buster + 10.4 known issues

  • Fixed: Optimizer flag rowid_filter needs to be disabled until it is fixed upstream (and checked) as it causes a regression: T245489 Upstream: https://jira.mariadb.org/browse/MDEV-21794
  • Possible bug with the optimizer choosing the wrong query plan on certain queries, pending upstream pick up: T246069 Upstream: https://jira.mariadb.org/browse/MDEV-21813
  • Some metrics have been deprecated in 10.4 and need to be removed from grafana dashboards: T244696
  • prometheus-mysqld-exporter:
  • wmf-pt-kill seems to be working fine, but needs more checking on wiki replicas.
  • Tendril doesn't like this in-place upgrade, so it requires a disable + drop + add + enable after upgrade, otherwise the Act. (last contact) field doesn't get updated.
  • Fixed: Events on 10.4 starts disabled: T247728 https://jira.mariadb.org/browse/MDEV-21758 - to be fixed on 10.4.13
  • On mysql_upgrade, backup user gets an additional grant (as it has SUPER, needed to handle replication) "DELETE HISTORY". This makes mydumper fail with "broken table detected" on sys and other system tables (my guess is because it can list it, but cannot read from it - this is on purpose, we only want backups from certain databases). Revoking the grant fixed mydumper run:
REVOKE DELETE HISTORY ON *.* from 'dump'@'X.X.X.X';

How to enable a new external storage (es) section

The process of enabling a new external storage section can be complex and require different steps.

Pre steps

  • Check all the involved servers have replication working and enabled (with SSL). Check replication lag is 0
  • Check all the involved servers have weight (if not done, check the section below)
  • Check pt-heartbeat is running
  • Check all the involved servers have notifications enabled (and Icinga looks clean)
  • Check the future active master has read_only set to OFF
  • Check the future servers have all the wikis and blobs_clusterXX tables created. This is an example ticket T245720
  • Check wikiadmin wikiuser have grants.

Setting up the servers with dbctl

external storage sections normally have the following weights: master: 0 slave1: 100 slave2: 100

In order to be able to do so, use dbctl to generate those configs.

dbctl instance esXXXX edit

And leave it like this for the master

# Editing object eqiad/esXXXX
host_ip: 10.XX.XX.XX
note: 
port: 3306
sections:
  esX: {percentage: 100, pooled: true, weight: 0}

And like this for a slave

# Editing object eqiad/esXXXX
host_ip: 10.XX.XX.XX
note: 
port: 3306
sections:
 esX: {percentage: 100, pooled: true, weight: 100}

Commit the change

dbtcl config commit -m "Set weights for the new es hosts - TXXXXXX"

Mediawiki patches and deployment

In order to be able to set the new external storage there are two steps needed from mediawiki point of view. They can be just one, but in order to catch errors, it better be split into two.

Once the change is merged and deployed to the passive DC, use Wikimedia Debug extension to browse the site via mwdebug2001.codfw.wmnet and monitor errors. If everything looks fine, deploy on the active DC.

Use shell.php from a mwmaint host to fetch stuff from those new tables, to make sure grants and everything is fine. Getting a false is a good thing. Everything else might require further research.

root@mwmaint1002:~# mwscript shell.php --wiki=enwiki
Psy Shell (PHP 7.2.26 — cli) by Justin Hileman
>>> ExternalStore::fetchFromURL( 'DB://cluster26/1' )
Wikimedia/Rdbms/DBConnectionError with message 'Cannot access the database: No working replica DB server: Unknown error'
>>> ExternalStore::fetchFromURL( 'DB://cluster25/1' )
=> b"Ý][û~¸»¿vÇ(i╚


If everything looks fine, the next patch should be enabling that section to be writable. https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/577189/

Before fully deploying that change, test it on mwdebug. Generate some writes by writing on your talk page and checking on the new blobs_clusterXX table. It might take a few writes to be able to show up there, as it is balanced with the other existing external storage masters.

Try a few times, and a few projects to make sure the row shows up there and on all the hosts (to double check replication is working as expected).

If the writes are showing up, and nothing else is errorring, you are good to go and fully deploy to all DCs.

Once deployed, monitor for lag, errors and periodically select from the new tables to make sure it is getting new things.

How to set a external storage (es) section to read only

Whenever an existing external storage section needs to go read-only, these are the steps that need to be taken.

As always, first push to the passive DC, monitor for logs and if all is ok, push to the active DC:

  • Once the change is live, monitor the master binlog (skip heartbeat entries) and make sure the INSERTS stopped a few seconds after the change is live everywhere.
  • Disable alerts for 10-15 minutes on the involved servers, to avoid any alerts.
  • Once writes has stopped, prepare and push a puppet patch to change those hosts to standalone: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/578816/
  • Set read_only=1 on the master
  • Once puppet has run on the master, check that pt-heartbeat has stopped there. Once done, run show master status; to make sure nothing is being written.
  • Run the following across all the hosts: show master status; show slave status\G and note those positions somewhere.
  • Reset replication across all the hosts: stop slave; reset slave all
  • Update tendril to hide that replication tree:
update shards set display=0 where name='esX'
  • Update zarcillo to reflect those hosts as standalone
 update sections set standalone=1 where name='esX';
  • Slowly give some weight to the old masters, and now standalone - probably best done in a few commits, to slowly warm their buffer pool
dbctl instance esXXXX edit
dbctl config commit -m "Slowly give weight to old and now standalone es masters TXXXXXX"

dbctl at the moment doesn't treat standalone hosts any different from a normal replicating section. So it will be showed normally on either db-eqiad.json and db-codfw.json even though there is no real master anymore but just standalone hosts.

IPv6 and MariaDB

MariaDB (and MySQL) should have good support of IPv6. However, due to account handling being based on IP addresses, not DNS, a lot of issues, specially with authentication and grants can arise:

  • Hosts using DNS will try to contact mysqld using IPv6 first. That will cause timeouts (of several seconds) until it fails back to IPv4
  • Privileges may fail completely (access denied):
    • When setting up and performing mysql backups
    • When accessing host from remote mysql root clients (e.g. cumin)
    • When using replication
    • When using watchdog user for tendril

db1108, as of July 2020, is the only host configured using IPv6 while holding a mysql service.

Tables drift between code and production

There are known differences between tables definitions in MW and the live version in production. We have historically worked hard to the most important ones, but there are still lots minor ones that need to be solved.

Amir created a dashboard where those can be checked/observed: https://drift-tracker.toolforge.org/report/core/

Warming up production databases

This is mostly done before a DC switchover so the databases are not totally cold (aka with nothing cached).

We do this in three different ways:

MW warmup script

From a mw maintenance host make sure you have a urls-cluster.txt file:

root@mwmaint1002:/home/marostegui# cat urls-cluster.txt
# Purpose: Root redirect
https://%server/
# Purpose: Main Page, Skin cache, Sidebar cache, Localisation cache
https://%server/wiki/Main_Page
# Purpose: MobileFrontend, Main Page
https://%mobileServer/wiki/Main_Page
# Purpose: Login page
https://%server/wiki/Special:UserLogin
# Purpose: API, Recent changes
https://%server/w/api.php?format=json&action=query&list=recentchanges

If you want to warm up eqiad, simply run:

nodejs /var/lib/mediawiki-cache-warmup/warmup.js urls-cluster.txt spread appservers.svc.eqiad.wmnet

Normally you might want to run this several times for hours to make sure you touch as many wikis and hosts as possible. In the output of the script you'll be able to see how the loading times start to decrease over time.

This script is pretty harmless and be run without much babysitting in a loop:

while true; do  nodejs /var/lib/mediawiki-cache-warmup/warmup.js urls-cluster.txt spread appservers.svc.eqiad.wmnet ; sleep 15; done

Comparing tables

Using db-compare over several is a good way to bring some of the most accessed tables into memory. This is a very gentle script and can be run on the background for hours/days without many issues.

The idea here is to check data consistency between DCs and bring data into the buffer pool. We have a file with the most important tables to be checked/warmed up, which is called tables_to_check.txt and lives at: https://raw.githubusercontent.com/wikimedia/operations-software/master/dbtools/tables_to_check.txt

The content:

actor actor_id
archive ar_id
change_tag ct_id
comment comment_id
content content_id
logging log_id
pagelinks pl_from
page page_id
revision rev_id
revision_actor_temp revactor_rev
revision_comment_temp revcomment_rev
slots slot_revision_id
text old_id
user user_id
watchlist wl_id

What we normally do is to iterate over all the databases on a given section and go through all the tables across all the wikis there. The idea is to select all the production hosts on the passive DC and compare them against the vslow,dump host on the active DC.

The list of hosts per section can be retrieved from zarcillo database, orchestrator or with the section tool: The list of databases per section can be retrieved from the sX.dblist files present on wmf-config repo

Once those things are ready, we can simply iterate over all the tables with a loop (example of enwiki (s1):

cat git/mediawiki-config/dblists/s1.dblist | grep -v "#" | while read db; do cat main_tables.txt | while read table index; do echo "$db.$table"; db-compare $db $table $index db2146.codfw.wmnet:3306 db1163 db1099:3311 db1105:3311 db1106 db1118 db1119 db1134 db1135 db1164 db1169 db1184 ; done ; done

If differences are found, they need to be investigated BEFORE giving greenlight for the DC switchover.

Warming up ES hosts with compare

External store hosts are different as they only have one table per wiki, and depending on the section it is name differently, they are all called blobs_clusterX.

In Sept 2021 this is the mapping: Read only sections: es1 blob_cluster1 until blob_cluster23 es2 blob_cluster24 es3 blob_cluster5

Writable sections: es4 blob_cluster26 es5 blob_cluster27

This comparison isn't trustable for data drifts as the PK is a blob_id but can still be used to warm up things. In this case, the iteration follow the same pattern as a normal one, you simply need to check the same table across all the databases existing on the hosts, that can be done with the following loop (as data drift isn't to be trusted, no need to compare against the active DC, so just comparing them within the passive DC):

for i in `mysql.py -hes1023 -e "show databases" -BN`; do db-compare $i blobs_cluster27 blob_id es1023 es1024 es1025; done

Warming up parsercache hosts

For parsercache hosts, the easiest way is to simply do a table count on all its tables.

for i in `mysql.py -hpc1011 parsercache -BN -e "show tables"`; do echo $i; mysql.py -hpc1012 parsercache -e "select count(*) from $i"; mysql.py -hpc1011 parsercache -e "select count(*) from $i";mysql.py -hpc1013 
parsercache -e "select count(*) from $i";  done

Warming up x1 hosts

Hosts in x1 have all the wikis but only a few tables per wiki and we mostly only warm up the echo tables. The databases present on x1 are at echo.dblists on the mediawiki-config repo: https://github.com/wikimedia/operations-mediawiki-config/blob/master/dblists/echo.dblist

The tables to check file is present at: https://raw.githubusercontent.com/wikimedia/operations-software/master/dbtools/echo_tables_to_check.txt

This is the content on Sept 2021:

echo_event event_id
echo_notification notification_event
echo_target_page etp_page

The check can be performed with a simple loop:

cat git/mediawiki-config/dblists/echo.dblist | grep -v "#" | while read db; do cat echo_tables_to_check.txt | while read table index; do echo "$db.$table"; db-compare $db $table $index db2096.codfw.wmnet db1103 db1120 db1137; done; done

Replaying traffic

Probably the most effective way to warm up tables, but the most time consuming one and definitely dangerous. So proceed with caution.

Another way to warm up more concrete tables and hosts is to reply some of the most common/slow queries that are arriving to the active DC and send them to the passive DC. This requires a bit more manual work but this is a high level way of doing it.

Which queries to replay

Tendril

Using slow queries section on tendril and filtering for the last 24h and getting all the SELECTs into a file is a good way to start. Then simply iterate that file over all the hosts in a given section and all the wikis on that section.

Live traffic

If you are not 100% sure of what you are doing, do not proceed with this warm up method. YOU COULD CORRUPT DATA

On the passive DC, you can enable slow query log for a few seconds using the following commands. It highly depends on the day the kind of queries you can grab, but in general you cannot really reuse them as some of them have specific timestamps (especially recentchanges).

First, make sure you are using FILE as a way to log queries and which file is it:

> SHOW VARIABLES LIKE "general_log%";
+------------------+------------+
| Variable_name    | Value      |
+------------------+------------+
| general_log      | OFF        |
| general_log_file | db1169.log |
+------------------+------------+

Now you are ready to enable, capture half a minute or a minute of traffic and then disable it:

SET global general_log = 1; select sleep (30); SET global general_log = 0;

Keep in mind that if forgotten, this file can grow to huge values, so always try to use the enable and disable on the same command so you don't forget to disable it. Unfortunately this captures all queries and not only SELECTs (as log_disabled_statements isn't dynamic)

Once you've got the file, you need to exclude INSERT DELETE UPDATE and ONLY leave SELECTs statements otherwise YOU COULD CORRUPT DATA

If you are not 100% sure of what you are doing, do not proceed with this warm up method.

Once the file only has SELECTs, you can replay them to all the hosts you want to warm up

This page is a part of the SRE Data Persistence technical documentation
(go here for a list of all our pages)