You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

MariaDB/Start and stop

From Wikitech-static
< MariaDB
Revision as of 14:37, 8 June 2021 by imported>LSobanski
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search



Start / Stop

In most cases, MariaDB will not start automatically on server start. This is by design - a server is not ready to be put into production just after start for many reasons, plus it could be in a bad state (e.g. after crash, requiring upgrade or maintenance).

If a server has been shutdown normally, it can be started with:

systemctl start mariadb

(current MariaDB packages do not ship a systemd unit, although technically, jessie's backwards compatibility would make them work with systemctl. Avoid service, as it does not allow to pass extra parameters.

Right now, starting MariaDB does start the replica automatically - however, this again may not be desired if mysql_upgrade has to be run or the server otherwise checked. Prefer:

systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
systemctl start mariadb

this will require later to run at the mysql prompt:

mysql> START SLAVE;

or

mysql> START ALL SLAVES;

if it is using multi-source replication.

To shutdown, make sure mysql replication is not running first:

mysql> STOP SLAVE;
mysql> SHOW [ALL] SLAVE[S] STATUS\G

systemctl stop mariadb

If you just do "shutdown -[hr] now", there is a high chance that mysql will timeout and the operating system kills it uncleanly. This is usually due to one of these 3 reasons:

  • It takes a lot of time to empty the buffer pool (can take several minutes!)
  • The replication thread is killing/committing the ongoing transaction, and it takes a lot of time. This would seem that the slave is "blocked"
  • TokuDB has crashed and replication is "stuck" (in this last case, you actually have to kill the server)

If either action takes a long while, run:

journalctl -u mariadb

Do not assume that things will happen quickly, particularly if a box has been under heavy write load. Failing to wait for the server to stop will, with high probability, corrupt its data if non-transactional tables and GTID are being used - forcing to reload all data (a multi-hour or multi-day task!).

Consider also unmounting manually the /srv partition. Apparently, systemd does not respect the umounting time in some cases and it can lead to corruption:

puppet agent --disable "server shutdown" # prevents puppet messing up with the inexistent data
umount /srv

Buffer pool dump

In order to speed up the warming of the buffer pool (more frequent data should be in memory), automatic load/dump has been activated. See buffer pool dump for details.

InnoDB Shutdown Speed

InnoDB can take a while (hours) to write dirty pages out to disk on shutdown. If the instance must have a predictable downtime then make MariaDB begin flushing dirty pages well in advance of the shutdown process:

set global innodb_max_dirty_pages_pct=0;

Then poll:

show global status like '%dirty%';

When the counters approach zero, a shutdown and/or restart will be fast. Of course, since MariaDB is still handling traffic in the meantime, if write load is high the counters may never drop :-) and you will have to depool the box in that case.

Stopping a replica: avoiding replication corruption

Until GTID with transactional replication control is deployed, and probably forever on places like labs (where non-transactional tables are the norm), it is advised to stop replication first. In 99% percent of the cases, mysql already does that on shutdown, and shutdown does cleanly stop mysql - in theory. In reality, if shutdown takes too much time, systemd/rc/etc. kills mysql, which despite "server transactionality" (server is consistent on crash), that means that replication state is not itself transitional. This happens, for example, if long running updates (vslow, imports running on labs) prevent replication from stopping. In order to do that, just execute:

$ mysql --skip-ssl -e "STOP SLAVE"

This is specially true on crash: in most cases, reimaging a replica and reimporting it is the fastest way to assure data integrity (again, until production gets transactional replication state for InnoDB).



This page is a part of the SRE Data Persistence technical documentation
(go here for a list of all our pages)