You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

MariaDB/Upgrading a section

From Wikitech-static
< MariaDB
Revision as of 11:44, 5 November 2021 by imported>Ladsgroup
Jump to navigation Jump to search

Order of upgrades

  • Upgrade clouddb* hosts.
  • Upgrade Sanitarium hosts in both DCs
  • Upgrade Sanitarium primaries in both DCs and ensure sanitarium host hangs from the 10.4 one in the active DC
  • Upgrade the candidate master on the standby DC
  • Upgrade the backup source in the standby DC (coordinate with Jaime)
  • Upgrade the master in the standby DC
  • Upgrade the candidate master in the primary DC
  • Upgrade the backup source in the primary DC (coordinate with Jaime)
  • Switchover the primary host in the primary DC to a Buster+10.4 host
  • Upgrade the old primary and make it a candidate primary

Upgrade procedure

  • Patch the dhcp file: [example]
  • Run puppet on install1003 and install2003
  • Depool the host (if needed) using software/dbtools/depool-and-wait
  • Silence the host in Icinga (e.g. on a cumin host, cookbook sre.hosts.downtime xxxx.wmnet -D1 -t TXXXXXX -r "reimage for upgrade - TXXXXXX")
  • Stop MySQL on the host
  • Run umount /srv; swapoff -a
  • Run reimage: sudo -E sudo cookbook sre.hosts.reimage xxxx.wmnet -p TXXXXXX
  • Wait until the host is up
  • Run systemctl set-environment MYSQLD_OPTS=”--skip-slave-start”
  • Run systemctl start mariadb ; mysql_upgrade
  • Run systemctl restart prometheus-mysqld-exporter.service
  • Dropped the host from Tendril and re-add it, otherwise they won’t get updated on tendril metrics
  • Check all the tables before starting replication (this can take up to 24h depending on the section)
    • In a screen run: mysqlcheck --all-databases
    • If any corruption is discovered, fix it with the following: journalctl -xe -u mariadb | grep table | grep Flagged | awk -F "table" '{print $2}' | awk -F " " '{print $1}' | tr -d "\`" | uniq >> /root/to_fix ; for i in `cat /root/to_fix`; do echo $i; mysql -e "set session sql_log_bin=0; alter table $i engine=InnoDB, force"; done
  • Start the replica
  • Wait until the host is up
  • Repool the host.

Upgrading mariadb minor version

dbctl instance db1153 depool
dbctl config diff
dbctl config commit -m "Depooling db1153 for mysql upgrade T295026"
mysql.py -hdb1153 -e "show processlist" and check if there is no process using it
cookbook sre.hosts.downtime --hours 2 -r "Maintenance T295026" db1153.eqiad.wmnet

ssh into the host (and become root)

stop slave;
SET GLOBAL innodb_buffer_pool_dump_at_shutdown = OFF;
systemctl stop mariadb
apt full-upgrade
!log Upgrade db1153 T295026

(if linux kernel got updated as well):
df -hT
umount /srv
reboot

systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
systemctl start mariadb
mysql_upgrade
mysql -e "start slave" (locally) or mysql.py -hdb1153 -e "start slave" from cumin
On masters on multi-master setups: "set global read_only = false; "

Wait for it to catch up in replication

(on screen in cumin)
./dbtools/repool db1153 "After upgrade T295026" 50 100
(for hosts getting traffic: 10 25 75 100)

mark it in https://phabricator.wikimedia.org/T295026



This page is a part of the SRE Data Persistence technical documentation
(go here for a list of all our pages)