You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "MariaDB/misc"

From Wikitech-static
Jump to navigation Jump to search
imported>BryanDavis
(→‎Current schemas: Not sure why iegreview and scholarships were missing from the m2 list)
imported>Jcrespo
(description update)
Line 4: Line 4:
 
* '''m2''': otrs, gerrit and others
 
* '''m2''': otrs, gerrit and others
 
* '''m3''': phabricator and other older task systems
 
* '''m3''': phabricator and other older task systems
* '''m5''': openstack and other labs-related dbs
+
* '''m5''': wikitech, openstack and other cloud-related dbs
  
 
On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them.
 
On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them.
Line 14: Line 14:
 
These are the current dbs, and what was needed to failover then:
 
These are the current dbs, and what was needed to failover then:
  
* '''bacula9''' ; sudo service bacula-director restart after the migration. I had already made sure no jobs were running with status director. Tested after with a list media
+
* '''bacula9''': We make sure there is not backup running at the time so we avoid backup failures. Currently we stop bacula-dir (may require puppet disabling to prevent it from automatically restarting) to make sure no new backups start and potentially fail, as temporarily stopping the director should not have any user imapact. If backups are running, stopping the daemon will cancel the ongoing jobs. Consider rescheduling them (run) if they are important and time-sensitive, otherwise they will be schedule at a later time automatically following configuration.
* '''bacula''': Nothing
+
* '''bacula''' (old, non used bacula db): Nothing
* '''etherpadlite''' ; seems like etherpad-lite crashed after the migration and systemd took care of restarting it. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad
+
* '''etherpadlite''' ; seems like etherpad-lite errors out and terminates after the migration. Normally systemd takes care of it and restarts it instantly. However if the maintenance window takes long enough, systemd will back out and stop trying to restart, in which case a systemctl restart etherpad-lite will be required. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad
 
* '''heartbeat''': needs "manual migration"- change master role on puppet
 
* '''heartbeat''': needs "manual migration"- change master role on puppet
 
* '''librenms''': required manual kill of its connections <code>@netmon1001: apache reload</code>
 
* '''librenms''': required manual kill of its connections <code>@netmon1001: apache reload</code>
 
* '''puppet''': required manual kill of its connections; This caused the most puppet spam.  Either restart puppet-masters or kill connections **as soon** as the failover happens.
 
* '''puppet''': required manual kill of its connections; This caused the most puppet spam.  Either restart puppet-masters or kill connections **as soon** as the failover happens.
 
* '''racktables''': went fine, no problems
 
* '''racktables''': went fine, no problems
 +
* '''rddmarc''': ?
 
* '''rt''': required manual kill of its connections ; <code>@unobtinium: apache reload</code>
 
* '''rt''': required manual kill of its connections ; <code>@unobtinium: apache reload</code>
  
Line 55: Line 56:
 
==== Current schemas ====
 
==== Current schemas ====
 
These are the current dbs, and what was needed to failover then:
 
These are the current dbs, and what was needed to failover then:
* '''reviewdb''': Gerrit: Normally needs a restart on ''gerrit1001'' just in case. People: akosiaris, hashar
+
* '''reviewdb''' + '''reviewdb-test''' (deprecated, scheduled to be deleted): Gerrit: Normally needs a restart on ''gerrit1001'' just in case. People: akosiaris, hashar
 
* '''otrs''': Normally requires restart of otrs-daemon, apache on ''mendelevium''. People: akosiaris
 
* '''otrs''': Normally requires restart of otrs-daemon, apache on ''mendelevium''. People: akosiaris
 
* '''debmonitor''': Normally nothing is required. People: volans, moritz
 
* '''debmonitor''': Normally nothing is required. People: volans, moritz
Line 63: Line 64:
 
**In case of issues it's safe to try a restart performing: <code>sudo systemctl restart uwsgi-debmonitor.service</code>
 
**In case of issues it's safe to try a restart performing: <code>sudo systemctl restart uwsgi-debmonitor.service</code>
 
* '''heartbeat''': Nothing required
 
* '''heartbeat''': Nothing required
* '''recommendation-api''': Normally requires a restart on scb. People: akosiaris
+
* '''xhgui''': performance team
 +
* '''recommendationapi''': Normally requires a restart on scb. People: akosiaris
 
* '''iegreview''': Shared nothing PHP application; should "just work". People: bd808, Niharika
 
* '''iegreview''': Shared nothing PHP application; should "just work". People: bd808, Niharika
 
* '''scholarships''': Shared nothing PHP application; should "just work". People: bd808, Niharika
 
* '''scholarships''': Shared nothing PHP application; should "just work". People: bd808, Niharika
Line 79: Line 81:
 
==== owners, (or in many cases just people that volunteer to help for the failover) ====
 
==== owners, (or in many cases just people that volunteer to help for the failover) ====
  
* '''reviewdb''': Daniel, Chad, Akosiaris on SRE side
+
* '''reviewdb/reviewdb-test''': Daniel, Chad, Akosiaris on SRE side
 
* '''otrs''': Akosiaris
 
* '''otrs''': Akosiaris
 
* '''heartbeat''': DBA
 
* '''heartbeat''': DBA
 
* '''debmonitor''': volans, moritzm
 
* '''debmonitor''': volans, moritzm
*'''recommendationapi''': {{u|bmansurov|bmansurov}}, #Research on Phabricator. Akosiaris on SRE side
+
* '''recommendationapi''': {{u|bmansurov|bmansurov}}, #Research on Phabricator. Akosiaris on SRE side
 +
* '''xhgui''': performance team
  
 
=== m3 ===
 
=== m3 ===
Line 100: Line 103:
 
* '''labswiki''': schema for wikitech (MediaWiki)
 
* '''labswiki''': schema for wikitech (MediaWiki)
 
* '''striker''': schema for [[toolsadmin.wikimedia.org]] (Striker)
 
* '''striker''': schema for [[toolsadmin.wikimedia.org]] (Striker)
* '''nodepooldb'': [[Nodepool]], connections are long/permanently established. Contact: Releng
+
* '''labsdbaccounts / test_labsdbaccounts''' cloud team
* ???: schema(s) for OpenStack
+
* '''testreduce / testreduce_vd''' parsoid / ssastry
 +
* OpenStack schemas: see https://phabricator.wikimedia.org/T255950#6252977 and following
 +
** '''designate'''
 +
** '''glance'''
 +
** '''keystone'''
 +
** '''neutron'''
 +
** '''nova_api_eqiad1'''
 +
** '''nova_cell0_eqiad1'''
 +
** '''nova_eqiad1'''
  
 
== Example Failover process ==
 
== Example Failover process ==

Revision as of 09:32, 8 July 2020

There are 5 "miscellaneous" shards: m1-m5.

  • m1: Basic ops utilities
  • m2: otrs, gerrit and others
  • m3: phabricator and other older task systems
  • m5: wikitech, openstack and other cloud-related dbs

On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them.

Sections description

m1

Current schemas

These are the current dbs, and what was needed to failover then:

  • bacula9: We make sure there is not backup running at the time so we avoid backup failures. Currently we stop bacula-dir (may require puppet disabling to prevent it from automatically restarting) to make sure no new backups start and potentially fail, as temporarily stopping the director should not have any user imapact. If backups are running, stopping the daemon will cancel the ongoing jobs. Consider rescheduling them (run) if they are important and time-sensitive, otherwise they will be schedule at a later time automatically following configuration.
  • bacula (old, non used bacula db): Nothing
  • etherpadlite ; seems like etherpad-lite errors out and terminates after the migration. Normally systemd takes care of it and restarts it instantly. However if the maintenance window takes long enough, systemd will back out and stop trying to restart, in which case a systemctl restart etherpad-lite will be required. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad
  • heartbeat: needs "manual migration"- change master role on puppet
  • librenms: required manual kill of its connections @netmon1001: apache reload
  • puppet: required manual kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections **as soon** as the failover happens.
  • racktables: went fine, no problems
  • rddmarc: ?
  • rt: required manual kill of its connections ; @unobtinium: apache reload

Deleted/archived schemas

  • reviewdb: not really on m1 anymore (it was migrated to m2). To delete.
  • blog: to archive
  • bugzilla: to archive * kill archived and dropped
  • bugzilla3: idem kill archived and dropped
  • bugzilla4: idem archive, actually, we also have this on dumps.wm.org https://dumps.wikimedia.org/other/bugzilla/ but that is the sanitized version, so keep this archive just in case i guess
  • bugzilla_testing: idem kill archived and dropped
  • communicate: ? archived and dropped
  • communicate_civicrm: not fundraising! we're not sure what this is, we can check users table to determine who administered it archived and dropped
  • dashboard_production: Puppet dashboard db. Never used it in my 3 years here, product sucks. Kill with fire. - alex archived and dropped
  • outreach_civicrm: not fundraising, this is the contacts.wm thing, not used anymore, but in turn it means i dont know what "communicate" is then, we can look at the users tables for info on the
  • admin: archived and dropped
  • outreach_drupal: kill archived and dropped
  • percona: jynus dropped
  • query_digests: jynus archived and dropped
  • test: archived and dropped
  • test_drupal: er, kill with fire ? kill archived and dropped

owners, (or in many cases just people that volunteer to help for the failover)

  • bacula9, bacula: Jaime
  • etherpadlite: Alex. Killed idle db connection.
  • heartbeat: will be handled as part of the failover process by DBAs
  • librenms: Arzhel. Killed idle db connection.
  • puppet: Alex
  • racktables: jmm
  • rt: Daniel, alex can help. Restarted apache2 on ununpentium to reset connections.

m2

Current schemas

These are the current dbs, and what was needed to failover then:

  • reviewdb + reviewdb-test (deprecated, scheduled to be deleted): Gerrit: Normally needs a restart on gerrit1001 just in case. People: akosiaris, hashar
  • otrs: Normally requires restart of otrs-daemon, apache on mendelevium. People: akosiaris
  • debmonitor: Normally nothing is required. People: volans, moritz
    • Django smoothly fails over without any manual intervention.
    • At most check sudo tail -F /srv/log/debmonitor/main.log on the active Debmonitor host (debmonitor1001 as of Jul. 2019).
      • Some failed writes logged with HTTP/1.1 500 and a stacktrace like django.db.utils.OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement') are expected, followed by the resume of normal operations with most write operations logged as HTTP/1.1 201.
    • In case of issues it's safe to try a restart performing: sudo systemctl restart uwsgi-debmonitor.service
  • heartbeat: Nothing required
  • xhgui: performance team
  • recommendationapi: Normally requires a restart on scb. People: akosiaris
  • iegreview: Shared nothing PHP application; should "just work". People: bd808, Niharika
  • scholarships: Shared nothing PHP application; should "just work". People: bd808, Niharika

dbproxies will need reload (systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio). You can check what's the active proxy by:

host m2-master.eqiad.wmnet

The passive can be checked by running grep -iR m2 hieradata/hosts/* on the puppet repo

Deleted/archived schemas

  • testotrs: alex: kill it with ice and fire
  • testblog: archive it like blog
  • bugzilla_testing: archive it with the rest of bugzillas

owners, (or in many cases just people that volunteer to help for the failover)

  • reviewdb/reviewdb-test: Daniel, Chad, Akosiaris on SRE side
  • otrs: Akosiaris
  • heartbeat: DBA
  • debmonitor: volans, moritzm
  • recommendationapi: bmansurov, #Research on Phabricator. Akosiaris on SRE side
  • xhgui: performance team

m3

Current schemas

  • phabricator_*: 57 schemas to support phabricator itself
  • rt_migration: schema needed for some crons related to phabricator jobs
  • bugzilla_migration: schema needed for some crons related to phabricator jobs

Dropped schemas

  • fab_migration

m5

Current schemas

Example Failover process

  1. Disable GTID on db1063, connect db2078 and db1001 to db1063 DONE
  2. Disable puppet @db1016, puppet @db1063 DONE
 puppet agent --disable "switchover to db1063"
  1. Merge gerrit: https://gerrit.wikimedia.org/r/420317 and https://gerrit.wikimedia.org/r/420318 DONE
  2. Run puppet and check config on dbproxy1001 and dbproxy1006 DONE

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg DONE

  1. Disable heartbeat @db1016 DONE
 killall perl
  1. Set old m1 master in read only DONE
 mysql --skip-ssl -hdb1016 -e "SET GLOBAL read_only=1"
  1. Confirm new master has catched up DONE
 mysql --skip-ssl -hdb1016 -e "select @@hostname; show master status\G show slave status\G"; mysql --skip-ssl -hdb1063 -e "select @@hostname; show master status\G show slave qstatus\G"
  1. Start puppet on db1063 (for heartbeat)
 puppet agent -tv
  1. Switchover proxy master @dbproxy1001 and dbproxy1006 DONE
 systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio DONE
  1. kill connections DONE
 ? which command is used- it would be nice to document it and put everything on the wiki
  1. Run puppet on old master @db1016 DONE
 puppet agent -tv
  1. Set new master as read-write and stop slave DONE
 mysql -h db1063.eqiad.wmnet -e "SET GLOBAL read_only=0; STOP SLAVE;"
  1. Check services affected at https://phabricator.wikimedia.org/T189655 DONE
  2. RESET SLAVE ALL on new master DONE
  3. Change old master to replicate from new master DONE
  4. Update tendril master server id for m1 (no need to change dns) DONE
  5. Patch prometheus, dblists DONE
  6. Create decommissioning ticket for db1016 - https://phabricator.wikimedia.org/T190179
  7. Close T166344