You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

MariaDB/misc: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Greg Grossmeier
(typo)
 
imported>Jcrespo
(+wikilink)
 
(25 intermediate revisions by 8 users not shown)
Line 1: Line 1:
There are 5 "miscelaneous" shards: m1-m5.
There are 4 "miscellaneous" shards: m1-m5.


* '''m1''': Basic ops utilities
* '''m1''': Basic ops utilities (backups, etherpad, librenms)
* '''m2''': otrs, gerrit and others
* '''m2''': otrs, recommendations api and others
* '''m3''': phabricator and other older task systems
* '''m3''': phabricator and other older task systems
* '''m4''': eventlogging system
* '''m5''': wikitech and other cloud-related dbs + testreduce
* '''m5''': openstack and other labs-related dbs


On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them.
On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them.


== m1 ==
== Sections description ==
=== m1 ===


=== Current schemas ===
==== Current schemas ====
These are the current dbs, and what was needed to failover then:
These are the current dbs, and what was needed to failover then:


* '''bacula''' ; sudo service bacula-director restart after the migration. I had already made sure no jobs were running with status director. Tested after with a list media
* '''bacula9''': The [[Bacula]] metadata database. We make sure there is not backup running at the time so we avoid backup failures. Currently we stop bacula-dir (may require puppet disabling to prevent it from automatically restarting) to make sure no new backups start and potentially fail, as temporarily stopping the director should not have any user imapact. If backups are running, stopping the daemon will cancel the ongoing jobs. Consider rescheduling them (run) if they are important and time-sensitive, otherwise they will be schedule at a later time automatically following configuration. Owners: Jaime, backup: Alex
* '''etherpadlite''' ; seems like etherpad-lite crashed after the migration and systemd took care of restarting it. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad
* '''cas''': Database to store 2FA tokens registered via Apereo CAS (idp.wikimedia.org): Owners: John Bond, Moritz https://phabricator.wikimedia.org/T268327
* '''heartbeat''': needs "manual migration"- change master role on puppet
* '''dbbackups''': Database backups metadata, on master failover need manual update as it doesn't use the proxy. Owners: Jaime At the moment, it requires manual migration of connections after failover: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668449
* '''librenms''': required manuall kill of its connections
* '''etherpadlite''' ; seems like etherpad-lite errors out and terminates after the migration. Normally systemd takes care of it and restarts it instantly. However if the maintenance window takes long enough, systemd will back out and stop trying to restart, in which case a systemctl restart etherpad-lite will be required. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad. Owners: Alex. Killed idle db connection on failover.
* '''puppet''': required manuall kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections **as soon** as the failover happens.
* '''heartbeat''': Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: [[MariaDB#Misc_section_failover_checklist_(example_with_m2)]] Owners: DBAs.
* '''racktables''': went fine, no problems
* '''pki''': Database to store signed certificates managed by pki.discovery.wmnet: Owners: John Bond, Moritz (https://phabricator.wikimedia.org/T268329)
* '''rt''': required manuall kill of its connections ; restarted for good measure apache ; tested by looking at a sample RT ticket
* '''librenms''': required manual kill of its connections <code>@netmon1001: apache reload</code> Owners: Netops (Arzhel). Killed idle db connection.
* '''racktables''': Mostly migrated to netbox, which uses Postgres. Kept for reference/ro. Owners: DC ops. jmm checked it after failover. went fine, no problems.
* '''rddmarc''': ?
* '''rt''': Old ticket manger, kept in read only for reference of contracts/orders, etc. Owners: Daniel, alex can help. Mosty used by RobH. Required manual kill of its connections ; <code>@unobtinium: apache reload</code> Restarted apache2 on ununpentium to reset connections.


=== Deleted/archived schemas ===
==== Deleted/archived schemas ====
* '''reviewdb''': not really on m1 anymore (it was migrated to m2). To delete.
* '''bacula''' old bacula database (for bacula 7.x). Archived into the backups "archive pool"
* '''blog''': to archive
* '''blog''': to archive
* '''bugzilla''': to archive * kill  archived and dropped
* '''bugzilla''': to archive * kill  archived and dropped
Line 33: Line 36:
* '''dashboard_production''': Puppet dashboard db. Never used it in my 3 years here, product sucks. Kill with fire. - alex archived and dropped
* '''dashboard_production''': Puppet dashboard db. Never used it in my 3 years here, product sucks. Kill with fire. - alex archived and dropped
* '''outreach_civicrm''': not fundraising, this is the contacts.wm thing, not used anymore, but in turn it means i dont know what "communicate" is then, we can look at the users tables for info on the  
* '''outreach_civicrm''': not fundraising, this is the contacts.wm thing, not used anymore, but in turn it means i dont know what "communicate" is then, we can look at the users tables for info on the  
* '''bacula''': alex
* '''admin''': archived and dropped
* '''admin''': archived and dropped
* '''outreach_drupal''': kill archived and dropped
* '''outreach_drupal''': kill archived and dropped
* '''percona''': jynus dropped
* '''percona''': jynus dropped
* '''puppet''': required manual kill of its connections; This caused the most puppet spam.  Either restart puppet-masters or kill connections **as soon** as the failover happens. Puppet no longer uses mysql, but its own postgres-backed storage. Was kept for a while for stats/observability. Owner: Alex
* '''query_digests''': jynus archived and dropped
* '''query_digests''': jynus archived and dropped
* '''test''': archived and dropped
* '''test''': archived and dropped
* '''test_drupal''': er, kill with fire ? kill archived and dropped
* '''test_drupal''': er, kill with fire ? kill archived and dropped


=== owners, (or in many cases just people that volunteer to help for the failover) ===
=== m2 ===


* '''blog''': blog people are Tilman Bayer et all (HaeB) archived and dropped
==== Current schemas ====
* '''etherpadlite''': jmm, alex
These are the current dbs, and what was needed to failover then:
* '''heartbeat''': jynus
* '''otrs''': Normally requires restart of otrs-daemon, apache on ''mendelevium''. People: akosiaris
* '''librenms''': filippo
* '''debmonitor''': Normally nothing is required. People: volans, moritz
* '''puppet''': jynus
**Django smoothly fails over without any manual intervention.
* '''racktables''': jmm, alex, krenair working on moving to servermon
**At most check <code>sudo tail -F /srv/log/debmonitor/main.log</code> on the active Debmonitor host (<code>debmonitor1001</code> as of Jul. 2019).
* '''reviewdb''': gerrit. - chad
***Some failed writes logged with <code>HTTP/1.1 500</code> and a stacktrace like <code>django.db.utils.OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement')</code> are expected, followed by the resume of normal operations with most write operations logged as <code>HTTP/1.1 201</code>.
* '''rt''': daniel working on migration , db schema upgrade on db2007
**In case of issues it's safe to try a restart performing: <code>sudo systemctl restart uwsgi-debmonitor.service</code>
* '''heartbeat''': Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: [[MariaDB#Misc_section_failover_checklist_(example_with_m2)]] Owners: DBAs.
* '''xhgui''': performance team
* '''recommendationapi''': k8s service, nothing required, should "just work". People: akosiaris
* '''iegreview''': Shared nothing PHP application; should "just work". People: bd808, Niharika
* '''scholarships''': Shared nothing PHP application; should "just work". People: bd808, Niharika
* '''sockpuppet''': Sockpuppet detection service (also known as the similar-users service). PySpark model currently generates the CSV files and the application needs to be restarted to reload these files. Ideally the process that creates these files would simply update the database in-place. https://phabricator.wikimedia.org/T268505. People: Hnowlan
* '''mwaddlink''': (https://phabricator.wikimedia.org/T267214 )The Link Recommendation Service is an application hosted on kubernetes with an API accessible via HTTP. It responds to a POST request containing wikitext of an article and responds with a structured response of link recommendations for the article. It does not have caching or storage; the client (MediaWiki) is responsible for doing that. MySQL table per wiki is used for caching the actual link recommendations (task T261411); each row contains serialized link recommendations for a particular article. https://wikitech.wikimedia.org/wiki/Add_Link . People: kostajh


== m3 ==
dbproxies will need reload (''systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio''). You can check what's the active proxy by:
host m2-master.eqiad.wmnet


=== Current schemas ===
The passive can be checked by running ''grep -iR m2 hieradata/hosts/*'' on the puppet repo
 
==== Deleted/archived schemas ====
* testotrs: alex: kill it with ice and fire
* testblog: archive it like blog
* bugzilla_testing: archive it with the rest of bugzillas
* reviewdb + reviewdb-test (deprecated & deleted): Gerrit: Normally needs a restart on ''gerrit1001'' just in case. People: akosiaris, hashar
 
=== m3 ===
 
==== Current schemas ====
* '''phabricator_*''': 57 schemas to support phabricator itself
* '''phabricator_*''': 57 schemas to support phabricator itself
* '''rt_migration''': schema needed for some crons related to phabricator jobs
* '''rt_migration''': schema needed for some crons related to phabricator jobs
* '''bugzilla_migration''': schema needed for some crons related to phabricator jobs
* '''bugzilla_migration''': schema needed for some crons related to phabricator jobs
* '''heartbeat''': Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: [[MariaDB#Misc_section_failover_checklist_(example_with_m2)]] Owners: DBAs.


=== Dropped schemas ===
==== Dropped schemas ====
* fab_migration
* fab_migration
=== m5 ===
==== Current schemas ====
* '''labswiki''': schema for wikitech (MediaWiki)
* '''striker''': schema for [[toolsadmin.wikimedia.org]] (Striker)
* '''labsdbaccounts''' cloud team
* ''' test_labsdbaccounts''' cloud team (not in use) https://phabricator.wikimedia.org/T255950#6260581
* '''testreduce / testreduce_vd''' parsoid / ssastry
* '''heartbeat''': Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: [[MariaDB#Misc_section_failover_checklist_(example_with_m2)]] Owners: DBAs.
== Example Failover process ==
See https://wikitech.wikimedia.org/wiki/MariaDB#Misc_section_failover_checklist_(example_with_m2)
[[Category:MySQL]]

Latest revision as of 09:30, 3 February 2022

There are 4 "miscellaneous" shards: m1-m5.

  • m1: Basic ops utilities (backups, etherpad, librenms)
  • m2: otrs, recommendations api and others
  • m3: phabricator and other older task systems
  • m5: wikitech and other cloud-related dbs + testreduce

On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them.

Sections description

m1

Current schemas

These are the current dbs, and what was needed to failover then:

  • bacula9: The Bacula metadata database. We make sure there is not backup running at the time so we avoid backup failures. Currently we stop bacula-dir (may require puppet disabling to prevent it from automatically restarting) to make sure no new backups start and potentially fail, as temporarily stopping the director should not have any user imapact. If backups are running, stopping the daemon will cancel the ongoing jobs. Consider rescheduling them (run) if they are important and time-sensitive, otherwise they will be schedule at a later time automatically following configuration. Owners: Jaime, backup: Alex
  • cas: Database to store 2FA tokens registered via Apereo CAS (idp.wikimedia.org): Owners: John Bond, Moritz https://phabricator.wikimedia.org/T268327
  • dbbackups: Database backups metadata, on master failover need manual update as it doesn't use the proxy. Owners: Jaime At the moment, it requires manual migration of connections after failover: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668449
  • etherpadlite ; seems like etherpad-lite errors out and terminates after the migration. Normally systemd takes care of it and restarts it instantly. However if the maintenance window takes long enough, systemd will back out and stop trying to restart, in which case a systemctl restart etherpad-lite will be required. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad. Owners: Alex. Killed idle db connection on failover.
  • heartbeat: Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: MariaDB#Misc_section_failover_checklist_(example_with_m2) Owners: DBAs.
  • pki: Database to store signed certificates managed by pki.discovery.wmnet: Owners: John Bond, Moritz (https://phabricator.wikimedia.org/T268329)
  • librenms: required manual kill of its connections @netmon1001: apache reload Owners: Netops (Arzhel). Killed idle db connection.
  • racktables: Mostly migrated to netbox, which uses Postgres. Kept for reference/ro. Owners: DC ops. jmm checked it after failover. went fine, no problems.
  • rddmarc: ?
  • rt: Old ticket manger, kept in read only for reference of contracts/orders, etc. Owners: Daniel, alex can help. Mosty used by RobH. Required manual kill of its connections ; @unobtinium: apache reload Restarted apache2 on ununpentium to reset connections.

Deleted/archived schemas

  • bacula old bacula database (for bacula 7.x). Archived into the backups "archive pool"
  • blog: to archive
  • bugzilla: to archive * kill archived and dropped
  • bugzilla3: idem kill archived and dropped
  • bugzilla4: idem archive, actually, we also have this on dumps.wm.org https://dumps.wikimedia.org/other/bugzilla/ but that is the sanitized version, so keep this archive just in case i guess
  • bugzilla_testing: idem kill archived and dropped
  • communicate:  ? archived and dropped
  • communicate_civicrm: not fundraising! we're not sure what this is, we can check users table to determine who administered it archived and dropped
  • dashboard_production: Puppet dashboard db. Never used it in my 3 years here, product sucks. Kill with fire. - alex archived and dropped
  • outreach_civicrm: not fundraising, this is the contacts.wm thing, not used anymore, but in turn it means i dont know what "communicate" is then, we can look at the users tables for info on the
  • admin: archived and dropped
  • outreach_drupal: kill archived and dropped
  • percona: jynus dropped
  • puppet: required manual kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections **as soon** as the failover happens. Puppet no longer uses mysql, but its own postgres-backed storage. Was kept for a while for stats/observability. Owner: Alex
  • query_digests: jynus archived and dropped
  • test: archived and dropped
  • test_drupal: er, kill with fire ? kill archived and dropped

m2

Current schemas

These are the current dbs, and what was needed to failover then:

  • otrs: Normally requires restart of otrs-daemon, apache on mendelevium. People: akosiaris
  • debmonitor: Normally nothing is required. People: volans, moritz
    • Django smoothly fails over without any manual intervention.
    • At most check sudo tail -F /srv/log/debmonitor/main.log on the active Debmonitor host (debmonitor1001 as of Jul. 2019).
      • Some failed writes logged with HTTP/1.1 500 and a stacktrace like django.db.utils.OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement') are expected, followed by the resume of normal operations with most write operations logged as HTTP/1.1 201.
    • In case of issues it's safe to try a restart performing: sudo systemctl restart uwsgi-debmonitor.service
  • heartbeat: Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: MariaDB#Misc_section_failover_checklist_(example_with_m2) Owners: DBAs.
  • xhgui: performance team
  • recommendationapi: k8s service, nothing required, should "just work". People: akosiaris
  • iegreview: Shared nothing PHP application; should "just work". People: bd808, Niharika
  • scholarships: Shared nothing PHP application; should "just work". People: bd808, Niharika
  • sockpuppet: Sockpuppet detection service (also known as the similar-users service). PySpark model currently generates the CSV files and the application needs to be restarted to reload these files. Ideally the process that creates these files would simply update the database in-place. https://phabricator.wikimedia.org/T268505. People: Hnowlan
  • mwaddlink: (https://phabricator.wikimedia.org/T267214 )The Link Recommendation Service is an application hosted on kubernetes with an API accessible via HTTP. It responds to a POST request containing wikitext of an article and responds with a structured response of link recommendations for the article. It does not have caching or storage; the client (MediaWiki) is responsible for doing that. MySQL table per wiki is used for caching the actual link recommendations (task T261411); each row contains serialized link recommendations for a particular article. https://wikitech.wikimedia.org/wiki/Add_Link . People: kostajh

dbproxies will need reload (systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio). You can check what's the active proxy by:

host m2-master.eqiad.wmnet

The passive can be checked by running grep -iR m2 hieradata/hosts/* on the puppet repo

Deleted/archived schemas

  • testotrs: alex: kill it with ice and fire
  • testblog: archive it like blog
  • bugzilla_testing: archive it with the rest of bugzillas
  • reviewdb + reviewdb-test (deprecated & deleted): Gerrit: Normally needs a restart on gerrit1001 just in case. People: akosiaris, hashar

m3

Current schemas

  • phabricator_*: 57 schemas to support phabricator itself
  • rt_migration: schema needed for some crons related to phabricator jobs
  • bugzilla_migration: schema needed for some crons related to phabricator jobs
  • heartbeat: Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: MariaDB#Misc_section_failover_checklist_(example_with_m2) Owners: DBAs.

Dropped schemas

  • fab_migration

m5

Current schemas

Example Failover process

See https://wikitech.wikimedia.org/wiki/MariaDB#Misc_section_failover_checklist_(example_with_m2)