You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Mysql Meta: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
No edit summary
imported>Elukey
Line 4: Line 4:
<syntaxhighlight lang="mysql">
<syntaxhighlight lang="mysql">
MariaDB [(none)]> show databases;
MariaDB [(none)]> show databases;
+-----------------------------------+
+---------------------+
| Database                         |
| Database           |
+-----------------------------------+
+---------------------+
| druid                             |
| druid               |
| druid_public_eqiad               |
| druid_public_eqiad |
| hive_metastore                   |
| hive_metastore     |
| hue                               |
| hue                 |        |
| oozie                             |
| oozie               |
| superset_production               |
| search_airflow      |
| superset_staging                 |
| superset_production |
+-----------------------------------+
| superset_staging   |
+---------------------+
</syntaxhighlight>
</syntaxhighlight>


Line 25: Line 26:
* Oozie ([[Analytics/Systems/Cluster/Oozie]])
* Oozie ([[Analytics/Systems/Cluster/Oozie]])
* Hive Server2 and Metastore ([[Analytics/Systems/Cluster/Hive]])
* Hive Server2 and Metastore ([[Analytics/Systems/Cluster/Hive]])
* Airflow for the Search instance


== Backup ==
== Backup ==
All the databases are periodically backed up in /srv/backup/mysql/analytics-meta on the Analytics Hadoop Master Standby (an-master1002).
There are currently two backup workflows currently running:
Daily snapshots are backed up into HDFS at /wmf/data/archive/backup/mysql/analytics-meta/.
 
* All the databases are periodically backed up in /srv/backup/mysql/analytics-meta on the Analytics Hadoop Master Standby (an-master1002). Daily snapshots are backed up into HDFS at /wmf/data/archive/backup/mysql/analytics-meta/.
* All the above databases are replicated to db1108, see [[Analytics/Systems/DB_Replica]]
 
Long term only the latter workflow will be kept (together with a periodic bacula backup of db1108).
 
== Failover ==
This procedure should be used only if an-coord1001 is not available for some reason, for example unrecoverable hardware failure. There is no automatic failover, only a manual one (described by this procedure). The high level idea is the following:
 
* Check what daemons are running on an-coord1001. The Hive metastore and oozie probably need to be moved elsewhere, check what host/vm can hold them temporarily.
* Ssh to db1108, and stop replication on the Analytics Meta's mysql instance (<code>sudo mysql -S /run/mysqld/mysqld.analytics_meta.sock</code> and <code>stop slaves;</code>).
* Stop bacula backups for the db1108's mysql instance, since we found out in the past that the backup locking overhead causes timeouts and problems to daemons connecting to the Analytics Meta DB.
* Set the db1108's meta replica in puppet to read/write (check <code>mariadb::config</code> in <code>profile::mariadb::misc::analytics::multiinstance</code> in puppet).
* Look for all occurrences of an-coord1001 in puppet and figure out the ones that need to be moved to db1108. '''Keep in mind that the port of the mysql instance on db1108 is not 3306, so replacing an-coord1001.eqiad.wmnet with db1108.eqiad.wmnet in puppet is not enough'''. A high level list is:
** hive metastore mysql configuration (hive-site.xml)
** oozie mysql configuration (oozie-site.xml)
** druid clusters configuration (druid common config, in hiera)
** superset configs (superset_config.py)
** hue config for oozie and hive
** airflow mysql config for Search (alert them when this happens to warn about airflow working in degraded mode).
* Merge the above change and run puppet across the hosts, you'll likely need to roll restart the impacted daemons.
* At this point, you should have recovery.
 
== Failback ==
This procedure should be used to restore service after a failover from an-coord1001 to db1108 happened. Since the dbs on an-coord1001 need to be rebuild, this will need some downtime for all services.
 
The high level idea is the following:
 
* Set up some downtime for all the daemons using the analytics meta instance (see in the Failover section).
* Dump the databases on db1108, copy it to an-coord1001 and restore them (see [[Analytics/Systems/DB_Replica]] for info about how to do it). You may probably need to drop the existing databases/data before doing so.
* Move all daemons to an-coord1001 (read the Failover procedure for more info).
* Set db1108's instance to read-only and set up replication again.
* Re-enable Bacula backups.

Revision as of 15:25, 21 July 2020

The Analytics Mysql Meta instance is a Mariadb database currently running on the Analytics Hadoop coordinator host (an-coord1001), that hosts some low volume db schemas related to Hadoop tools like Hive, Oozie, Druid, etc..

Databases

MariaDB [(none)]> show databases;
+---------------------+
| Database            |
+---------------------+
| druid               |
| druid_public_eqiad  |
| hive_metastore      |
| hue                 |         |
| oozie               |
| search_airflow      |
| superset_production |
| superset_staging    |
+---------------------+

Dependent services

The Mariadb instance collects metadata about the following systems:

Backup

There are currently two backup workflows currently running:

  • All the databases are periodically backed up in /srv/backup/mysql/analytics-meta on the Analytics Hadoop Master Standby (an-master1002). Daily snapshots are backed up into HDFS at /wmf/data/archive/backup/mysql/analytics-meta/.
  • All the above databases are replicated to db1108, see Analytics/Systems/DB_Replica

Long term only the latter workflow will be kept (together with a periodic bacula backup of db1108).

Failover

This procedure should be used only if an-coord1001 is not available for some reason, for example unrecoverable hardware failure. There is no automatic failover, only a manual one (described by this procedure). The high level idea is the following:

  • Check what daemons are running on an-coord1001. The Hive metastore and oozie probably need to be moved elsewhere, check what host/vm can hold them temporarily.
  • Ssh to db1108, and stop replication on the Analytics Meta's mysql instance (sudo mysql -S /run/mysqld/mysqld.analytics_meta.sock and stop slaves;).
  • Stop bacula backups for the db1108's mysql instance, since we found out in the past that the backup locking overhead causes timeouts and problems to daemons connecting to the Analytics Meta DB.
  • Set the db1108's meta replica in puppet to read/write (check mariadb::config in profile::mariadb::misc::analytics::multiinstance in puppet).
  • Look for all occurrences of an-coord1001 in puppet and figure out the ones that need to be moved to db1108. Keep in mind that the port of the mysql instance on db1108 is not 3306, so replacing an-coord1001.eqiad.wmnet with db1108.eqiad.wmnet in puppet is not enough. A high level list is:
    • hive metastore mysql configuration (hive-site.xml)
    • oozie mysql configuration (oozie-site.xml)
    • druid clusters configuration (druid common config, in hiera)
    • superset configs (superset_config.py)
    • hue config for oozie and hive
    • airflow mysql config for Search (alert them when this happens to warn about airflow working in degraded mode).
  • Merge the above change and run puppet across the hosts, you'll likely need to roll restart the impacted daemons.
  • At this point, you should have recovery.

Failback

This procedure should be used to restore service after a failover from an-coord1001 to db1108 happened. Since the dbs on an-coord1001 need to be rebuild, this will need some downtime for all services.

The high level idea is the following:

  • Set up some downtime for all the daemons using the analytics meta instance (see in the Failover section).
  • Dump the databases on db1108, copy it to an-coord1001 and restore them (see Analytics/Systems/DB_Replica for info about how to do it). You may probably need to drop the existing databases/data before doing so.
  • Move all daemons to an-coord1001 (read the Failover procedure for more info).
  • Set db1108's instance to read-only and set up replication again.
  • Re-enable Bacula backups.