You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Data Services/Admin/OSMDB

From Wikitech-static
Jump to navigation Jump to search

DON'T TRY THIS YET! This is a first pass.

The following is documentation for WMCS and OSM database administrators to be able to perform operations on the Postgresql database servers that make up the OSMDB service in Cloud Services.

Failing over OSMDB for maintenance

Most of the data on these instances is all replicated from the upstream OSM databases. However, there are a few writeable schemas that users have access to and may have no other backups of. Therefore, it is a good idea to fail things over and treat the data with some care. We have instituted no systems for automatic failover, and postgresql doesn't have a native one, so we must do things manually using the STONITH (Shoot The Other Node In The Head) method.

  • With that in mind, the best practice is to first announce the maintenace and that things will briefly read-only and possible unavailable for a few moments.
  • Disable puppet on both servers
  • Change the DNS entry for osm.db.svc.eqiad.wmflabs wherever that is
  • Once most things are pointing at the read-only replica, stop postgresql on the primary server with systemctl stop postgresql. Make sure it's stopped! If postgresql is still showing up in ps commands, try waiting for it to stop. The last resort is kill -9, but really try not to do that.
  • On the replica, become root with sudo -i and run su postgres -c '/usr/lib/postgresql/9.6/bin/pg_ctl promote -D /srv/postgres/9.6/main'. Check /var/log/postgresql/postgresql-9.6-main.log for messages that it restarted as the master and is accepting queries.
  • In this kind of arrangement postgresql doesn't have a beautiful way of demoting a primary to a standby server. The pg_rewind tool is not going to work because we don't have the right configuration. Therefore, you are going to have to rebuild the database of the primary to fail back or to remain in this configuration with replication enabled.
    • On the old primary, you need to delete the database or move it aside and then run sudo -u postgres pg_basebackup -D /srv/postgres/9.6/main/ -h clouddb1004.clouddb-services.eqiad.wmflabs --checkpoint=fast --xlog-method=stream -U replication -W and enter the replication password that you carefully extracted from the recovery.conf file at /srv/postgres/9.6/main/recovery.conf.
  • If everything is staying in this configuration, swap the puppet roles and enable puppet. If not, you should fail back once things are done by repeating the entire process.