You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

DBA Tasks

From Wikitech-static
Revision as of 00:56, 25 June 2013 by imported>Asher
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

DBA onboarding and upcoming projects, June 2013

The following list is unordered in terms of priority

Get familiar with

  • Puppet
    • manifests/role/coredb.pp
      • $role::coredb::config::topology - Currently a static representation of db server role assignments, used to both generate my.cnf and mha configs. Too static for MHA?
    • modules/coredb_mysql
      • Our puppet module to install mysql/mariadb, my.cnf, pt-query-digest collection, pt-heartbeat, ganglia metrics, nagios monitoring, lvs snapshotting.
    • manifests/role/mha.pp
      • MHA configuration. We don't currently actually run MHA for automated master failover - everything is in place for it except mediawiki integration, which will be an upcoming project. We do use MHA for manual master failover though, both for planned maintenance, and in the event that a master db server fails. In either case, mediawiki config changes need to be deployed.
      • MHA needs to be documented. Peter wrote some initial documentation based on using it fail over masters in the secondary datacenter at https://wikitech.wikimedia.org/wiki/Master_switch but dedicated documentation is needed.
      • MHA doesn't support mutli-tiered replication topologies, which we have. So for every shard, there are three sets of MHA configs, one for the active datacenter, with the master in the secondary datacenter included. One for the secondary datacenter, just including its members. And one for site failover.
  • DB Tree - http://noc.wikimedia.org/dbtree/
  • Slow query logs - https://ishmael.wikimedia.org/ (use your gerrit login)
  • Sampled query logs - https://ishmael.wikimedia.org/sample/
  • Mediawiki Logs
    • fluorine:/a/mw-log
      • fluorine:/a/mw-log/dberror.log - db related logs, mostly errors
      • fluorine:/a/mw-log/exception.log - php exceptions, occasionally db related
  • Mediawiki Config
    • Clone the operations/mediawiki-config git repo
    • mediawiki-config/wmf-config contains the production mediawiki configs
    • db-eqiad.php - db config for the active / primary datacenter
    • db-pmtpa.php - the same for the secondary datacenter that will soon by decommissioned
      • mediawiki always treats the host in position 0 of each shard array under sectionLoads as the master

To Do

  • Documentation - obviously needs a refresh :)
  • Schema changes - there is a backlog of production schema changes to be completed
  • Backups - We currently have one db server per shard, per datacenter configured as an lvm snapshot host, with two snapshots kept in 8 hour increments. I'd like to have binary backups stored for at least a few days though. We have a NetApp that should have space for this. Once this is done, building a new slave should be fairly automated via loading the most recent available binary backup.
  • Automating master failover - there are a few routes available for implementing this, whether or not we build off of the existing MHA work. One is that we could eliminate the need to deploy a new db.php file on master change by moving to vips that could be managed by either MHA or ha.d. If you have a preferred approach, make a proposal.
  • Mariadb upgrades - still several shards to upgrade
  • Improving schema changes - more scripting around pt-online-schema-change or related. Support for sharding.
  • Capacity planning - needed for the External Store as well as coredb shards, mostly around storage space.
  • Flow project review - introducing a new approach to mysql sharding, needs to be well thought out and will impact how we run mysql going forward - https://www.mediawiki.org/wiki/Flow_Portal/Architecture
  • Galera evaluation - Peter has built a test cluster and is running some benchmarks. The mediawiki schema isn't yet ready for galera (there are still tables without primary keys) but it's an interesting possibility for the long term, or even the near term for certain shards (like the upcoming sharded cluster for flow.)
  • DB Tree - needs external store shards, and the extension shard added