MariaDB/troubleshooting/external storage failover

14 August 2018
  1. Old master: es1014
  2. New master: es1017
  3. Check if there are passwords in the old format / fix grants CHECKED es1017 looks good
  4. Set expire_logs_days to 30 on the new master DONE
set global expire_logs_days = 30;
  1. Check pt-heartbeat is running with the latest puppet parameters, meaning (note the user, host and defaults): CHECKED
/usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard={shard} --datacenter={dc} --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S {socket} --daemonize --pid /var/run/

and not with the older format, still present on some masters:

/usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard={shard} --datacenter={dc} --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S {socket} --daemonize --pid /var/run/
  1. Silence alerts on all hosts DONE
  2. Disable GTID on es1017 DONE
  3. Move replicas (es1019, es2017) under the new master (es1017) DONE
  4. Disable puppet on es1017, es1014 DONE
@es1014> puppet agent --disable "Switching over es3 from es1014 to es1017"
@es1017> puppet agent --disable "Switching over es3 from es1014 to es1017"
  1. merge puppet patch and deploy it: DONE
  1. merge mediawiki patch and rebase on deployment host DONE

(actual deployment starts here)

  1. !log the actions about to take place

!log switchover es3 eqiad master from es1014 to es1017 DONE

  1. Run switchover script from neodymium
./ --skip-slave-move es1014 es1017 DONE
[Servers sync at master: es1014-bin.002508:184384418 slave: es1017-bin.002491:41215873]
  1. Deploy mediawiki change (deployment.eqiad.wmnet) DONE
scap sync-file --force wmf-config/db-eqiad.php "Switchover es3 master eqiad from es1014 to es1017"

(main deployment finishes here)

  1. run puppet on es1014 and es1017, and make sure it doesn't break anything DONE
@es1014> puppet agent --enable && puppet agent -tv
@es1017> puppet agent --enable && puppet agent -tv
  1. Check semisync and gtid status of all related servers DONE
  2. Make the change reflect on dns CNAME: DONE
  1. Update tendril, zarcillo: DONE -A -h db1115 tendril -e "update shards set master_id=1231 WHERE name='es3' LIMIT 1" -A -h db1115 zarcillo -e "UPDATE masters SET instance = 'es1017' WHERE section='es3' and dc = 'eqiad' LIMIT 1"
  1. Update and close the ticket
  2. Perform planned maintenance on es1014 (upgrade socket location, upgrade mysql, upgrade kernel, make sure firmware is deployed, change old format passwords if any)
  3. Remove accounts 'repl'@'10.%' and 'repl'@'208.80.152.%', 'repl'@'10.0.%' maybe others from es1017 (maybe other hosts, too) DONE (es1017 for now)