You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Procedure to fold 2 partitions into one: https://phabricator.wikimedia.org/T278424#7020076
mkdir /srv/sqldata mv /var/lib/mysql/* /srv/sqldata umount /var/lib/mysql umount /srv lvremove /dev/an-coord1001-vg/mysql lvextend -l +100%FREE /dev/an-coord1001-vg/srv resize2fs /dev/an-coord1001-vg/srv
Also had to change the mysql data directory: https://gerrit.wikimedia.org/r/c/operations/puppet/+/681358/2/hieradata/role/common/analytics_cluster/coordinator.yaml
Got a ping on https://phabricator.wikimedia.org/T280367 - Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th
The issue was resolved, which is visible on: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&from=1618272000000&to=1619913599000
Priority for today:
I had thought originally that we could do the upgrade with everything online, rather than doing a maintenance window with readonly safe mode etc. I can see the benefit of safe mode for protecting against data loss, and there is always the chance a reimage goes horribly wrong, but since all this is on a standby we shouldn't have to take writing offline.
What would happen if we had a snapshot, data keeps getting written, then we have to restore to the snapshot? There would be some unreferenceable data on workers, but what would be the data lost?
In safe mode, what would
Data builds up on kafka
need to understand all the data that flows into hdfs
How to drain the cluster?
Created kerberos principal for user, as easy as running create and adding krb: present to data.yaml: https://phabricator.wikimedia.org/T281809