You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
User:Razzi/new plan for reimaging an-masters
update the haadmins to use kerberos-run-command
respond: purpose of failover: just to test that both nodes are healthy
gotta add timers to systemctl commands. testing this out now
> sudo systemctl list-units 'camus-*.timer'
ok looks good
need to plan to contact search team to ask to pause
remove part of plan fo stopping oozie coordinators
need to make sense of
This command is a little bit brutal, what we could do is something like: - check `profile::analytics::cluster::hadoop::yarn_capacity_scheduler` and add something like `'yarn.scheduler.capacity.root.default.state' => 'STOPPED'` - send puppet patch and merge it (but at this point we are with puppet disabled, so you either add it manually or you merge it beforehand). - execute `sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues`
The above will instruct the Yarn RMs to not accept any new job.
! need to learn about transfer.py and incorporate it into plan
Add to plan step to stop hadoop-namenode timers
Add to plan step to check for hdfs / yarn processes before changing uids/gids
need to make sense of
> Reimage an-master1002 > - `sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet` > - Will have to confirm that the partitions since we're using reuse-parts-test
At this point I would probably think about checking that all daemons are ok, logs are fine, metrics, etc.. It is ok to do the maintenance in two separate days, to leave some time for any unexpected issue to come up. We could also test a failover from 1001 (still not reimaged) to 1002 and leave it running for a few hours monitoring metrics. I know that on paper we don't expect any issue from previous tests, but this is production and there may be some corner cases that we were not able to test before.
So to summarize - at this point I'd check the status of the all the services and just re-enable timers/jobs/etc.. Then after a bit I'd failover to 1002 and test for a few hours if everything works as expected (heap pressure, logs, etc..)
Yes, 1001 should be active here
Update plan to restart puppet
Update plan to use stat100x for backup