You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:Razzi/Plan to drain hadoop cluster

From Wikitech-static
Jump to navigation Jump to search

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

for production, draining cluster: shutting down input disabling camus timers on an-launcher

by disabling, no data flowing in

some jobs like refine are scheduled

Should drain in less than an hour

7-day retention in kafka; kafka used as buffer

now that we have capacity scheduler, you can disable queues

Plan:

  • disable puppet on an-master1002
    • sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • Disable jobs on an-launcher1002
    • sudo systemctl stop 'camus-*'
    • sudo systemctl stop 'drop-*'
    • sudo systemctl stop 'hdfs-*'
    • sudo systemctl stop 'mediawiki-*'
    • sudo systemctl stop 'refine_*'
    • sudo systemctl stop 'refinery-*'
    • sudo systemctl stop 'reportupdater-*'
  • disable queue
    • sudo systemctl stop hadoop-yarn-resourcemanager[1]
  • kill yarn applications
    • for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
  • enable safe mode
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • checkpoint
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • create snapshot tar
    • sudo su
    • cd /srv/hadoop/namenode
    • tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
  • copy snapshot to elsewhere
    • (from my personal computer)
    • scp -3 an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz thorium.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz
      • Based on scp-ing a test file, this will take about 30 minutes; that's acceptable, but if there's a faster way (distcp?) it'd be good to know
  • change uids
  • reimage


stop the cluster

make a backup

change uids

reimage