You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Please note that every paragraph's link is used in puppet when defining the related alert, so if you change any of them remember to follow up!
HDFS Namenode RPC length queue
The HDFS Namenode handles operations on HDFS via RPCs (getfileinfo, mkdir, etc..) and it has a fixed amount of worker threads dedicated to handle the incoming RPCs. Any RPC enters a queue, and then it is processed by a worker. If the queue length grows too much, the HDFS Namenode starts to lag in answering to clients and datanode health checks, and it also may end up in trashing due to heap pressure and GC activity. When icinga alerts for RPC queue too long, usually it is sufficient to do the following:
ssh an-master1001.eqiad.wmnet tail -f /var/log/hadoop-hdfs/hdfs-audit.log
You will see a ton of entries logged for every second, but usually it should be very easy to spot a user making a ton of subsequent requests. Issues happened in the past:
- Too many getfileinfo RPCs sent (scanning directories with a ton of small files)
- Too many small/temporary files created in a short burst (order of Millions)
Once the user that hammers the Namenode is identified, check in yarn.wikimedia.org if there is something running for the same user, and kill it asap if the user doesn't answer in few minutes. We don't care what the job is doing, the availability of the HDFS Namenode comes first :)
HDFS topology check
The HDFS Namenode has a view of the racking details of the HDFS Datanodes, and it uses it to establish how to best spread blocks and their replicas to get the best reliability and availability. The racking details are set in puppet's hiera, and if a Datanode is not added to it for any reason (new node, accidental changes, etc..) the Namenode will put it in the "default rack", that is not optimal.
A good follow up to this alarm is to:
1) SSH to an-master1001 (or if it is a different cluster, check where the Namenode run) and run
sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology and look for hosts in the default rack.
2) Check for new nodes in the Hadoop hiera config (
hieradata/common.yaml in puppet).
No active HDFS Namenode running
Normally there are two HDFS Namenode running, one active and one standby. If none of them are in active state, we get an alert since the Hadoop cluster cannot function properly.
A good follow up to this alarm is to ssh to the Namenode hosts (for example, an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster) and check
/var/log/hadoop-hdfs . You should find a log file related to what's happening, and look for exceptions or errors.
To be sure that it is not a false alert, check the status of the Namenodes via:
sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet
HDFS corrupt blocks
In this case, the HDFS Namenode is registering blocks that are corrupted. This is not necessarily bad, it may be due to faulty Datanodes, so before worrying check:
- How many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
- What files have corrupt blocks. This can be done via
sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblockson the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster).
- If there are roll restart of Hadoop HDFS Datanodes in progress, in the past this was a source of false positives due to the JMX metric reporting a temporary weird values. In this case always trust what the fsck command above tells you, it is way more reliable than the JMX metric (from past experiences).
Depending on how bad the situation is, fsck may or may not solve the problem (check how to run it to repair corrupted blocks in case). If the issue is related to a specific Datanode host, it may need to be depooled by an SRE.
HDFS missing blocks
In this case, the HDFS Namenode is registering blocks that are missing, namely that no replica for them is available (hence the data that they carry is no available at all). Some useful steps:
- Check how many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
- What files have missing blocks. This can be done via
sudo -u hdfs kerberos-run-command hdfs hdfs fsck /on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster), filtering for missing blocks.
At this point there are two cases: either the blocks are definitely gone for some reason (in case look on HDFS tutorials about what to do, like removing references to those files to fix the inconsistency) or they are temporary gone (for example if multiple datanodes are down for network reasons).
Unhealthy Yarn Nodemanagers
On every hadoop worker node there is a daemon called Yarn Nodemanager, that is responsible to manage vcores and memory on behalf of the Resource manager. If multiple Nodemanager are down, it means that jobs are probably not scheduled on the affected nodes, reducing the performances of the cluster. Check the https://yarn.wikimedia.org/cluster/nodes/unhealthy page to see what nodes are affected, and ssh on them to check the Nodemanager's logs (
HDFS Namenode backup age
On the HDFS Namenode standby host (
an-master1002.eqiad.wmnet) we run a systemd timer called
hadoop-namenode-backup-fetchimage that periodically executes
hdfs dfsadmin -fetchImage. The hdfs command pulls the most recent HDFS FSImage from the HDFS Namenode active host (
an-master1001.eqiad.wmnet) and saves it under a specific directory. Please ssh to an-master1002 and check the logs of the timer with
journalctl -u hadoop-namenode-backup-fetchimage to see what is the current problem.