You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Hadoop/Alerts

From Wikitech-static
< Analytics‎ | Systems‎ | Cluster‎ | Hadoop
Jump to navigation Jump to search

HDFS Namenode RPC length queue

The HDFS Namenode handles operations on HDFS via RPCs (getfileinfo, mkdir, etc..) and it has a fixed amount of worker threads dedicated to handle the incoming RPCs. Any RPC enters a queue, and then it is processed by a worker. If the queue length grows too much, the HDFS Namenode starts to lag in answering to clients and datanode health checks, and it also may end up in trashing due to heap pressure and GC activity. When icinga alerts for RPC queue too long, usually it is sufficient to do the following:

ssh an-master1001.eqiad.wmnet

tail -f /var/log/hadoop-hdfs/hdfs-audit.log

You will see a ton of entries logged for every second, but usually it should be very easy to spot a user making a ton of subsequent requests. Issues happened in the past:

  • Too many getfileinfo RPCs sent (scanning directories with a ton of small files)
  • Too many small/temporary files created in a short burst (order of Millions)
  • etc..

Once the user that hammers the Namenode is identified, check in yarn.wikimedia.org if there is something running for the same user, and kill it asap if the user doesn't answer in few minutes. We don't care what the job is doing, the availability of the HDFS Namenode comes first :)

HDFS topology check

The HDFS Namenode has a view of the racking details of the HDFS Datanodes, and it uses it to establish how to best spread blocks and their replicas to get the best reliability and availability. The racking details are set in puppet's hiera, and if a Datanode is not added to it for any reason (new node, accidental changes, etc..) the Namenode will put it in the "default rack", that is not optimal.

A good follow up to this alarm is to:

1) SSH to an-master1001 (or if it is a different cluster, check where the Namenode run) and run sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology and look for hosts in the default rack.

2) Check for new nodes in the Hadoop hiera config (hieradata/common.yaml in puppet).

No active HDFS Namenode running

Normally there are two HDFS Namenode running, one active and one standby. If none of them are in active state, we get an alert since the Hadoop cluster cannot function properly.

A good follow up to this alarm is to ssh to the Namenode hosts (for example, an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster) and check /var/log/hadoop-hdfs . You should find a log file related to what's happening, and look for exceptions or errors.

To be sure that it is not a false alert, check the status of the Namenodes via:

sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet

HDFS corrupt blocks

In this case, the HDFS Namenode is registering blocks that are corrupted. This is not necessarily bad, it may be due to faulty Datanodes, so before worrying check:

  1. How many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
  2. What files have corrupt blocks. This can be done via sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster).
  3. If there are roll restart of Hadoop HDFS Datanodes/Namenodes in progress, or if one was performed recently. In the past this was a source of false positives due to the JMX metric reporting a temporary weird values. In this case always trust what the fsck command above tells you, it is way more reliable than the JMX metric (from past experiences).

Depending on how bad the situation is, fsck may or may not solve the problem (check how to run it to repair corrupted blocks in case). If the issue is related to a specific Datanode host, it may need to be depooled by an SRE.

HDFS missing blocks

In this case, the HDFS Namenode is registering blocks that are missing, namely that no replica for them is available (hence the data that they carry is no available at all). Some useful steps:

  1. Check how many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
  2. What files have missing blocks. This can be done via sudo -u hdfs kerberos-run-command hdfs hdfs fsck / on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster), filtering for missing blocks.

At this point there are two cases: either the blocks are definitely gone for some reason (in case look on HDFS tutorials about what to do, like removing references to those files to fix the inconsistency) or they are temporary gone (for example if multiple datanodes are down for network reasons).

HDFS total files and heap size

We have always used https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.3/bk_command-line-installation/content/configuring-namenode-heap-size.html to increase the heap size of the HDFS Namenodes after a certain threshold of files stored. This alarm is meant to remember the heap size bump when needed.

Next steps:

Number of files (millions) Total Java Heap (Xmx and Xms) Young Generation Size (-XX:NewSize -XX:MaxNewSize)
50-70 36889m 4352m
70-100 52659m 6144m
100-125 65612m 7680m
125-150 78566m 8960m
150-200 104473m 8960m

The above settings are for the CMS GC, meanwhile we are using G1GC, please refer to the info in https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.2/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html too.

Unhealthy Yarn Nodemanagers

On every hadoop worker node there is a daemon called Yarn Nodemanager, that is responsible to manage vcores and memory on behalf of the Resource manager. If multiple Nodemanager are down, it means that jobs are probably not scheduled on the affected nodes, reducing the performances of the cluster. Check the https://yarn.wikimedia.org/cluster/nodes/unhealthy page to see what nodes are affected, and ssh on them to check the Nodemanager's logs (/var/log/hadoop-yarn/..)

HDFS Namenode backup age

On the HDFS Namenode standby host (an-master1002.eqiad.wmnet) we run a systemd timer called hadoop-namenode-backup-fetchimage that periodically executes hdfs dfsadmin -fetchImage. The hdfs command pulls the most recent HDFS FSImage from the HDFS Namenode active host (an-master1001.eqiad.wmnet) and saves it under a specific directory. Please ssh to an-master1002 and check the logs of the timer with journalctl -u hadoop-namenode-backup-fetchimage to see what is the current problem.

HDFS Namenode process

There are two HDFS Namenodes for the main cluster, running on an-master100[1,2].eqiad.wmnet. The control the metadata and the state of the distributed file system, and they also handle client traffic. They are basically the head of the HDFS file system, without them no read/write operation can be performed on HDFS. Usually they are both up, one in state active and the other one in state standby, and you can check them via:

elukey@an-master1001:~$  sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
active
elukey@an-master1001:~$  sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet
standby

If one of them goes down, there are two use cases:

  • It is the active, so a failover happens and the previous standby is elected the new active.
  • It is the standby, so no failover happening.

In both use cases, if at least one Namenode is up, the cluster should be able to progress fine. It is nonetheless a situation that needs to be fixes asap, so here a couple of things to do first:

  • Figure out how many Namenodes are down, and if one is active. You can use the commands above for a quick check. If both Namenodes are down we have another alert, so it should be clear in case.
  • Check logs on the host with the failed Namenode. You should be able to find them in /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master100[1,2].log. Look for anything weird, especially exceptions etc..
  • Check if other alerts related to the Namenodes are pending. For example, if you check above another alert could be the RPC queue getting hammered, that could be likely the root cause of why the Namenode is down.
  • Check the Namenode panel in grafana to see if metrics can help.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

HDFS ZKFC process

There are two HDFS ZKFC processes, running on the master nodes an-master100[1,2].eqiad.wmnet. Their job is to periodically health check the HDFS Namenodes, and trigger automatic failovers in case something fails (using Zookeeper to hold locks). Usually they are both up, and if one of them is down it may not be a big issue (the Namenodes can live without them for a bit). The first thing to do is to ssh on the master nodes and check /var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master100[1,2].log, looking for Exceptions or similar problems.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

HDFS Datanode process

There are a lot of HDFS Datanodes in the cluster, one on each worker node, so nothing major happens if a few of them are down. The Datanodes are responsible to manage the HDFS blocks on behalf of the Namenodes, providing them to the clients when needed (or accepting a write from a client for new blocks). The clients have to first talk with the Namenodes to authenticate and get permission to perform a certain operation on HDFS (read/write/etc..), and then they can contact the Datanodes to get the real data.

The first thing to do is to figure out how many Datanodes are down - if the figure is up to 3/4 nodes it is not the end of world (since we have 60+ nodes), but if more please ping an SRE as soon as possible. The most common use case for a failure is a Java OOM, that can be quickly checked using something like:

elukey@analytics1061:~$ grep java.lang.OutOfMemory /var/log/hadoop-hdfs/hadoop-hdfs-datanode-analytics1061.log -B 2 --color

A deeper inspection in the Datanode's logs will surely give some more information. Check also the Datanodes panel in grafana to see if metrics can help.

HDFS Journalnode process

The HDFS Namenodes (see above, the head of HDFS) use a special group of daemons called Journalnodes to keep a shared edit log, in order to facilitate failover in case needed (so to implement high availability). We have 5 Journal nodes daemons running on some worker nodes in the cluster (see hieradata/common.yaml and look for journalnode_hosts for the main cluster), and we need a minimum of three to keep the Namenodes working. If the quorum of three Journalnodes cannot be reached, the Namenodes shutdown.

Things to do:

  • Figure out how many Journalnodes are down. If one or two, the situation should be ok since the cluster should not be impacted. More than 2 is a big problem, and it will likely trigger other alerts (like the Namenode ones).
  • Check logs for Exceptions:
    elukey@an-worker1080:~$ less /var/log/hadoop-hdfs/hadoop-hdfs-journalnode-an-worker1080.log
    
  • Check in the Namenode UI how many Journalnodes are active using the following ssh tunnel: ssh -L 50470:localhost:50470 an-master1001.eqiad.wmnet
  • Check the Jornalnodes panel in grafana to see if metrics can help.

Yarn Resourcemanager process

There are two Yarn Resourcemanager processes in the cluster, running on the master nodes an-master100[1,2].eqiad.wmnet. Their job is to handle the overall compute resources of the cluster (cpus/ram) that all the worker nodes offer. They communicate with the Yarn Nodemanager processes on each worker node, getting info from them about available resources and allocating what requested (if doable) from clients. Usually they are both up, one in state active and the other one in state standby, and you can check them via:

elukey@an-master1001:~$  sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState an-master1001-eqiad-wmnet
active
elukey@an-master1001:~$  sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState an-master1002-eqiad-wmnet
standby

If one of them goes down, there are two use cases:

  • It is the active, so a failover happens and the previous standby is elected the new active.
  • It is the standby, so no failover happening.

In both use cases, if at least one Resourcemanager is up, the cluster should be able to progress fine. It is nonetheless a situation that needs to be fixes asap, so here a couple of things to do first:

  • Figure out how many Resourcemanagers are down, and if one is active.
  • Check logs on the host with the failed Resourcemanager. You should be able to find them in /var/log/hadoop-yarn/hadoop-yarn-resourcemanager-an-master100[1,2].log. Look for anything weird, especially exceptions etc..
  • Check the Resourcemanager panel in grafana to see if metrics can help.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

Yarn Nodemanager process

There are a lot of Yarn Nodemanagers in the cluster, one on each worker node, so nothing major happens if a few of them are down. The Yarn Nodemanagers are responsible for the allocation and management of local resources on the worker (cpus/ram). They can be restarted anytime without affecting the running containers/jvms that they create on behalf of the Yarn Resource Managers.

The first thing to do is to figure out how many Nodemanagers are down - if the figure is up to 3/4 nodes it is not the end of world (since we have 60+ nodes), but if more please ping an SRE as soon as possible. The most common use case for a failure is a Java OOM, that can be quickly checked using something like:

elukey@analytics1061:~$ grep java.lang.OutOfMemory /var/log/hadoop-yarn/hadoop-yarn-nodenamager-analytics1061.log -B 2 --color

A deeper inspection in the Nodemanager's logs will surely give some more information. Check also the Nodemanagers panel in grafana to see if metrics can help.

Mapreduce Historyserver process

The Mapreduce History server runs only on an-master1001.eqiad.wmnet, and it offers an API for jobs about the status of finished applications. It is not critical for the health of the cluster, but it should be fixed as soon as possible nonetheless.

Check logs on the host to look for any sign of Exception or weird behavior:

elukey@an-master1001:~$ less /var/log/hadoop-mapreduce/mapred-mapred-historyserver-an-master1001.out