You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Hadoop/Alerts: Difference between revisions

From Wikitech-static
< Analytics‎ | Systems‎ | Cluster‎ | Hadoop
Jump to navigation Jump to search
imported>Elukey
No edit summary
 
imported>Elukey
No edit summary
Line 1: Line 1:
__FORCETOC__
__FORCETOC__{{Warn|content=Please note that every paragraph's link is used in puppet when defining the related alert, so if you change any of them remember to follow up!}}


==HDFS Namenode RPC length queue==
==HDFS Namenode RPC length queue==
Line 11: Line 11:
*etc..
*etc..
Once the user that hammers the Namenode is identified, check in yarn.wikimedia.org if there is something running for the same user, and kill it asap if the user doesn't answer in few minutes. We don't care what the job is doing, the availability of the HDFS Namenode comes first :)
Once the user that hammers the Namenode is identified, check in yarn.wikimedia.org if there is something running for the same user, and kill it asap if the user doesn't answer in few minutes. We don't care what the job is doing, the availability of the HDFS Namenode comes first :)
== HDFS topology check ==
The HDFS Namenode has a view of the racking details of the HDFS Datanodes, and it uses it to establish how to best spread blocks and their replicas to get the best reliability and availability. The racking details are set in puppet's hiera, and if a Datanode is not added to it for any reason (new node, accidental changes, etc..) the Namenode will put it in the "default rack", that is not optimal.
A good follow up to this alarm is to:
1) SSH to an-master1001 (or if it is a different cluster, check where the Namenode run) and run <code>sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology</code> and look for hosts in the default rack.
2) Check for new nodes in the Hadoop hiera config (<code>hieradata/common.yaml</code> in puppet).
== No active HDFS Namenode running ==
Normally there are two HDFS Namenode running, one active and one standby. If none of them are in active state, we get an alert since the Hadoop cluster cannot function properly.
A good follow up to this alarm is to ssh to the Namenode hosts (for example, an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster) and check <code>/var/log/hadoop-hdfs</code> . You should find a log file related to what's happening, and look for exceptions or errors.
To be sure that it is not a false alert, check the status of the Namenodes via:<syntaxhighlight lang="bash">
sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet
</syntaxhighlight>
== HDFS corrupt blocks ==
In this case, the HDFS Namenode is registering blocks that are corrupted. This is not necessarily bad, it may be due to faulty Datanodes, so before worrying check:
# How many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
# What files have corrupt blocks. This can be done via <code>sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corrupt</code> on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster).
Depending on how bad the situation is, fsck may or may not solve the problem (check how to run it to repair corrupted blocks in case). If the issue is related to a specific Datanode host, it may need to be depooled by an SRE.
== HDFS missing blocks ==
In this case, the HDFS Namenode is registering blocks that are missing, namely that no replica for them is available (hence the data that they carry is no available at all). Some useful steps:
# Check how many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
# What files have missing blocks. This can be done via <code>sudo -u hdfs kerberos-run-command hdfs hdfs fsck /</code> on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster), filtering for missing blocks.
At this point there are two cases: either the blocks are definitely gone for some reason (in case look on HDFS tutorials about what to do, like removing references to those files to fix the inconsistency) or they are temporary gone (for example if multiple datanodes are down for network reasons).
== Unhealthy Yarn Nodemanagers ==
On every hadoop worker node there is a daemon called Yarn Nodemanager, that is responsible to manage vcores and memory on behalf of the Resource manager. If multiple Nodemanager are down, it means that jobs are probably not scheduled on the affected nodes, reducing the performances of the cluster. Check the https://yarn.wikimedia.org/cluster/nodes/unhealthy page to see what nodes are affected, and ssh on them to check the Nodemanager's logs (<code>/var/log/hadoop-yarn/..</code>)

Revision as of 16:00, 5 February 2021

HDFS Namenode RPC length queue

The HDFS Namenode handles operations on HDFS via RPCs (getfileinfo, mkdir, etc..) and it has a fixed amount of worker threads dedicated to handle the incoming RPCs. Any RPC enters a queue, and then it is processed by a worker. If the queue length grows too much, the HDFS Namenode starts to lag in answering to clients and datanode health checks, and it also may end up in trashing due to heap pressure and GC activity. When icinga alerts for RPC queue too long, usually it is sufficient to do the following:

ssh an-master1001.eqiad.wmnet

tail -f /var/log/hadoop-hdfs/hdfs-audit.log

You will see a ton of entries logged for every second, but usually it should be very easy to spot a user making a ton of subsequent requests. Issues happened in the past:

  • Too many getfileinfo RPCs sent (scanning directories with a ton of small files)
  • Too many small/temporary files created in a short burst (order of Millions)
  • etc..

Once the user that hammers the Namenode is identified, check in yarn.wikimedia.org if there is something running for the same user, and kill it asap if the user doesn't answer in few minutes. We don't care what the job is doing, the availability of the HDFS Namenode comes first :)

HDFS topology check

The HDFS Namenode has a view of the racking details of the HDFS Datanodes, and it uses it to establish how to best spread blocks and their replicas to get the best reliability and availability. The racking details are set in puppet's hiera, and if a Datanode is not added to it for any reason (new node, accidental changes, etc..) the Namenode will put it in the "default rack", that is not optimal.

A good follow up to this alarm is to:

1) SSH to an-master1001 (or if it is a different cluster, check where the Namenode run) and run sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology and look for hosts in the default rack.

2) Check for new nodes in the Hadoop hiera config (hieradata/common.yaml in puppet).

No active HDFS Namenode running

Normally there are two HDFS Namenode running, one active and one standby. If none of them are in active state, we get an alert since the Hadoop cluster cannot function properly.

A good follow up to this alarm is to ssh to the Namenode hosts (for example, an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster) and check /var/log/hadoop-hdfs . You should find a log file related to what's happening, and look for exceptions or errors.

To be sure that it is not a false alert, check the status of the Namenodes via:

sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet

HDFS corrupt blocks

In this case, the HDFS Namenode is registering blocks that are corrupted. This is not necessarily bad, it may be due to faulty Datanodes, so before worrying check:

  1. How many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
  2. What files have corrupt blocks. This can be done via sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corrupt on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster).

Depending on how bad the situation is, fsck may or may not solve the problem (check how to run it to repair corrupted blocks in case). If the issue is related to a specific Datanode host, it may need to be depooled by an SRE.

HDFS missing blocks

In this case, the HDFS Namenode is registering blocks that are missing, namely that no replica for them is available (hence the data that they carry is no available at all). Some useful steps:

  1. Check how many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
  2. What files have missing blocks. This can be done via sudo -u hdfs kerberos-run-command hdfs hdfs fsck / on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster), filtering for missing blocks.

At this point there are two cases: either the blocks are definitely gone for some reason (in case look on HDFS tutorials about what to do, like removing references to those files to fix the inconsistency) or they are temporary gone (for example if multiple datanodes are down for network reasons).

Unhealthy Yarn Nodemanagers

On every hadoop worker node there is a daemon called Yarn Nodemanager, that is responsible to manage vcores and memory on behalf of the Resource manager. If multiple Nodemanager are down, it means that jobs are probably not scheduled on the affected nodes, reducing the performances of the cluster. Check the https://yarn.wikimedia.org/cluster/nodes/unhealthy page to see what nodes are affected, and ssh on them to check the Nodemanager's logs (/var/log/hadoop-yarn/..)