You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/Cluster/Hadoop/Alerts"

From Wikitech-static
< Analytics‎ | Systems‎ | Cluster‎ | Hadoop
Jump to navigation Jump to search
imported>Elukey
imported>Btullis
m (→‎HDFS Capacity Remaining: Added severity levels.)
Line 77: Line 77:
|}
|}
The above settings are for the CMS GC, meanwhile we are using G1GC, please refer to the info in https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.2/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html too.
The above settings are for the CMS GC, meanwhile we are using G1GC, please refer to the info in https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.2/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html too.
== HDFS Capacity Remaining ==
This alert indicates that the available capacity on the HDFS cluster has dropped below a given threshold and is a concern. The thresholds are currently 15% for a warning, 10% for a critical alert, and 5% for a paging-level incident.
The first thing to do if this alert fires is '''not to panic'''. Check to see via the [https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=106&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-hadoop&var-worker=All Hadoop Dashboard in Grafana] whether the usage increase is rapid, or whether there has been sustained growth over some time.
This rate of growth, combined with the severity level of the alert, will inform you as to how urgently action is required.
A Phabricator ticket should be created for this work, but it is currently not done automatically.
Remember: '''always try to obtain consent from relevant stakeholders before performing any destructive operations''' on the data itself.
==== Paging ====
If the first alert is a page, this would suggest that there has been a rapid increase in disk space consumption and that action may be required urgently in order to prevent the cluster reaching 100% of capacity.
Try to find out any causes of rapid growth, before rectifying them by moving data away from HDFS or by deleting it. Check the [https://yarn.wikimedia.org/ YARN web interface] for any unusual jobs running that might be producing data at a rapid rate.
Check for any user jobs running on the stat boxes, which might have high network I/O if they are in the process of writing data to HDFS.
Check for any unusual behaviour from an-launcher1002 which might indicate a problem with a regular scheduled task.
An invocation of <code>hdfs -du</code> such as the following can be used to obtain information about the use of space. This example will list the 30 largest entries of a given path (<code>/user</code>), in reverse order of size.
<code>btullis@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -du -h /user | awk '{print $1$2,$3}' | sort -hr | head -n 30</code>
==== Critical ====
If the first alert is critical it would imply that the cluster has dropped suddenly to less than 10% of capacity free, which is a cause for significant concern. However it may not require immediate action to clear space. The cluster can still operate efficiently at this level of utilization. The same practices as for a paging level alert apply, so work shoud be tracked via a ticket, the rate of change should be assessed from the [https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=106&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-hadoop&var-worker=All Hadoop Dashboard in Grafana]  and any obvious causes for a sudden increase in utilization should be investigated as soon as practicable.
If the rate of growth is low then investigation should move toward finding where capacity might be freed. It is likely that you would need to contact the owners of any particularly large and/or stale data in order to enquire whether it can be removed or archived to a different location.
==== Warning ====
A warning level alert indicates that the threshold value of 15% free capacity has been breached, however it does not necessarily indicate that any data needs to be deleted immediately. This could be a transient condition or it could be an indicator that the capacity of the cluster needs to be expanded. Create a ticket for the investigation and check the rate of growth from the [https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=106&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-hadoop&var-worker=All Hadoop Dashboard in Grafana]. If the rate of growth is low then perhaps the priority and courses of action could be assessed during weekly team meetings.


== Unhealthy Yarn Nodemanagers ==
== Unhealthy Yarn Nodemanagers ==

Revision as of 11:28, 19 October 2021

HDFS Namenode RPC length queue

The HDFS Namenode handles operations on HDFS via RPCs (getfileinfo, mkdir, etc..) and it has a fixed amount of worker threads dedicated to handle the incoming RPCs. Any RPC enters a queue, and then it is processed by a worker. If the queue length grows too much, the HDFS Namenode starts to lag in answering to clients and datanode health checks, and it also may end up in trashing due to heap pressure and GC activity. When icinga alerts for RPC queue too long, usually it is sufficient to do the following:

ssh an-master1001.eqiad.wmnet

tail -f /var/log/hadoop-hdfs/hdfs-audit.log

You will see a ton of entries logged for every second, but usually it should be very easy to spot a user making a ton of subsequent requests. Issues happened in the past:

  • Too many getfileinfo RPCs sent (scanning directories with a ton of small files)
  • Too many small/temporary files created in a short burst (order of Millions)
  • etc..

Once the user that hammers the Namenode is identified, check in yarn.wikimedia.org if there is something running for the same user, and kill it asap if the user doesn't answer in few minutes. We don't care what the job is doing, the availability of the HDFS Namenode comes first :)

HDFS topology check

The HDFS Namenode has a view of the racking details of the HDFS Datanodes, and it uses it to establish how to best spread blocks and their replicas to get the best reliability and availability. The racking details are set in puppet's hiera, and if a Datanode is not added to it for any reason (new node, accidental changes, etc..) the Namenode will put it in the "default rack", that is not optimal.

A good follow up to this alarm is to:

1) SSH to an-master1001 (or if it is a different cluster, check where the Namenode run) and run sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology and look for hosts in the default rack.

2) Check for new nodes in the Hadoop hiera config (hieradata/common.yaml in puppet).

No active HDFS Namenode running

Normally there are two HDFS Namenode running, one active and one standby. If none of them are in active state, we get an alert since the Hadoop cluster cannot function properly.

A good follow up to this alarm is to ssh to the Namenode hosts (for example, an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster) and check /var/log/hadoop-hdfs . You should find a log file related to what's happening, and look for exceptions or errors.

To be sure that it is not a false alert, check the status of the Namenodes via:

sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet

HDFS corrupt blocks

In this case, the HDFS Namenode is registering blocks that are corrupted. This is not necessarily bad, it may be due to faulty Datanodes, so before worrying check:

  1. How many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
  2. What files have corrupt blocks. This can be done via sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster).
  3. If there are roll restart of Hadoop HDFS Datanodes/Namenodes in progress, or if one was performed recently. In the past this was a source of false positives due to the JMX metric reporting a temporary weird values. In this case always trust what the fsck command above tells you, it is way more reliable than the JMX metric (from past experiences).

Depending on how bad the situation is, fsck may or may not solve the problem (check how to run it to repair corrupted blocks in case). If the issue is related to a specific Datanode host, it may need to be depooled by an SRE.

HDFS missing blocks

In this case, the HDFS Namenode is registering blocks that are missing, namely that no replica for them is available (hence the data that they carry is no available at all). Some useful steps:

  1. Check how many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
  2. What files have missing blocks. This can be done via sudo -u hdfs kerberos-run-command hdfs hdfs fsck / on the Hadoop master nodes (an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster), filtering for missing blocks.

At this point there are two cases: either the blocks are definitely gone for some reason (in case look on HDFS tutorials about what to do, like removing references to those files to fix the inconsistency) or they are temporary gone (for example if multiple datanodes are down for network reasons).

HDFS total files and heap size

We have always used https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.3/bk_command-line-installation/content/configuring-namenode-heap-size.html to increase the heap size of the HDFS Namenodes after a certain threshold of files stored. This alarm is meant to remember the heap size bump when needed.

Next steps:

Number of files (millions) Total Java Heap (Xmx and Xms) Young Generation Size (-XX:NewSize -XX:MaxNewSize)
50-70 36889m 4352m
70-100 52659m 6144m
100-125 65612m 7680m
125-150 78566m 8960m
150-200 104473m 8960m

The above settings are for the CMS GC, meanwhile we are using G1GC, please refer to the info in https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.2/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html too.

HDFS Capacity Remaining

This alert indicates that the available capacity on the HDFS cluster has dropped below a given threshold and is a concern. The thresholds are currently 15% for a warning, 10% for a critical alert, and 5% for a paging-level incident.

The first thing to do if this alert fires is not to panic. Check to see via the Hadoop Dashboard in Grafana whether the usage increase is rapid, or whether there has been sustained growth over some time.

This rate of growth, combined with the severity level of the alert, will inform you as to how urgently action is required.

A Phabricator ticket should be created for this work, but it is currently not done automatically.

Remember: always try to obtain consent from relevant stakeholders before performing any destructive operations on the data itself.

Paging

If the first alert is a page, this would suggest that there has been a rapid increase in disk space consumption and that action may be required urgently in order to prevent the cluster reaching 100% of capacity.

Try to find out any causes of rapid growth, before rectifying them by moving data away from HDFS or by deleting it. Check the YARN web interface for any unusual jobs running that might be producing data at a rapid rate.

Check for any user jobs running on the stat boxes, which might have high network I/O if they are in the process of writing data to HDFS.

Check for any unusual behaviour from an-launcher1002 which might indicate a problem with a regular scheduled task.

An invocation of hdfs -du such as the following can be used to obtain information about the use of space. This example will list the 30 largest entries of a given path (/user), in reverse order of size.

btullis@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -du -h /user | awk '{print $1$2,$3}' | sort -hr | head -n 30

Critical

If the first alert is critical it would imply that the cluster has dropped suddenly to less than 10% of capacity free, which is a cause for significant concern. However it may not require immediate action to clear space. The cluster can still operate efficiently at this level of utilization. The same practices as for a paging level alert apply, so work shoud be tracked via a ticket, the rate of change should be assessed from the Hadoop Dashboard in Grafana and any obvious causes for a sudden increase in utilization should be investigated as soon as practicable.

If the rate of growth is low then investigation should move toward finding where capacity might be freed. It is likely that you would need to contact the owners of any particularly large and/or stale data in order to enquire whether it can be removed or archived to a different location.

Warning

A warning level alert indicates that the threshold value of 15% free capacity has been breached, however it does not necessarily indicate that any data needs to be deleted immediately. This could be a transient condition or it could be an indicator that the capacity of the cluster needs to be expanded. Create a ticket for the investigation and check the rate of growth from the Hadoop Dashboard in Grafana. If the rate of growth is low then perhaps the priority and courses of action could be assessed during weekly team meetings.

Unhealthy Yarn Nodemanagers

On every hadoop worker node there is a daemon called Yarn Nodemanager, that is responsible to manage vcores and memory on behalf of the Resource manager. If multiple Nodemanager are down, it means that jobs are probably not scheduled on the affected nodes, reducing the performances of the cluster. Check the https://yarn.wikimedia.org/cluster/nodes/unhealthy page to see what nodes are affected, and ssh on them to check the Nodemanager's logs (/var/log/hadoop-yarn/..)

HDFS Namenode backup age

On the HDFS Namenode standby host (an-master1002.eqiad.wmnet) we run a systemd timer called hadoop-namenode-backup-fetchimage that periodically executes hdfs dfsadmin -fetchImage. The hdfs command pulls the most recent HDFS FSImage from the HDFS Namenode active host (an-master1001.eqiad.wmnet) and saves it under a specific directory. Please ssh to an-master1002 and check the logs of the timer with journalctl -u hadoop-namenode-backup-fetchimage to see what is the current problem.

HDFS Namenode process

There are two HDFS Namenodes for the main cluster, running on an-master100[1,2].eqiad.wmnet. The control the metadata and the state of the distributed file system, and they also handle client traffic. They are basically the head of the HDFS file system, without them no read/write operation can be performed on HDFS. Usually they are both up, one in state active and the other one in state standby, and you can check them via:

elukey@an-master1001:~$  sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
active
elukey@an-master1001:~$  sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet
standby

If one of them goes down, there are two use cases:

  • It is the active, so a failover happens and the previous standby is elected the new active.
  • It is the standby, so no failover happening.

In both use cases, if at least one Namenode is up, the cluster should be able to progress fine. It is nonetheless a situation that needs to be fixes asap, so here a couple of things to do first:

  • Figure out how many Namenodes are down, and if one is active. You can use the commands above for a quick check. If both Namenodes are down we have another alert, so it should be clear in case.
  • Check logs on the host with the failed Namenode. You should be able to find them in /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master100[1,2].log. Look for anything weird, especially exceptions etc..
  • Check if other alerts related to the Namenodes are pending. For example, if you check above another alert could be the RPC queue getting hammered, that could be likely the root cause of why the Namenode is down.
  • Check the Namenode panel in grafana to see if metrics can help.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

HDFS ZKFC process

There are two HDFS ZKFC processes, running on the master nodes an-master100[1,2].eqiad.wmnet. Their job is to periodically health check the HDFS Namenodes, and trigger automatic failovers in case something fails (using Zookeeper to hold locks). Usually they are both up, and if one of them is down it may not be a big issue (the Namenodes can live without them for a bit). The first thing to do is to ssh on the master nodes and check /var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master100[1,2].log, looking for Exceptions or similar problems.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

HDFS Datanode process

There are a lot of HDFS Datanodes in the cluster, one on each worker node, so nothing major happens if a few of them are down. The Datanodes are responsible to manage the HDFS blocks on behalf of the Namenodes, providing them to the clients when needed (or accepting a write from a client for new blocks). The clients have to first talk with the Namenodes to authenticate and get permission to perform a certain operation on HDFS (read/write/etc..), and then they can contact the Datanodes to get the real data.

The first thing to do is to figure out how many Datanodes are down - if the figure is up to 3/4 nodes it is not the end of world (since we have 60+ nodes), but if more please ping an SRE as soon as possible. The most common use case for a failure is a Java OOM, that can be quickly checked using something like:

elukey@analytics1061:~$ grep java.lang.OutOfMemory /var/log/hadoop-hdfs/hadoop-hdfs-datanode-analytics1061.log -B 2 --color

A deeper inspection in the Datanode's logs will surely give some more information. Check also the Datanodes panel in grafana to see if metrics can help.

HDFS Journalnode process

The HDFS Namenodes (see above, the head of HDFS) use a special group of daemons called Journalnodes to keep a shared edit log, in order to facilitate failover in case needed (so to implement high availability). We have 5 Journal nodes daemons running on some worker nodes in the cluster (see hieradata/common.yaml and look for journalnode_hosts for the main cluster), and we need a minimum of three to keep the Namenodes working. If the quorum of three Journalnodes cannot be reached, the Namenodes shutdown.

Things to do:

  • Figure out how many Journalnodes are down. If one or two, the situation should be ok since the cluster should not be impacted. More than 2 is a big problem, and it will likely trigger other alerts (like the Namenode ones).
  • Check logs for Exceptions:
    elukey@an-worker1080:~$ less /var/log/hadoop-hdfs/hadoop-hdfs-journalnode-an-worker1080.log
    
  • Check in the Namenode UI how many Journalnodes are active using the following ssh tunnel: ssh -L 50470:localhost:50470 an-master1001.eqiad.wmnet
  • Check the Jornalnodes panel in grafana to see if metrics can help.

Yarn Resourcemanager process

There are two Yarn Resourcemanager processes in the cluster, running on the master nodes an-master100[1,2].eqiad.wmnet. Their job is to handle the overall compute resources of the cluster (cpus/ram) that all the worker nodes offer. They communicate with the Yarn Nodemanager processes on each worker node, getting info from them about available resources and allocating what requested (if doable) from clients. Usually they are both up, one in state active and the other one in state standby, and you can check them via:

elukey@an-master1001:~$  sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState an-master1001-eqiad-wmnet
active
elukey@an-master1001:~$  sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState an-master1002-eqiad-wmnet
standby

If one of them goes down, there are two use cases:

  • It is the active, so a failover happens and the previous standby is elected the new active.
  • It is the standby, so no failover happening.

In both use cases, if at least one Resourcemanager is up, the cluster should be able to progress fine. It is nonetheless a situation that needs to be fixes asap, so here a couple of things to do first:

  • Figure out how many Resourcemanagers are down, and if one is active.
  • Check logs on the host with the failed Resourcemanager. You should be able to find them in /var/log/hadoop-yarn/hadoop-yarn-resourcemanager-an-master100[1,2].log. Look for anything weird, especially exceptions etc..
  • Check the Resourcemanager panel in grafana to see if metrics can help.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

Yarn Nodemanager process

There are a lot of Yarn Nodemanagers in the cluster, one on each worker node, so nothing major happens if a few of them are down. The Yarn Nodemanagers are responsible for the allocation and management of local resources on the worker (cpus/ram). They can be restarted anytime without affecting the running containers/jvms that they create on behalf of the Yarn Resource Managers.

The first thing to do is to figure out how many Nodemanagers are down - if the figure is up to 3/4 nodes it is not the end of world (since we have 60+ nodes), but if more please ping an SRE as soon as possible. The most common use case for a failure is a Java OOM, that can be quickly checked using something like:

elukey@analytics1061:~$ grep java.lang.OutOfMemory /var/log/hadoop-yarn/hadoop-yarn-nodenamager-analytics1061.log -B 2 --color

A deeper inspection in the Nodemanager's logs will surely give some more information. Check also the Nodemanagers panel in grafana to see if metrics can help.

Mapreduce Historyserver process

The Mapreduce History server runs only on an-master1001.eqiad.wmnet, and it offers an API for jobs about the status of finished applications. It is not critical for the health of the cluster, but it should be fixed as soon as possible nonetheless.

Check logs on the host to look for any sign of Exception or weird behavior:

elukey@an-master1001:~$ less /var/log/hadoop-mapreduce/mapred-mapred-historyserver-an-master1001.out