You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Hadoop/Alerts

From Wikitech-static
< Analytics‎ | Systems‎ | Cluster‎ | Hadoop
Revision as of 12:21, 2 February 2021 by imported>Elukey
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


HDFS Namenode RPC length queue

The HDFS Namenode handles operations on HDFS via RPCs (getfileinfo, mkdir, etc..) and it has a fixed amount of worker threads dedicated to handle the incoming RPCs. Any RPC enters a queue, and then it is processed by a worker. If the queue length grows too much, the HDFS Namenode starts to lag in answering to clients and datanode health checks, and it also may end up in trashing due to heap pressure and GC activity. When icinga alerts for RPC queue too long, usually it is sufficient to do the following:

ssh an-master1001.eqiad.wmnet

tail -f /var/log/hadoop-hdfs/hdfs-audit.log

You will see a ton of entries logged for every second, but usually it should be very easy to spot a user making a ton of subsequent requests. Issues happened in the past:

  • Too many getfileinfo RPCs sent (scanning directories with a ton of small files)
  • Too many small/temporary files created in a short burst (order of Millions)
  • etc..

Once the user that hammers the Namenode is identified, check in yarn.wikimedia.org if there is something running for the same user, and kill it asap if the user doesn't answer in few minutes. We don't care what the job is doing, the availability of the HDFS Namenode comes first :)