You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Archive/Hadoop - Logstash
Hadoop consists of several components running on several hosts with several logfiles, some stored inside HDFS. To make this less irritating, we send copies of all log events to a central location: Logstash!
"Logstash" actually consists of three components:
- Ingest: (event stream processor): Logstash
- Storage: Elasticsearch
- Web UI: Kibana
Because LEK is an unpleasant acronym, people refer to this trinity as the "ELK stack", or simply "Logstash".
(Most) Hadoop components use the Log4j java logging framework. To this we add logstash-gelf, a Log4j appender which outputs events over the network in Greylog Extended Log Format (GELF), which is a compressed JSON schema.
Logstash: stream processing
Logstash processing occurs in three phases, all configured in /etc/logstash/conf.d/
The Logstash daemon is configured to listen for GELF events (on 12201/UDP). These events have the property "type=gelf". UDP is selected in order to handle input overload without exhausting log server resources: delivery of events is not guaranteed. Logstash can also ingest other event types from the network such as Syslog, Graphite, etc. however for Hadoop we use only GELF input. One advantage of GELF is that, because it is a JSON data type, we get discrete named fields rather than one long line which must be parsed and split.
This is where events are processed. Conditional logic is used with filter plugins like 'grok', 'mutate', and 'prune' to extract values into properties, modify and remove properties, add tags to events, etc. This is the only opportunity to modify events: there is no post-processing.
In this stage the config sends events with the appropriate tag ("es") to Elasticsearch for storage.
For service isolation, the Logstash cluster operates its own Elasticsearch storage pool independent of the pool used for CirrusSearch. Shards are replicated between nodes for redundancy.
How do I find the error when my job fails?
Job IDs are extracted to a separate field for querying, and there is a dashboard panel which shows a list of Job IDs present in the search results. Click the magnifying glass Action to filter the query by that Job ID. A query such as Severity:ERROR or !Severity:INFO (to display both warnings and errors) can help identify the pertinent message. FIXME - this answer should be improved with some work on the dashboard.
How do I see the details of an event?
Clicking on an event in Kibana opens an expanded view showing each field and value, as well as offering syntax-highlighted JSON or raw views of the data.
How do I narrow my results?
Queries may be written with Lucene syntax, or, by clicking on the color circle to the left of the query field and changing the type, with regular expressions or Kibana's own "topN" syntax.
Example Lucene queries:
Example regex queries:
Example topN queries:
Some equivalent results may be obtained either by refining your query as detailed above, or by filtering their output. Kibana's filters which modify query results should not be confused with Logstash filters which modify events during ingestion.
You can select preset ranges from the drop-down menu such as "Last 5m" or "Last 30d", or specify a custom range manually, or click and drag to select the temporal area of interest in a Kibana visualization panel.
Excluding event types
When an event is expanded to show its details, each field features icons to add filters matching or excluding events which match the value of that field. Additionaly there is an icon which toggles display of that field in a column of the event list.
How do I automatically see new events as they arrive?
Click to open the time range selector in the nav bar, select Auto-Refresh, and choose your interval.
Why is my query so slow?
Kibana is purely a client-side application. Therefore, asking it to sort large fields such as "message" (which can contain entire java stack traces) can consume large amounts of memory and lead to slow performance. The solutions are 1) don't do that, or 2) only do this on fields which are stored using the doc_values field data option (such as normalized_messsage.raw) which avoids loading the values into memory.
How do I make my own dashboard?
- Begin by modifying your current view
- When you are satisfied, click the Save icon in the nav bar which looks like a floppy diskette
- REPLACE THE CURRENT NAME with the name for your dashboard
- Click the button to the left of the name field to save your new dashboard
- Dashboards are saved in Elasticsearch!
- There is no protection against overwriting existing dashboards, and no way to revert edits. Please use care when saving dashboards.
Clicking the Share icon presents a unique permalink to your current view - queries, filters, date ranges and all. There is no need to save a custom dashboard first.
Hadoop logging configuration
FIXME. For now, see our https://office.wikimedia.org/wiki/User:JGerard_%28WMF%29/Hadoop-Logstash and our Puppet configs for details on how Hadoop is configured to send log events to Logstash, mostly via log4j.properties and hadoop-env.sh.
The toughest part to configure is the highest-value events: errors from YarnChild passed to MRAppManager. Because the latter is launched by another Java process as opposed to an init script, we have to modify a container-log4j.properties inside the NodeManager JAR.