You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation/20190723-logstash
Jump to navigation
Jump to search
document status: in-review
Summary
Logstash became overloaded during a network outage that caused elevated error rate. Problematic log messages were also observed, which were able to crash logstash, requiring manual filtering of the problem logs.
Impact
Logstash was unable to process incoming logs on all inputs until the the problem log type was identified and filtered. This resulted in delayed logs, and missing logs of the affected type (MediaWiki's SlowTimer)
Timeline
- 19:10 - First page (logstash failure, secondary fallout from error ingestion) - A flood of errors caused by network disruption during eqiad rack a6/a7 pdu maintenance overwhelmed logstash.
- troubleshooting of related network issue proceeds with higher priority. logstash outage presumed to be secondary effect of network issues.
- 20:10 network issue begins to improve, however logstash does not recover on its own
- 20:23 logstash unable to process backlog and is crashing due to UTF-8 parsing error Suspected message example: https://phabricator.wikimedia.org/P8790
[FATAL][logstash.runner ] An unexpected error occurred! {:error=>#<ArgumentError: invalid byte sequence in UTF-8>
- 21:00 UTF-8 issue traced to kafka rsyslog-shipper input. Logstash collectors restarted with kafka rsyslog-shipper input disabled. Kafka-logging consumer lag begins to recover on non-rsyslog topics https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=1563904908998&to=1563929943574&var-datasource=eqiad%20prometheus%2Fops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
- 21:45 UTF-8 issue traced to SlowTimer messages and a temporary fix to drop [message] =~ /^SlowTimer/ deployed
- 22:00 temporary fix is active across eqiad/codfw logstash collectors, kafka rsyslog-shipper input re-enabled. Kafka consumer lag is recovering now on all topics
- 00:16 (next day) logstash has caught up with kafka backlog
Actionables
- Get to the bottom of invalid utf8 sequences being produced to logstash. Errors were traced to SlowTimer events logging binary data, in the form of slow parser cache queries. For example https://phabricator.wikimedia.org/P8790 (TODO: Create task)
- Investigate logstash rate limiting options for logstash to avoid ingesting millions of the same error message in failures like this (TODO: Create task)
- Look into separating logstash pipelines (TODO: discuss)
- Investigate filtering/parsing options to avoid UTF-8 error observed
- https://github.com/logstash-plugins/logstash-filter-mutate/commit/ac4c440fba70e2ddb2a6a664af755f79893cfabd
- https://github.com/logstash-plugins/logstash-input-elasticsearch/issues/101