You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/20190723-logstash"

From Wikitech
Jump to navigation Jump to search
imported>Cwhite
m (or update mutate plugin)
imported>Cwhite
(added other fields from template and filled in a few)
 
Line 1: Line 1:
 
'''document status''': {{irdoc-review}}
 
'''document status''': {{irdoc-review}}
  
==Summary==
+
== Summary ==
 
Logstash became overloaded during a network outage that caused elevated error rate.  Problematic log messages were also observed, which were able to crash logstash, requiring manual filtering of the problem logs.
 
Logstash became overloaded during a network outage that caused elevated error rate.  Problematic log messages were also observed, which were able to crash logstash, requiring manual filtering of the problem logs.
  
==Impact==
+
=== Impact ===
 
Logstash was unable to process incoming logs on all inputs until the the problem log type was identified and filtered.  This resulted in delayed logs, and missing logs of the affected type (MediaWiki's SlowTimer)
 
Logstash was unable to process incoming logs on all inputs until the the problem log type was identified and filtered.  This resulted in delayed logs, and missing logs of the affected type (MediaWiki's SlowTimer)
  
==Timeline==
+
=== Detection ===
 +
Icinga alerts notified us to the issue.
 +
 
 +
== Timeline ==
  
 
* 19:10 - First page (logstash failure, secondary fallout from error ingestion) - A flood of errors caused by network disruption during eqiad rack a6/a7 pdu maintenance overwhelmed logstash.
 
* 19:10 - First page (logstash failure, secondary fallout from error ingestion) - A flood of errors caused by network disruption during eqiad rack a6/a7 pdu maintenance overwhelmed logstash.
Line 21: Line 24:
 
* 00:16 (next day) logstash has caught up with kafka backlog
 
* 00:16 (next day) logstash has caught up with kafka backlog
  
==Actionables==
+
== Conclusions ==
 +
=== What went well? ===
 +
* We successfully isolated what stream was generating the problematic messages.
 +
 
 +
=== What went poorly? ===
 +
* It took us a while to determine where to put and install a filter that would drop the problematic messages.
 +
 
 +
=== Where did we get lucky? ===
 +
* Happened in the middle of the day for two of the engineers with context.
 +
 
 +
=== How many people were involved in the remediation? ===
 +
* 2 SRE
 +
 
 +
== Links to relevant documentation ==
 +
* https://wikitech.wikimedia.org/wiki/Logstash
 +
 
 +
== Actionables ==
  
 
* [DONE] Investigate separating logstash pipelines
 
* [DONE] Investigate separating logstash pipelines

Latest revision as of 22:07, 13 September 2019

document status: in-review

Summary

Logstash became overloaded during a network outage that caused elevated error rate.  Problematic log messages were also observed, which were able to crash logstash, requiring manual filtering of the problem logs.

Impact

Logstash was unable to process incoming logs on all inputs until the the problem log type was identified and filtered.  This resulted in delayed logs, and missing logs of the affected type (MediaWiki's SlowTimer)

Detection

Icinga alerts notified us to the issue.

Timeline

  • 19:10 - First page (logstash failure, secondary fallout from error ingestion) - A flood of errors caused by network disruption during eqiad rack a6/a7 pdu maintenance overwhelmed logstash.
    • troubleshooting of related network issue proceeds with higher priority.  logstash outage presumed to be secondary effect of network issues.
  • 20:10 network issue begins to improve, however logstash does not recover on its own
  • 20:23 logstash unable to process backlog and is crashing due to UTF-8 parsing error  Suspected message example: https://phabricator.wikimedia.org/P8790
    • [FATAL][logstash.runner          ] An unexpected error occurred! {:error=>#<ArgumentError: invalid byte sequence in UTF-8>
      
  • 21:00 UTF-8 issue traced to kafka rsyslog-shipper input.  Logstash collectors restarted with kafka rsyslog-shipper input disabled.  Kafka-logging consumer lag begins to recover on non-rsyslog topics https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=1563904908998&to=1563929943574&var-datasource=eqiad%20prometheus%2Fops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
  • 21:45 UTF-8 issue traced to SlowTimer messages and a temporary fix to drop [message] =~ /^SlowTimer/ deployed
  • 22:00 temporary fix is active across eqiad/codfw logstash collectors, kafka rsyslog-shipper input re-enabled.  Kafka consumer lag is recovering now on all topics
  • 00:16 (next day) logstash has caught up with kafka backlog

Conclusions

What went well?

  • We successfully isolated what stream was generating the problematic messages.

What went poorly?

  • It took us a while to determine where to put and install a filter that would drop the problematic messages.

Where did we get lucky?

  • Happened in the middle of the day for two of the engineers with context.

How many people were involved in the remediation?

  • 2 SRE

Links to relevant documentation

Actionables

  • [DONE] Investigate separating logstash pipelines
    • Possible with current architecture, but multiple pipelines support requires logstash 6.x or newer
  • [DONE] Investigate filtering/parsing options to avoid UTF-8 error observed
    • https://github.com/logstash-plugins/logstash-filter-mutate/commit/ac4c440fba70e2ddb2a6a664af755f79893cfabd
    • https://github.com/logstash-plugins/logstash-input-elasticsearch/issues/101
    • Requires logstash 7.x or newer
  • Upgrade to logstash 7.x or newer https://phabricator.wikimedia.org/T183281 or otherwise upgrade mutate plugin to 3.5.0