You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2019-04-23 varnish

From Wikitech-static
< Incident documentation
Revision as of 19:35, 31 March 2021 by imported>Krinkle (Krinkle moved page Incident documentation/20190423-varnish to Incident documentation/2019-04-23 varnish)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

Similar Varnish 'mailbox lag' problems as many times before.

Impact

Approximately 82k queries lost (HTTP 503 served instead). source

Detection

Automated monitoring -- Icinga alerts on traffic availability.

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.

All times in UTC.

  • 19:54 Varnish mailbox lag begins climbing on cp1083 OUTAGE BEGINS
  • 19:56 first Icinga alert for HTTP availability PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo
  • 19:57 Varnish mailbox lag recovers on cp1083 but begins climbing on cp1085
  • 20:02 Varnish mailbox lag recovers on cp1085 OUTAGE ENDS

Graphs: Mailbox lag HTTP availability

Conclusions

See Incident_documentation/20190416-varnish#Conclusions

Actionables

See Incident_documentation/20190416-varnish#Actionables