You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Incidents/2019-04-16 varnish

From Wikitech-static
Jump to navigation Jump to search


For approximately an hour, the traffic layer served bursts of 503 errors: up to ~50k/minute for several minutes at a time. It is unclear why this happened, and whether misbehavior at the traffic layer or at the appserver layer was actually at fault.


Approximately 553,000 HTTP 503 errors were served across all sites.


Automated monitoring (Icinga alerts on traffic availability) plus multiple staff/user reports in #wikimedia-operations.


All times in UTC.

  • 18:24: HTTP 503 error rate begins to rise OUTAGE BEGINS
  • 18:26: first alert from Icinga
    <+icinga-wm>	PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams
  • 18:52: jynus phones bblack
  • 18:55: cdanis depools cp1085 after looking at Varnish mailbox lag console and
  • 19:04: bblack performs varnish-backend-restart on cp1085
  • 19:07: bblack performs varnish-backend-restart on cp1083
  • 19:08: 503s taper off to 0 OUTAGE ENDS

Possibly-relevant other graphs:

It is unclear what of the above are symptoms vs causes.


What went well?

  • automated monitoring detected the incident

What went poorly?

  • unable to root-cause incident

Where did we get lucky?

  • Whatever was causing the issue stopped happening.
  • The outage was not more widespread

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.


  • Continue working on moving to ATS. Similar incidents have happened before, and continuing to investigate these failures of Varnish is not a good use of time.