You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
< Incidents
Revision as of 17:44, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/20151126-graphite-grafana to Incidents/20151126-graphite-grafana)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary was overwhelmed with legitimate requests, yielding 500s returned to clients


  • 2015/11/25 14:48 - graphite starts throwing 500s to clients
  • 2015/11/25 14:52 - alert on icinga, investigation begins
  • 2015/11/25 14:57 - heavy query/dashboard suspected, uwsgi bounced
  • 2015/11/25 15:11 - big influx of requests on graphite1001's apache identified as being the root cause. likely a misbehaving dashboard
  • 2015/11/25 15:24 - labs-monitoring grafana dashboard change default refresh inverval from 5s to 5m
  • 2015/11/25 15:53 - kafka dashboard also suspected and banned from apache


grafana dashboards relying on intensive graphite queries can easily overwhelm graphite itself, particularly dashboard that refresh frequently, resulting in denial of service.

in addition, it has been observed that misc varnish retries the request on 5xx from a backend, further contributing to thundering herd of requests.


  • Make it easier to ban misbehaving dashboards from graphite (bug T119718)
  • Enforce a minimum refresh period for grafana dashboards hitting graphite (bug T119719)
  • 500 errors from graphite shouldn't be retried by varnish (bug T119721)