You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/20160823-ToolsProxy

From Wikitech-static
< Incident documentation
Revision as of 01:58, 6 January 2017 by imported>Krinkle (Krinkle moved page Incident documentation/ToolsProxy20160823 to Incident documentation/20160823-ToolsProxy: Unbreak Template:Days since incidents - conform to standard format)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Tool Labs proxy (and thus all webservice accessibility from the internet) were down for approximately 2 minutes, between UTC 0546 and 0548, Tue Aug 23.


  • 0546: Yuvi gets a page for PAWS being down
  • 0546: Yuvi investigates, notices tools is down too
  • 0547: SSHs into tools-proxy-01, looks at error.log. Notice lots of 768 worker_connections are not enough errors
  • 0548: Restarts nginx, fixing the issue for now.


  • Our current worker_connections limit is too low.
  • There was no widespread paging for this. PAWS alert is set to alert only Yuvi, and also caught this only incidentally.
  • We had a higher worker_connections limit, but that was killed in favor of the default number in
  • There's no quick way to failover tools-proxy, making intense debugging a priority over failover & calmly investigating.


  • Increase the worker_connections limit, tune nginx properly task T143637
  • Setup paging with a super simple webservice, to replace the killed tools home page check task T143638
  • Make a script that facilitates failover of tools / nova proxy task T143639