You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation/20160823-ToolsProxy
< Incident documentation
Jump to navigation
Jump to search
Revision as of 01:58, 6 January 2017 by imported>Krinkle (Krinkle moved page Incident documentation/ToolsProxy20160823 to Incident documentation/20160823-ToolsProxy: Unbreak Template:Days since incidents - conform to standard format)
Summary
Tool Labs proxy (and thus all webservice accessibility from the internet) were down for approximately 2 minutes, between UTC 0546 and 0548, Tue Aug 23.
Timeline
- 0546: Yuvi gets a page for PAWS being down
- 0546: Yuvi investigates, notices tools is down too
- 0547: SSHs into tools-proxy-01, looks at error.log. Notice lots of
768 worker_connections are not enough
errors - 0548: Restarts nginx, fixing the issue for now.
Conclusions
- Our current worker_connections limit is too low.
- There was no widespread paging for this. PAWS alert is set to alert only Yuvi, and also caught this only incidentally.
- We had a higher worker_connections limit, but that was killed in favor of the default number in https://gerrit.wikimedia.org/r/#/c/297829/.
- There's no quick way to failover tools-proxy, making intense debugging a priority over failover & calmly investigating.
Actionables
- Increase the worker_connections limit, tune nginx properly task T143637
- Setup paging with a super simple webservice, to replace the killed tools home page check task T143638
- Make a script that facilitates failover of tools / nova proxy task T143639