You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Incident documentation/2019-04-07 Zotero
On the morning of April 7, most of the zotero pods in codfw started using too much memory, and stopped responding to requests.
Given restbase and citoid are active/active, any user coming from eqsin, codfw or ulsfo and trying to get a citation would have seen a degraded service.
The LVS level check of zotero, and the service-checker test of citoid both reported the issue.
All times in UTC.
- 04:58 sudden increase in the memory used by the first zotero pod
- 05:05 all zotero pods in codfw have now high memory watermark. Service checker on citoid reports a problem
- 05:12 The zotero LVS endpoint has become unresponsive to monitoring. A page is sent. OUTAGE BEGINS
- 05:13 A recovery page arrives. The service will keep flapping and more pages are sent out 5:23 (recovery at 5:37), at 5:41 (recovery at 5:42) at 6:00
- 06:01 Alexandros, Giuseppe and marostegui respond to the page that is now being sent to people in EU timezones
- 06:03 Giuseppe depools zotero in codfw, even if the recovery arrives, so that the issue can be better analyzed. OUTAGE ENDS
- 06:23 After some log analysis, it is decided to kill the pods which still show a high memory watermark.
There isn't much to conclude, apart from the fact we still have situations where zotero can fail because of sudden memory increases. This is a bug in the software and unless we can create a reproducible test case (as this seems to happen because of some user request), there aren't many chances to see it fixed. On the positive side, this hadn't happened in a long time.
The root cause of this specific incident is still TBD.
What went well?
- The problem is known, the monitoring adequate, the response clear. This isn't a new kind of outage and we're well equipped to respond to it.
What went poorly?
- No reponse to repeated pages/recoveries for almost 1 hour.
- No runbook exists for this
Where did we get lucky?
- It's fair to say a malfunctioning zotero doesn't have a huge impact. Requests rates average at 0.05 req/s. See https://grafana.wikimedia.org/d/NJkCVermz/citoid?refresh=5m&panelId=46&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid&from=now-7d&to=now
- That some of us have a light sleep, probably?
Links to relevant documentation
AIUI the documentation on how to operate on kubernetes for debugging should be written this quarter.
- Document kubernetes debugging procedures (TODO: Create task)
- Identify and reproduce the situation that caused the memory leak (TODO: Create task)