You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2019-04-07 Zotero

From Wikitech-static
< Incident documentation
Revision as of 19:35, 31 March 2021 by imported>Krinkle (Krinkle moved page Incident documentation/20190407-Zotero to Incident documentation/2019-04-07 Zotero)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

On the morning of April 7, most of the zotero pods in codfw started using too much memory, and stopped responding to requests.

Impact

Given restbase and citoid are active/active, any user coming from eqsin, codfw or ulsfo and trying to get a citation would have seen a degraded service.

Detection

The LVS level check of zotero, and the service-checker test of citoid both reported the issue.

Timeline

All times in UTC.

  • 04:58 sudden increase in the memory used by the first zotero pod
  • 05:05 all zotero pods in codfw have now high memory watermark. Service checker on citoid reports a problem
  • 05:12 The zotero LVS endpoint has become unresponsive to monitoring. A page is sent. OUTAGE BEGINS
  • 05:13 A recovery page arrives. The service will keep flapping and more pages are sent out 5:23 (recovery at 5:37), at 5:41 (recovery at 5:42) at 6:00
  • 06:01 Alexandros, Giuseppe and marostegui respond to the page that is now being sent to people in EU timezones
  • 06:03 Giuseppe depools zotero in codfw, even if the recovery arrives, so that the issue can be better analyzed. OUTAGE ENDS
  • 06:23 After some log analysis, it is decided to kill the pods which still show a high memory watermark.

Conclusions

There isn't much to conclude, apart from the fact we still have situations where zotero can fail because of sudden memory increases. This is a bug in the software and unless we can create a reproducible test case (as this seems to happen because of some user request), there aren't many chances to see it fixed. On the positive side, this hadn't happened in a long time.

The root cause of this specific incident is still TBD.

What went well?

  • The problem is known, the monitoring adequate, the response clear. This isn't a new kind of outage and we're well equipped to respond to it.

What went poorly?

  • No reponse to repeated pages/recoveries for almost 1 hour.
  • No runbook exists for this

Where did we get lucky?

Links to relevant documentation

AIUI the documentation on how to operate on kubernetes for debugging should be written this quarter.

Actionables

  • Document kubernetes debugging procedures (TODO: Create task)
  • Identify and reproduce the situation that caused the memory leak (TODO: Create task)