You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Incident documentation/2021-09-06 Wikifeeds

From Wikitech-static
< Incident documentation
Revision as of 15:58, 10 September 2021 by imported>Elukey (→‎Detection)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft


A MediaWiki API outage happened on 2021-09-04 caused a rise in HTTP 503s returned by the Wikifeeds service. The issue was particularly subtle since only some requests ended up in HTTP 503, so the service health checks failed intermittently for a brief amount of time every now and then during the weekend, getting unnoticed until the next Monday. A roll restart of the Wikifeeds Kubernetes pods restored the service to an healthy status.


All times in UTC.

  • 2021-09-04T02:40 - OUTAGE BEGINS
  • 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
  • 2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
  • 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - - OUTAGE ENDS


The detection of the issue happened two days after it started, thanks for a service-checker icinga alert:

17:32 +<icinga-wm> PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} 
                   (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for 
                   April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: 
                   Test retrieve featu
17:32 +<icinga-wm> e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with 
                   aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned 
                   the unexpected status 504 (expecting: 200)

This alert fired and self recovered every now and then during the weekend, where attention to IRC reported errors is lower, and it got noticed by one SRE by chance while looking at the #wikimedia-operations IRC channel. A lot of time was spent trying to figure out how the service worked, how to reproduce the problem and what could be the root cause of it.


What weaknesses did we learn about and how can we address them?

What went well?

  • (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc

What went poorly?

  • (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc

Where did we get lucky?

  • (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc

How many people were involved in the remediation?

  • (Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.


Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.