You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-09-06 Wikifeeds"

From Wikitech-static
Jump to navigation Jump to search
imported>Elukey
 
imported>Elukey
Line 16: Line 16:
*2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
*2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
* 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu -  '''OUTAGE ENDS'''
* 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu -  '''OUTAGE ENDS'''
Metrics related to the above time window: https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=1630597675079&to=1630985454607


== Detection ==
== Detection ==
Line 31: Line 32:


== Conclusions ==
== Conclusions ==
<mark>What weaknesses did we learn about and how can we address them?</mark>
The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data).


=== What went well? ===
=== What went well? ===
* <mark>(Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc</mark>
 
* Automated monitoring (service-checker) detected the intermittent failures, even if not in a very precise way.


=== What went poorly? ===
=== What went poorly? ===
* <mark>(Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc</mark>
 
* It was difficult to understand exactly how Wikifeeds work, and where to look for error logs.


=== Where did we get lucky? ===
=== Where did we get lucky? ===
* <mark>(Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc</mark>
 
* We got lucky that the issue was fixed with a simple roll restart of the Wikifeeds service. We didn't have more idea about where to look or what was happening, and the roll restart ended up to be the right thing to do.


=== How many people were involved in the remediation? ===
=== How many people were involved in the remediation? ===
* <mark>(Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander</mark>
 
* 3 SREs and one Software Engineer


== Links to relevant documentation ==
== Links to relevant documentation ==

Revision as of 14:08, 13 September 2021

document status: draft

Summary

A MediaWiki API outage happened on 2021-09-04 caused a rise in HTTP 503s returned by the Wikifeeds service. The issue was particularly subtle since only some requests ended up in HTTP 503, so the service health checks failed intermittently for a brief amount of time every now and then during the weekend, getting unnoticed until the next Monday. A roll restart of the Wikifeeds Kubernetes pods restored the service to an healthy status.

Timeline

All times in UTC.

  • 2021-09-04T02:40 - OUTAGE BEGINS
  • 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
  • 2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
  • 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu - OUTAGE ENDS

Metrics related to the above time window: https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=1630597675079&to=1630985454607

Detection

The detection of the issue happened two days after it started, thanks for a service-checker icinga alert:

17:32 +<icinga-wm> PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} 
                   (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for 
                   April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: 
                   Test retrieve featu
17:32 +<icinga-wm> e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with 
                   aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned 
                   the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds

This alert fired and self recovered every now and then during the weekend, where attention to IRC reported errors is lower, and it got noticed by one SRE by chance while looking at the #wikimedia-operations IRC channel. A lot of time was spent trying to figure out how the service worked, how to reproduce the problem and what could be the root cause of it.

Conclusions

The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data).

What went well?

  • Automated monitoring (service-checker) detected the intermittent failures, even if not in a very precise way.

What went poorly?

  • It was difficult to understand exactly how Wikifeeds work, and where to look for error logs.

Where did we get lucky?

  • We got lucky that the issue was fixed with a simple roll restart of the Wikifeeds service. We didn't have more idea about where to look or what was happening, and the roll restart ended up to be the right thing to do.

How many people were involved in the remediation?

  • 3 SREs and one Software Engineer

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.