You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-09-06 Wikifeeds

From Wikitech-static
< Incident documentation
Revision as of 08:36, 14 September 2021 by imported>JMeybohm (→‎Timeline)
Jump to navigation Jump to search

document status: draft

Summary

A MediaWiki API outage happened on 2021-09-04 caused a rise in HTTP 503s returned by the Wikifeeds service. The issue was particularly subtle since only some requests ended up in HTTP 503, so the service health checks failed intermittently for a brief amount of time every now and then during the weekend, getting unnoticed until the next Monday. A roll restart of the Wikifeeds Kubernetes pods restored the service to an healthy status.

Timeline

All times in UTC.

  • 2021-09-04T02:40 - OUTAGE BEGINS
  • 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
  • 2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
  • 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu - OUTAGE ENDS

Metrics related to the above time window: https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=1630597675079&to=1630985454607

Wikifeeds tls-proxy (envoy) logs related to the above time window: https://logstash.wikimedia.org/goto/22555b41f9b7bbd2829081a041d76bb2

Detection

The detection of the issue happened two days after it started, thanks for a service-checker icinga alert:

17:32 +<icinga-wm> PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} 
                   (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for 
                   April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: 
                   Test retrieve featu
17:32 +<icinga-wm> e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with 
                   aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned 
                   the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds

This alert fired and self recovered every now and then during the weekend, where attention to IRC reported errors is lower, and it got noticed by one SRE by chance while looking at the #wikimedia-operations IRC channel. A lot of time was spent trying to figure out how the service worked, how to reproduce the problem and what could be the root cause of it.

Conclusions

The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data).

What went well?

  • Automated monitoring (service-checker) detected the intermittent failures, even if not in a very precise way.

What went poorly?

  • It was difficult to understand exactly how Wikifeeds work, and where to look for error logs.

Where did we get lucky?

  • We got lucky that the issue was fixed with a simple roll restart of the Wikifeeds service. We didn't have more idea about where to look or what was happening, and the roll restart ended up to be the right thing to do.

How many people were involved in the remediation?

  • 3 SREs and one Software Engineer

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.