You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-09-06 Wikifeeds"

From Wikitech-static
Jump to navigation Jump to search
imported>JMeybohm
imported>JMeybohm
(→‎What went poorly?: Link phab task)
Line 14: Line 14:
* 2021-09-04T02:40 - '''OUTAGE BEGINS'''
* 2021-09-04T02:40 - '''OUTAGE BEGINS'''
* 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
* 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
*2021-09-06T18:30 - https://phabricator.wikimedia.org/T290445 is created with a list of possible suspects to look into.
*2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
*2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
*2021-09-07T07:08 - One Software engineer notices a recurrent stacktrace listed in the logs, and files a patch for it (turned out to be a red herring, more details [[phab:T290445#7335379|T290445#7335379]]).
* 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu -  '''OUTAGE ENDS'''
* 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu -  '''OUTAGE ENDS'''


Line 35: Line 37:


== Conclusions ==
== Conclusions ==
The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data).
The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data). Due to the unlucky circumstances (weekend plus US holiday) there were fewer SREs available to work on the problem, and it was not clear if it was worth a page or not. This situation will probably get better when we'll introduce SLOs and more formal on-call process.


=== What went well? ===
=== What went well? ===
Line 43: Line 45:
=== What went poorly? ===
=== What went poorly? ===


* It was difficult to understand exactly how Wikifeeds work, and where to look for error logs.  
* It was difficult to understand exactly how Wikifeeds work, and where to look for error logs. The role of the tls-proxy container in the Wikifeeds pods was immediately recognized as one of the possible source of problems, but there was no in depth analysis of its state (for example, <code>nsenter</code> may have helped in checking if some socket state was inconsistent, causing the connection failures).
*The problem was "fixed" with a roll restart of the Wikifeeds pods, therefore it is possible that the root cause of the problem was an inconsistent state of the envoy proxy running in the tls-proxy container, but we are not completely sure.
*The {{Phabricator/en|T290445}} tracking task was cut a few hours after the first SRE started to check the problem, and the tls-proxy's 503s were listed as possible root cause. The three SREs involved didn't focus on it to concentrate on getting the full picture of the request flow. Nobody thought that the issue was worth a formal Incident Management workflow, but an incident commander would have probably helped in distributing/organizing the work in a better way (rather than all three SREs doing their own research with some communication on IRC about it).  


=== Where did we get lucky? ===
=== Where did we get lucky? ===
Line 55: Line 59:
== Links to relevant documentation ==
== Links to relevant documentation ==
<mark>Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.</mark>
<mark>Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.</mark>
* [[Wikifeeds|Wikifeeds documentation]]


== Actionables ==
== Actionables ==
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark>
* Add (more) documentation to the Wikifeeds Wikitech page to describe what is the relationship between Restbase, Wikifeeds and its tls-proxy container.
 
*Establish a formal SLO for Wikifeeds (this is part of an ongoing effort from SRE to introduce SLOs for all our services, not worth a task but good to mention).
* <mark>To do #1 (TODO: Create task)</mark>
*It would be great to find a way to reproduce the Envoy inconsistent state, or check if there is already literature/links/etc.. about it. We may be missing an option in the Envoy's config to prevent this from happening again.
* <mark>To do #2 (TODO: Create task)</mark>


<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark>
<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark>

Revision as of 15:23, 15 September 2021

document status: draft

Summary

A MediaWiki API outage happened on 2021-09-04 caused a rise in HTTP 503s returned by the Wikifeeds service. The issue was particularly subtle since only some requests ended up in HTTP 503, so the service health checks failed intermittently for a brief amount of time every now and then during the weekend, getting unnoticed until the next Monday. A roll restart of the Wikifeeds Kubernetes pods restored the service to an healthy status.

Timeline

All times in UTC.

  • 2021-09-04T02:40 - OUTAGE BEGINS
  • 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
  • 2021-09-06T18:30 - https://phabricator.wikimedia.org/T290445 is created with a list of possible suspects to look into.
  • 2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
  • 2021-09-07T07:08 - One Software engineer notices a recurrent stacktrace listed in the logs, and files a patch for it (turned out to be a red herring, more details T290445#7335379).
  • 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu - OUTAGE ENDS

Metrics related to the above time window: https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=1630597675079&to=1630985454607

Wikifeeds tls-proxy (envoy) logs related to the above time window: https://logstash.wikimedia.org/goto/22555b41f9b7bbd2829081a041d76bb2

Detection

The detection of the issue happened two days after it started, thanks for a service-checker icinga alert:

17:32 +<icinga-wm> PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} 
                   (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for 
                   April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: 
                   Test retrieve featu
17:32 +<icinga-wm> e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with 
                   aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned 
                   the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds

This alert fired and self recovered every now and then during the weekend, where attention to IRC reported errors is lower, and it got noticed by one SRE by chance while looking at the #wikimedia-operations IRC channel. A lot of time was spent trying to figure out how the service worked, how to reproduce the problem and what could be the root cause of it.

Conclusions

The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data). Due to the unlucky circumstances (weekend plus US holiday) there were fewer SREs available to work on the problem, and it was not clear if it was worth a page or not. This situation will probably get better when we'll introduce SLOs and more formal on-call process.

What went well?

  • Automated monitoring (service-checker) detected the intermittent failures, even if not in a very precise way.

What went poorly?

  • It was difficult to understand exactly how Wikifeeds work, and where to look for error logs. The role of the tls-proxy container in the Wikifeeds pods was immediately recognized as one of the possible source of problems, but there was no in depth analysis of its state (for example, nsenter may have helped in checking if some socket state was inconsistent, causing the connection failures).
  • The problem was "fixed" with a roll restart of the Wikifeeds pods, therefore it is possible that the root cause of the problem was an inconsistent state of the envoy proxy running in the tls-proxy container, but we are not completely sure.
  • The task T290445 tracking task was cut a few hours after the first SRE started to check the problem, and the tls-proxy's 503s were listed as possible root cause. The three SREs involved didn't focus on it to concentrate on getting the full picture of the request flow. Nobody thought that the issue was worth a formal Incident Management workflow, but an incident commander would have probably helped in distributing/organizing the work in a better way (rather than all three SREs doing their own research with some communication on IRC about it).

Where did we get lucky?

  • We got lucky that the issue was fixed with a simple roll restart of the Wikifeeds service. We didn't have more idea about where to look or what was happening, and the roll restart ended up to be the right thing to do.

How many people were involved in the remediation?

  • 3 SREs and one Software Engineer

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

  • Add (more) documentation to the Wikifeeds Wikitech page to describe what is the relationship between Restbase, Wikifeeds and its tls-proxy container.
  • Establish a formal SLO for Wikifeeds (this is part of an ongoing effort from SRE to introduce SLOs for all our services, not worth a task but good to mention).
  • It would be great to find a way to reproduce the Envoy inconsistent state, or check if there is already literature/links/etc.. about it. We may be missing an option in the Envoy's config to prevent this from happening again.

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.