You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-09-06 Wikifeeds"

From Wikitech-static
Jump to navigation Jump to search
imported>JMeybohm
imported>Krinkle
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{irdoc|status=draft}} <!--
{{irdoc|status=review}} <!--
The status field should be one of:
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
Line 7: Line 7:


== Summary ==
== Summary ==
A MediaWiki API outage happened on 2021-09-04 caused a rise in HTTP 503s returned by the Wikifeeds service. The issue was particularly subtle since only some requests ended up in HTTP 503, so the service health checks failed intermittently for a brief amount of time every now and then during the weekend, getting unnoticed until the next Monday. A roll restart of the Wikifeeds Kubernetes pods restored the service to an healthy status. {{TOC|align=right}}
A MediaWiki API outage happened two days earlier on 2021-09-04. It was correlated back then to a rise in HTTP 503s returned by the Wikifeeds service for the next 2 days. The issue was particularly subtle since only some requests ended up in HTTP 503, so the service health checks failed intermittently for a brief amount of time every now and then during the weekend, only noticed the next Monday. A rolling restart of the Wikifeeds Kubernetes pods reportedly restored the service to a healthy status.  
 
While the possibility that the Mediawiki API outage is related remains, it was discovered that end-user traffic to the Wikifeeds service more than quadrupled (4x) at peak, violating the traffic [[Wikifeeds#Service level indicators/objectives (SLIs/SLOs)|Wikifeeds SLO]] leaving the service in a degraded state and returning many 503 and 504s. The tls-proxy container was throttled at times during the peaks of the incident, causing the [https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?viewPanel=102&orgId=1&from=1630708200000&to=1631010300000 timeouts]. 
 
'''Impact:''' For 3 days, the Wikifeeds API failed about 1% of its requests (e.g. 5 of every 500 per second).  {{TOC|align=right}}


== Timeline ==
== Timeline ==
Line 14: Line 18:
* 2021-09-04T02:40 - '''OUTAGE BEGINS'''
* 2021-09-04T02:40 - '''OUTAGE BEGINS'''
* 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
* 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
*2021-09-06T18:30 - https://phabricator.wikimedia.org/T290445 is created with a list of possible suspects to look into.
*2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
*2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
*2021-09-07T07:08 - One Software engineer notices a recurrent stacktrace listed in the logs, and files a patch for it (turned out to be a red herring, more details [[phab:T290445#7335379|T290445#7335379]]).
* 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu -  '''OUTAGE ENDS'''
* 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu -  '''OUTAGE ENDS'''
*2021-09-28T13:26 - After an incident review across multiple days, an engineer figures out that the errors were the result of increased end-user traffic generated by the Wikipedia App. The envoy based tls-proxy was also mildly [https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?viewPanel=102&orgId=1&from=1630708200000&to=1631010300000 throttled] during the incident, causing the timeouts. Log and metrics investigation did not reveal any other sign of envoy based tls-proxy misbehaving.


Metrics related to the above time window: https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=1630597675079&to=1630985454607
Metrics related to the above time window: https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=1630597675079&to=1630985454607
Line 22: Line 29:


== Detection ==
== Detection ==
The detection of the issue happened two days after it started, thanks for a service-checker icinga alert:<syntaxhighlight lang="irc">
The detection of the issue happened two days after it started, thanks to a service-checker icinga alert:<syntaxhighlight lang="irc">
17:32 +<icinga-wm> PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day}  
17:32 +<icinga-wm> PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day}  
                   (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for  
                   (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for  
Line 35: Line 42:


== Conclusions ==
== Conclusions ==
The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data).
The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data). Due to the unlucky circumstances (weekend plus US holiday) there were fewer SREs available to work on the problem, and it was not clear if it was worth a page or not. This situation will probably get better when we'll introduce SLOs and more formal on-call process.


=== What went well? ===
=== What went well? ===
Line 43: Line 50:
=== What went poorly? ===
=== What went poorly? ===


* It was difficult to understand exactly how Wikifeeds work, and where to look for error logs.  
* It was difficult to understand exactly how Wikifeeds work, and where to look for error logs. The role of the tls-proxy container in the Wikifeeds pods was immediately recognized as one of the possible source of problems, but there was no in depth analysis of its state (for example, <code>nsenter</code> may have helped in checking if some socket state was inconsistent, causing the connection failures). More in depth analysis later on, yielded no clear signs of
*The problem was "fixed" with a roll restart of the Wikifeeds pods, therefore it is was deemed possible at the time that the root cause of the problem was an inconsistent state of the envoy proxy running in the tls-proxy container, but this does not look plausible now.
*The {{Phabricator/en|T290445}} tracking task was cut a few hours after the first SRE started to check the problem, and the tls-proxy's 503s were listed as possible root cause. The three SREs involved didn't focus on it to concentrate on getting the full picture of the request flow. Nobody thought that the issue was worth a formal Incident Management workflow, but an incident commander would have probably helped in distributing/organizing the work in a better way (rather than all three SREs doing their own research with some communication on IRC about it).
*Nominally, in a kubernetes environment, it is possible for workloads to autoscale automatically based on the amount of requests they received or CPU/memory consumption. Unfortunately we don't have the infrastructure for that yet.  


=== Where did we get lucky? ===
=== Where did we get lucky? ===


* We got lucky that the issue was fixed with a simple roll restart of the Wikifeeds service. We didn't have more idea about where to look or what was happening, and the roll restart ended up to be the right thing to do.
* We got lucky that the issue was fixed with a simple roll restart of the Wikifeeds service. We didn't have more idea about where to look or what was happening, and the roll restart ended up seeming like the the right thing to do. At an incident review later on, it became obvious the the organic traffic patterns were dissipating by then causing this impression.


=== How many people were involved in the remediation? ===
=== How many people were involved in the remediation? ===


* 3 SREs and one Software Engineer
* 3 SREs and one Software Engineer
*1 SRE reviewed and investigate the incident a few weeks later.


== Links to relevant documentation ==
== Links to relevant documentation ==
<mark>Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.</mark>
 
* [[Wikifeeds|Wikifeeds documentation]]


== Actionables ==
== Actionables ==
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark>
* Add (more) documentation to the Wikifeeds Wikitech page to describe what is the relationship between Restbase, Wikifeeds and its tls-proxy container. [[phab:T291912|T291912]]
 
*Increase the capacity for wikifeeds [[phab:T291914|T291914]]
* <mark>To do #1 (TODO: Create task)</mark>
*Originally, it was deemed as a worthy actionable to find a way to reproduce the Envoy inconsistent state, or check if there is already literature/links/etc.. about it. However, since this was related to the greatly increased end-user traffic levels and since the incident review and investigation did not manage to reveal signs of envoy having an inconsistent state, this should not happen
* <mark>To do #2 (TODO: Create task)</mark>
 
<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark>

Latest revision as of 19:04, 20 October 2021

document status: in-review

Summary

A MediaWiki API outage happened two days earlier on 2021-09-04. It was correlated back then to a rise in HTTP 503s returned by the Wikifeeds service for the next 2 days. The issue was particularly subtle since only some requests ended up in HTTP 503, so the service health checks failed intermittently for a brief amount of time every now and then during the weekend, only noticed the next Monday. A rolling restart of the Wikifeeds Kubernetes pods reportedly restored the service to a healthy status.

While the possibility that the Mediawiki API outage is related remains, it was discovered that end-user traffic to the Wikifeeds service more than quadrupled (4x) at peak, violating the traffic Wikifeeds SLO leaving the service in a degraded state and returning many 503 and 504s. The tls-proxy container was throttled at times during the peaks of the incident, causing the timeouts.

Impact: For 3 days, the Wikifeeds API failed about 1% of its requests (e.g. 5 of every 500 per second).

Timeline

All times in UTC.

  • 2021-09-04T02:40 - OUTAGE BEGINS
  • 2021-09-06T17:32 - One SRE starts to investigate the problem after noticing an icinga alert about service-checker failures on #wikimedia-operations (two more SREs will join during the subsequent hour).
  • 2021-09-06T18:30 - https://phabricator.wikimedia.org/T290445 is created with a list of possible suspects to look into.
  • 2021-09-06T20:00 - The three SREs in Europe working on the problem (US holiday) decided to reconvene the next morning, the impact for the service seemed not worth a page.
  • 2021-09-07T07:08 - One Software engineer notices a recurrent stacktrace listed in the logs, and files a patch for it (turned out to be a red herring, more details T290445#7335379).
  • 2021-09-07T07:13 - A connection is made between the MediaWiki API outage timings and the rise of HTTPS 503s. As consequence, a roll restart of all the Wikifeeds Kubernetes pods is executed - https://sal.toolforge.org/log/Zg0av3sB1jz_IcWuiMhu - OUTAGE ENDS
  • 2021-09-28T13:26 - After an incident review across multiple days, an engineer figures out that the errors were the result of increased end-user traffic generated by the Wikipedia App. The envoy based tls-proxy was also mildly throttled during the incident, causing the timeouts. Log and metrics investigation did not reveal any other sign of envoy based tls-proxy misbehaving.

Metrics related to the above time window: https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=1630597675079&to=1630985454607

Wikifeeds tls-proxy (envoy) logs related to the above time window: https://logstash.wikimedia.org/goto/22555b41f9b7bbd2829081a041d76bb2

Detection

The detection of the issue happened two days after it started, thanks to a service-checker icinga alert:

17:32 +<icinga-wm> PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} 
                   (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for 
                   April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: 
                   Test retrieve featu
17:32 +<icinga-wm> e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with 
                   aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned 
                   the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds

This alert fired and self recovered every now and then during the weekend, where attention to IRC reported errors is lower, and it got noticed by one SRE by chance while looking at the #wikimedia-operations IRC channel. A lot of time was spent trying to figure out how the service worked, how to reproduce the problem and what could be the root cause of it.

Conclusions

The main pain point was surely to identify what systems are involved in handling a request for the Wikifeeds API, and how to reproduce one error case. The documentation on Wikitech is good but generic, and there were some important details that not all SREs involved had clear in mind (one above all, the fact that a Wikifeeds request involves Restbase, Wikifeeds on Kubernetes, and possibly again Restbase to fetch some auxiliary data). Due to the unlucky circumstances (weekend plus US holiday) there were fewer SREs available to work on the problem, and it was not clear if it was worth a page or not. This situation will probably get better when we'll introduce SLOs and more formal on-call process.

What went well?

  • Automated monitoring (service-checker) detected the intermittent failures, even if not in a very precise way.

What went poorly?

  • It was difficult to understand exactly how Wikifeeds work, and where to look for error logs. The role of the tls-proxy container in the Wikifeeds pods was immediately recognized as one of the possible source of problems, but there was no in depth analysis of its state (for example, nsenter may have helped in checking if some socket state was inconsistent, causing the connection failures). More in depth analysis later on, yielded no clear signs of
  • The problem was "fixed" with a roll restart of the Wikifeeds pods, therefore it is was deemed possible at the time that the root cause of the problem was an inconsistent state of the envoy proxy running in the tls-proxy container, but this does not look plausible now.
  • The task T290445 tracking task was cut a few hours after the first SRE started to check the problem, and the tls-proxy's 503s were listed as possible root cause. The three SREs involved didn't focus on it to concentrate on getting the full picture of the request flow. Nobody thought that the issue was worth a formal Incident Management workflow, but an incident commander would have probably helped in distributing/organizing the work in a better way (rather than all three SREs doing their own research with some communication on IRC about it).
  • Nominally, in a kubernetes environment, it is possible for workloads to autoscale automatically based on the amount of requests they received or CPU/memory consumption. Unfortunately we don't have the infrastructure for that yet.

Where did we get lucky?

  • We got lucky that the issue was fixed with a simple roll restart of the Wikifeeds service. We didn't have more idea about where to look or what was happening, and the roll restart ended up seeming like the the right thing to do. At an incident review later on, it became obvious the the organic traffic patterns were dissipating by then causing this impression.

How many people were involved in the remediation?

  • 3 SREs and one Software Engineer
  • 1 SRE reviewed and investigate the incident a few weeks later.

Links to relevant documentation

Actionables

  • Add (more) documentation to the Wikifeeds Wikitech page to describe what is the relationship between Restbase, Wikifeeds and its tls-proxy container. T291912
  • Increase the capacity for wikifeeds T291914
  • Originally, it was deemed as a worthy actionable to find a way to reproduce the Envoy inconsistent state, or check if there is already literature/links/etc.. about it. However, since this was related to the greatly increased end-user traffic levels and since the incident review and investigation did not manage to reveal signs of envoy having an inconsistent state, this should not happen