You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-11-10 cirrussearch commonsfile outage: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Herron
imported>Krinkle
 
Line 1: Line 1:
{{irdoc|status=draft}} <!--
#REDIRECT [[Incidents/2021-11-10 cirrussearch commonsfile outage]]
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{irdoc|status=final}}
-->
==Summary and Metadata==
The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.
{| class="wikitable"
|'''Incident ID'''
|2021-11-10 cirrussearch commonsfile outage
|'''UTC Start Timestamp:'''
|YYYY-MM-DD hh:mm:ss
|-
|'''Incident Task'''
|https://phabricator.wikimedia.org/T299967
|'''UTC End Timestamp'''
|YYYY-MM-DD hh:mm:ss
|-
|'''People Paged'''
|<amount of people>
|'''Responder Count'''
| <amount of people>
|-
|'''Coordinator(s)'''
|Names - Emails
|'''Relevant Metrics / SLO(s) affected'''
|Relevant metrics
% error budget
|-
|'''Summary:'''
| colspan="3" |For about 2.5 hours (14:00-16:32 UTC), the Search results page was unavailable on many wikis (except for English Wikipedia). On Wikimedia Commons the search suggestions feature was unresponsive as well.
|}On 10 November, as part of verifying a bug report, a developer submitted a high volume of search queries against the active production Cirrus cluster (eqiad cirrussearch) via a tunnel from their local mw-vagrant environment. <code>vagrant provision</code> was (probably) later run without the tunnel being properly closed first, which resulted in (for reasons not yet understood) the deletion and recreation of the <code>commonswiki_file_1623767607</code> index.
 
As a direct consequence, any Elasticsearch queries that targetted media files from commonswiki encountered a hard failure.
 
During the incident, all media searches on Wikimedia Commons failed. Wikipedia projects were impacted as well,<ref>[https://logstash.wikimedia.org/goto/73a9d7e35f409c0d122888d42df94761 Log events of all affected requests] ('''note''': requires [[Logstash]] access)</ref> through the "cross-wiki" feature of the sidebar on Search results pages. This cross-wiki feature is enabled on most wikis by default, though notably not on English Wikipedia where the community disabled search integration to Commons.
 
Note that the search suggestions feature, as present on all article pages was not affected (except on Wikimedia Commons itself). The search suggestions field is how how most searches are performed on Wikipedia, and was not impacted. Rather, it impacted the dedicated Search results page ("Special:Search", which consistently failed to return results on wikis where the rendering of that page includes a sidebar with results from Wikimedia Commons.
 
'''Impact''': For about 2.5 hours (14:00-16:32 UTC), the Search results page was unavailable on many wikis (except for English Wikipedia). On Wikimedia Commons the search suggestions feature was unresponsive as well.
 
=== Timeline ===
'''15:21''' First ticket filed by impacted user https://phabricator.wikimedia.org/T295478
 
'''15:28''' Additional, largely duplicate ticket filed by user https://phabricator.wikimedia.org/T295480
 
'''15:32''' <code><Dylsss> Searching for files on Commons is currently impossible, I believe this is quite critical given the whole point of Commons is being a file repository </code>
 
'''15:52''' Initial attempt to shift cirrussearch traffic to codfw (did not work due to missing a required line in patch) (https://sal.toolforge.org/log/05mNCn0B1jz_IcWuO9iw)
 
'''16:32''' Search team operator successfully moves all cirrussearch traffic to codfw, resolving user impact (https://sal.toolforge.org/log/8p2xCn0Ba_6PSCT9sorW)
 
'''??? (In future)''' Index successfully restored, and traffic is returned to eqiad
 
=== References: ===
<references />
=Scorecard=
{| class="wikitable"
| colspan="2" |'''Incident Engagement™  ScoreCard'''
|'''Score'''
|Notes
|-
| rowspan="5" |'''People'''
|Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
|0
|NA
|-
|Were the people who responded prepared enough to respond effectively (0/5pt)
|5
|
|-
|Did fewer than 5 people get paged (0/5pt)?
|0
|NA
|-
|Were pages routed to the correct sub-team(s)?
|0
|No pages logged, issue reported via task
|-
|Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
|0
|No pages logged
|-
| rowspan="6" |'''Process'''
|Was the incident status section actively updated during the incident? (0/1pt)
|1
|
|-
|If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
|0
|
|-
|Is there a phabricator task for the incident? (0/1pt)
|1
|
|-
|Are the documented action items assigned?  (0/1pt)
|0
|
|-
|Is this a repeat of an earlier incident (-1 per prev occurrence)
|0
|
|-
|Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
|0
|
|-
| rowspan="4" |'''Tooling'''
|Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
|0
|
|-
|Did existing monitoring notify the initial responders? (1pt)
|0
|
|-
|Were all engineering tools required available and in service? (0/5pt)
|5
|
|-
|Was there a runbook for all known issues present? (0/5pt)
|0
|
|-
| colspan="2" |'''Total Score'''
|12
|
|}
== Actionables ==
<!--
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark>
 
* <mark>To do #1 (TODO: Create task)</mark>
* <mark>To do #2 (TODO: Create task)</mark>
 
<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark>
-->
* Future one-off debugging of the sort that triggered this incident, when it requires production data, should be done on <code>cloudelastic</code>, which is an up-to-date read-only Elasticsearch cluster. If production data is needed but <= 1 week stale data is acceptable, <code>relforge</code> should be used instead.

Latest revision as of 17:49, 8 April 2022