You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2021-11-10 cirrussearch commonsfile outage: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ryan Kemper
(→‎Summary: Make timeline a bit more accurate / detailed)
 
imported>Krinkle
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{irdoc|status=draft}} <!--
#REDIRECT [[Incidents/2021-11-10 cirrussearch commonsfile outage]]
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{irdoc|status=final}}
-->
 
== Summary ==
 
In order to test a bug, queries were being run against the active production cirrus cluster (eqiad cirrussearch) via a tunnel from mw-vagrant. <code>vagrant provision</code> was (probably) later run without the tunnel being properly closed, resulting in (for reasons not fully understood) the index `commonswiki_file_1623767607` being deleted and recreated by the script.
 
As a result, any search queries for commonswiki files directly failed. Furthermore, any "cross-wiki" searches<ref>[https://logstash.wikimedia.org/goto/73a9d7e35f409c0d122888d42df94761 Log events of all affected requests] ('''note''': requires [[Logstash]] access)</ref> that searched Commons, such as the sidebar of many wikis (notably, not English b/c the English Wikipedia community disables the commons integration), failed as well.
 
For context, when using the Wikipedia search function <code>Special:Search</code>, most wikipedias queries their sister wikis along with commons. So any wiki who included Commons in their "sidebar" (right side of page) would have had the query fail.
 
Note that with respect to Wikipedia search, the "Go box" in the top-right corner (how most users search for articles) was not impacted. It was only the full search page <code>Special:Search</code> that failed on any Wikis that had Commons as one of the possible sister search results in the right sidebar.
 
'''Impact''': Users were impacted between 14:00-16:32 (about 2.5 hours). All commons file searches failed, as well as Special:Search for many wikis (but notably not English wikipedia)
 
=== Timeline ===
'''15:21''' First ticket filed by impacted user https://phabricator.wikimedia.org/T295478
 
'''15:28''' Additional, largely duplicate ticket filed by user https://phabricator.wikimedia.org/T295480
 
'''15:32''' <code><Dylsss> Searching for files on Commons is currently impossible, I believe this is quite critical given the whole point of Commons is being a file repository </code>
 
'''15:52''' Initial attempt to shift cirrussearch traffic to codfw (did not work due to missing a required line in patch) (https://sal.toolforge.org/log/05mNCn0B1jz_IcWuO9iw)
 
'''16:32''' Search team operator successfully moves all cirrussearch traffic to codfw, resolving user impact (https://sal.toolforge.org/log/8p2xCn0Ba_6PSCT9sorW)
 
'''??? (In future)''' Index successfully restored, and traffic is returned to eqiad
 
'''References''':
<references />
 
== Actionables ==
<!--
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark>
 
* <mark>To do #1 (TODO: Create task)</mark>
* <mark>To do #2 (TODO: Create task)</mark>
 
<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark>
-->
* Future one-off debugging of the sort that triggered this incident, when it requires production data, should be done on <code>cloudelastic</code>, which is an up-to-date read-only Elasticsearch cluster. If production data is needed but <= 1 week stale data is acceptable, <code>relforge</code> should be used instead.

Latest revision as of 17:49, 8 April 2022