You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-11-10 cirrussearch commonsfile outage"

From Wikitech-static
Jump to navigation Jump to search
imported>Ryan Kemper
(→‎Summary: Make timeline a bit more accurate / detailed)
 
imported>Krinkle
 
Line 8: Line 8:
== Summary ==
== Summary ==


In order to test a bug, queries were being run against the active production cirrus cluster (eqiad cirrussearch) via a tunnel from mw-vagrant. <code>vagrant provision</code> was (probably) later run without the tunnel being properly closed, resulting in (for reasons not fully understood) the index `commonswiki_file_1623767607` being deleted and recreated by the script.
On 10 November, as part of verifying a bug report, a developer submitted a high volume of search queries against the active production Cirrus cluster (eqiad cirrussearch) via a tunnel from their local mw-vagrant environment. <code>vagrant provision</code> was (probably) later run without the tunnel being properly closed first, which resulted in (for reasons not yet understood) the deletion and recreation of the <code>commonswiki_file_1623767607</code> index.


As a result, any search queries for commonswiki files directly failed. Furthermore, any "cross-wiki" searches<ref>[https://logstash.wikimedia.org/goto/73a9d7e35f409c0d122888d42df94761 Log events of all affected requests] ('''note''': requires [[Logstash]] access)</ref> that searched Commons, such as the sidebar of many wikis (notably, not English b/c the English Wikipedia community disables the commons integration), failed as well.
As a direct consequence, any Elasticsearch queries that targetted media files from commonswiki encountered a hard failure.


For context, when using the Wikipedia search function <code>Special:Search</code>, most wikipedias queries their sister wikis along with commons. So any wiki who included Commons in their "sidebar" (right side of page) would have had the query fail.
During the incident, all media searches on Wikimedia Commons failed. Wikipedia projects were impacted as well,<ref>[https://logstash.wikimedia.org/goto/73a9d7e35f409c0d122888d42df94761 Log events of all affected requests] ('''note''': requires [[Logstash]] access)</ref> through the "cross-wiki" feature of the sidebar on Search results pages. This cross-wiki feature is enabled on most wikis by default, though notably not on English Wikipedia where the community disabled search integration to Commons.


Note that with respect to Wikipedia search, the "Go box" in the top-right corner (how most users search for articles) was not impacted. It was only the full search page <code>Special:Search</code> that failed on any Wikis that had Commons as one of the possible sister search results in the right sidebar.
Note that the search suggestions feature, as present on all article pages was not affected (except on Wikimedia Commons itself). The search suggestions field is how how most searches are performed on Wikipedia, and was not impacted. Rather, it impacted the dedicated Search results page ("Special:Search", which consistently failed to return results on wikis where the rendering of that page includes a sidebar with results from Wikimedia Commons.


'''Impact''': Users were impacted between 14:00-16:32 (about 2.5 hours). All commons file searches failed, as well as Special:Search for many wikis (but notably not English wikipedia)
'''Impact''': For about 2.5 hours (14:00-16:32 UTC), the Search results page was unavailable on many wikis (except for English Wikipedia). On Wikimedia Commons the search suggestions feature was unresponsive as well.


=== Timeline ===
=== Timeline ===
Line 31: Line 31:
'''??? (In future)''' Index successfully restored, and traffic is returned to eqiad
'''??? (In future)''' Index successfully restored, and traffic is returned to eqiad


'''References''':
=== References: ===
<references />
<references />



Latest revision as of 21:44, 1 December 2021

document status: draft

Summary

On 10 November, as part of verifying a bug report, a developer submitted a high volume of search queries against the active production Cirrus cluster (eqiad cirrussearch) via a tunnel from their local mw-vagrant environment. vagrant provision was (probably) later run without the tunnel being properly closed first, which resulted in (for reasons not yet understood) the deletion and recreation of the commonswiki_file_1623767607 index.

As a direct consequence, any Elasticsearch queries that targetted media files from commonswiki encountered a hard failure.

During the incident, all media searches on Wikimedia Commons failed. Wikipedia projects were impacted as well,[1] through the "cross-wiki" feature of the sidebar on Search results pages. This cross-wiki feature is enabled on most wikis by default, though notably not on English Wikipedia where the community disabled search integration to Commons.

Note that the search suggestions feature, as present on all article pages was not affected (except on Wikimedia Commons itself). The search suggestions field is how how most searches are performed on Wikipedia, and was not impacted. Rather, it impacted the dedicated Search results page ("Special:Search", which consistently failed to return results on wikis where the rendering of that page includes a sidebar with results from Wikimedia Commons.

Impact: For about 2.5 hours (14:00-16:32 UTC), the Search results page was unavailable on many wikis (except for English Wikipedia). On Wikimedia Commons the search suggestions feature was unresponsive as well.

Timeline

15:21 First ticket filed by impacted user https://phabricator.wikimedia.org/T295478

15:28 Additional, largely duplicate ticket filed by user https://phabricator.wikimedia.org/T295480

15:32 <Dylsss> Searching for files on Commons is currently impossible, I believe this is quite critical given the whole point of Commons is being a file repository

15:52 Initial attempt to shift cirrussearch traffic to codfw (did not work due to missing a required line in patch) (https://sal.toolforge.org/log/05mNCn0B1jz_IcWuO9iw)

16:32 Search team operator successfully moves all cirrussearch traffic to codfw, resolving user impact (https://sal.toolforge.org/log/8p2xCn0Ba_6PSCT9sorW)

??? (In future) Index successfully restored, and traffic is returned to eqiad

References:

Actionables

  • Future one-off debugging of the sort that triggered this incident, when it requires production data, should be done on cloudelastic, which is an up-to-date read-only Elasticsearch cluster. If production data is needed but <= 1 week stale data is acceptable, relforge should be used instead.