You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Incident documentation/2021-11-10 cirrussearch commonsfile outage
document status: draft
Summary
In order to test a bug, queries were being run against the active production cirrus cluster (eqiad cirrussearch) via a tunnel from mw-vagrant. vagrant provision
was (probably) later run without the tunnel being properly closed, resulting in (for reasons not fully understood) the index `commonswiki_file_1623767607` being deleted and recreated by the script.
As a result, any search queries for commonswiki files directly failed. Furthermore, any "cross-wiki" searches[1] that searched Commons, such as the sidebar of many wikis (notably, not English b/c the English Wikipedia community disables the commons integration), failed as well.
For context, when using the Wikipedia search function Special:Search
, most wikipedias queries their sister wikis along with commons. So any wiki who included Commons in their "sidebar" (right side of page) would have had the query fail.
Note that with respect to Wikipedia search, the "Go box" in the top-right corner (how most users search for articles) was not impacted. It was only the full search page Special:Search
that failed on any Wikis that had Commons as one of the possible sister search results in the right sidebar.
Impact: Users were impacted between 14:00-16:32 (about 2.5 hours). All commons file searches failed, as well as Special:Search for many wikis (but notably not English wikipedia)
Timeline
15:21 First ticket filed by impacted user https://phabricator.wikimedia.org/T295478
15:28 Additional, largely duplicate ticket filed by user https://phabricator.wikimedia.org/T295480
15:32 <Dylsss> Searching for files on Commons is currently impossible, I believe this is quite critical given the whole point of Commons is being a file repository
15:52 Initial attempt to shift cirrussearch traffic to codfw (did not work due to missing a required line in patch) (https://sal.toolforge.org/log/05mNCn0B1jz_IcWuO9iw)
16:32 Search team operator successfully moves all cirrussearch traffic to codfw, resolving user impact (https://sal.toolforge.org/log/8p2xCn0Ba_6PSCT9sorW)
??? (In future) Index successfully restored, and traffic is returned to eqiad
References:
- ↑ Log events of all affected requests (note: requires Logstash access)
Actionables
- Future one-off debugging of the sort that triggered this incident, when it requires production data, should be done on
cloudelastic
, which is an up-to-date read-only Elasticsearch cluster. If production data is needed but <= 1 week stale data is acceptable,relforge
should be used instead.