You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/20200412-eqiad down: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
m (Lets not leave the page in a broken state)
(Replaced content with "{{delete|This incident document (and especially its title) is inaccurate}}")
Line 1: Line 1:
{{delete|This incident document (and especially its title) is inaccurate}}
'''document status''': {{irdoc-draft}} <!--
The status field should be one of:
* {{tl|irdoc-draft}} - Initial status. When you're happy with the state of your draft, change it to {{tl|irdoc-review}}.
* {{tl|irdoc-review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{tl|irdoc-final}}
== Summary ==
[[eqiad]] cluster had issues.
'''Impact''': All users reading or editing from that cache.
See [[phab:T250025]].
== Timeline ==
'''All times in UTC.'''
* 05:05 Connection problems (Error: 502, Next Hop Connection Failed)
* 05:34 Issues resolved
== Detection ==
Issue was first detected during editing en.wp, when the site failed.
== Conclusions ==
<mark>What weaknesses did we learn about and how can we address them?</mark>
=== What went well? ===
* <mark>(Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc</mark>
=== What went poorly? ===
* <mark>(Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc</mark>
=== Where did we get lucky? ===
* <mark>(Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc</mark>
=== How many people were involved in the remediation? ===
* <mark>(Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander</mark>
== Links to relevant documentation ==
<mark>Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.</mark>
== Actionables ==
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark>
* <mark>To do #1 (TODO: Create task)</mark>
* <mark>To do #2 (TODO: Create task)</mark>
<mark>TODO: Add the [[phab:tag/wikimedia-incident/|#Wikimedia-Incident]] Phabricator tag to these tasks and move them to the "Follow-up" column.</mark>
[[Category:Incident documentation]]

Latest revision as of 17:57, 13 April 2020

Gnome-user-trash-full.svg  This page is queued for deletion. Given reason: This incident document (and especially its title) is inaccurate. Last modified: Mon, 13 Apr 2020 17:57:47 +0000.