You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/20200412-eqiad down: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Peachey88
m (Lets not leave the page in a broken state)
 
imported>CDanis
(Replaced content with "{{delete|This incident document (and especially its title) is inaccurate}}")
 
Line 1: Line 1:
{{irdoc}}
{{delete|This incident document (and especially its title) is inaccurate}}
 
'''document status''': {{irdoc-draft}} <!--
The status field should be one of:
* {{tl|irdoc-draft}} - Initial status. When you're happy with the state of your draft, change it to {{tl|irdoc-review}}.
* {{tl|irdoc-review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{tl|irdoc-final}}
-->
 
== Summary ==
[[eqiad]] cluster had issues.
 
'''Impact''': All users reading or editing from that cache.
 
See [[phab:T250025]].
 
{{TOC|align=right}}
 
== Timeline ==
 
'''All times in UTC.'''
* 05:05 Connection problems (Error: 502, Next Hop Connection Failed)
* 05:34 Issues resolved
 
== Detection ==
Issue was first detected during editing en.wp, when the site failed.
 
== Conclusions ==
<mark>What weaknesses did we learn about and how can we address them?</mark>
 
=== What went well? ===
* <mark>(Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc</mark>
 
=== What went poorly? ===
* <mark>(Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc</mark>
 
=== Where did we get lucky? ===
* <mark>(Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc</mark>
 
=== How many people were involved in the remediation? ===
* <mark>(Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander</mark>
 
== Links to relevant documentation ==
<mark>Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.</mark>
 
== Actionables ==
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark>
 
* <mark>To do #1 (TODO: Create task)</mark>
* <mark>To do #2 (TODO: Create task)</mark>
 
<mark>TODO: Add the [[phab:tag/wikimedia-incident/|#Wikimedia-Incident]] Phabricator tag to these tasks and move them to the "Follow-up" column.</mark>
 
 
[[Category:Incident documentation]]

Latest revision as of 17:57, 13 April 2020

Gnome-user-trash-full.svg  This page is queued for deletion. Given reason: This incident document (and especially its title) is inaccurate. Last modified: Mon, 13 Apr 2020 17:57:47 +0000.