You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2018-02-14 labvirt1008-failure: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
 
imported>Krinkle
 
Line 1: Line 1:
== Summary ==
#REDIRECT [[Incidents/2018-02-14 labvirt1008-failure]]
Labvirt1008 seems to have overheated and gone down. This effects tenants as well as virtual VPS infrastructure
 
== Timeline ==
* 7:20 UTC labvirt1008 rebooted
* 8:00 UTC moritzm logged a task about it
* There was no paging or alerting. This is a problem.
* 11:00 UTC Chase woke up and started investigating the extent of the outage, and looking for impact on Toolforge especially
* 12:58 UTC Chase sent an email about impact to cloud-announce https://lists.wikimedia.org/pipermail/cloud-announce/2018-February/000023.html with a list of [https://phabricator.wikimedia.org/T187292#3971559 affected instances].
 
== Conclusions ==
We know that our instance storage is local and ephemeral. We should ensure that is documented for tenants in easy to find places, and re-ensure that our mechanism that keep critical redundant components spread across labvirts are working.  In our world though a single hypervisor is a special snowflake and I believe we should have been paged on this outage, but seem not to have been.  It was my understanding that a full instance storage partition should have paged if nothing else, and in this case the failure of that check. 
 
== Actionables ==
<onlyinclude>
* Coordinate with DC OPS to deal with overheating [[phab:T187292]]
* Look at moving tenant instances to another labvirt (we should have a standing spare)
* Investigate what should have paged and why it did not (and fix it)
</onlyinclude>
 
{{#ifeq:{{SUBPAGENAME}}|Report Template||
[[Category:Incident documentation]]
}}

Latest revision as of 17:46, 8 April 2022