You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/2018-02-14 labvirt1008-failure

From Wikitech-static
< Incident documentation
Revision as of 21:30, 31 March 2021 by imported>Krinkle (Krinkle moved page Incident documentation/20180214-labvirt1008-failure to Incident documentation/2018-02-14 labvirt1008-failure)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Labvirt1008 seems to have overheated and gone down. This effects tenants as well as virtual VPS infrastructure



We know that our instance storage is local and ephemeral. We should ensure that is documented for tenants in easy to find places, and re-ensure that our mechanism that keep critical redundant components spread across labvirts are working. In our world though a single hypervisor is a special snowflake and I believe we should have been paged on this outage, but seem not to have been. It was my understanding that a full instance storage partition should have paged if nothing else, and in this case the failure of that check.


  • Coordinate with DC OPS to deal with overheating phab:T187292
  • Look at moving tenant instances to another labvirt (we should have a standing spare)
  • Investigate what should have paged and why it did not (and fix it)