You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/20160608-gallium-disk-failure: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
m (→‎Actionables: better task to migrate to contint1001 asap:
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''This is an ongoing incident, page opened as a placeholder 09:27, 8 June 2016 (UTC)'''
#REDIRECT [[Incidents/20160608-gallium-disk-failure]]
== Summary ==
* On Wednesday June 8th shortly after midnight, encountered hard disk / raid failure that is unrecoverable. That caused the whole CI infrastructure to be entirely unavailable since that servers hosts Jenkins and Zuul.
== Timeline ==
All times are in UTC.
* 23:56 Icinga ** PROBLEM alert - gallium/MD RAID is CRITICAL **
* At this point Jenkins has lost most of its executor / Zuul must be misbehaving and the Zuul status page would show changes pilling up. Jobs are no more triggering.
* 02:xx YuviPanda looks at the RAID alarm
* 02:51 legoktm: / partition on gallium is currently read-only for some reason
* 02:56 Legoktm fills
* 03:07-03:50 yuvipanda runs an fsck -n . Reports back on . Output shows corruption
* 04:10 yuvipanda and Kunal agree to not page since that has been going on for a while and European ops are about to roll in anyway. gallium is intentionally NOT rebooted for fear the unpuppetize parts get lost or the issue goes worth.
* 04:58 Moritz diagnose the RAID and find /dev/sda2 as failed and the array need rebuild.
* 07:30-07:58 Giuseppe further investigate raid / acknowledge the alarms in Icinga
* 08:15 hashar shows up and catch up with ops. Stops Zuul/Jenkins that are useless at this point.
Most of this initial delay is due to the incident happening at an odd time (SF Evening, Europe night) and only impacting CI / developers which tends to be low traffic at that point.
During the European morning:
* Jaime took backups to db1085 and dealt with the disk failure + RAID with confirmations/support from Faidon/Mark.
* Giuseppe allocated a server and installed Jessie pairing with Antoine to polish up the puppet scripts
* Antoine rebuild a Zuul package for Jessie and tested it, provided info about the CI context / indicate which data are important and which one can be dropped
* 15:00 contint1001.eqiad.wmnet been working on by Giuseppe and passing puppet with all proper roles, new partitions layout and Zuul masked in systemd. Needs Jenkins and docroots to be restored.
* 15:00 faulty disk is being replaced on gallium by Chris
* 17:00-18:00 Jenkins data are pushed to contint1001.  The RAID array is rebuilding. We agreed to keep CI down until the array is complete to prevent additional I/O from Jenkins
* 18:55 Gallium rebooted. Mark confirms RAID is all good.  Jenkins and Zuul spawned just fine and the service is resumed.
== Conclusions ==
''What weakness did we learn about and how can we address them?''
* hosts lacked a backup despite it has been identified and set up a year or so ago by operations ({{Bug|T80385}})
* gallium was 5 years old and still on Precise. Should have migrated it early 2016 to a newer host / Jessie
* A single machine hosting both Jenkins and Zuul turns out to be a huge SPOF
== Actionables ==
''Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.''
* {{Status}} Add contint to backup ({{Bug|T80385}})
* {{Status}} Validate and publish Zuul Debian package for Jessie {{Bug|T137279}}
* {{Status}} Migrate to contint1001 asap {{Bug|T137358}} and others
* {{Status}} Decide on best option to replace gallium {{Bug|T133300}}
* Have more than one Jenkins master for CI (co masters)
* Setup a dedicated Jenkins for daily jobs / long jobs not triggered by Zuul
== References ==
* Lot of activity and synchronization happened on a non public channel.
[[Category:Incident documentation]]

Latest revision as of 17:45, 8 April 2022