You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Incident documentation/20160623-etherpad

From Wikitech-static
< Incident documentation
Revision as of 21:43, 23 June 2016 by imported>Alexandros Kosiaris
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Etherpad had a catastrophic and unrecoverable downtime. Recovering from it in a timely manner meant restoring a day old snapshot of the database. Exact cause: Still Unknown'


On 14:27 UTC it became apparent that etherpad was offline. Efforts to restore the service through restarts, software debugging were inconclusive. End result was we took the decision to restore the database from a 1 day old snapshot to restore the service, which means some pad connect may have been irrevocably lost. We are working on that yet


  • 14:27: It becomes apparent from wikimedia-operations that seems down
  • 14:29: Jaime restarts etherpad. Service is not restored
  • 14:30: Jaime is debugging the service.
  • 14:30: First icinga check noting etherpad is down
  • 14:32: <jynus> restarting etherpad-lite.service
  • 14:33: Alex joins the debugging.
  • 14:34: Icinga reports etherpad down.
  • 14:38: <akosiaris> stopping etherpad-lite on etherpad1001, disabling puppet
  • 14:38: Jaime after consulting with Alex starts restoring the database so that we can have it handy in case debugging goes south.
  • 14:45: <akosiaris> debugging etherpad. Started the service with a blank db, looks like it's working
  • 14:47: <akosiaris> change the default message in etherpad to indicate problems
  • 15:17: Discovering and commenting on Seems the closest we got to a problem, no response or more help. Stack trace is exactly the same, suspicion the database is corrupted is reinforced.
  • 15:20: Restoring the corrupted db and trying to debug the problem. Alex is restricting access to etherpad via ferm to him only (via ssh tunnel). Visits various pads, no reproduction. Starts visiting pads in an effort to find a corrupted pad, delete it and fix the problem. Logs really close to crashes the are the source of pad names. Efforts are in vain.
  • 15:26: <akosiaris> stop etherpad-lite, etherpad is down
  • 15:27-16:10: investigation continues with no result
  • 16:10: Jaime manages to have a restore from backups of the database
  • 16:11: ferm restriction is lifted. Service is restored.
  • Deliberations between Jaime, Yuvi, Alex as to what to do about the 1 day of lost pads. Proposal to have the service running with the 1 day old data due to wikimania wins, will make efforts to replay the logs and get a clone of the service with the database frozen in time to allow users to access lost pads.
  • 21:40 created to allow users to access restored versions of their pads and in a self-serve fashion migrate copies of them in the production instance. Unfortunately it's impossible to do that automatically


  • Somehow the etherpad database has been corrupted in a way the caused etherpad's ueberDB component to emit a stack trace terminate. Due to that, the service was down from 14:27 to 16:11 despite efforts to find and fix the problem. Pad content that has not made it to the production instance, may be found on