You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Gerrit's CPU consumption has been elevated for some time, possibly due to
git gc having been disabled for some time. Slow pageloads have been noticeable. Today gerrit stopped responding to queries and had to be restarted, after which it worked again.
- 800 requests returned some 5xx code from apache during the outage
- 5 servers failed their puppet runs
- Over the past several days there have been CI failures due to git pull timeouts
- Over the past several days users have reported slow clones and slow patch-uploads
Automated monitoring and humans noticed Gerrit going fully unresponsive approximately simultaneously.
This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.
All times in UTC.
- 2019-03-05: regularly-scheduled
git gcdisabled to prevent corruption of packfile list due to concurrent access during GC jGit issue 544199
- approx 2019-03-20: gerrit CPU usage begins to be higher than usual
- 2019-03-27 18:39: two users report that gerrit seems down OUTAGE BEGINS
- 18:40: icinga reports socket timeouts for gerrit
- 18:47: thcipriani attempts to use jstack to gather stack traces; gets error: "Unable to open socket file: target process not responding or HotSpot VM not loaded"; re-runs with
Attaching to process ID 13929, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.181-b13 Deadlock Detection: Can't print deadlocks:Unable to deduce type of thread from address 0x00007f3a5445e800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)
- 18:48: thcipriani restarts gerrit service
- 18:51: icinga reports recovery OUTAGE ENDS
- 19:27: thcipriani makes a local backup and runs
- 19:39: that
gerrit gccompletes; size difference approx 353MB
- 19:51: thcipriani repeats the process for
- 20:0x (didn't note time):
gerrit gccompletes for
operations/puppetwith a 121MB savings
What weaknesses did we learn about and how can we address them?
The following sub-sections should have a couple brief bullet points each.
What went well?
- Automated monitoring quickly detected the actual outage
- Recovered from outage relatively quickly
- Lots of people paying attention/quick response
- Staggered puppet runs mean that impact for servers was minimal
What went poorly?
- Unclear if our theory of the root cause of the issue is actually true; time will tell
Where did we get lucky?
- Incident occurred when thcipriani was around
- release train was not impacted
Links to relevant documentation
Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.
Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.
NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.
- task T184086 Install gerrit's prometheus_exporter plugin so we have clearer visibility into gerrit internals
- Now that it should be safe to do so, re-enable regular git gc
- TODO update gerrit documentation for things to check in case of high-load
- TODO update gerrit documentation how to grab a stack trace (thcipriani frantically ctrl-r/looking for irc scrollback)