You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/20160509-CirrusSearch

From Wikitech-static
< Incident documentation
Revision as of 16:01, 10 May 2016 by imported>Gehel (Created page with "== Summary == At 21:43 UTC on 2016-05-09 Elasticsearch started to slow down (as seen on [")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


At 21:43 UTC on 2016-05-09 Elasticsearch started to slow down (as seen on Grafana). Investigation showed a high CPU consumption on elastic1026. Elasticsearch service was restarted and response times went back to normal. Investigation trace the cause to a large segment merge and garbage collection going crazy on elastic1026.


  • 2016-05-09T21:43 increase in overall response time (95%-ile) on elasticsearch
  • 2016-05-09T21:51 issue with search reported on IRC by odder
  • 2016-05-09T22:14 elasticsearch service restarted on elastic1026
  • 2016-05-09T22:20 response time back to normal


  • more details on Phabricator
  • We did not get an alert via Icinga. There is currently a check on response time, but it is done on prefix search, which now has a low volume of traffic. This shows again the fragility of Graphite checks.
  • Analysis of GC timings indicates that time was spend mainly in young GC and that memory was recollected. This is usually an indication of a too high memory throughput, not of a memory leak or a too small heap.
  • GC is strongly correlated with a large segment merge on elastic1026. This does not explain why it was an issue only this time.
  • Graphite currentAbove() function is a good tool to identify which server is under more load.