You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/20160509-CirrusSearch: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Gehel
(Created page with "== Summary == At 21:43 UTC on 2016-05-09 Elasticsearch started to slow down (as seen on [https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?from=14628271199...")
 
imported>Krinkle
 
Line 1: Line 1:
== Summary ==
#REDIRECT [[Incidents/20160509-CirrusSearch]]
 
At 21:43 UTC on 2016-05-09 Elasticsearch started to slow down (as seen on [https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?from=1462827119921&to=1462836047591&var-cluster=All Grafana]). Investigation showed a high CPU consumption on elastic1026. Elasticsearch service was restarted and response times went back to normal. Investigation trace the cause to a large segment merge and garbage collection going crazy on elastic1026.
 
== Timeline ==
* 2016-05-09T21:43 increase in overall response time (95%-ile) on elasticsearch
* 2016-05-09T21:51 issue with search reported on IRC by odder
* 2016-05-09T22:14 elasticsearch service restarted on elastic1026
* 2016-05-09T22:20 response time back to normal
 
== Analysis ==
* more details on [[phab:T134829|Phabricator]]
* We did not get an alert via Icinga. There is currently a check on response time, but it is done on prefix search, which now has a low volume of traffic. This shows again the fragility of Graphite checks.
* Analysis of GC timings indicates that time was spend mainly in young GC and that memory was recollected. This is usually an indication of a too high memory throughput, not of a memory leak or a too small heap.
* GC is strongly correlated with a large segment merge on elastic1026. This does not explain why it was an issue only this time.
* Graphite [http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.currentAbove currentAbove()] function is a good tool to identify which server is under more load.
 
== Actionables ==
* Task created: [[phab:T134853|Enable GC logs on Elasticsearch JVM]] which might help in a more detailed analysis if a similar issue happens again. Note that the size of those logs might be an issue.
* Task created: [[phab:T134852|Check Icinga alert on CirrusSearch response time]], which might lead to either fix this alert or remove it completely
 
[[Category:Incident documentation]]

Latest revision as of 17:45, 8 April 2022