You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/20150901-Elasticsearch

From Wikitech-static
< Incidents
Revision as of 17:44, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/20150901-Elasticsearch to Incidents/20150901-Elasticsearch)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

Elasticsearch service (on elastic*.eqiad.wmnet nodes) backing the search functionality went red for few minutes. We didn't lose any real data and we failed to service some searches during 10 minutes.

Timeline

  • 05:28: dcausse pauses write before applying the firewall rules to master (elastic1001)
  • 05:32: chasemp applies the rules
  • 05:32: master is starting to lose track of its nodes
  • 05:33: cluster is red
  • 05:33: chasemp revert the rules
  • 05:34: cluster is starting to recover
  • 05:39: cluster is back to yellow
  • 05:48: there's a 10 min spike of "Pool errors", dcausse and chasemp test some queries on enwiki and they all worked
  • 07:58: cluster is back to green
  • 08:00: dcausse unfreeze the indices

Conclusions

Actionables