You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/20160505-ChangeProp RESTBase Parsoid

From Wikitech-static
< Incident documentation
Revision as of 04:47, 6 May 2016 by imported>BryanDavis
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

See phab:T134537.

Timeline

  • 2016-05-04T15:02:45 first requests to parsoid for oldId 106801025 (from ChangePropagation/WMF)
  • <date change>
  • 2016-05-05T02:16:10 first INFO [HintedHandoff:2] 2016-05-03 02:16:10,523 HintedHandOffManager.java:486 - Timed out replaying hints to /10.192.32.137; aborting (25847 delivered) (on restbase1014)
  • 2016-05-05T02:28 org.apache.cassandra.db.compaction.CompactionTask log events elevated
  • 2016-05-05T02:3X bytes_out and cpu_user start to spike on rb cluster
  • 2016-05-05T02:3X load spikes on parsoid cluster
  • 2016-05-05T02:34:16 first "Retry count exceeded" error in /srv/log/changeprop/main.log (for http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE)
  • 2016-05-05T02:38:14 first alert: <icinga-wm> PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
  • 2016-05-05T03:20:00 (ca.) Ori stops changeprop service
  • 2016-05-05T03:30:00 (ca.) Load on Parsoid and RESTBase drops, service recovers.

Conclusions

See phab:T134537.

Actionables