You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/2017-01-26 API Slowdown: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
Line 1: Line 1:
:''Very WIP''- still under heavy research
#REDIRECT [[Incidents/2017-01-26 API Slowdown]]
On the 2017-01-26, from 17:51 to 18:15 (all times UTC) there was a slowdown/increase in 500 responses on Wikimedia wikis' [[mw:API:Main_page|Mediawiki Action API]]. While there was scheduled maintenance at the time, no user impact should have been seen, the underlying cause is still being researched.
== Summary ==
* A core router started rebooting/behaving strangely since 12 January {{phabricator|T155875}}
* DB and other services impact was mitigated by moving essential services away from the affected rack {{phabricator|T155875}} (e.g. s1 master)
* Maintenance started on router at 17:51- mediawiki should have just depooled affected services (dbs), and continue unaffected, as usual, but it didn't work/didn't work as expected
* API latency/thoughput impact can be seen at:
* Depooling affected API servers resolved the issue
== Timeline ==
* 17:46 paravoid: stopping pybal on lvs1001/lvs1002/lvs1003
* 17:51 paravoid: replacing asw-c2-eqiad
* 17:57 elukey: boostrapping aqs1007-a cassandra instance
* 18:14 paravoid: rebooting newly provisioned asw-c2-eqiad to enable mixed mode
* 18:15 jynus@tin: Synchronized wmf-config/db-eqiad.php: Depool db1055, 56, 57, 59 (duration: 00m 54s)
* 18:32 paravoid: starting pybal on lvs1001/lvs1002/lvs1003
== Conclusions ==
More research is needed to understand why the issue happened and how mediawiki model works, and if it has a bug for this particular scenario.
== Actionables ==
* {{phabricator|T156475}}
[[Category:Incident documentation]]

Latest revision as of 17:45, 8 April 2022