You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2018-02-29 Train-1.31.0-wmf.27: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
 
imported>Krinkle
 
Line 1: Line 1:
 
#REDIRECT [[Incidents/2018-02-29 Train-1.31.0-wmf.27]]
== Summary ==
''Train deployment of 1.31.0-wmf.27 rolled back due to a increase in replication wait errors.''
 
== Timeline ==
 
 
* 19:24 Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 17s)
* 19:22 rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26
* 19:20 Rolling back to wmf.26 due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"
* 19:19 rolling back to wmf.26
* 19:18 icinga-wm: PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
* 19:17 twentyafterfour: | I'm seeing quite a few "[{exception_id}] {exception_url} Wikimedia\Rdbms\DBExpectedError: Replication wait failed: Lost connection to MySQL server during query
* 19:06 twentyafterfour@tin: Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 17s)
* 19:05 twentyafterfour@tin: rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27
 
Error graph from the same time period:
https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=1522263260081&to=1522265839537
 
After rollback Mukunda filed {{Phabricator|T190960}} to document the incident. The culprit was determined to be [[phab:rMWceb7d61ee7ef3edc6705abd41ec86b3afcd9c491|this commit]] which was made by Aaron Schulz. That commit was intended to address the issues described in {{Phabricator|T180918}}.
 
== Conclusions ==
''What weakness did we learn about and how can we address them?''
 
It is exceedingly difficult to thoroughly test some changes outside of production. Testing replication lag detection properly would require a simulation of production databases plus realistic traffic to stress them. Our current deployment process prevented this from having an impact on production databases or site reliability, however, we spent a lot of time deploying and then reverting changes and we blocked testing of other changes in the pipeline. This points to weaknesses in the weekly branching strategy that we currently employ as well as weaknesses in our testing environments. A change such as this one should really have it's own staged deployment process rather than "riding the train" concurrent with a bunch of unrelated changes.
 
== Actionables ==
 
'''NOTE''': Please add the [https://phabricator.wikimedia.org/tag/wikimedia-incident/ #wikimedia-incident] Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.
* [[phab:T193258]]
 
{{#ifeq:{{SUBPAGENAME}}|Report Template||
[[Category:Incident documentation]]
}}

Latest revision as of 17:46, 8 April 2022