You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2018-07-24 Train: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
 
imported>Krinkle
 
Line 1: Line 1:
== Summary ==
#REDIRECT [[Incidents/2018-07-24 Train]]
 
There were several problems with [https://phabricator.wikimedia.org/T191060 1.32.0-wmf.14]. Tasks are sorted from oldest to newest.
 
* [https://phabricator.wikimedia.org/T200257 T200257] `scap sync` fails with `Error: You are missing some external dependencies.`
* [https://phabricator.wikimedia.org/T200340 T200340] Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string
* [https://phabricator.wikimedia.org/T200346 T200346] wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure"
* [https://phabricator.wikimedia.org/T200412 T200412] PageTriage requires ORES to be installed
* [https://phabricator.wikimedia.org/T200420 T200420] Wikidata dispatching stuck (not releasing lockmanager locks)
* [https://phabricator.wikimedia.org/T200456 T200456] MapCacheLRU::has called with invalid key. Must be string or integer.
 
== Timeline ==
 
Events are sorted from newest to oldest. Times are UTC.
 
=== 2018-07-30 Monday ===
* 🚂 '''wmf.14→2''' 13:58 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.14
 
=== 2018-07-26 Thursday ===
 
* ✅ 21:34 Tgr closed subtask '''T200456''': ''MapCacheLRU::has called with invalid key. Must be string or integer.'' as Resolved.
* 🚂 '''wmf.14→1''' 18:19 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: Revert "all wikis to 1.32.0-wmf.14"
* 💣 18:16 zeljkofilipin added a subtask: '''T200456''': ''MapCacheLRU::has called with invalid key. Must be string or integer.''
* 🚂 '''wmf.14→2''' 18:13 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.14
* ✅ 17:01 zeljkofilipin removed a subtask: '''T200420''': ''Wikidata dispatching stuck (not releasing lockmanager locks).''
* 💣 13:18 zeljkofilipin added a subtask: T200420: Wikidata dispatching stuck (not releasing lockmanager locks).
* 🚂 '''wikidatawiki>wmf.13''' 12:38 <reedy@deploy1001> rebuilt and synchronized wikiversions files: wikidatawiki back to .13 T200420
* ✅ 10:51 zeljkofilipin closed subtask '''T200412''': ''PageTriage requires ORES to be installed'' as Resolved.
* 🚂 '''wmf.14→1''' 10:45 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14
* 🚂 '''wmf.14→0''' 10:08 zfilipin@deploy1001> rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.32.0-wmf.14"
* 💣 10:00 zeljkofilipin added a subtask: '''T200412''': ''PageTriage requires ORES to be installed.''
* 🚂 '''wmf.14→1''' 09:49 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14
 
=== 2018-07-25 Wednesday ===
 
* ✅ 20:13 Krinkle closed subtask '''T200346''': ''wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure"'' as Resolved.
* ✅ 17:09 Krinkle removed a subtask: '''T200340''': ''Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string.''
* 💣 15:15 Krinkle added a subtask: '''T200346''': ''wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure".''
* 🚂 '''wmf.14→0''' 14:39 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: (no justification provided) (Revert "group1 wikis to 1.32.0-wmf.14")
* 💣 14:28 zeljkofilipin added a subtask: '''T200340''': ''Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string.''
* 🚂 '''wmf.14→1''' 13:59 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14
 
=== 2018-07-24 Tuesday ===
 
* ✅ 12:50 thcipriani closed subtask '''T200257''': ''`scap sync` fails with `Error: You are missing some external dependencies.`'' as Resolved.
* 💣 12:04 zeljkofilipin added a subtask: '''T200257''': ''`scap sync` fails with `Error: You are missing some external dependencies.`''
 
== Conclusions ==
 
''What weakness did we learn about and how can we address them?''
 
* Scap should perform canary checks for sync-wikiversions.
* 1 problem was caused by train conductor inexperience, before deploying 1.32.0-wmf.14 to group 0.
* 4 problems were noticed after deploying 1.32.0-wmf.14 to group 1.
* 1 problem was noticed after deploying 1.32.0-wmf.14 to group 2.
 
=== Before wmf.14 → group0 ===
 
* '''T200257''' ''`scap sync` fails with `Error: You are missing some external dependencies`''
** {{done}} Feedback needed from Željko Filipin (Release Engineering).
** It was caused by train conductor's lack of experience with train deployments. 1.32.0-wmf.13 was still not deployed to all wikis on Tuesday (see [[Incident documentation/20180717-Train]]) but it was time to cut 1.32.0-wmf.14 branch. He has misunderstood the process and thought that cutting the branch means doing all steps from [[Heterogeneous deployment/Train deploys#Before_the_deploy_window]] but it means to do only [[Heterogeneous deployment/Train deploys#Create_the_new_branch_in_Gerrit]]. No further steps are needed, like updating the documentation, it is clear enough. He has learned the lesson.
 
=== wmf.14 → group1 ===
 
* '''T200340''' ''Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string''
** {{done}} Feedback needed from Adam Shorland (Wikimedia Deutschland).
 
* '''T200346''' ''wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure"''
** {{done}} Feedback needed from Gergő Tisza (Readers), Ian Marlier (Performance), Brad Jorsch (MediaWiki Platform), Timo Tijhof (Performance).
** This was not a new error, but rather an error being incorrectly indicated.  A change that was unrelated to the ThumbnailRender job itself resulted in an MWHttpRequest returning an HTTP status of 0 instead of an HTTP status of 200.  ThumbnailRender was configured to consider a status of 200 to be successful, but did not consider a status of 0 to be successful, and thus logged an error message.  Realistically this should not have stopped the train, but it did require investigation to realize that.  The actual remediation for this is the work that BPirkle is doing, in [[phab:T202110]] and related.
 
* '''T200412''' ''PageTriage requires ORES to be installed''
** {{done}} Feedback needed from Amir Sarabadani (Wikimedia Deutschland), Adam Wight (Scoring Platform), Stephane Bisson (Contributors).
** {{done}} It could have been prevented transparently by softening the dependency (done in {{Gerrit|448098}}), and could have been mitigated manually by knowing that it was necessary to enable ORES.
 
* '''T200420''' ''Wikidata dispatching stuck (not releasing lockmanager locks)''
** {{done}} Feedback needed from Adam Shorland (Wikimedia Deutschland).
 
=== wmf.14 → group2 ===
 
* '''T200456''' ''MapCacheLRU::has called with invalid key. Must be string or integer''
** {{done}} Feedback needed from Gergő Tisza (Readers), Aaron Schulz (Performance).
 
== Links to relevant documentation ==
''Where is the documentation that someone responding to this alert should have (cookbook / runbook). If that documentation does not exist, there should be an action item to create it.''
 
* [[Heterogeneous deployment/Train deploys]]
 
== Actionables ==
 
''Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.''
 
'''NOTE''': Please add the [https://phabricator.wikimedia.org/tag/wikimedia-incident/ #wikimedia-incident] Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.
 
Feedback from various teams is needed on how each problem could have been prevented:
 
* (Release Engineering) [[phab:T200257]] ''`scap sync` fails with `Error: You are missing some external dependencies.`''
** {{done}} No further action needed.
 
* (Wikidata) [[phab:T200340]] ''EntityIdParsingException $serialization must not be an empty string''
** {{done}} The fix contains a regression test.
 
* (Reading Infrastructure) [[phab:T200346]] ''Failing to execute ThumbnailRender jobs''
** [[phab:T172480]] ''Add jobrunner servers to Scap canary process''
 
* (ORES/Wikidata) [[phab:T200412]] ''PageTriage requires ORES to be installed''
** [[phab:T200944]] ''Detect missing extension dependencies before production''
 
* (Wikidata) [[phab:T200420]] ''Wikidata dispatching stuck (not releasing lockmanager locks)''
** [[gerrit:448103]] ''Use getClientLockName value for releaseClientLock when dispatching''
** {{done}} The above patch fixed the issue.
** What has trigerred the dispatching issues is still not clear.
 
* (Readers) [[phab:T200456]] ''MapCacheLRU::has called with invalid key. Must be string or integer''
** [[phab:T201200]] ''Introduce soft assertions in MediaWiki''
 
* (Release Engineering) [[phab:T198640]] ''Perform scap canary checks after sync-wikiversions''
 
[[Category:Incident documentation]]

Latest revision as of 17:46, 8 April 2022