You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation/2021-11-05 TOC language converter
document status: draft
On wikis with language variants enabled, TOC were no longer getting language variant conversion applied. Train rollback made the problem worse on *all* wikis since the old version couldn't handle the updated entries in ParserCache from the TOC changes. So, the train was rolled forward again to leave a far smaller subset of pages on language variant wikis. A fix to address language conversion in TOCs was rolled out late Friday evening PST to mitigate the worst impacts.
Impact: Initially, language variant wikis had TOC without language variants applied. On rollback, TOC was entirely lost on all wikis for a brief period. Train was rolled forward again to leave only the language variant wikis with broken TOCs on a subset of pages on those wikis.
September 15th: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721115 work begins on table of contents patch (author: Jdlrobson, CScott reviewing)
October 27th: Patch merged.
November 2nd: Patch begins roll out on train. No reason at this point to believe it's risky.
Wednesday 3rd: https://phabricator.wikimedia.org/T295003 is reported which hints at potential problems with code using table of contents.
Thursday 4th: With the train rolled out, Chinese Wikipedia's are impacted by the yet to be reported T295187.
Friday November 5th:
- 10:56 PST https://phabricator.wikimedia.org/T295187 is reported
- 11:18 Subbu flags issue to cscott and jdlrobson, who begin analyzing the issue
- 12:41 Patch is set to unbreak now.
- 13:09 Train is rolled back. This causes a brief client error spike relating to the previously documented error https://phabricator.wikimedia.org/T295079
- 14:23 Subbu alerts jdlrobson and cscott on Slack the train was rolled back. Neither has seen this.
- 15:20 jdlrobson reports issue with parser cache not being compatible between versions, the result is table of contents is now no longer present on cached pages.
- 15:28 cscott comments on a patch he's working on to provide a solution
- 15:32 The train is rolled back to the 11:18 state.
- 18:08 Cscott's patch https://gerrit.wikimedia.org/r/737079 is backported to fix T295187.
- 18:43 Cscott's patch is merged resolving T295187.
The table of contents was incorrectly rendering simplified Chinese characters instead of traditional on several projects (zh.wikivoyage) from Wednesday 3rd and 4th ((on zh.wikipedia, ) to Friday 5th 18:43.
On Friday, from 13:09 to 15:32 PST, when the train was rolled back, the table of contents disappeared on cached pages for all Wikipedias.
Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?
Copy the relevant alerts that fired in this section.
Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?
TODO: If human only, an actionable should probably be to "add alerting".
What weaknesses did we learn about and how can we address them?
What went well?
- (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc
What went poorly?
- (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc
- The patch was not correctly identified as risky due to complexities in the parser code so this was not flagged on Monday as a risky patch to train conductors. Changes to the parser should likely always be flagged as risky even if we don't know how.
- Key individuals analyzing the problem were not involved in the decision to roll back the train on a Friday which led to the disappearance of table of contents, which had been identified but not documented at the time. Ideally, we should always reach out to those closest to the problem at hand before making such decisions, even if that delays fixing the bug.
- The LanguageConverter's use of table of contents was not documented in a test so was not obvious when the patch was being written. This issue could have been caught with a well written unit test.
Where did we get lucky?
- (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc
- Lots of user reports from Chinese language speakers made it clear this was a problem.
- Despite being late on a Friday we managed to get the right people together to provide a fix for the weekend. We were lucky that certain individuals went above and beyond to work late on Friday to make sure the problem was dealt with.
How many people were involved in the remediation?
- (Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander
Links to relevant documentation
Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.
Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.
- To do #1 (TODO: Create task)
- To do #2 (TODO: Create task)
TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.