You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-11-05 TOC language converter

From Wikitech-static
< Incident documentation
Revision as of 01:46, 10 November 2021 by imported>C. Scott Ananian (→‎Epilogue)
Jump to navigation Jump to search

document status: draft

Summary

On wikis with language variants enabled (ban/crh/gan/iu/kk/ku/shi/sr/tg/uz/zh, but noticed and [[phab:T295187|reported on Chinese Wikipedia]), the Table of Contents was not being converted to the selected language variant. Train rollback made the problem worse on *all* wikis: any page put into the parser cache by the new release had *no table of contents at all* when the train was rolled back (at least until the page was manually purged), since the rollback version of mediawiki didn't know how to handle the Table of Contents marker in the ParserCache contents left by the newer version.

The train was rolled forward again, as the "lesser of two evils". A fix to properly convert Table of Contents into the selected language variant was rolled out late Friday evening PST to mitigate the worst impacts.

Impact: Initially, affected wikis had Tables of Contents displayed in an incorrect or inconsistent language variant on all pages. (Depending on the language, this may or not render it unreadable to a subset of visitors.) On rollback, the Table of Contents was entirely lost on pages rendered since the initial train deploy on all wikis. Train was rolled forward again to restore the original impact, then a patch was deployed to correct variant rendering in the table of contents on all but a small subset of pages on those wikis.

Timeline

Link to a specific offset in SAL using the SAL tool at https://sal.toolforge.org/ (example)

September 15

October 27

  • Patch merged ahead of a project status update scheduled for the morning of Oct 28.

November 2

  • 20:01 UTC: Patch begins roll out, first only to test wikis due to US holiday. No reason at this point to believe it's risky.

November 3

  • 19:15 UTC: Train rolls out to group 0, delayed due to US holiday.
  • 19:51 UTC: Thirty minutes after group 0, train rolls out to group 1, which includes wikivoyage as well as non-wikipedia Chinese projects. Both ToC issues (T295003 and T295187) would have begun to appear on group 1 wikis.

November 4

  • 06:56 UTC: phab:T295003 is reported on wikivoyage, an incompatibility with the mw:Extension:WikidataPageBanner extension which causes tables of contents to appear on wikivoyage pages (they are usually suppressed on mainspace pages by the extension, and a custom pagebanner inserted). A temporary workaround using site CSS/JS is developed, and this bug does not block the train.
  • 19:29 UTC: Train rolls out to group 2, which includes Chinese wikipedia (zhwiki). Chinese Wikipedia begins to be affected by T295187.

November 5

  • 16:17 UTC: gerrit:737075 is written and merged (16:57 UTC) to fix the issues with wikivoyage; however, it is not immediately backported as a temporary fix is already in place.
  • 17:56 UTC: phab:T295187 is reported: "Since yesterday" language conversion has failed to be applied to the table of contents "on Chinese Wikipedia".
  • 18:18 UTC: Subbu flags issue to cscott and jdlrobson, who begin analyzing the issue.
  • 19:41 UTC: Phab task T295187 is set to Unbreak Now.
  • 20:08 UTC: User:dduvall sets T295187 to be an train blocker (phab:T293948).
  • 20:17 UTC: Train rolled back to 1.38.0-wmf.6 on group 0/1/2 wikis. Missing Tables of Contents begin to appear on all wikis: phab:????. Brief client error spike relating to the previously documented error https://phabricator.wikimedia.org/T295079.
  • 21:23 UTC: Subbu alerts jdlrobson and cscott on Slack the train was rolled back. Neither has seen this.
  • 22:20 UTC: jdlrobson reports issue with parser cache not being compatible between versions, the result is table of contents is now no longer present on cached pages.
  • 22:28 UTC: cscott comments on a patch he's working on to provide a solution.
  • 22:19 UTC-22:32 UTC: The train is rolled forward to 1.38.0-wmf.7 on all wikis again.

November 6

  • 00:10 UTC: User:cscott's first "quick-and-dirty" patch gerrit:737150 is uploaded; it is a bit safer but would not fix pages which already have "non-language-converted" renders in the ParserCache (which would initially be all pages in the ParserCache)
  • 00:27 UTC: User:cscott's follow-up patch gerrit:737079 is uploaded; by doing the language conversion in ParserOutput::getText it would ensure that cached pages are converted properly.
  • 01:08 UTC: user:cscott's patch gerrit:737079 is backported to fix T295187.
  • 01:43 UTC: user:cscott's patch is deployed resolving T295187.

November 8

  • 19:39 UTC: The fix for wikivoyage is backported and deployed. Admins remove their workarounds and confirm the fix.

Epilogue

The table of contents was incorrectly presenting the author's original variant (instead of converting to consistent simplified or traditional characters) on Chinese wikivoyage and other non-Wikipedia projects from Nov 3 19:51 UTC to Nov 6 01:43 UTC. Other non-Wikipedia projects in ban/crh/gan/iu/kk/ku/shi/sr/tg/uz/zh languages were also affected during the same time frame.

Chinese Wikipedia and Wikipedia projects in ban/crh/gan/iu/kk/ku/shi/sr/tg/uz/zh languages were incorrectly rendered from Nov 4 19:29 UTC to Nov 6 01:43 UTC.

In certain of these languages, readers are not typically literate in more than one script; for these readers the Table of Contents would have been unintelligible for that time period. In other language regions, like Serbian, most readers are literate in both dominant scripts for the language and the issue would have been mostly cosmetic.

From Nov 5 20:17 UTC (when the train was rolled back to wmf.6) to 22:32 UTC (when we restored wmf.7), the table of contents disappeared in all wiki projects on pages which were rendered (most likely due to an edit) between Nov 3 19:15 (group 0/1) or Nov 4 19:29 (group 2) and Nov 5 20:17.

Detection

Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?

Copy the relevant alerts that fired in this section.

Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?

TODO: If human only, an actionable should probably be to "add alerting".

Conclusions

What weaknesses did we learn about and how can we address them?

What went well?

  • (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc

What went poorly?

  • The patch was not correctly identified as risky due to complexities in the parser code so this was not flagged on Monday as a risky patch to train conductors. Changes to the parser should likely always be flagged as risky even if we don't know how.
  • Key individuals analyzing the problem were not involved in the decision to roll back the train on a Friday which led to the disappearance of table of contents, which had been identified but not documented at the time. Ideally, we should always reach out to those closest to the problem at hand before making such decisions, even if that delays fixing the bug.
  • The LanguageConverter's use of table of contents was not documented in a test so was not obvious when the patch was being written. This issue could have been caught with a well written unit test.

Where did we get lucky?

  • Lots of user reports from Chinese language speakers made it clear this was a problem.
  • Despite being late on a Friday we managed to get the right people together to provide a fix for the weekend. We were lucky that certain individuals went above and beyond to work late on Friday to make sure the problem was dealt with.

How many people were involved in the remediation?

  • (Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.