You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-11-05 TOC language converter"

From Wikitech-static
Jump to navigation Jump to search
imported>Jdlrobson
imported>Quiddity
(clarify footnote)
Line 91: Line 91:
=== What went poorly? ===
=== What went poorly? ===
*The patch was not correctly identified as risky due to complexities in the parser code so this was not flagged on Monday as a risky patch to train conductors. Changes to the parser should likely always be flagged as risky even if we don't know how, and as part of the code review process we should have considered rollback plans. The ParserOutput does not have any concept of versioning and the ParserOutput::getText method is not documented in such a way that makes it clear such changes are risky.
*The patch was not correctly identified as risky due to complexities in the parser code so this was not flagged on Monday as a risky patch to train conductors. Changes to the parser should likely always be flagged as risky even if we don't know how, and as part of the code review process we should have considered rollback plans. The ParserOutput does not have any concept of versioning and the ParserOutput::getText method is not documented in such a way that makes it clear such changes are risky.
*This exposed issues in communication protocols. Key individuals analyzing the problem were conversing on Slack (product engineers are required to be there), while train conductors were conversing on IRC (release engineering are required to be there). Other individuals were offline due to [https://office.wikimedia.org/wiki/HR_Corner/Culture/Silent_Fridays Silent Fridays] <ref>This is a link to a fishbowl wiki. In summary, it describes how employees are encouraged to use Friday as focus time and feel free to limit communication mediums.</ref>. The decision to roll back the train on a Friday which led to the disappearance of table of contents, which had been identified but not documented at the time. Ideally, we should always reach out to those closest to the problem at hand before making such decisions, even if that delays fixing the bug, but the communication protocols failed us here.
*This exposed issues in communication protocols. Key individuals analyzing the problem were conversing on Slack (product engineers are required to be there), while train conductors were conversing on IRC (release engineering are required to be there). Other individuals were offline due to [https://office.wikimedia.org/wiki/HR_Corner/Culture/Silent_Fridays Silent Fridays] <ref>This is a link to a private wiki. In summary, it describes how employees are encouraged to use Friday as focus time and feel free to limit communication mediums, in order to reduce problems occurring before the weekend, and avoid meetings during Friday evenings in many timezones.</ref>. The decision to roll back the train on a Friday which led to the disappearance of table of contents, which had been identified but not documented at the time. Ideally, we should always reach out to those closest to the problem at hand before making such decisions, even if that delays fixing the bug, but the communication protocols failed us here.
*The LanguageConverter's use of table of contents was not documented in a test so was not obvious when the patch was being written. This issue could have been caught with a well written unit test.  
*The LanguageConverter's use of table of contents was not documented in a test so was not obvious when the patch was being written. This issue could have been caught with a well written unit test.


=== Where did we get lucky? ===
=== Where did we get lucky? ===

Revision as of 21:15, 11 November 2021

document status: draft

Summary

On wikis with language variants enabled (ban/crh/gan/iu/kk/ku/shi/sr/tg/uz/zh, but noticed and reported on Chinese Wikipedia), the Table of Contents was not being converted to the selected language variant. Train rollback made the problem worse on all wikis: any page put into the parser cache by the new release had no table of contents at all when the train was rolled back (at least until the page was manually purged), since the rollback version of MediaWiki didn't know how to handle the Table of Contents marker in the ParserCache contents left by the newer version.

The train was rolled forward again, as the "lesser of two evils". A fix to properly convert Table of Contents into the selected language variant was rolled out late Friday evening PST to mitigate the worst impacts.

Impact: Initially, affected wikis had Tables of Contents displayed in an incorrect or inconsistent language variant on all pages. (Depending on the language, this may or not render it unreadable to a subset of visitors.) On rollback, the Table of Contents was entirely lost on pages rendered since the initial train deploy on all wikis. Train was rolled forward again to restore the original impact, then a patch was deployed to correct variant rendering in the table of contents on all but a small subset of pages on those wikis.

Timeline

Link to a specific offset in SAL using the SAL tool at https://sal.toolforge.org/ (example)

September 15

October 27

  • Patch merged ahead of a project status update scheduled for the morning of Oct 28.

November 2

  • 20:01 UTC: Patch begins roll out, first only to test wikis due to US holiday. No reason at this point to believe it's risky.

November 3

  • 19:15 UTC: Train rolls out to group 0, delayed due to US holiday.
  • 19:51 UTC: Thirty minutes after group 0, train rolls out to group 1, which includes Wikivoyage as well as non-Wikipedia Chinese projects. Both ToC issues (T295003 and T295187) would have begun to appear on group 1 wikis.

November 4

  • 06:56 UTC: phab:T295003 is reported on Wikivoyage, an incompatibility with the mw:Extension:WikidataPageBanner extension which causes tables of contents to appear on Wikivoyage pages (they are usually suppressed on mainspace pages by the extension, and a custom pagebanner inserted). A temporary workaround using site CSS/JS is developed, and this bug does not block the train.
  • 19:29 UTC: Train rolls out to group 2, which includes Chinese Wikipedia (zhwiki). Chinese Wikipedia begins to be affected by T295187.

November 5

  • 16:17 UTC: gerrit:737075 is written and merged (16:57 UTC) to fix the issues with Wikivoyage; however, it is not immediately backported as a temporary fix is already in place.
  • 17:56 UTC: phab:T295187 is reported: "Since yesterday" language conversion has failed to be applied to the table of contents "on Chinese Wikipedia".
  • 18:18 UTC: Subbu flags issue to cscott and jdlrobson, who begin analyzing the issue.
  • 19:41 UTC: Phab task T295187 is set to Unbreak Now.
  • 19:42 UTC: legoktm points out the issue being UBN to dduvall as that week's train conductor. Discussion on whether it's rollback worthy happens in #wikimedia-releng
  • 20:08 UTC: User:dduvall sets T295187 to be an train blocker (phab:T293948).
  • 20:17 UTC: Train rolled back to 1.38.0-wmf.6 on group 0/1/2 wikis. Missing Tables of Contents begin to appear on all wikis: phab:????. Brief client error spike relating to the previously documented error https://phabricator.wikimedia.org/T295079.
  • 21:23 UTC: Subbu alerts jdlrobson and cscott on Slack the train was rolled back. Neither has seen this.
  • 22:20 UTC: jdlrobson reports issue with parser cache not being compatible between versions, the result is table of contents is now no longer present on cached pages.
  • 22:28 UTC: cscott comments on a patch he's working on to provide a solution.
  • 22:19 UTC-22:32 UTC: The train is rolled forward to 1.38.0-wmf.7 on all wikis again.

November 6

  • 00:10 UTC: User:cscott's first "quick-and-dirty" patch gerrit:737150 is uploaded; it is a bit safer but would not fix pages which already have "non-language-converted" renders in the ParserCache (which would initially be all pages in the ParserCache)
  • 00:27 UTC: User:cscott's follow-up patch gerrit:737079 is uploaded; by doing the language conversion in ParserOutput::getText it would ensure that cached pages are converted properly.
  • 01:08 UTC: user:cscott's patch gerrit:737079 is backported to fix T295187.
  • 01:43 UTC: user:cscott's patch is deployed resolving T295187.

November 8

  • 19:39 UTC: The fix for Wikivoyage is backported and deployed. Admins remove their workarounds and confirm the fix.

Epilogue

The table of contents was incorrectly presenting the author's original variant (instead of converting to consistent simplified or traditional characters) on Chinese Wikivoyage and other non-Wikipedia projects from Nov 3 19:51 UTC to Nov 6 01:43 UTC. Other non-Wikipedia projects in ban/crh/gan/iu/kk/ku/shi/sr/tg/uz/zh languages were also affected during the same time frame.

Chinese Wikipedia and Wikipedia projects in ban/crh/gan/iu/kk/ku/shi/sr/tg/uz/zh languages were incorrectly rendered from Nov 4 19:29 UTC to Nov 6 01:43 UTC.

In certain of these languages, readers are not typically literate in more than one script; for these readers the Table of Contents would have been unintelligible for that time period. In other language regions, like Serbian, most readers are literate in both dominant scripts for the language and the issue would have been mostly cosmetic.

From Nov 5 20:17 UTC (when the train was rolled back to wmf.6) to 22:32 UTC (when we restored wmf.7), the table of contents disappeared in all wiki projects on pages which were rendered (most likely due to an edit) between Nov 3 19:15 (group 0/1) or Nov 4 19:29 (group 2) and Nov 5 20:17.

Detection

The issue was first detected by users of Chinese Wikipedia. There was no automated monitoring.

On investigation, there do not appear to be any parser tests or other test cases which exercise language conversion on the table of contents (phab:T295187) or which verify the correct operation of the WikidataPageBanner extension (phab:T295003).

Rollback removed the table of contents from many articles on all wikis; this was also not detected by any monitoring.

Rollback did cause spurious (but unrelated to the table of contents issue) alerts, as discussed above: phab:T295079.


Actionables:

  • Add test cases which exercise language conversion
  • Add test cases to WikidataPageBanner extension
  • Add test cases for any changes ParserOutput::getText to check backwards compatibility with previous values

Conclusions

This incident exposed weaknesses in test coverage of the Table of Contents, and in the way that Parser Cache content interacts with our deployment and versioning systems. Content stored in RESTBase has the potential for similar issues, as discussed below, but has slightly better purging and versioning systems to allow prevention and/or mitigation of version mismatch issues such as these.

In addition, procedural weaknesses were exposed in flagging potentially "risky" patches, and in the forum used for rollback discussions. A related issue is that, due to time zone skew, detecting and reacting to failures in Chinese Wikipedia (deployed late UTC time on Thursday) can easily push timelines past 5pm local time on a Friday for engineers involved in the response. The community involved in the smaller group 1 projects, like Chinese Wikivoyage, would in theory have noticed both ToC problems a full day earlier, but those community members did not successfully relay the issue to WMF staff. It may be advisable to move Chinese Wikipedia from group 2 to group 1 in order to accelerate detection and response to issues.

What went well?

  • We went into the weekend in a more or less stable state thanks to many engineers going above and beyond and staying late on a Friday/
  • Some good lessons learned are likely to come out of this :-)

What went poorly?

  • The patch was not correctly identified as risky due to complexities in the parser code so this was not flagged on Monday as a risky patch to train conductors. Changes to the parser should likely always be flagged as risky even if we don't know how, and as part of the code review process we should have considered rollback plans. The ParserOutput does not have any concept of versioning and the ParserOutput::getText method is not documented in such a way that makes it clear such changes are risky.
  • This exposed issues in communication protocols. Key individuals analyzing the problem were conversing on Slack (product engineers are required to be there), while train conductors were conversing on IRC (release engineering are required to be there). Other individuals were offline due to Silent Fridays [1]. The decision to roll back the train on a Friday which led to the disappearance of table of contents, which had been identified but not documented at the time. Ideally, we should always reach out to those closest to the problem at hand before making such decisions, even if that delays fixing the bug, but the communication protocols failed us here.
  • The LanguageConverter's use of table of contents was not documented in a test so was not obvious when the patch was being written. This issue could have been caught with a well written unit test.

Where did we get lucky?

  • Lots of user reports from Chinese language speakers made it clear this was a problem.
  • Despite being late on a Friday we managed to (eventually) get the right people together to provide a fix for the weekend. We were lucky that certain individuals went above and beyond to work late on Friday to make sure the problem was dealt with.

How many people were involved in the remediation?

  • (Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Technical changes:

  • Add test cases which exercise language conversion on the table of contents
  • Add test cases to WikidataPageBanner extension
  • Add test cases for any changes ParserOutput::getText to check backwards compatibility with previous values
  • Add documentation to ParserOutput::getText with guidance around how best to make backwards compatible changes and that any changes there should be always considered risky and have a rollback plan.
  • Enable pig latin variant on English Wikipedia beta cluster.

Process changes

  • Revise communication protocols based on the current expectations of communication medium of WMF employees. For example, if product engineers are required to be available on IRC, that should be communicated broadly. If that's not a requirement, we should perhaps use email/Phabricator as primary communication

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.

  1. This is a link to a private wiki. In summary, it describes how employees are encouraged to use Friday as focus time and feel free to limit communication mediums, in order to reduce problems occurring before the weekend, and avoid meetings during Friday evenings in many timezones.