You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-11-04 large file upload timeouts

From Wikitech-static
< Incident documentation
Revision as of 06:39, 6 November 2021 by imported>Legoktm (first draft)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

Since the upgrade to Debian Buster, large file uploads (anecdotally anything over 300MB) have been failing because of timeouts when uploading the file to Swift cross-datacenter (all uploads are sent to both datacenters). The cause was determined to be the libcurl upgrade enabling by HTTP/2 by default, which is generally slower at these kinds of transfers than HTTP/1 is (see this Cloudflare blog post for a brief explainer). Forcing all MediaWiki requests to Swift to use HTTP/1 immediately fixed the issue, and we have disabled HTTP/2 on internal nginx instances that are just used for TLS termination. Originally the nginx TLS termination code was used for public traffic, which is why it made sense to enable HTTP/2 for that use case.

Impact: Editors wishing to upload large files, primarily on Commons, were simply unable to do so. Adding to frustration is that UploadWizard and other tools were unable to provide detailed error messages about the failure, just saying that it had failed. Editors either gave up or resorted to server-side upload requests, consuming more sysadmin time. Some editors also downscaled files so they could be uploaded, sacrificing quality and providing a poorer experience for readers.

Timeline

  • Feb 11: T274589: No atomic section is open (got LocalFile::lockingTransaction) filed
  • Feb 25: T275752: Jobrunner on Buster occasional timeout on codfw file upload filed
  • March 1: Link between file timeouts and buster migration identified (or at least theorized)
  • mid-March: video2commons is unblocked by YouTube, and people attempt to upload large files at a much higher rate than before
  • Late March-early April: One jobrunner is temporarily reimaged back to stretch and it appears to not show the same timeout symptoms.
  • April 13: Help:Maximum file size on Commons edited to say that files over 100MB need a server-side upload (discussion ensues on the talk page)
  • mid-April: Remaining servers reimaged to buster, including that one jobrunner.
  • Oct. 11: priority raised to high, with this being the most likely cause of failing uploads
  • ... slow progress in figuring out a reproducible test case
  • Oct. 27: [Wikimedia-l] Upload for large files is broken
  • Oct. 28: A reproducible test case shows that from the same host, the CLI curl command works fine, while PHP via libcurl does not
    • libcurl is transferring data using very small requests, causing lots of round-trips which add up for cross-datacenter requests
    • It's noticed that CLI curl is using HTTP/1, while PHP is using HTTP/2. Forcing PHP to use HTTP/1 fixes the issue.
    • libcurl 7.62.0 enabled HTTP/2 multiplexing when available, which arrived as part of the Buster upgrade (7.52.1 to 7.64.0)
    • Patches to force SwiftFileBackend's MultiHttpClient to use HTTP/1 are submitted
  • Oct. 29: Patches are deployed, large file uploads being working again, at a much faster/expected speed
  • Nov. 2: HTTP/2 disabled on nginxes that provide TLS termination in front of Swift

Detection

Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?

Copy the relevant alerts that fired in this section.

Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?

TODO: If human only, an actionable should probably be to "add alerting".

Conclusions

What weaknesses did we learn about and how can we address them?

What went well?

  • (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc

What went poorly?

  • (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc

Where did we get lucky?

  • (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc

How many people were involved in the remediation?

  • (Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.