You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2021-11-04 large file upload timeouts

From Wikitech-static
< Incidents
Revision as of 17:49, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/2021-11-04 large file upload timeouts to Incidents/2021-11-04 large file upload timeouts)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: in-review

Summary and Metadata

Incident ID 2021-11-04 large file upload timeouts UTC Start Timestamp: YYYY-MM-DD hh:mm:ss
Incident Task https://phabricator.wikimedia.org/T299965 UTC End Timestamp YYYY-MM-DD hh:mm:ss
People Paged <amount of people> Responder Count <amount of people>
Coordinator(s) Names - Emails Relevant Metrics / SLO(s) affected Relevant metrics

% error budget

Summary: Since the upgrade of MediaWiki appservers to Debian Buster early in 2021, large file uploads (anecdotally anything over 300MB) have been failing because of timeouts when uploading the file to Swift cross-datacenter (all uploads are sent to both primary datacenters). The cause was determined to be the libcurl upgrade enabling HTTP/2 by default, which is generally slower at these kinds of transfers than HTTP/1.1 or HTTP/1.0 is (see this Cloudflare blog post for a brief explainer). Forcing internal requests from MediaWiki PHP to Swift to use HTTP/1 immediately resolved the issue. We have since also disabled HTTP/2 more generally in the internal Nginx proxy we use for TLS termination to Swift. The nginx puppet code we used here was originally written for public traffic TLS termination (this now uses ATS), which explains why it had HTTP/2 enabled.


Impact: For 9 months, editors were entirely unable to upload large files (e.g. on Wikimedia Commons). The UploadWizard feature, and other upload tools, displayed generic error messages about this failure, typically after a timeout was reached. Editors reportedly gave up, or sacrificed image, video, and document quality (which provides a poorer user experience to readers), or resorted to manual "server-side upload" requests (which demanded additional sysadmin time during these months).

Timeline

  • Feb 11: T274589: No atomic section is open (got LocalFile::lockingTransaction) filed
  • Feb 25: T275752: Jobrunner on Buster occasional timeout on codfw file upload filed
  • March 1: Link between file timeouts and buster migration identified (or at least theorized)
  • mid-March: video2commons is unblocked by YouTube, and people attempt to upload large files at a much higher rate than before
  • Late March-early April: One jobrunner is temporarily reimaged back to stretch and it appears to not show the same timeout symptoms.
  • April 13: Help:Maximum file size on Commons edited to say that files over 100MB need a server-side upload (discussion ensues on the talk page)
  • mid-April: Remaining servers reimaged to buster, including that one jobrunner.
  • Oct. 11: priority raised to high, with this being the most likely cause of failing uploads
  • ... slow progress in figuring out a reproducible test case
  • Oct. 27: [Wikimedia-l] Upload for large files is broken
  • Oct. 28: A reproducible test case shows that from the same host, the CLI curl command works fine, while PHP via libcurl does not
    • libcurl is transferring data using very small requests, causing lots of round-trips which add up for cross-datacenter requests
    • It's noticed that CLI curl is using HTTP/1, while PHP is using HTTP/2. Forcing PHP to use HTTP/1 fixes the issue.
    • It is identified that:
      • libcurl 7.62.0 enabled HTTP/2 multiplexing when available, which arrived as part of the Buster upgrade (7.52.1 to 7.64.0)
      • A prerequisite of above was to also switch by default to HTTP/2 when available.
    • Patches to force SwiftFileBackend's MultiHttpClient to use HTTP/1 are submitted
  • Oct. 29: Patches are deployed, large file uploads being working again, at a much faster/expected speed
  • Nov. 2: HTTP/2 disabled on nginx instances that provide TLS termination in front of Swift
  • Nov. 8: HTTP/2 disabled on remaining nginx instances that use the tlsproxy::localssl Puppet role.

Detection

The issue was first detected as a train blocker because the timeouts were causing database query errors on the MediaWiki side. Further manual investigation discovered the timeouts in MediaWiki/Swift communication.

In this specific case, alerting would not have helped given that the issue was noticed almost immediately, there was just no one actively looking into it for a significant amount of time.

For the general case, having metrics about throughput of uploads to Swift for large files and alerting (non-paging) based on that average might be useful (see actionables).

Conclusions

What went well?

  • Once the underlying issue was identified (use of HTTP/2), we were able to prepare patches right away and deploy them in less than a day.

What went poorly?

  • There is no owner for MediaWiki backend multimedia work, leaving this to languish for months. In the course of this investigation, many other bugs were found and filed, but there is no one to work on them.
  • The errors emitted to users when uploads fail are vague and not helpful, leading to user frustration and many duplicate issues.
    • Users also did not know which component to use, so bugs ended up scattered across MediaWiki-Uploading, MediaWiki-File-management and SRE-swift-storage.
  • Chunked uploads use and were failing during the job queue, which is much harder to debug in production because it can't be routed to mwdebug* like web requests can.

Where did we get lucky?

  • We were able to reproduce the issue using plain PHP+curl via a script that extracted the useful parts of MultiHttpClient and SwiftFileBackend
  • Pasting verbose traces onto Phabricator allowed a volunteer to spot the use of HTTP/2 when it was not intended

How many people were involved in the remediation?

  • ~4 SREs
  • ~3 software engineers
  • Special thanks to Xover who noticed the use of HTTP/2 and the corresponding change in libcurl, plus Inductiveload who helped with reproduction and testing once it was fixed.

Links to relevant documentation

  • Swift/How To is marked as outdated and would be more useful if it explained how to use the Python Swift client to upload files to Swift, as well as had designated test users and buckets. Or maybe recommend to use test.wikipedia.org's buckets.
  • As far as I know there is no detailed infographic or even text that explains how the MediaWiki chunked upload process works. Broadly users upload chunks, they get reassembled on a job runner, and then eventually published. Some steps result in actual uploads, others use Swift's x-copy-from. Some parts use temporary names, others use the actual filename. Also everything is uploaded to both DCs, but not in parallel.

Actionables

The biggest issue identified here is needing a maintainer/owner for backend file upload/management related components.

  • There is a general need for better metrics collection and logging of the individual actions in SwiftFileBackend.

Scorecard

Read more about Incident Scorecard.

Incident Engagement™  ScoreCard Score Notes
People Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt) 0 This is an unusual incident involving many people over 9 months
Were the people who responded prepared enough to respond effectively (0/5pt) 0
Did fewer than 5 people get paged (0/5pt)? 0 No page
Were pages routed to the correct sub-team(s)? 0
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours) 0
Process Was the incident status section actively updated during the incident? (0/1pt) 0 Task was updated, but incident docs were created after resolving the issue
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt) 0 Issue was eventually given high priority, but wasn't considered UBN. troubleshooting status was logged in task, and incident status was updated after resolution
Is there a phabricator task for the incident? (0/1pt) 1 Yes
Are the documented action items assigned?  (0/1pt) 0 Some, some are not
Is this a repeat of an earlier incident (-1 per prev occurrence) 0
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task) 0
Tooling Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt) 5 No tooling issues were reported, although communication was slow due to unclear ownership
Did existing monitoring notify the initial responders? (1pt) 0
Were all engineering tools required available and in service? (0/5pt) 5 Yes
Was there a runbook for all known issues present? (0/5pt) 0
Total Score 11