You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/2021-11-04 large file upload timeouts
document status: in-review
Summary and Metadata
|Incident ID||2021-11-04 large file upload timeouts||UTC Start Timestamp:||YYYY-MM-DD hh:mm:ss|
|Incident Task||https://phabricator.wikimedia.org/T299965||UTC End Timestamp||YYYY-MM-DD hh:mm:ss|
|People Paged||<amount of people>||Responder Count||<amount of people>|
|Coordinator(s)||Names - Emails||Relevant Metrics / SLO(s) affected||Relevant metrics
% error budget
|Summary:||Since the upgrade of MediaWiki appservers to Debian Buster early in 2021, large file uploads (anecdotally anything over 300MB) have been failing because of timeouts when uploading the file to Swift cross-datacenter (all uploads are sent to both primary datacenters). The cause was determined to be the libcurl upgrade enabling HTTP/2 by default, which is generally slower at these kinds of transfers than HTTP/1.1 or HTTP/1.0 is (see this Cloudflare blog post for a brief explainer). Forcing internal requests from MediaWiki PHP to Swift to use HTTP/1 immediately resolved the issue. We have since also disabled HTTP/2 more generally in the internal Nginx proxy we use for TLS termination to Swift. The nginx puppet code we used here was originally written for public traffic TLS termination (this now uses ATS), which explains why it had HTTP/2 enabled.
- Feb 11: T274589: No atomic section is open (got LocalFile::lockingTransaction) filed
- Feb 25: T275752: Jobrunner on Buster occasional timeout on codfw file upload filed
- March 1: Link between file timeouts and buster migration identified (or at least theorized)
- mid-March: video2commons is unblocked by YouTube, and people attempt to upload large files at a much higher rate than before
- Late March-early April: One jobrunner is temporarily reimaged back to stretch and it appears to not show the same timeout symptoms.
- April 13: Help:Maximum file size on Commons edited to say that files over 100MB need a server-side upload (discussion ensues on the talk page)
- mid-April: Remaining servers reimaged to buster, including that one jobrunner.
- Oct. 11: priority raised to high, with this being the most likely cause of failing uploads
- ... slow progress in figuring out a reproducible test case
- Oct. 27: [Wikimedia-l] Upload for large files is broken
- Oct. 28: A reproducible test case shows that from the same host, the CLI curl command works fine, while PHP via libcurl does not
- libcurl is transferring data using very small requests, causing lots of round-trips which add up for cross-datacenter requests
- It's noticed that CLI curl is using HTTP/1, while PHP is using HTTP/2. Forcing PHP to use HTTP/1 fixes the issue.
- It is identified that:
- libcurl 7.62.0 enabled HTTP/2 multiplexing when available, which arrived as part of the Buster upgrade (7.52.1 to 7.64.0)
- A prerequisite of above was to also switch by default to HTTP/2 when available.
- Patches to force SwiftFileBackend's MultiHttpClient to use HTTP/1 are submitted
- Oct. 29: Patches are deployed, large file uploads being working again, at a much faster/expected speed
- Nov. 2: HTTP/2 disabled on nginx instances that provide TLS termination in front of Swift
- Nov. 8: HTTP/2 disabled on remaining nginx instances that use the tlsproxy::localssl Puppet role.
The issue was first detected as a train blocker because the timeouts were causing database query errors on the MediaWiki side. Further manual investigation discovered the timeouts in MediaWiki/Swift communication.
In this specific case, alerting would not have helped given that the issue was noticed almost immediately, there was just no one actively looking into it for a significant amount of time.
For the general case, having metrics about throughput of uploads to Swift for large files and alerting (non-paging) based on that average might be useful (see actionables).
What went well?
- Once the underlying issue was identified (use of HTTP/2), we were able to prepare patches right away and deploy them in less than a day.
What went poorly?
- There is no owner for MediaWiki backend multimedia work, leaving this to languish for months. In the course of this investigation, many other bugs were found and filed, but there is no one to work on them.
- The errors emitted to users when uploads fail are vague and not helpful, leading to user frustration and many duplicate issues.
- Users also did not know which component to use, so bugs ended up scattered across MediaWiki-Uploading, MediaWiki-File-management and SRE-swift-storage.
- Chunked uploads use and were failing during the job queue, which is much harder to debug in production because it can't be routed to mwdebug* like web requests can.
Where did we get lucky?
- We were able to reproduce the issue using plain PHP+curl via a script that extracted the useful parts of MultiHttpClient and SwiftFileBackend
- Pasting verbose traces onto Phabricator allowed a volunteer to spot the use of HTTP/2 when it was not intended
How many people were involved in the remediation?
- ~4 SREs
- ~3 software engineers
- Special thanks to Xover who noticed the use of HTTP/2 and the corresponding change in libcurl, plus Inductiveload who helped with reproduction and testing once it was fixed.
Links to relevant documentation
- Swift/How To is marked as outdated and would be more useful if it explained how to use the Python Swift client to upload files to Swift, as well as had designated test users and buckets. Or maybe recommend to use test.wikipedia.org's buckets.
- As far as I know there is no detailed infographic or even text that explains how the MediaWiki chunked upload process works. Broadly users upload chunks, they get reassembled on a job runner, and then eventually published. Some steps result in actual uploads, others use Swift's x-copy-from. Some parts use temporary names, others use the actual filename. Also everything is uploaded to both DCs, but not in parallel.
The biggest issue identified here is needing a maintainer/owner for backend file upload/management related components.
- There is a general need for better metrics collection and logging of the individual actions in SwiftFileBackend.
- T283045: LocalFile::lock() rearchitecting - don't hold a database transaction open for the duration of the upload
- T295008: MediaWiki uploads files to Swift in eqiad and codfw in serial, not parallel
- T295482: Track throughput of large file uploads to Swift
- Log headers in MultiHttpClient
Read more about Incident Scorecard.
|Incident Engagement™ ScoreCard||Score||Notes|
|People||Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)||0||This is an unusual incident involving many people over 9 months|
|Were the people who responded prepared enough to respond effectively (0/5pt)||0|
|Did fewer than 5 people get paged (0/5pt)?||0||No page|
|Were pages routed to the correct sub-team(s)?||0|
|Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)||0|
|Process||Was the incident status section actively updated during the incident? (0/1pt)||0||Task was updated, but incident docs were created after resolving the issue|
|If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)||0||Issue was eventually given high priority, but wasn't considered UBN. troubleshooting status was logged in task, and incident status was updated after resolution|
|Is there a phabricator task for the incident? (0/1pt)||1||Yes|
|Are the documented action items assigned? (0/1pt)||0||Some, some are not|
|Is this a repeat of an earlier incident (-1 per prev occurrence)||0|
|Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)||0|
|Tooling||Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)||5||No tooling issues were reported, although communication was slow due to unclear ownership|
|Did existing monitoring notify the initial responders? (1pt)||0|
|Were all engineering tools required available and in service? (0/5pt)||5||Yes|
|Was there a runbook for all known issues present? (0/5pt)||0|