You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2021-11-04 large file upload timeouts: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
No edit summary
imported>Herron
(Updating to reflect scorecard amendments)
 
Line 84: Line 84:
Read more about [[Incident Scorecard]].
Read more about [[Incident Scorecard]].
{| class="wikitable"
{| class="wikitable"
| colspan="2" |'''Incident Engagement™  ScoreCard'''
!
|'''Score'''
!Question
|'''Notes'''
! Score
!Notes
|-
|-
| rowspan="5" |'''People'''
! rowspan="5" | People
|Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
|Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)
|0
|0
|This is an unusual incident involving many people over 9 months
|This is an unusual incident involving many people over 9 months
|-
|-
|Were the people who responded prepared enough to respond effectively (0/5pt)
|Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)
|0
|0
|
|
|-
|-
|Did fewer than 5 people get paged (0/5pt)?
|Were more than 5 people paged? (score 0 for yes, 1 for no)
|0
|0
|No page
|No page
|-
|-
|Were pages routed to the correct sub-team(s)?
|Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)
|0
|0
|
|
|-
|-
|Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)  
|Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours)
|0
|0
|
|
|-
|-
| rowspan="6" |'''Process'''
! rowspan="5" |Process
|Was the incident status section actively updated during the incident? (0/1pt)
|Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)
|0
|0
|Task was updated, but incident docs were created after resolving the issue
|Task was updated, but incident docs were created after resolving the issue
|-
|-
|If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
|Was the public status page updated? (score 1 for yes, 0 for no)
|0
|0
|Issue was eventually given high priority, but wasn't considered UBN. troubleshooting status was logged in task, and incident status was updated after resolution
|Issue was eventually given high priority, but wasn't considered UBN. troubleshooting status was logged in task, and incident status was updated after resolution
|-
|-
|Is there a phabricator task for the incident? (0/1pt)
|Is there a phabricator task for the incident? (score 1 for yes, 0 for no)
|1
|1
|Yes
|Yes
|-
|-
|Are the documented action items assigned?  (0/1pt)
|Are the documented action items assigned?  (score 1 for yes, 0 for no)
|0
|0
|Some, some are not
|Some, some are not
|-
|-
|Is this a repeat of an earlier incident (-1 per prev occurrence)
| Is this a repeat of an earlier incident (score 0 for yes, 1 for no)  
|0
|0
|
|
|-
|-
|Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
! rowspan="5" |Tooling
|Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)  
|0
|0
|
|
|-
|-
| rowspan="4" |'''Tooling'''
|Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)
|Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
|1
|5
|No tooling issues were reported, although communication was slow due to unclear ownership
|No tooling issues were reported, although communication was slow due to unclear ownership
|-
|-
|Did existing monitoring notify the initial responders? (1pt)
|Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)
|0
|0
|
|
|-
|-
|Were all engineering tools required available and in service? (0/5pt)
|Were all engineering tools required available and in service? (score 1 for yes, 0 for no)
|5
|1
|Yes
|Yes
|-
|-
|Was there a runbook for all known issues present? (0/5pt)
| Was there a runbook for all known issues present? (score 1 for yes, 0 for no)
|0
|0
|
|
|-
|-
| colspan="2" |'''Total Score'''
! colspan="2" align="right" | Total score
|11
|3
|
|
|}
|}

Latest revision as of 19:16, 26 April 2022

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2021-11-04 large file upload timeouts Start YYYY-MM-DD hh:mm:ss
Task T299965 End YYYY-MM-DD hh:mm:ss
People paged Responder count
Coordinators Affected metrics/SLOs
Impact For 9 months, editors were entirely unable to upload large files (e.g. on Wikimedia Commons). The UploadWizard feature, and other upload tools, displayed generic error messages about this failure, typically after a timeout was reached. Editors reportedly gave up, or sacrificed image, video, and document quality (which provides a poorer user experience to readers), or resorted to manual "server-side upload" requests (which demanded additional sysadmin time during these months).

Since the upgrade of MediaWiki appservers to Debian Buster early in 2021, large file uploads (anecdotally anything over 300MB) have been failing because of timeouts when uploading the file to Swift cross-datacenter (all uploads are sent to both primary datacenters). The cause was determined to be the libcurl upgrade enabling HTTP/2 by default, which is generally slower at these kinds of transfers than HTTP/1.1 or HTTP/1.0 is (see this Cloudflare blog post for a brief explainer). Forcing internal requests from MediaWiki PHP to Swift to use HTTP/1 immediately resolved the issue. We have since also disabled HTTP/2 more generally in the internal Nginx proxy we use for TLS termination to Swift. The nginx puppet code we used here was originally written for public traffic TLS termination (this now uses ATS), which explains why it had HTTP/2 enabled.

Timeline

  • Feb 11: T274589: No atomic section is open (got LocalFile::lockingTransaction) filed
  • Feb 25: T275752: Jobrunner on Buster occasional timeout on codfw file upload filed
  • March 1: Link between file timeouts and buster migration identified (or at least theorized)
  • mid-March: video2commons is unblocked by YouTube, and people attempt to upload large files at a much higher rate than before
  • Late March-early April: One jobrunner is temporarily reimaged back to stretch and it appears to not show the same timeout symptoms.
  • April 13: Help:Maximum file size on Commons edited to say that files over 100MB need a server-side upload (discussion ensues on the talk page)
  • mid-April: Remaining servers reimaged to buster, including that one jobrunner.
  • Oct. 11: priority raised to high, with this being the most likely cause of failing uploads
  • ... slow progress in figuring out a reproducible test case
  • Oct. 27: [Wikimedia-l] Upload for large files is broken
  • Oct. 28: A reproducible test case shows that from the same host, the CLI curl command works fine, while PHP via libcurl does not
    • libcurl is transferring data using very small requests, causing lots of round-trips which add up for cross-datacenter requests
    • It's noticed that CLI curl is using HTTP/1, while PHP is using HTTP/2. Forcing PHP to use HTTP/1 fixes the issue.
    • It is identified that:
      • libcurl 7.62.0 enabled HTTP/2 multiplexing when available, which arrived as part of the Buster upgrade (7.52.1 to 7.64.0)
      • A prerequisite of above was to also switch by default to HTTP/2 when available.
    • Patches to force SwiftFileBackend's MultiHttpClient to use HTTP/1 are submitted
  • Oct. 29: Patches are deployed, large file uploads being working again, at a much faster/expected speed
  • Nov. 2: HTTP/2 disabled on nginx instances that provide TLS termination in front of Swift
  • Nov. 8: HTTP/2 disabled on remaining nginx instances that use the tlsproxy::localssl Puppet role.

Detection

The issue was first detected as a train blocker because the timeouts were causing database query errors on the MediaWiki side. Further manual investigation discovered the timeouts in MediaWiki/Swift communication.

In this specific case, alerting would not have helped given that the issue was noticed almost immediately, there was just no one actively looking into it for a significant amount of time.

For the general case, having metrics about throughput of uploads to Swift for large files and alerting (non-paging) based on that average might be useful (see actionables).

Conclusions

What went well?

  • Once the underlying issue was identified (use of HTTP/2), we were able to prepare patches right away and deploy them in less than a day.

What went poorly?

  • There is no owner for MediaWiki backend multimedia work, leaving this to languish for months. In the course of this investigation, many other bugs were found and filed, but there is no one to work on them.
  • The errors emitted to users when uploads fail are vague and not helpful, leading to user frustration and many duplicate issues.
    • Users also did not know which component to use, so bugs ended up scattered across MediaWiki-Uploading, MediaWiki-File-management and SRE-swift-storage.
  • Chunked uploads use and were failing during the job queue, which is much harder to debug in production because it can't be routed to mwdebug* like web requests can.

Where did we get lucky?

  • We were able to reproduce the issue using plain PHP+curl via a script that extracted the useful parts of MultiHttpClient and SwiftFileBackend
  • Pasting verbose traces onto Phabricator allowed a volunteer to spot the use of HTTP/2 when it was not intended

How many people were involved in the remediation?

  • ~4 SREs
  • ~3 software engineers
  • Special thanks to Xover who noticed the use of HTTP/2 and the corresponding change in libcurl, plus Inductiveload who helped with reproduction and testing once it was fixed.

Links to relevant documentation

  • Swift/How To is marked as outdated and would be more useful if it explained how to use the Python Swift client to upload files to Swift, as well as had designated test users and buckets. Or maybe recommend to use test.wikipedia.org's buckets.
  • As far as I know there is no detailed infographic or even text that explains how the MediaWiki chunked upload process works. Broadly users upload chunks, they get reassembled on a job runner, and then eventually published. Some steps result in actual uploads, others use Swift's x-copy-from. Some parts use temporary names, others use the actual filename. Also everything is uploaded to both DCs, but not in parallel.

Actionables

The biggest issue identified here is needing a maintainer/owner for backend file upload/management related components.

  • There is a general need for better metrics collection and logging of the individual actions in SwiftFileBackend.

Scorecard

Read more about Incident Scorecard.

Question Score Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) 0 This is an unusual incident involving many people over 9 months
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) 0
Were more than 5 people paged? (score 0 for yes, 1 for no) 0 No page
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) 0
Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours) 0
Process Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) 0 Task was updated, but incident docs were created after resolving the issue
Was the public status page updated? (score 1 for yes, 0 for no) 0 Issue was eventually given high priority, but wasn't considered UBN. troubleshooting status was logged in task, and incident status was updated after resolution
Is there a phabricator task for the incident? (score 1 for yes, 0 for no) 1 Yes
Are the documented action items assigned?  (score 1 for yes, 0 for no) 0 Some, some are not
Is this a repeat of an earlier incident (score 0 for yes, 1 for no) 0
Tooling Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) 0
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) 1 No tooling issues were reported, although communication was slow due to unclear ownership
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) 0
Were all engineering tools required available and in service? (score 1 for yes, 0 for no) 1 Yes
Was there a runbook for all known issues present? (score 1 for yes, 0 for no) 0
Total score 3