You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2021-03-30 Jobqueue overload

From Wikitech-static
< Incident documentation
Revision as of 14:25, 31 March 2021 by imported>RLazarus
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

An upload of 65 video 4k files via the server-side upload process caused high CPU/socket timeout errors on jobqueues, specifically the video scalers. This caused an increase in job backlog and unavailability on several mw-related servers (job queue runners, etc.). It seems that a combination of the files being 4k (and thus requiring many different downscales) combined with the fact that the videos were uploads from a local server (mwmaint) with a fast connection to the rest of our infrastructure resulted in too much load being placed on the jobqueue infrastructure.

Halting the uploads and temporarily splitting the jobqueue into videoscalers and other jobrunners allowed the infrastructure to catch up.

Actionables