You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Incident documentation/2021-03-30 Jobqueue overload
document status: draft
An upload of 65 video 4k files via the server-side upload process caused high CPU/socket timeout errors on jobqueues, specifically the video scalers. This caused an increase in job backlog and unavailability on several mw-related servers (job queue runners, etc.). It seems that a combination of the files being 4k (and thus requiring many different downscales) combined with the fact that the videos were uploads from a local server (mwmaint) with a fast connection to the rest of our infrastructure resulted in too much load being placed on the jobqueue infrastructure.
Halting the uploads and temporarily splitting the jobqueue into videoscalers and other jobrunners allowed the infrastructure to catch up.
- Document that users should use
--sleepto pause between files when running
- Rate limit the process to upload large files
- Add rate limiting to the jobqueue videoscalers
- Add alerting for Memcached timeout errors
- Update Runboook wikis for the application and LVS servers