You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2020-09-09 mobileapps config change

From Wikitech-static
< Incident documentation
Revision as of 19:09, 31 March 2021 by imported>Krinkle (Krinkle moved page Incident documentation/20200909-mobileapps config change to Incident documentation/2020-09-09 mobileapps config change)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

While reconfiguring all services to use our service proxy middleware to make remote procedure calls, a faulty configuration was deployed by yours truly for mobileapps at 08:40. This caused mobileapps to create mobile-html content with broken css and js links for pages regenerated during the day.

The issue was reported at 16:40 and the issue was quickly reverted. Then we needed a few hours to actually clear all the caching layers (RESTBase, edge caches). All pages affected were purged by 20:20.

Actionables

  • The biggest actionable is of course to always wait for validation from service owners before merging a patch - and the whole outage would've been avoided if that was done. Anything else listed here is purely a second-order actionable.
  • While this deployment was the result of bad judgement, SRE need to be able to deploy a configuration change with confidence. The fact that the mobileapps spec tests all passed in staging lulled SRE into a false sense of security. The openAPI spec should be extended to include a test for the aforementioned URLs. (TODO: create task)
  • We need staging to become a functional environment where we can test more than just a swagger spec test. Maybe linking it to restbase-dev, and making it possible to compare results of urls with production would help (TODO: create task)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.