You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/20160809-MediaWiki

From Wikitech-static
< Incident documentation
Revision as of 20:14, 10 August 2016 by imported>BryanDavis (→‎Conclusions)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

TLDR: migration of 2 extensions to wfLoadExtension() resulted in problems, Logstash wasn't displaying them.

Timeline

Previous days

  • In a massive effort by many people, lots of extensions were converted to extension.json, including
  • These changes were not compatible with our current production configuration and thus had to be accompanied with mediawiki-config changes and probably be deployed separately to minimize the chance of screwup.
  • Furthermore, even a cursory testing of the above Timeline change would have shown that it is broken.

August 9

  • After 12:00 SF time Mukunda deploys train to stage 0 wikis
  • At 16:00 Max prepares for SWAT but sees errors in fatalmonitor and investigates:
    • Creating default object from empty value in /srv/mediawiki/wmf-config/CommonSettings.php on line 686
    • Undefined variable: wgContactConfig in /srv/mediawiki/wmf-config/CommonSettings.php on line 968
  • Max sees no such errors in Logstash.
  • After identifying the cause, Max starts reverting the affected extensions, however there were a lot of intermediate commits and Reedy was committing fixes so Max proceeds with deploying the fixes instead.
  • Fixes produced more problems. Max contemplates a revert of group0 back to wmf.13 but decides not to because he has never done that before and fixes kept on coming. In the hindsight, this was a mistake.
  • Config fixes to accommodate for wmf.14 started causing notices in wmf.13 so Max resets wmf.13 Timeline to wmf.14.
  • Errors indicating more breakages in Timeline prompt another batch of fixes.
  • At 17:42, everything is back to normal.

Casualties

  • Evening SWAT didn't happen.
  • For about 10 minutes, new timeline generation on production wikis was broken.

Conclusions

  • Our code review practices are lax, including merging hairy patches without testing and self-merges.
  • Timeline has 0 (zero) tests while just a single parser test would have allowed to detect problems during code review.
  • Logstash fatalmonitor dashboard isn't displaying HHVM warnings/errors right now.
    Yes Done The dashboard was setup to filter out all NOTICE, INFO, and WARNING messages. It has been updated to only exclude those event levels when the event type is "mediawiki". This has restored display of HHVM warnings. BryanDavis (talk) 20:07, 10 August 2016 (UTC)
  • And Logstash is used by scap to verify error levels, rendering this check useless.
  • Logstash/Kibana is probably too complex a beast to be trusted to be the definitive source of MediaWiki health information, fatalmonitor is still more reliable. Invest time in improving it and merging with exceptionmonitor?
  • In ongoing outage with logs full of noise, testing stuff on canary servers is hard as non-fatal errors are easy to miss on fluorine. Deployers need access to HHVM logs on all appservers.
    The hhvm error log should be available in /var/log/hhvm/error.log on all MW servers. This file is readable by the www-data group which all deployers can sudo to: sudo -u www-data tail -f /var/log/hhvm/error.log. The logs are also aggregated via rsyslog+udp2log on fluorine as /a/mw-log/hhvm.log. Maybe we need better documentation and/or a helper script on the deploy servers to make tailing these logs on some random MW server easier? BryanDavis (talk) 20:13, 10 August 2016 (UTC)
  • Beta cluster isn't serving its purpose of being the first line of defense against bugs (other than "oh, whole thing is down"). Errors in beta should be watched as closely as in prod and should be treated with the same level of seriousness, because otherwise the former will eventually turn into the latter.