You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation meeting/QR201407/group1/notes
< Incident documentation meeting/QR201407 | group1Jump to navigation Jump to search
Revision as of 22:13, 2 April 2020 by (Krinkle moved page Incident documentation/QR201407/group1/notes to Incident documentation meeting/QR201407/group1/notes: rename so that prefixindex/from works without these showing up as sorting "after" 2020)
- migrated to m2 shard, shouldn't have too many load issues in future
- analytics is responsible for responding to alerts is from analytics
- ops is responsible for generic looking database alerts
- EL can be down or lagging for up to 48 hours (weekends) - "Tier 2" support
- would have been good to have Ariel on the call
- greg to follow up on explicit next steps with Bryan and Reedy
- Add to next group's list
- all green :)
- seems all bases are covered here, any disagreement? :)
- blog work, loop back with RobH re future of that box? HA? etc?
- how far away to get rid of blog?
- still need to create reproducible steps for this to be reported upstream
- still need to manually remove a sick node (on purpose)
- Friday! :)
- need a bug for "Add monitoring for individual job types on single machines. "
- Should deploy https://gerrit.wikimedia.org/r/#/c/144612/ before hhvm goes to jobrunners
- MediaWiki failed to stop trying to use the bogged down machine
- Greg: need to get this diagnosed and tracked
- HHVM's impact here?
- proposal 4 related to Rashomon?
- epicly awesome fix: http://ur1.ca/hpjpa ;)
- need the Swift bandwidth
- metrics: https://bugzilla.wikimedia.org/show_bug.cgi?id=67116 Please help :)
- Still have the feature request for scap here