You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/Clinic duties: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Andrew Bogott
(New puppet versions seem to say 'Failed to apply catalog' rather than 'Catalog fetch fail')
imported>BryanDavis
(→‎Maintenance tasks (probably not all weeks): Fix link to DNS leak checks)
Line 12: Line 12:
* Check the [https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts tools grafana board] for trends
* Check the [https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts tools grafana board] for trends
* Triage [https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-msyn2z45n7mw45bfuscb&statuses=open()&group=none&order=newest#R|new tasks added to Phabricator in the #cloud-services project]
* Triage [https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-msyn2z45n7mw45bfuscb&statuses=open()&group=none&order=newest#R|new tasks added to Phabricator in the #cloud-services project]
* Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time):
* Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time). From a [[Cumin|cumin master]] (ie cloud-cumin-01.cloudinfra.eqiad.wmflabs):
  andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Failed to apply catalog"
  andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Failed to apply catalog"
    
    
Line 22: Line 22:
* Summarize work reported by the team on the weekly meeting etherpad and add summary to:
* Summarize work reported by the team on the weekly meeting etherpad and add summary to:
** Add outgoing updates to weekly SRE meeting document (put important notes in bold)
** Add outgoing updates to weekly SRE meeting document (put important notes in bold)
** Bryan will also copy this summary to the weekly Tech Managers meeting notes
** Nicholas will also copy this summary to the weekly Tech Managers meeting notes
* Archive the weekly meeting etherpad to https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/YYYY-MM-DD
* Archive the weekly meeting etherpad to https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/YYYY-MM-DD


Line 28: Line 28:
* [[Portal:Cloud_VPS/Admin/Managing package upgrades|Check for attended package upgrades]]
* [[Portal:Cloud_VPS/Admin/Managing package upgrades|Check for attended package upgrades]]
* [[Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack|Check for Nova fullstack VM leaks]]
* [[Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack|Check for Nova fullstack VM leaks]]
* [[Portal:Cloud_VPS/Admin/DNS#Detecting_leaked_records|Check for OpenStack resource leaks]]
* [[Portal:Cloud VPS/Admin/DNS/Designate#Detecting leaked records|Check for DNS record leaks]]

Revision as of 17:44, 31 July 2020

The WMCS team practices a clinic duty rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties.

🦄 of the week duties

andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Failed to apply catalog"
 
andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep  -i unknown

Maintenance tasks (probably not all weeks)