You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/Clinic duties: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Andrew Bogott
No edit summary
imported>Andrew Bogott
(New puppet versions seem to say 'Failed to apply catalog' rather than 'Catalog fetch fail')
Line 13: Line 13:
* Triage [https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-msyn2z45n7mw45bfuscb&statuses=open()&group=none&order=newest#R|new tasks added to Phabricator in the #cloud-services project]
* Triage [https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-msyn2z45n7mw45bfuscb&statuses=open()&group=none&order=newest#R|new tasks added to Phabricator in the #cloud-services project]
* Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time):
* Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time):
  andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Catalog fetch fail"
  andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Failed to apply catalog"
    
    
  andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep  -i unknown
  andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep  -i unknown

Revision as of 16:25, 5 June 2020

The WMCS team practices a clinic duty rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties.

🦄 of the week duties

andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Failed to apply catalog"
 
andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep  -i unknown

Maintenance tasks (probably not all weeks)