You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Wikimedia Cloud Services team/Clinic duties: Difference between revisions
imported>Nskaggs (Update triage query) |
imported>Nskaggs m (Fix tools basic alerts link) |
||
Line 14: | Line 14: | ||
* Monitor [https://alerts.wikimedia.org/?q=team%3Dwmcs alertmanager], [https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive vps alertmanager] and [https://icinga.wikimedia.org/icinga/ icinga] for alerts, especially [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=wmcs_eqiad&style=overview WMCS hardware]. | * Monitor [https://alerts.wikimedia.org/?q=team%3Dwmcs alertmanager], [https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive vps alertmanager] and [https://icinga.wikimedia.org/icinga/ icinga] for alerts, especially [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=wmcs_eqiad&style=overview WMCS hardware]. | ||
** Watch for wmcs-related cronspam and fix the causes when possible | ** Watch for wmcs-related cronspam and fix the causes when possible | ||
* Check the [https://grafana-labs.wikimedia.org/ | * Check the [https://grafana-labs.wikimedia.org/d/000000012/tools-basic-alerts tools grafana board] for trends (For gridengine monitoring, https://sge-status.toolforge.org/ is also helpful) | ||
* Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time). From a [[Cumin#WMCS_Cloud_VPS_infrastructure|cumin master]]: | * Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time). From a [[Cumin#WMCS_Cloud_VPS_infrastructure|cumin master]]: | ||
<syntaxhighlight lang="shell-session"> | <syntaxhighlight lang="shell-session"> |
Revision as of 16:59, 18 May 2022
The WMCS team practices a clinic duty rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties. Clinic Duty runs from one weekly meeting to the next. Your shift begins after the weekly meeting, and ends with the next.
Start of clinic duty
- Change oncall in title of #wikimedia-cloud-admin connect to yourself
- Archive the weekly meeting etherpad to https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/YYYY-MM-DD
🦄 of the week duties
Phabricator
- Complete tasks under clinic duty on Phabricator board
- Help triage new / incoming tasks on phabricator
Monitoring
- Monitor alertmanager, vps alertmanager and icinga for alerts, especially WMCS hardware.
- Watch for wmcs-related cronspam and fix the causes when possible
- Check the tools grafana board for trends (For gridengine monitoring, https://sge-status.toolforge.org/ is also helpful)
- Check for broken puppet on VMs (owners get daily emails from puppetalert.py but you can contact them if an instance is un-puppetized for a particularly long time). From a cumin master:
cloud-cumin-03:~$ sudo cumin --force --timeout 500 -o json "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Failed to apply catalog"
cloud-cumin-03:~$ sudo cumin --force --timeout 500 -o json "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep -i unknown
- Monitor Vps alertmanager
- Platform Health NFS Storage Utilization Openstack Ceph Cluster WMCS Grafana Dashboards
Community
IRC
- #wikimedia-cloud connect monitoring
- Respond to help requests
- Watch for pings to other team members and intercept if appropriate
- Watch for pings to
!help
- Call people out for poor behavior in the channel
- Praise people for helping constructively
Community Requests
Check for and respond to incoming requests. For new project requests or quota requests, please seek and obtain at least one other person's approval before approving and granting the request. Ensure this permission is explicitly documented on the phabricator ticket. For all floating IP requests, and any request you are unsure about, please bring up to the weekly meeting. Requests that represent an increase of more than double the quota or more than 300GB of storage should also be reviewed at the weekly meeting.
- DB Requests
- Related docs: Add_a_wiki#Cloud_Services
Maintenance tasks (probably not all weeks)
End of clinic duty
- Summarize work reported by the team on the weekly meeting etherpad and add summary to:
- Add outgoing updates to weekly SRE meeting document (put important notes in bold)
- Nicholas will also copy this summary to the weekly Tech Managers meeting notes