You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/Clinic duties: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Nskaggs
(Use anchor link for cumin master)
imported>Nskaggs
(Add direct links for platform health dashboards)
Line 13: Line 13:
=== Monitoring ===
=== Monitoring ===
* Monitor [https://icinga.wikimedia.org/icinga/ icinga] for alerts, especially [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=wmcs_eqiad&style=overview WMCS hardware].
* Monitor [https://icinga.wikimedia.org/icinga/ icinga] for alerts, especially [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=wmcs_eqiad&style=overview WMCS hardware].
* Watch for wmcs-related cronspam and fix the causes when possible
** Watch for wmcs-related cronspam and fix the causes when possible
* Check the [https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts tools grafana board] for trends (For gridengine monitoring, https://sge-status.toolforge.org/ is also helpful)
* Check the [https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts tools grafana board] for trends (For gridengine monitoring, https://sge-status.toolforge.org/ is also helpful)
* Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time). From a [[Cumin#WMCS_Cloud_VPS_infrastructure|cumin master]]:
* Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time). From a [[Cumin#WMCS_Cloud_VPS_infrastructure|cumin master]]:
Line 21: Line 21:
</syntaxhighlight>
</syntaxhighlight>
* Monitor [https://vpsalertmanager.toolforge.org/?q= Vps alertmanager]
* Monitor [https://vpsalertmanager.toolforge.org/?q= Vps alertmanager]
* Platform Health [https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 NFS Storage Utilization] [https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1 Openstack] [https://grafana.wikimedia.org/d/7TjJENEWz/cloudvps-ceph-cluster?orgId=1 Ceph Cluster] [https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?search=open&folder=current&orgId=1&refresh=1m WMCS Grafana Dashboards]


=== Community ===
=== Community ===

Revision as of 20:53, 23 December 2020

The WMCS team practices a clinic duty rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties. Clinic Duty runs from one weekly meeting to the next. Your shift begins after the weekly meeting, and ends with the next.

Start of clinic duty

🦄 of the week duties

Phabricator

Monitoring

cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Failed to apply catalog"
cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep  -i unknown

Community

IRC

  • #wikimedia-cloud connect monitoring
    • Respond to help requests
    • Watch for pings to other team members and intercept if appropriate
    • Watch for pings to !help
    • Call people out for poor behavior in the channel
    • Praise people for helping constructively

Requests

Maintenance tasks (probably not all weeks)

End of clinic duty

  • Summarize work reported by the team on the weekly meeting etherpad and add summary to:
    • Add outgoing updates to weekly SRE meeting document (put important notes in bold)
    • Nicholas will also copy this summary to the weekly Tech Managers meeting notes