You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/Clinic duties: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Andrew Bogott
imported>Nskaggs
(clarify WMCS does own some cloud vps projects :-))
(31 intermediate revisions by 7 users not shown)
Line 1: Line 1:
The WMCS team practices a '''clinic duty''' rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties.
The WMCS team practices a '''clinic duty''' rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties. Clinic Duty runs from one weekly meeting to the next. Your shift begins after the weekly meeting, and ends with the next.
Β 
In a similar fashion, we have two '''oncall duty''' rotations, that also run for one week ([https://portal.victorops.com/api/v1/org/wikimedia/team/team-ZXiOeQOoOEqFncON/calendar/67EF88CA9AB3C469AC1AC50F6ACBC0B4.ics see the calendar]).
Β 
== Start of clinic duty ==
* Change oncall in title of {{Irc|wikimedia-cloud-admin}} to yourself
* Archive the weekly meeting etherpad to https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/YYYY-MM-DD


== πŸ¦„ of the week duties ==
== πŸ¦„ of the week duties ==
=== Phabricator ===
* Complete tasks under clinic duty on [[phab:project/board/2774/|Phabricator board]]
* Help triage [[phab:maniphest/query/zYEPBGlDH9xk/|new / incoming tasks]] on phabricator
=== Community ===
==== IRC ====
* {{Irc|wikimedia-cloud}} monitoring
* {{Irc|wikimedia-cloud}} monitoring
** Respond to help requests
** Respond to help requests
Line 8: Line 22:
** Call people out for poor behavior in the channel
** Call people out for poor behavior in the channel
** Praise people for helping constructively
** Praise people for helping constructively
* Monitor [https://icinga.wikimedia.org/icinga/ icinga] and [http://shinken.wmflabs.org/problems shinken] for alerts
Β 
* Watch for wmcs-related cronspam and fix the causes when possible
==== Community Requests ====
* Check the [https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts tools grafana board] for trends
Check for and respond to incoming requests. For new project requests or quota requests, please seek and obtain at least one other person's approval before approving and granting the request. Ensure this permission is explicitly documented on the phabricator ticket. For all floating IP requests, and any request you are unsure about, please bring up to the weekly meeting. Requests that represent an increase of more than double the quota or more than 300GB of storage should also be reviewed at the weekly meeting.
* Triage [https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-msyn2z45n7mw45bfuscb&statuses=open()&group=none&order=newest#R|new tasks added to Phabricator in the #cloud-services project]
Β 
* Check for broken puppet on VMs (owners get daily emails from [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/base/files/labs/puppet_alert.py puppetalert.py] but you can contact them if an instance is un-puppetized for a particularly long time):
*[https://phabricator.wikimedia.org/project/board/2875 VPS Project Requests]
andrew@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o jsonΒ  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Catalog fetch fail"
**Related docs: [[Portal:Cloud VPS/Admin/Projects lifecycle#Creating a new project]] Β 
* Run [[Add_a_wiki#Cloud_Services|<code>maintain-views</code> and <code>maintain-meta_p</code> on labsdb*]] as needed for new tables/wikis
**Related docs: [[Portal:Cloud VPS/Admin/Projects lifecycle#Deleting_a_project]]
* Monitor https://toolsadmin.wikimedia.org/tools/membership/ for new requests and process them
Β 
** Related docs: [[Portal:Toolforge/Admin#Users_and_community]]
*[https://phabricator.wikimedia.org/project/board/4481/ DB Requests]
**Related docs: [[Add_a_wiki#Cloud_Services]]
Β 
*[https://phabricator.wikimedia.org/project/board/2880/ VPS Quota Requests ]
**Related docs: [[Portal:Cloud_VPS/Admin/Projects_lifecycle#Modifying_project_quotas]]
**Related docs: [[Portal:Cloud_VPS/Admin/Trove#Adjusting_per-project_Trove_quotas]]
Β 
*[https://phabricator.wikimedia.org/project/board/4834/ Toolforge Quota Requests ]
**Related docs: [[Portal:Toolforge/Admin/Kubernetes#Quota management]]
Β 
*[https://toolsadmin.wikimedia.org/tools/membership/ Toolforge account requests]
**Related docs: [[Portal:Toolforge/Admin#Users_and_community]]
Β 
== Maintenance tasks (probably not all weeks) ==
* [[Portal:Cloud_VPS/Admin/Managing package upgrades|Check for attended package upgrades]]
* [[Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack|Check for Nova fullstack VM leaks]]
* [[Portal:Cloud VPS/Admin/DNS/Designate#Detecting leaked records|Check for DNS record leaks]]
Β 
== End of clinic duty ==
* Summarize work reported by the team on the weekly meeting etherpad and add summary to:
* Summarize work reported by the team on the weekly meeting etherpad and add summary to:
** Add outgoing updates to weekly SRE meeting document (put important notes in bold)
** Add outgoing updates to weekly SRE meeting document (put important notes in bold)
** Bryan will also copy this summary to the weekly Tech Managers meeting notes
* Archive the weekly meeting etherpad to https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/YYYY-MM-DD


==Maintenance tasks (probably not all weeks)==
= Oncall Duty =
* [[Portal:Cloud_VPS/Admin/Managing package upgrades|Check for attended package upgrades]]
During your shift, you are expected to monitor and react to alerts, as well as highly prioritize working on tasks that improve the current alert/monitoring/stability of the platforms. See [[Wikimedia Cloud Services team/EnhancementProposals/Decision record T310598 Team oncall alerting schedules and processes|Decision Record]] for more information.
* [[Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack|Check for Nova fullstack VM leaks]]
Β 
* [[Portal:Cloud_VPS/Admin/DNS#Detecting_leaked_records|Check for OpenStack resource leaks]]
==Monitoring==
===Alerts===
*phaultfinder (This [[Alertmanager#Notifications|bot automatically opens tasks]] for non-paging alerts). Please ensure all open requests get assigned and worked (whether yourself or someone else).
**[[phab:maniphest/query/LXZs.g30DfOi/#R|Open tasks]]
**[[phab:maniphest/query/0PZ2LPsDUKL5/#R|All tasks]]
*[https://alerts.wikimedia.org/?q=team%3Dwmcs&q=%40state%3Dactive alertmanager (team=wmcs)]
**You might also find [https://logstash.wikimedia.org/goto/d5c0cb63bb13bc883685c3d90d87cfb7 this dashboard] useful to browse alert history
*[https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=wmcs_eqiad&style=overview Icinga for WMCS hardware].
*Watch for wmcs-related emails (cron, puppet failing on our projects, etc.) and fix
Β 
===Cloud VPS alerts===
These include things WMCS isn't directly responsible for. For this reason, most of these alerts aren't critical and aren't WMCS's problem to solve. However, projects for which WMCS is the owner / admin, like tools, admin, etc, are important and we should respond as the responsible party.
*[https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive Alertmanager for Cloud VPS projects]
Β 
===Dashboards===
This list isn't exhaustive. Dashboards can be utilized to debug or confirm issues within WMCS platforms.
*Platform Health
**[https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 NFS Storage Utilization]
**[https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1 Openstack]
**[https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&folder=current&tag=ceph&tag=health Ceph Cluster]
**[https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?search=open&folder=current&orgId=1&refresh=1m WMCS Grafana Dashboards]
**[https://grafana-labs.wikimedia.org/d/000000012/tools-basic-alerts tools trends]
**Gridengine, https://sge-status.toolforge.org/ and https://sge-jobs.toolforge.org/
Β 
==Improvements==
If nothing currently requires immediate attention, you should work on improving tooling in this area. Consider:
Β 
*Moving alerts from [https://icinga.wikimedia.org/ Icinga] to [https://alerts.wikimedia.org/ Alertmanager] (e.g. [https://gerrit.wikimedia.org/r/c/operations/puppet/+/813275 novafullstack], [https://gerrit.wikimedia.org/r/c/operations/puppet/+/813228 ceph])
*Adding new alerts or removing stale alerts (e.g. [https://gerrit.wikimedia.org/r/c/operations/alerts/+/822319 Adding neutron alert], [https://gerrit.wikimedia.org/r/c/operations/alerts/+/812706 Adding ceph alerts], [https://gerrit.wikimedia.org/r/c/operations/alerts/+/813274 Adding novafullstack alerts])
*Improving [[Portal:Cloud VPS/Admin/Runbooks|runbooks]] and documentation
*Writing cookbooks to automate tasks (e.g. [https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/774385 remove grid errors],Β  [https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/801785 remove grid node], [https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/810914 ceph_reboot], [https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/806429 increase quotas] )
*Cleaning up puppet code, add tests
*Improve/fix/upgrade the dashboards for the team in [https://grafana-rw.wikimedia.org/d/-K8NgsUnz/home?orgId=1&search=open&tag=WMCS grafana].

Revision as of 16:37, 9 September 2022

The WMCS team practices a clinic duty rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties. Clinic Duty runs from one weekly meeting to the next. Your shift begins after the weekly meeting, and ends with the next.

In a similar fashion, we have two oncall duty rotations, that also run for one week (see the calendar).

Start of clinic duty

πŸ¦„ of the week duties

Phabricator

Community

IRC

  • #wikimedia-cloud connect monitoring
    • Respond to help requests
    • Watch for pings to other team members and intercept if appropriate
    • Watch for pings to !help
    • Call people out for poor behavior in the channel
    • Praise people for helping constructively

Community Requests

Check for and respond to incoming requests. For new project requests or quota requests, please seek and obtain at least one other person's approval before approving and granting the request. Ensure this permission is explicitly documented on the phabricator ticket. For all floating IP requests, and any request you are unsure about, please bring up to the weekly meeting. Requests that represent an increase of more than double the quota or more than 300GB of storage should also be reviewed at the weekly meeting.

Maintenance tasks (probably not all weeks)

End of clinic duty

  • Summarize work reported by the team on the weekly meeting etherpad and add summary to:
    • Add outgoing updates to weekly SRE meeting document (put important notes in bold)

Oncall Duty

During your shift, you are expected to monitor and react to alerts, as well as highly prioritize working on tasks that improve the current alert/monitoring/stability of the platforms. See Decision Record for more information.

Monitoring

Alerts

Cloud VPS alerts

These include things WMCS isn't directly responsible for. For this reason, most of these alerts aren't critical and aren't WMCS's problem to solve. However, projects for which WMCS is the owner / admin, like tools, admin, etc, are important and we should respond as the responsible party.

Dashboards

This list isn't exhaustive. Dashboards can be utilized to debug or confirm issues within WMCS platforms.

Improvements

If nothing currently requires immediate attention, you should work on improving tooling in this area. Consider: