You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/Clinic duties: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>David Caro
No edit summary
imported>FNegri
(→‎Monitoring: Add link to alert history dashboard, fix indentation)
Line 57: Line 57:
== Monitoring ==
== Monitoring ==
* Monitor the following for alerts:
* Monitor the following for alerts:
* [https://alerts.wikimedia.org/?q=team%3Dwmcs&q=%40state%3Dactive alertmanager]
** [https://alerts.wikimedia.org/?q=team%3Dwmcs&q=%40state%3Dactive alertmanager (team=wmcs)]
* [https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive vps alertmanager]
*** You might also find [https://logstash.wikimedia.org/goto/d5c0cb63bb13bc883685c3d90d87cfb7 this dashboard] useful to browse alert history
* [https://icinga.wikimedia.org/icinga/ icinga], especially [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=wmcs_eqiad&style=overview WMCS hardware].
** [https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive vps alertmanager]
** [https://icinga.wikimedia.org/icinga/ icinga], especially [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=wmcs_eqiad&style=overview WMCS hardware].
* Watch for wmcs-related emails (cron, puppet failing on our projects, etc.) and fix the causes when possible
* Watch for wmcs-related emails (cron, puppet failing on our projects, etc.) and fix the causes when possible
* Check the [https://grafana-labs.wikimedia.org/d/000000012/tools-basic-alerts tools grafana board] for trends (For gridengine monitoring, https://sge-status.toolforge.org/ and https://sge-jobs.toolforge.org/ is also helpful)
* Check the [https://grafana-labs.wikimedia.org/d/000000012/tools-basic-alerts tools grafana board] for trends (For gridengine monitoring, https://sge-status.toolforge.org/ and https://sge-jobs.toolforge.org/ is also helpful)

Revision as of 15:35, 23 August 2022

The WMCS team practices a clinic duty rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties. Clinic Duty runs from one weekly meeting to the next. Your shift begins after the weekly meeting, and ends with the next.

In a similar fashion, we have two oncall duty rotations, that also run for one week (see the calendar).

Start of clinic duty

🦄 of the week duties

Phabricator

Community

IRC

  • #wikimedia-cloud connect monitoring
    • Respond to help requests
    • Watch for pings to other team members and intercept if appropriate
    • Watch for pings to !help
    • Call people out for poor behavior in the channel
    • Praise people for helping constructively

Community Requests

Check for and respond to incoming requests. For new project requests or quota requests, please seek and obtain at least one other person's approval before approving and granting the request. Ensure this permission is explicitly documented on the phabricator ticket. For all floating IP requests, and any request you are unsure about, please bring up to the weekly meeting. Requests that represent an increase of more than double the quota or more than 300GB of storage should also be reviewed at the weekly meeting.

Maintenance tasks (probably not all weeks)

End of clinic duty

  • Summarize work reported by the team on the weekly meeting etherpad and add summary to:
    • Add outgoing updates to weekly SRE meeting document (put important notes in bold)
    • Nicholas will also copy this summary to the weekly Tech Managers meeting notes

Oncall Duty

Monitoring

cloud-cumin-03:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Failed to apply catalog"
cloud-cumin-03:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep  -i unknown

Day to day

You are expected to monitor, investigate and fix any tasks created by the alert system on phabricator

During your shift, you are expected to highly prioritize working on tasks that improve the current alert/monitoring/stability of the platforms, things like:

  • Moving alerts from Icinga to Alertmanager
  • Adding new alerts or removing stale alerts
  • Improving runbooks and documentation
  • Writing cookbooks to automate tasks
  • Cleaning up puppet code
  • ...