You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
To measure progress outside of standard incident counts and severities, the SRE team has designed an incident scorecard, to be applied in every major incident, measuring the team’s incident response and engagement score.
Incident Assessment Overview
Defining a scorecard for the incident management effort facilitates tuning the process to achieve our defined objectives. The scorecard is be structured in 2 layers:
- Per Incident: this should be a list of items to assess the management of the incident.
- Per Month/Quarter: an aggregate of the scorecard to review trends over time and YoY
The organization can extrapolate and report on yearly efforts based on the collected data with these two views. Assessment of the SRE team’s collective progress would be visible at different levels of metrics resolution according to the level of reporting needed.
- Incident: An incident is an outage, security issue, or other operational issue whose severity demands an immediate human response.
- SLI: An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile. Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret.
- SLO: A Service Level Objective (SLO) is an understanding between teams about expectations for reliability and performance. An SLO is a service level objective: a target value or range of values for a service level measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. (i.e., More than 99% of all requests are successful) defined in SLO - Wikitech
Incident Engagement: Incident engagement is a set of metrics defined within the WMF that help track how we (SRE) as an organization are responding to incidents based on a PPT model: People, Process, and Tooling.
The incident document (Full Report) is the authoritative compendium of information for any given incident. In the case of an incident involving private data/information, a skeleton incident document will be posted to Wikitech as a pointer. This document is the completed artifact produced at the end of any incident. It should consist of all the contextual information needed to understand the incident. Currently, Incident status on Wikitech is that format; the proposal is to augment it by expanding its metadata and adding a “scorecard” section to assess whether or not the incident was managed effectively or not based on the criteria defined to better track incident engagement.
Following is a brief metadata table, proposed to be included at the top of each incident document (and eventual storage in a database, like Phabricator). The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.
|Incident ID||datestamp + service and event||Start||YYYY-MM-DD hh:mm:ss|
|People paged||Responder count|
|Impact||Who was affected and how?|
Some of the above fields are easy to fill in retrospectively, others not so much. Below we have some proposal on how to fill these in during the ritual
- Incident ID: Use whatever is already the name of the corresponding page in Wikitech. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in the format described above
- Start: Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
- End: Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
- Task: If one does not exist, ask that one is created. It could just be an umbrella one for the actionables one. Title of task could be “Incident: <Incident ID>”
- People paged: Go through VictorOps timeline. Make sure to deduplicate SMS/PUSH
- Responder count: This will currently require grepping IRC logs. This is going to be pretty onerous to do. We suggest updating the Incident Status doc maintained by the IC to record people responding.
- Coordinators: Straight out of the Incident Status doc. Remember to parse the timeline for IC role handovers
- Affected metrics/SLOs: If an SLO exists in the Published SLOs page use that. Eventually there will be Grafana dashboards containing SLOs and SLIs as well as a calculation of the error budget. While SLOs are being rolled out, this is expected to not be true for most services, in which case, enter “No relevant SLOs exist”. In that case, add any relevant metrics that will help quantify the Incident’s impact. It’s imperative to add them when dealing with Incident with direct end-user impact (e.g. edge traffic requests dropped by X%)
- Impact and Summary: Impact in one or two sentences. Summary in one or two paragraphs. Copied from the public Incident doc, or the Incident Status doc if the former does not exist yet.
The Incident status page contains the ongoing status updates and notes/timeline during an incident. In addition, this notepad will feed into the overall incident (post-mortem review) document. Using the Create a new incident report box will allow you to quickly create an incident report.
Following is a proposal based on the three assessment rubrics for this incident’s response efforts, each with its point scale and assessment bracket per item. Low scores equal poor performance; high scores indicate positive performance. The intent is not to blame or raise concern but to effectively introspect around how an incident played out without fear of blame or retribution. If anything, low scores should help indicate where to direct attention and priority at an organizational level.
|People||Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)|
|Were the people who responded prepared enough to respond effectively? (0/5pt)|
|Did fewer than 5 people get paged? (0/5pt)|
|Were pages routed to the correct sub-team(s)?|
|Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)|
|Process||Was the incident status section actively updated during the incident? (0/1pt)|
|If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)|
|Is there a phabricator task for the incident? (0/1pt)|
|Are the documented action items assigned? (0/1pt)|
|Is this a repeat of an earlier incident? (-1 per prev occurrence)|
|Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)|
|Tooling||Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)|
|Did existing monitoring notify the initial responders? (1pt)|
|Were all engineering tools required available and in service? (0/5pt)|
|Was there a runbook for all known issues present? (0/5pt)|
This scorecard is meant to be filled out as a part of an incident review effort after the incident is complete and the document is written. Part of the metadata and questions can be filled out before the incident review as needed. The goal of the scorecard is to use it as a reflection piece, and a conversation starter to help identify gaps in our current IR efforts. Bad scores are not meant to reflect poorly on responders, but increase visibility and help drive action to the gaps.
The aggregate sccorecard is an average of scores of all the incidents within a specific time period (quarter in this case). We mainly use the monthly scorecard to tabulate results for the end-of-quarter results.
|Incidents Count (per severity)|
|SLO Delta (global or affected SLOs only)|