You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident Scorecard: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>LMata
(First published draft of incident scorecard)
 
imported>BCornwall
(→‎Incident status: clarify wording)
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
To measure progress outside of standard incident counts and severities, the SRE team has designed an incident scorecard, to be applied in every major incident, measuring the team’s incident response and engagement score.  
To measure progress outside of standard incident counts and severities, the SRE team has designed an incident scorecard, to be applied in every major incident, measuring the team’s incident response and engagement score.  


=== Incident Assessment Overview ===
== Incident Assessment Overview ==
Defining a scorecard for the incident management effort facilitates tuning the process to achieve our defined objectives. The scorecard is be structured in 2 layers:
Defining a scorecard for the incident management effort facilitates tuning the process to achieve our defined objectives. The scorecard is be structured in 2 layers:


Line 9: Line 9:
The organization can extrapolate and report on yearly efforts based on the collected data with these two views. Assessment of the SRE team’s collective progress would be visible at different levels of metrics resolution according to the level of reporting needed.  
The organization can extrapolate and report on yearly efforts based on the collected data with these two views. Assessment of the SRE team’s collective progress would be visible at different levels of metrics resolution according to the level of reporting needed.  


=== Definitions ===
== Definitions ==


* Incident: An incident is an outage, security issue, or other operational issue whose severity demands an immediate human response.  
* Incident: An incident is an outage, security issue, or other operational issue whose severity demands an immediate human response.  
Line 17: Line 17:
Incident Engagement: Incident engagement is a set of metrics defined within the WMF that help track how we (SRE) as an organization are responding to incidents based on a PPT model: People, Process, and Tooling.  
Incident Engagement: Incident engagement is a set of metrics defined within the WMF that help track how we (SRE) as an organization are responding to incidents based on a PPT model: People, Process, and Tooling.  


= Incident document =
== Incident document ==
The incident document (Full Report) is the authoritative compendium of information for any given incident. In the case of an incident involving private data/information, a skeleton incident document will be posted to Wikitech as a pointer.  This document is the completed artifact produced at the end of any incident. It should consist of all the contextual information needed to understand the incident. Currently, the [[Incident status|Incident status - Wikitech]] is that format; the proposal is to augment it by expanding its metadata and adding a “scorecard”  section to assess whether or not the incident was managed effectively or not based on the criteria defined to better track incident engagement.  
The incident document (Full Report) is the authoritative compendium of information for any given incident. In the case of an incident involving private data/information, a skeleton incident document will be posted to Wikitech as a pointer.  This document is the completed artifact produced at the end of any incident. It should consist of all the contextual information needed to understand the incident. Currently, [[Incident status|Incident status]] on Wikitech is that format; the proposal is to augment it by expanding its metadata and adding a “scorecard”  section to assess whether or not the incident was managed effectively or not based on the criteria defined to better track incident engagement.  


== Metadata ==
=== Metadata ===
Following is a brief metadata table, proposed to be included at the top of each incident document (and eventual storage in a database, like Phabricator). The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.
Following is a brief metadata table, proposed to be included at the top of each incident document (and eventual storage in a database, like Phabricator). The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.
{| class="wikitable"
|'''Incident ID'''
|datestamp + brief component format
|'''UTC Start Timestamp:'''
|YYYY-MM-DD hh:mm:ss
|-
|'''Incident Task'''
|Phabricator Link
|'''UTC End Timestamp'''
|YYYY-MM-DD hh:mm:ss
|-
|'''People Paged'''
|<amount of people>
|'''Responder Count'''
|<amount of people>
|-
|'''Coordinator(s)'''
|Names - Emails
|'''Relevant Metrics / SLO(s) affected'''
|Relevant metrics


% error budget
{{Incident scorecard|demo=1}}
|-
|'''Summary:'''
| colspan="3" |
|}


==== Metadata Dictionary ====
==== Metadata dictionary ====
Some of the above fields are easy to fill in retrospectively, others not so much. Below we have some proposal on how to fill these in during the ritual
Some of the above fields are easy to fill in retrospectively, others not so much. Below we have some proposal on how to fill these in during the ritual


* '''Incident ID''': Use whatever is already the name of the corresponding page in Wikitech. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in the format described above
* '''Incident ID''': Use whatever is already the name of the corresponding page in Wikitech. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in the format described above
* '''UTC Start Timestamp''': Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
* '''Start''': Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
* '''UTC End Timestamp''': Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
* '''End''': Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
* '''Incident Task''': If one does not exist, ask that one is created. It could just be an umbrella one for the actionables one. Title of task could be “Incident: <Incident ID>”
* '''Task''': If one does not exist, ask that one is created. It could just be an umbrella one for the actionables one. Title of task could be “Incident: <Incident ID>”
* '''People Paged:''' Go through VictorOps timeline. Make sure to deduplicate SMS/PUSH
* '''People paged:''' Go through VictorOps timeline. Make sure to deduplicate SMS/PUSH
* '''Responder count''': This will currently require grepping IRC logs. This is going to be pretty onerous to do. We suggest updating the Incident Status doc maintained by the IC to record people responding.
* '''Responder count''': This will currently require grepping IRC logs. This is going to be pretty onerous to do. We suggest updating the Incident Status doc maintained by the IC to record people responding.
* '''Coordinator(s)''': Straight out of the Incident Status doc. Remember to parse the timeline for IC role handovers
* '''Coordinators''': Straight out of the Incident Status doc. Remember to parse the timeline for IC role handovers
* '''Relevant Metric(s)/SLO(s) Affected''': If an SLO exists in the [[SLO#Published%20SLOs|Published SLOs page]] use that. Eventually there will be Grafana dashboards containing SLOs and SLIs as well as a calculation of the error budget. While SLOs are being rolled out, this is expected to not be true for most services, in which case, enter “No relevant SLOs exist”. In that case, add any relevant metrics that will help quantify the Incident’s impact. It’s imperative to add them when dealing with Incident with direct end-user impact (e.g. edge traffic requests dropped by X%)
* '''Affected metrics/SLOs''': If an SLO exists in the [[SLO#Published%20SLOs|Published SLOs page]] use that. Eventually there will be Grafana dashboards containing SLOs and SLIs as well as a calculation of the error budget. While SLOs are being rolled out, this is expected to not be true for most services, in which case, enter “No relevant SLOs exist”. In that case, add any relevant metrics that will help quantify the Incident’s impact. It’s imperative to add them when dealing with Incident with direct end-user impact (e.g. edge traffic requests dropped by X%)
* '''Summary''': Straight out of the public Incident doc or the Incident Status doc if the former does not exist yet.
* '''Impact''' and '''Summary''': Impact in one or two sentences. Summary in one or two paragraphs. Copied from the public Incident doc, or the Incident Status doc if the former does not exist yet.


== Incident Status ==
=== Incident status ===
The incident status document is the ongoing status updates and notes/timeline during an incident. In addition, this notepad will feed into the overall incident (post-mortem review) document.
The [[Incident status]] page contains the ongoing status updates and notes/timeline during an incident. In addition, this notepad will feed into the overall incident (post-mortem review) document. Using the ''Create a new incident report'' box will allow you to quickly create an incident report.


(TEMPLATE, make a copy!) 2021-MM-DD: Incident status
=== Scorecard ===
Following is a proposal based on the three assessment rubrics for this incident’s response efforts, each with its point scale and assessment bracket per item. Low scores equal poor performance; high scores indicate positive performance. The intent is not to blame or raise concern but to effectively introspect around how an incident played out without fear of blame or retribution. If anything, low scores should help indicate where to direct attention and priority at an organizational level.


== Incident scorecard ==
Following is a proposal based on the three assessment rubrics for this incident’s response efforts, each with its point scale and assessment bracket per item. Low scores equal poor performance; high scores indicate positive performance. The intent is not to blame or raise concern but to effectively introspect around how an incident played out without fear of blame or retribution. If anything, low scores should help indicate where to direct attention and priority at an organizational level.
{| class="wikitable"
{| class="wikitable"
| colspan="2" |'''Incident Engagement™  ScoreCard'''
|+ [[Incident Scorecard|Incident Engagement ScoreCard]]
|'''Score'''
! Rubric
! Question
! Score
|-
|-
| rowspan="5" |'''People'''
! rowspan="5" | People
|Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
|Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
|
|
|-
|-
|Were the people who responded prepared enough to respond effectively (0/5pt)
|Were the people who responded prepared enough to respond effectively? (0/5pt)
|
|
|-
|-
|Did fewer than 5 people get paged (0/5pt)?
|Did fewer than 5 people get paged? (0/5pt)
|
|
|-
|-
Line 89: Line 66:
|
|
|-
|-
| rowspan="6" |'''Process'''
! rowspan="6" | Process
|Was the incident status section actively updated during the incident? (0/1pt)
|Was the incident status section actively updated during the incident? (0/1pt)
|
|
Line 99: Line 76:
|
|
|-
|-
|Are the documented action items assigned?  (0/1pt)
|Are the documented action items assigned? (0/1pt)
|
|
|-
|-
|Is this a repeat of an earlier incident (-1 per prev occurrence)
|Is this a repeat of an earlier incident? (-1 per prev occurrence)
|
|
|-
|-
Line 108: Line 85:
|
|
|-
|-
| rowspan="4" |'''Tooling'''
! rowspan="4" | Tooling
|Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
|Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
|
|
Line 115: Line 92:
|
|
|-
|-
|Were all engineering tools required available and in service? (0/5pt)  
|Were all engineering tools required available and in service? (0/5pt)
|
|
|-
|-
|Was there a runbook for all known issues present? (0/5pt)
|Was there a runbook for all known issues present? (0/5pt)
|
|-
! colspan="2" align="right" | Total score
|
|
|}
|}
Notes:
Notes:


This scorecard is meant to be filled out as a part of an incident review effort after the incident is complete and the document is written. Part of the metadata and questions can be filled out before the incident review as needed. The goal of the scorecard is to use it as a reflection piece, and a conversation starter to help identify gaps in our current IR efforts. Bad scores are not meant to reflect poorly on responders, but increase visibility and help drive action to the gaps.  
This scorecard is meant to be filled out as a part of an incident review effort after the incident is complete and the document is written. Part of the metadata and questions can be filled out before the incident review as needed. The goal of the scorecard is to use it as a reflection piece, and a conversation starter to help identify gaps in our current IR efforts. Bad scores are not meant to reflect poorly on responders, but increase visibility and help drive action to the gaps.


= Aggregate ScoreCard =
== Aggregate ScoreCard ==
The aggregate sccorecard is an average of scores of all the incidents within a specific time period (quarter in this case). We mainly use the monthly scorecard to tabulate results for the end-of-quarter results.
The aggregate sccorecard is an average of scores of all the incidents within a specific time period (quarter in this case). We mainly use the monthly scorecard to tabulate results for the end-of-quarter results.
{| class="wikitable"
{| class="wikitable"

Revision as of 17:38, 1 August 2022

To measure progress outside of standard incident counts and severities, the SRE team has designed an incident scorecard, to be applied in every major incident, measuring the team’s incident response and engagement score.

Incident Assessment Overview

Defining a scorecard for the incident management effort facilitates tuning the process to achieve our defined objectives. The scorecard is be structured in 2 layers:

  • Per Incident: this should be a list of items to assess the management of the incident.
  • Per Month/Quarter: an aggregate of the scorecard to review trends over time and YoY

The organization can extrapolate and report on yearly efforts based on the collected data with these two views. Assessment of the SRE team’s collective progress would be visible at different levels of metrics resolution according to the level of reporting needed.

Definitions

  • Incident: An incident is an outage, security issue, or other operational issue whose severity demands an immediate human response.
  • SLI: An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile. Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret.
  • SLO: A Service Level Objective (SLO) is an understanding between teams about expectations for reliability and performance. An SLO is a service level objective: a target value or range of values for a service level measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. (i.e., More than 99% of all requests are successful) defined in SLO - Wikitech

Incident Engagement: Incident engagement is a set of metrics defined within the WMF that help track how we (SRE) as an organization are responding to incidents based on a PPT model: People, Process, and Tooling.

Incident document

The incident document (Full Report) is the authoritative compendium of information for any given incident. In the case of an incident involving private data/information, a skeleton incident document will be posted to Wikitech as a pointer.  This document is the completed artifact produced at the end of any incident. It should consist of all the contextual information needed to understand the incident. Currently, Incident status on Wikitech is that format; the proposal is to augment it by expanding its metadata and adding a “scorecard”  section to assess whether or not the incident was managed effectively or not based on the criteria defined to better track incident engagement.

Metadata

Following is a brief metadata table, proposed to be included at the top of each incident document (and eventual storage in a database, like Phabricator). The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.

Incident metadata
Incident ID datestamp + service and event Start YYYY-MM-DD hh:mm:ss
Task End YYYY-MM-DD hh:mm:ss
People paged Responder count
Coordinators Affected metrics/SLOs
Impact Who was affected and how?

Metadata dictionary

Some of the above fields are easy to fill in retrospectively, others not so much. Below we have some proposal on how to fill these in during the ritual

  • Incident ID: Use whatever is already the name of the corresponding page in Wikitech. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in the format described above
  • Start: Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
  • End: Use whatever the timeline in the corresponding Wikitech page says. For a not yet publicly available doc, use the internal incident status doc. Make sure it is in UTC
  • Task: If one does not exist, ask that one is created. It could just be an umbrella one for the actionables one. Title of task could be “Incident: <Incident ID>”
  • People paged: Go through VictorOps timeline. Make sure to deduplicate SMS/PUSH
  • Responder count: This will currently require grepping IRC logs. This is going to be pretty onerous to do. We suggest updating the Incident Status doc maintained by the IC to record people responding.
  • Coordinators: Straight out of the Incident Status doc. Remember to parse the timeline for IC role handovers
  • Affected metrics/SLOs: If an SLO exists in the Published SLOs page use that. Eventually there will be Grafana dashboards containing SLOs and SLIs as well as a calculation of the error budget. While SLOs are being rolled out, this is expected to not be true for most services, in which case, enter “No relevant SLOs exist”. In that case, add any relevant metrics that will help quantify the Incident’s impact. It’s imperative to add them when dealing with Incident with direct end-user impact (e.g. edge traffic requests dropped by X%)
  • Impact and Summary: Impact in one or two sentences. Summary in one or two paragraphs. Copied from the public Incident doc, or the Incident Status doc if the former does not exist yet.

Incident status

The Incident status page contains the ongoing status updates and notes/timeline during an incident. In addition, this notepad will feed into the overall incident (post-mortem review) document. Using the Create a new incident report box will allow you to quickly create an incident report.

Scorecard

Following is a proposal based on the three assessment rubrics for this incident’s response efforts, each with its point scale and assessment bracket per item. Low scores equal poor performance; high scores indicate positive performance. The intent is not to blame or raise concern but to effectively introspect around how an incident played out without fear of blame or retribution. If anything, low scores should help indicate where to direct attention and priority at an organizational level.

Incident Engagement ScoreCard
Rubric Question Score
People Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
Were the people who responded prepared enough to respond effectively? (0/5pt)
Did fewer than 5 people get paged? (0/5pt)
Were pages routed to the correct sub-team(s)?
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
Process Was the incident status section actively updated during the incident? (0/1pt)
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
Is there a phabricator task for the incident? (0/1pt)
Are the documented action items assigned? (0/1pt)
Is this a repeat of an earlier incident? (-1 per prev occurrence)
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
Tooling Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
Did existing monitoring notify the initial responders? (1pt)
Were all engineering tools required available and in service? (0/5pt)
Was there a runbook for all known issues present? (0/5pt)
Total score

Notes:

This scorecard is meant to be filled out as a part of an incident review effort after the incident is complete and the document is written. Part of the metadata and questions can be filled out before the incident review as needed. The goal of the scorecard is to use it as a reflection piece, and a conversation starter to help identify gaps in our current IR efforts. Bad scores are not meant to reflect poorly on responders, but increase visibility and help drive action to the gaps.

Aggregate ScoreCard

The aggregate sccorecard is an average of scores of all the incidents within a specific time period (quarter in this case). We mainly use the monthly scorecard to tabulate results for the end-of-quarter results.

FY2022 Q1 Q2 Q3 Q4
Jul Aug Sep Oct Nov Dev Jan Feb Mar Apr May Jun
Incidents Count (per severity)
SLO Delta (global or affected SLOs only)
Score