You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident response

From Wikitech-static
Jump to navigation Jump to search
Incident response

Templates: LightweightFull report

Working group:


This is a brief, at-a-glance description of what steps to take when responding to an on-going incident.

Don’t panic. Even when the wikis are down, you have time to communicate.

If you’ve been paged

  • Stop everything else you’re doing. If you can, respond even if you’re not at your desk.
  • Speak up in #wikimedia-operations to say you got the page and you’re looking at it. Read up in that channel for context.
  • If the alert is a clear false alarm, you can stop here.
  • If the alert may be caused by a (D)DoS or other attack or security issue, move to #mediawiki_security. If there’s too much alert noise, move to #wikimedia-sre. Otherwise, stay in #wikimedia-operations connect.
  • Every genuine page needs an Incident Coordinator (IC). If you're an SRE and there's no IC yet, you should become the IC.

If there was no page, but...

  • If the issue affects users, and three or more people are working on it, there should be an IC.
  • If the issue needs continuous attention, so you’ll be handing it off until it’s resolved, there should be an IC.
  • If you’re not sure whether there should be an IC, it’s better to have one. If it turns out to be unnecessary, you can stop later.
  • If you're an SRE and there's no IC yet when one is needed, you should become the IC. Alert others by mentioning #page in IRC, and proceed below.

To become the Incident Coordinator (IC)

  • If there is an offgoing IC, ensure that you are both in agreement about the handoff.
  • Announce in IRC, “I am the IC.” You are now the IC.
  • If there’s not yet a status doc, start one by making a copy of the template (File -> Make a copy).
  • Update the status doc to say “IC: <your name and IRC nick>” and add the IC handoff to the timeline.
  • If it's not already, put a link to the status doc in the topic of #mediawiki_security, along with a few words identifying the incident (“foobaroid OOMs”) or at least the date.

When you are the IC

  • Communicate, don’t deep dive. Resist the temptation to troubleshoot the issue; let others do that. Your job is to keep the big picture. If you’re uniquely suited to solve the problem yourself, hand off the IC role to someone else.
  • Keep track of what needs to be done, and what everyone is working on. Assign tasks as needed to make sure everything is covered and no one is doing conflicting work.
  • Keep the status doc up to date. When new information comes in, or engineers take action to work on the problem, update the doc.
    • Set a timer for yourself: every half hour, make sure the status doc is correct.
  • Ask questions. It’s important for you to be fully informed, and it’s also likely that if you don’t know the answer, others don’t either.
    • If you’re not sure what someone is doing, ask them.
    • If someone was investigating a question and you never saw an answer, follow up.
    • If the team agrees “we should do X,” ask who is going to do it -- or assign it to someone.
  • Using the guidelines on officewiki, evaluate whether you need to notify SRE Directors, Legal, Comms, or WMF leadership. If so, either contact Directors yourself or assign someone to do so.
  • Continue to actively work as the IC until you hand off the role to a specific person or until the incident is over.

When you are not the IC

  • Watch IRC while you work. If others are talking to you, make sure you’ll know.
  • Talk in IRC while you work. Don’t take any action without announcing it first. Keep the channel free of unnecessary chatter during the incident.
  • Log your actions to the SAL. It’s better to log too much than too little.
    • In #wikimedia-operations, say !log Restarted foobaroid on xyz1234.
    • If the incident is security-sensitive, instead use !log-private in #mediawiki_security for visibility, even though it doesn’t actually log anywhere.
  • If you need more people to help you, tell the IC.
  • If you have a question no one has asked, or you know something no one is talking about, speak up -- even if you think someone must have thought of it already.
  • After one person has been the IC for several hours now, or if it’s near the end of their workday, consider asking them if they would like a replacement IC.

To hand over the IC role to another person

  • If the incident is in progress, you are the IC until someone takes over from you.
  • Make sure the status doc is up-to-date with everything you know.
  • Make sure the new IC has a full understanding of the situation so far: what’s known, what’s unknown, and who’s working on what.
  • Make sure they know they are the IC.
  • Make sure they update IRC and the status doc to show they’re the IC.
  • You are no longer the IC. Good job!

To resolve the incident and stop being IC

  • Even if there’s still work to do, you may not need an IC if that work is no longer urgent. When remaining tasks can wait until normal working hours, the IC can end the incident.
  • Update the status doc with everything you know. Remind others to do the same. This is much easier now than it will be later. Update the incident status to “resolved.”
  • Make sure unfinished work is tracked in Phabricator, and tasks are linked from the doc.
  • Announce in IRC, “I am resolving the incident.” Mention the status of any continuing issues. Make sure to update each channel where the incident was discussed.
  • You are no longer the IC. Good job!
  • Consider writing an incident report. This decision can wait until regular working hours, but the writing will be easiest while the details are fresh in your mind.
  • Add a bullet to the SRE meeting notes for discussion at the next meeting.

Deciding whether to contact others & how to contact others

Information on when and how to involve other teams in WMF is on officewiki, since it includes staff members' contact information.