Incident response#

When an Incident is declared, we trigger a special response in order to ensure that it is resolved quickly. This section describes our incident response process, major roles and terminology, and what to expect.123.

In Beta!

We are currently working out our Incident Response process. The content on this page might change over time, and we welcome suggested changes and pull requests!

Roles and team structure#

An Incident Response Team is formed when an Incident has been declared. The goal of the Incident Response Team is to collectively resolve incidents.

An Incident Response Team is generally made up of:

Incident Response Team#

The group of roles that collectively understand, plan, resolve, and communicate our actions around an Incident. The people in these roles may change in a fluid manner, and one person may serve in multiple roles. A rough way to approximate this team is “the people that have communicated in internal and external channels to resolve an incident.”

Incident Commander#

The Incident Commander has the authority to plan and delegate action to others on the Incident Response Team. They are not expected to take actions themselves. Their goal is to help the team make consistent and deliberate progress towards resolving an incident. They are the Source of Truth about the current state and action plan surrounding an incident.

External Liason#
External Liasons#

The person that is responsible for communicating with external stakeholders during an incident. This is either the Incident Commander, or somebody to which they delegate this role. Every few working hours, they should communicate the status of the incident, updates about our current thinking and what we have tried, and any expected changes coming.

Subject Matter Expert#
Subject Matter Experts#

A member on the Incident Response Team with expertise in an area of relevance to an Incident. SMEs have a variety of backgrounds and abilities, and they should be pulled in to the Response Team as-needed by the Incident Commander. Their goal is to take actions as-directed by the Incident Commander to resolve an incident.

Communication channels#

External communication#

  • The Incident Commander acts as the primary point of communication with external stakeholders like the Community Representatives.

  • They may delegate this responsibilitiy to another team member if they wish (e.g., to the Support Steward team.)

  • We may interact with external stakeholders via comments in Incident Response issues if it helps resolve the incident more quickly.

Internal communication#

Incident response process#

Incidents are a special kind of support ticket, because they are related to degraded service that immediately impacts communities. We prioritize the resolution of incidents above all other kinds of work, and have a special process for tracking conversation and progress with them.

Here is the process that we follow for incidents:

  1. Acknowledge the incident. Communicate with the Community Representative that there is an incident. Use this canned response as a start for responding:

    Incident first response template

  2. Open an incident issue. For each Incident we create a dedicated issue to track its progress. open an incident issue and notify our engineering team via Slack.

  3. Try resolving the issue and take notes while you gather information about it.

  4. If after 30 minutes the issue is not solved or you know you cannot resolve it

  • Ping our engineering team and our Project Manager in the #support-freshdesk channel so that they are aware of the incident.

  • Add the incident issue to our team backlog.

  1. Designate an Incident Commander. Do this in the Incident issue. By default, this is the Support Steward.

  • Confirm that the Incident Commander has the bandwidth and ability to serve in this role.

  • If not, delegate this to another team member.4

  1. Designate an External Liason. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.4

  2. Investigate and resolve the incident. The Incident Commander should follow the structure of the incident issue opened in the step above.

  3. Delegate to Subject Matter Experts as-needed. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.4

  4. Communicate our status every few hours. The External Liason is expected to communicate incident status and plan with the Community Representatives. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:

    Incident update template

  5. Communicate when the incident is resolved. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. Mark the FreshDesk ticket as Resolved.

  6. Fill in the Incident Report. The Incident Commander should do this in partnership with the Incident Response Team.

  7. Close the incident ticket. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident Incident Report, then close the incident by:

    • Closing the incident issue on GitHub

    • Marking the FreshDesk ticket as Closed

Handing off Incident Commander status#

During an incident, it may be necessary to designate another person to be the Incident Commander. For example, if it is getting late in the current IC’s time zone, they feel burnt out from leading the incident response, or there is someone with better visibility or experience to be the Incident Commander. This is encouraged and expected, especially for more complex or longer incidents!

To designate another team member as the Incident Commander, follow these steps:

  1. Confirm with them that they are able and willing to serve as the Incident Commander.

  2. Update the Incident Report issue by updating the Incident Commander name in the top comment.

  3. Notify the team with a comment in the Incident Report issue.

Key terms#

Incident Report#
Incident Reports#

A document that describes what went wrong during an incident and what we’ll do to avoid it in the future. When we have an Incident, we create an Incident Report issue. This helps us explain what went wrong, and directs actions to avoid the incident in the future. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is not meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault5.


1

The Google SRE Incident response guide has a wealth of information about incident response and distributed SRE teams.

2

This ACM blog post describes the complexity of coordinating across a team of distributed responders during an incident, and notes a places where Incident Commander roles may actually hinder responsiveness. It is a good lesson in the complexity of incidents with distributed teams!

3

The WikiMedia Clinic Duty process also inspired our process here, and is a great overall workflow around distributed SRE.

4(1,2,3)

If you cannot find somebody to take on this work, or feel uncomfortable delegating, the Project Manager should help you, and is empowered to delegate on your behalf.

5

See the Google SRE post-mortem culture and the Blameless guide to post-mortems for some guidelines.