A Crash Course on Incident Response, Part 1
Shit happens. No amount of planning, preparation, or conservative practices can prevent incidents. Some items are simply outside of our…
Shit happens. No amount of planning, preparation, or conservative practices can prevent incidents. Some items are simply outside of our control and it isn’t possible to plan for every imaginable scenario. Low probability events will still occur. What we can do is develop effective incident management practices so that we can quickly, effectively, and efficiently recover with minimal stress and impact. While incidents are not within our control, how we respond to those incidents is something we can manage.
There’s nothing worse than a minor issue unnecessarily resulting in a major outage or stressful situation. How we manage incidents directly impacts our overall outcome — especially the time to resolution. It also impacts the likelihood of a repeat incident. Addressing problems the right way is at the heart of incident management.
What are the important aspects of a good incident management practice? First and foremost, our goal is to bring a sense of calm focus to the situation. We need to reduce and eliminate stress and unnecessary decisions so that all available resources are focused on resolving the situation. This post is a distillation of actual experience building an incident management program that works well in real life.
We should define “incident” before going further. Within this post, the term incident is used to refer to an abnormal situation that requires out-of-normal-operating-procedures efforts to resolve. Specifically, we are focused on situations that are currently, or have a high probability of, impacting our users.
Why do I call this an incident management program? Because resolving the incident is only one component. There are other key aspects too: our preparation, communication during and after the event, and our post-event follow-up. These all fit and play together to create an effective program that improves over time.
Think about the last incident you were involved in. It doesn’t have to be work, it could be a situation from your personal life. What mattered the most? What would have made it go smoother and reduced stress? There are two crucial aspects: 1) communication and 2) getting resolution. Let’s break each of those down a bit more.
If you’ve ever been notified about a situation without being given much information, or given inaccurate information, you understand the importance of communication during an incident. Communication isn’t just important, it is critical. We need to ensure anyone who needs to be is informed of the correct information, at the correct time, and kept in the loop. We’ll talk more about specific practices to achieve this later.
The second aspect is getting resolution. Having insufficient or incorrect resources working to resolve an issue is problematic. It increases the chances of making poor, possibly reckless, decisions that might only exacerbate the incident. Adding unnecessary people will only increase stress and distraction in an already stressful situation. A timely resolution requires rapidly engaging the minimum set of resources needed to drive resolution.
Incident Stakeholders
Focused, timely, crisp, accurate communication during an incident is absolutely critical. We need to ensure we minimize stress both inside and outside the incident response team so the response team can focus. The best way to do that is with high quality communication. If you’ve worked an incident and had the pleasure of a senior manager or executive asking for updates every 90 seconds, you’ll understand this. Stakeholders want to know what’s going on. It isn’t always obvious to them that they are adding more stress to a stressful situation, or they might believe adding more pressure will drive to a quicker resolution.
In order to think about our communications, we need to understand our stakeholders. We generally have different groups of stakeholders we need to keep informed. Each group has different concerns and needs. The key groups are often:
Executive / Senior Leadership: These groups need extremely focused, consistent communication. They want to understand the impact in terms of risk to the business. That means ensuring they have insight into the severity, customers impacted, and expected time-to-resolution. They want to know the right people are working on the problem. Be brief but accurate.
Customer Support Resources: These groups are the boots on the ground in terms of helping your customers through a frustrating and stressful time. They might need information that helps them explain what risks the customer might be facing, severity, impact, and expected time-to-resolution. Ensure they have information needed to address and help comfort customers. It is not uncommon for the direct incident management communications to be filtered and revised by a specialist group.
Customers: Customers might be internal or external. They will be frustrated, confused, and in the dark. They generally just want to know you’re working hard on the issue, a time-to-resolution, and if/how they might be impacted longer term. Depending on your business, this information will be filtered and provided in a summary form externally.
Engineering / Operations Leadership: This group usually cares about the cause and the resolution. Their job during an incident should be to help get the resources needed to resolve the issue. They want to know the right people are working on the problem. They’ll need information to answer the hard questions they’ll be getting from their superiors about what’s happening, how it happened, why it happened, etc.
Sales / Professional Service Leadership: These groups need to be informed about issues. They might be onsite with customers, giving demos, or fielding questions about the situation. They often just need to understand the expected time-to-resolution, know the risk the situation poses to customers, and perhaps high level talking points.
Engineering / Operations Teams: These people comprise the majority of our Standby Resources. Keeping the engineering and operations teams informed is valuable. They’ll generally be most interested in the detailed technical information, certainly more so than the other groups. These teams often want to know if they can help or if needed resources are missing. We want to keep them informed without needing them to hang out on a conference bridge.
Effective Incident Communication
All communications should be high quality. That means they are written with grammatically correct sentences and focused on clarity. Avoid jargon, acronyms, or anything that might lead to confusion or reduce clarity. Avoid unneeded verbosity. I generally suggest using a combination of “event-driven” and “time-based” updates. Stakeholders left wondering what is going on causes tremendous unneeded stress because they will attempt to figure out what’s happening — pulling attention off getting resolution. We want to avoid that. We also want to ensure there is one clear, accessible source of truth where stakeholders can access information. We need to minimize rumors, communication race-conditions, speculation, and so on.
I suggest all communications start from the following template:
Title: A brief, clear, summary of the problem.
Severity: A high-level assessment of the severity.
Impact: An indicator of customer impact (count, high impact customers, etc…)
Resources Needed [optional]: A list of teams or groups additional resources are needed from.
Status: A clear, well-written description of the current status. It needs to contain enough context and be well-written enough to not further increase confusion. Whether or not you’ve got a viable path to resolution should be clearly articulated.
Next Update At: The approximate time you’ll issue the next status update if there are no important status changes sooner.
Engaged Teams: A (summarized) list of the resources engaged in getting resolution.
Coordination Room/Call: Name of the chat room or conference call the team is using to coordinate response efforts.
Incident Commander Contact: Name, phone, email, chat alias of the incident commander or person who can be contacted for additional information.
Having standardized communications is essential. You don’t want recipients hunting for information. I advocate providing training to both the incident commanders, who are responsible for ensuring clear communication is happening, and to stakeholders so they understand how to read incident communications and know your expectations around their behavior.
Let’s break down each component of this format and its intent.
The Title serves to clearly and concisely identify an issue. It should be clear but very concise. Sometimes we’ll have multiple issues happen at once, so don’t use something meaningless like “SEV1” or so generic that stakeholders can’t tell to which incident communication pertains. We want to use a title that helps us disentangle issues and reduce speculation. You don’t want people referring to an incident by multiple names causing further confusion. I do not like using ticket numbers because these can be easily confused and are not obvious to people who are unfamiliar with the tickets.
Severity is used so that people can quickly assess the impact of an incident. We’ll dive into some scales later, but this doesn’t need to be fancy. A simple High, Medium, Low is often sufficient. We just want our stakeholders to quickly understand the impact or potential impact of an incident. Also keep in mind that this might change and evolve throughout the life of an incident. It isn’t set in stone.
Giving an indicator of impact will help alleviate a lot of concerns and help make clear if additional resources should be brought in to assist. This is a proxy that helps us understand how widespread an issue might be. Is it impacting three customers out of 1,000,000 or is it impacting all customers? Is it impacting your largest, highest paying, or most influential customers? You should concisely represent this information. Some stakeholders can use this information to begin crafting communication about the issue.
Resources Needed is helpful to include when there are needed resources. This will give those stressed out leaders something to do: they can go wake up and harass the resources you’re trying to raise to get resolution. It is OK to leave this line in and clearly state “None” if there are no additional resources needed at present.
Status is obviously the meat of the message. It is used to convey specifics about what’s currently going on. Don’t be skimpy, but absolutely do not write a novel here. Do everything possible to make this as clear and well written as you can. This is what will calm fears, let everyone know that the situation is under control, and that the situation is being well managed. I used to have incident commanders practice their incident status writing each month.
Include a Next Update At section so that people know when to expect updates, this will help to set expectations properly regarding updates. If it is 15:13 and the solution you’re working on is expected to take 6 hours say that: “next update at 21:15, or as the situation evolves.” With high severity issues, or broad impact incidents, it might be prudent to send updates at fixed intervals to give even marginal updates on the status.
Engaged Teams is a useful piece of information to let stakeholders know who’s working on getting resolution. This lets everyone know the issue is in good hands and the right people are engaged and working on it. It doesn’t generally need to be a list of individual names, but team or area of concern is useful.
It can be helpful to include the chat room name or conference bridge information so that those who want to observe the efforts can do so, depending on your culture and tool set. You don’t want noise or distractions in the mediums being used for active work, but this can help reduce stakeholder stress if they can “see” work happening.
Lastly, tell people how to reach the incident commander. This lets them know who to contact if they need more information. You should set expectations with leadership that they should only engage with the incident commander or another designated resource, so as to maximize the focus on getting resolution. Stakeholders will want more information. Make sure it is inline with expectations and tell them where and how to get it.
So far we have spent a lot of time discussing communication. That’s because it is critical. Without it, you’ll have confusion, frustration, and stress. Even relatively minor incidents will feel like a calamity. Just improving the quality of communication will massively reduce the stress around incidents and lead to quicker resolutions because you’ll reduce distractions, improve focus, and get the correct resources more quickly.
In Part 2, we’ll dig into some more tactical aspects of incident response. Specifically, we’ll talk about how to get good information to share, the notion of an Incident Commander, and communication protocols.
Real Kinetic helps companies develop effective incident management programs. Learn more about working with us.