A Crash Course on Incident Response, Part 2
In Part 1, I discussed the important aspects of a good incident management practice including effective communication, clearly defined…
In Part 1, I discussed the important aspects of a good incident management practice including effective communication, clearly defined stakeholders, and getting timely resolution. Now let’s dig into some tactical considerations of incident response.
Before you can give quality updates during an incident, you need to have good information to share. This section will dive into some practical aspects of getting good information and managing an incident to resolution.
Incident Commander
The first step should be designating an incident commander. This is the most crucial role in the incident management process. Their focus is on ensuring the right resources are engaged, that we’re sending high-quality communications, and that we’re making the calls about how to proceed. For the time being, they outrank everyone else. They are the boss. If they need resources, they have authority to call VPs to wake their people up.
The incident commander is tasked with assessing the quality of information, determining when more information is needed, and ultimately making the best decisions possible under duress. This means they’ve got a lot of responsibility to complement their authority.
Because of this you should favor people who are calm and collected under pressure, know how to focus and how to help others focus, can communicate exceptionally well, and aren’t afraid to tell someone who’s not adding value to get out of the room — even if that person has a big fancy title. The last part is often key to supporting the larger incident management effort. A good incident commander knows how to “manage up” to allow those working on an issue to focus.
People that perform well as incident commanders often come from roles that emphasize communication, teamwork, organization, and follow through. Many of the best incident commanders I have worked with had very strong technical aptitude but came from customer support type organizations. They do not need to be senior, but having good familiarity with your product(s) and organization can be very helpful. I have rarely seen people from technical roles, such as software engineers or operations, or project management do well as incident commanders.
Coordination Mediums
Next, determine your coordination mediums and make sure they match your company’s leadership culture.
Are you a war-room shop? War rooms are meant to bring all resources together into a single space to facilitate communication and coordination. In practice, these often look more like everyone working on the problem together in a pressure-cooker-like environment, some higher-up people pacing around and periodically freaking out, and a few random people who pop in every 17 minutes asking for updates. A well-run war room can be very effective, but if not carefully managed, they can be very stressful. Most war rooms are not well-run.
Another common option, especially with more companies working from home, is a conference bridge. A conference bridge is similar to a virtual war room but can sometimes amass very large numbers of people lingering on the call. These are often awkward conference calls where most of the people sit wondering what to do while a few people actually work to fix the problem.
My preferred option is a chat room combined with a conference bridge if needed. I prefer this option because I find it balances active, low-friction communication with lower stress. If you have a small collocated team where everyone is already together, then a war room can replace the conference bridge.
Don’t have a conference bridge with hundreds of people hanging around. It isn’t a good use of anyone’s time. You can pull people back in as they are needed, so have them drop if they’re not actively engaging. That’s why I like chat rooms. They provide a nice written account of what’s going on, the current thinking, ideas on solutions, and new issues which lets people passively observe the situation and keeps a timeline to discuss in the after-action retro.
Irrespective of your coordination mechanisms, I suggest discouraging too broad an audience from attending. Adding more people into a mix can sometimes help overall throughput but rarely increases raw speed. Incident resolution is usually about speed more than throughput. Throwing more people into the mix often just serves as a distraction.
Someone should be tasked with keeping track of the rough timeline of information discovery, steps taken, theories about the issue and its solutions, validation of those theories, communications, and actions being taken. This is true even if you’re using a chat room. This timeline will be important much later, but can also be useful to reference what has been or not been tried and why. Keep tabs on general updates and flows of information. It’s useful to improve your communication processes and it might provide ideas about how to identify weak spots in observational tooling. That might help you prevent issues or resolve them quicker in the future.
Incident Management Operations
Now that we have our leader, we know how we’ll coordinate resources, and we know how we’ll update stakeholders (see Part 1), we should talk about the operational aspects of incident management.
Let’s revisit our goals: resolve the incident quickly and safely.
Every company is unique in terms of what “quickly” and “safely” mean. The ratio of importance between those two depends on the company as well. For some groups, like social game companies, getting quick resolution is the most important because downtime is incredibly costly. For other organizations, like financial institutions, safety may be more important.
Resolving problems centers around three aspects:
Understand the problem
Identify possible solutions
Implement a solution
You’ll have loads of time to dissect what allowed the issue to happen, why other secondary problems occurred, and so on. For now, we need to understand the issue well enough to choose a solution that will resolve the problem without causing more problems. You also need to select a solution that can be reasonably implemented. The perfect solution often cannot be put in place quickly enough, so you’ll need a stop-gap to buy enough time.
How do we accomplish those aspects?
First we need to understand the issue. That requires information about the problem and the systems. This is often from a combination of observability tooling (metrics and logs) and user reports. Make sure your systems provide this information or you’ll be debugging in the dark, which isn’t fun.
Using that data, we determine which team or teams might need to be engaged. You probably will not get this perfect, but that’s OK. Pull in resources from the teams most likely involved and they can help determine who else might be required. You need this group to be small enough for people to freely discuss the situation, but large enough to have the resources you need. I suggest that as people are no longer actively engaged, you ask them to drop off the call but remain available if needed (on standby).
Once you have the resources they need to be able to communicate, have access to information, have space to develop ideas and theories and test hypotheses, and ultimately implement the solution and monitor the situation, establish either a call or chat room for them to discuss the situation and share ideas. If you opt for a call, be sure someone is taking notes or transcribing the high points to a chat room for passive observers.
Setting up a chat room for the broader team and interested parties can be helpful so they can follow along and contribute where possible. I like to have someone summarizing the high points of the call to that channel. That will enable a broader team to contribute ideas and help out where possible. It will also help them stay generally informed.
Throughout this process, you should be sending communications to your stakeholders as previously discussed.
I have found this two-tier structure to work well. It allows the team actively working to focus, and it provides a means to keep a larger support team informed and ready to assist. It is completely reasonable and acceptable for sub-groups to split out for more focused work provided they keep the incident commander informed of what’s going on. They might do that via the larger chat room or by re-joining the active team call.
The incident commander is responsible for ensuring communications happen, but that doesn’t mean they have to author all of them. The larger chat room can be a great resource to get help drafting and proof-reading the updates.
Once a solution has been identified, I suggest posting it to the larger chat room to make sure that no one has serious concerns. This acts like a code review. Since these resources were not directly involved in developing the idea, they might spot issues from angles that others missed.
One of the most important considerations for your solution should be determining if it was successful and how to monitor the situation going forward. Be sure that you have answers for both of those questions.
Apply the solution and monitor. Remember, if the solution carries serious risks and your team has been working feverishly for many hours, it is sometimes wise to get some rest before applying the solution.
Note on Process
If applying your solution requires bypassing all of your normal processes and safeguards, I would — in the strongest possible terms — suggest you review your processes very critically. Your normal processes should allow you to work quickly and safely. If they don’t, your processes are broken and ineffective when you need them the most.
In the midst of an incident is the worst possible time to take shortcuts and bypass safeguards such as code reviews. I would suggest pair coding any time you’re making changes to infrastructure or running scripts in production environments. You’re moving rapidly, under pressure, in a situation where you’ve had limited time to consider impacts. Don’t take shortcuts and don’t be reckless with your implementation.
Incident Retros
In the final part of this series, we’ll talk about the most important thing after communications: incident retros.
Real Kinetic helps companies develop effective incident management programs. Learn more about working with us.