What IT incident response can learn from emergency services: Incident Management Structure

Edmund Hintz
6 min readApr 20, 2021

--

In the previous article we saw a standard Fire and Emergency New Zealand (FENZ) incident response to a small scale and easily resolved incident. But lets choose the other path on our wee adventure. On the way to that trash can on fire, the radio advised that they’re now receiving multiple calls of a structure fire, persons reported (meaning we may have a rescue on our hands). In the distance there’s a hefty plume of smoke. Right here we’re probably escalating to 2nd alarm while still on the way; that’s going to immediately tip out 5 more trucks. And on arrival we have bystanders reporting sounds of fighting and a family member absconding from the scene. So now we need ambulance and police, and due to the fire going lickety split we’re gonna escalate to 3rd alarm (2 more trucks). We had 2 trucks on the first turnout, so with the escalation to 3rd we now have 9 trucks on the way along with an ambulance or two and who knows how many police. This is where large scale incident management comes into its own.

Going with our storage analogy, lets say that we’re now getting reports of timeouts on our service, mixed with error messages, and when the load balancers shifted traffic off the problem area it completely overwhelmed and crashed the resources left, and then the lb itself lost the plot because everything else is bork3n, and we need to call in vendor support to help diagnose and recover. It’s a big shambles, customers are screaming, c-levels are asking questions and social media is losing its mind. It’s game on.

So how are we gonna manage all this chaos? CIMS is our friend (and a friend to our ambo and police pals that are on the way too).

About CIMS

New Zealand has implemented the Coordinated Incident Management System (CIMS). This system is derived from California’s Incident Command System (ICS), and its further derivative Australasian Interservice Incident Management System (AIIMS). All participating agencies (Police, Ambulance, Civil Defense, Fire, Military, regional/local councils, etc) are trained in the structure, and as a result are able to have a common understanding and framework when incidents occur and they come together to resolve them. The full 3rd edition of CIMS from the NZ Government is linked at the conclusion of this document.

The entire intent of CIMS is to be modular and scalable. Take the principles, apply to your own incident types. Be they Covid19 (health sector), earthquake (Police/USAR), fire (dur), oil spill (environmental authorities), or Amazon US East 1-b failure (IT).

The CIMS structure is expressly designed to scale. The command responsibilities may be all managed by a single person at a small incident, but this can scale out to a national incident with multiple agencies. From CIMS 3rd Edition:

CIMS 3rd Edition page 5

This model has obvious application for an IT organization, where incidents can range from small contained products all the way out to full scale responses involving multiple teams, workstreams and external vendors/products.

More than just response

It’s important to note that while response is the portion which is most visible, a key part of CIMS is preparedness. Again, from CIMS 3rd Edition:

CIMS 3rd Edition page 106

As the saying goes, ambulance at the top of the cliff. By applying risk reduction, and being ready for response if this fails, a large amount of the damage potential can be mitigated prior to the point at which things have gone pear shaped, and this also sets up a faster recovery.

Some background information/insight

Because CIMS is designed for interagency coordination, there are some issues and terminology which are less concerning in the context of IT, but are still related and important to understand.

Coordination: As this structure is intended for common understanding between multiple emergency services, this takes on significant import in that context. There are of course parallels with the IT model, specifically various teams, reporting lines, product lines, etc. Like the emergency services, we function best when we clearly understand who is responsible for what, and what their deliverables are.

Command: While there are parallels, these are perhaps less likely to apply except in very sizeable incidents. A key element is authority which applies vertically to agencies; for example Police do not generally have legislative authority to order another agencies personnel to accomplish a task. In a large scale incident the management team would define which agency has such responsibility, and the application of operations by this agency would cascade down vertically from the persons representing the agency at the incident control layer. In a corporate context this might be analogous to a GM coordinating various teams within their portfolio.

Control: This is the previously referred to application of operations. At the Incident Control layer the actual Incident Controller will have designated leadership authority. Participating agencies have a representative at the control layer, and tasks delegated to an agency progress via this representation. Control does not imply command; command authority of the agency remains vertical as noted above.These principles of command and control can be applied within agencies if required for scaling appropriately, all the way down to team level.

Again, as multiple agencies can and will be involved, the terms “Lead Agency” and “Support Agency” take on extra importance. This is usually defined by legislative authority in the context of emergency services. In the context of IT it could be defined by the primary product/fault being responded to, with the team/group responsible for this product the lead, and other teams/groups as supporting. However it likely has less importance as there are not legal implications, and an organic approach may be sufficient.

Response Management

CIMS functions

The CIMS functions have obvious parallels to an IT response, and the definitions should be fairly self explanatory. From CIMS 3rd edition:

CIMS 3rd edition page 84
CIMS 3rd edition page 36

CIMS 3rd edition documents these functions in exhaustive detail on pages 35 through 68, thus for detailed ideas for the functions it’s best to simply refer to the source.

It’s worth calling out the Welfare function. As large incidents carry on for longer periods of time, an important command consideration is the welfare of the individuals involved; particularly those who were first responders. In a longer duration incident it’s highly advisable to consider relief personnel when possible, this is easily and often overlooked. This is why having a designated Welfare command is desirable, as they will be cognizant of these needs and can help increase awareness of them for the IC.

Incident Management Team

In addition to the core CIMS functions described above, the controller may wish to bring in experts in other functions to assist in incident management. These would be in the form of advisors, and may or may not be members of other agencies involved. Examples might be subject matter experts, representatives of external vendors, etc. While they are part of the decision making process they are not necessarily part of the command chain, depending on the context.

The incident itself is an entity to support the process of identifying, mitigating, and resolving a problem (in the context of IT, presumably technical), and the related requirements. The actual operational activity, from troubleshooting to resolution, is carried out by the Operations function (and if scaling up, the resulting Sector commands) and the team(s) working with them. The primary responsibility of an Incident Controller is managing all of the other incident activity (such as communications, escalations, status updates, resourcing, etc), allowing Operations to focus on the problem at hand.

So, that’s a good overview of the Response function and the Incident Command structures. Next up, I’ll talk about specific Operational Response tactics which may be useful in the IT context.

https://edmund-hintz.medium.com/what-it-incident-response-can-learn-from-emergency-services-operational-response-2773ac9cd92b

--

--

Edmund Hintz
Edmund Hintz

Written by Edmund Hintz

25 yr IT vet. Mostly unix systems administration, network administration, and operations management. Also a Station Officer with Plimmerton Volunteer Fire.

No responses yet