What IT incident response can learn from emergency services: An introduction

Edmund Hintz
6 min readApr 20, 2021

--

I’ve spent about 25 years in the trenches of IT, both in the US and in New Zealand after emigrating in 2003. So, I’ve been there on many occasions when everything was figuratively on fire.

I’ve also been a volunteer firefighter with Plimmerton Volunteer Fire Brigade (Fire and Emergency New Zealand (FENZ), and New Zealand Fire Service (NZFS) before that) for 17 years. So, I’ve been there on many occasions when everything was, quite literally and rather impressively, on fire.

As if getting rousted at 0-dark-30 for a living to deal with bork3n computers wasn’t enough, I signed up to do it for fun and relaxation-when, if you get it wrong, people (including yourself or your close comrades) can get seriously injured or worse. I probably need my head examined… ¯\_(ツ)_/¯

The author, far right, compiling an extensive Dangerous Goods list from the drivers manifest after a large ChemCouriers truck rollover, 0430 19 October 2018 — Photo courtesy of Carl Mills

The basics

On the IT side of the world, I figure we’ve gotten pretty good at the basics, but when things really start to scale out we tend to struggle a bit. There’s quite a lot we can learn from firefighting (and other emergency services), for whom incident response isn’t so much part of the job, as it is THE job. As a result, the operational practices need to be able to apply to something as small as a trash can on fire, right up to national level disasters. Internationally, emergency services have been working on this problem, and getting better at it, for decades (arguably even centuries). So, I’m writing this series of articles to attempt to educate how emergency services go about their business, with the idea that we in IT can take something from the experience and apply it to our processes. Specifically, this is from the context of fire, since that’s my particular window of visibility. But as we’ll see later on, our incident management/command and control follows a common structure shared by the likes of Police, Ambulance, Civil Defense (spelled ‘murican style), coastguard, SAR, military, etc.

There’s a lot of ground to cover, and those at different levels of the game will be interested in different parts, so don’t feel as if you need to understand all of it on the first go. Take the bits that resonate, and run with them. To begin with, I’ll be describing the very basic process of initial incident response as an urban fire crew, as well as the background structure necessary for context. While I believe we in IT are for the most part pretty solid in the introductory phases (in the case of smaller incidents, it doesn’t get past that point), I’ll be describing it primarily so that when we start to scale out there’s a common understanding of where we started.

Rank structure

Fire tends to have a slightly military organization (particularly around rank). The skillsets of individuals involved in response will have relevant ranks; it’s considerably more structured than your average IT shop. The relevant ones to this series:

Senior Firefighter: As the title implies, there’s a fair bit of experience. As part of SFF coursework there’s a section on basic response command principles and incident management; it’s not comprehensive, but it’s enough that they should be able to at the least hold the fort until the cavalry arrives. At a certain level of experience and training a SFF can be deemed capable of filling the role of Officer In Charge (OIC) of an appliance (that’s our fancy word for fire truck), in the event that someone holding the rank of Station Officer or higher is not available. Note as well it’s not uncommon for a SFF to complete the coursework for SO and be qualified for promotion, but waiting for an opening.

Station Officer/Senior Station Officer: This is the rank at which riding OIC is the primary responsibility, and is the first level of command at an incident.

Executive Officer: In the context of volunteers, this will be Deputy Chief or Chief Fire Officer. On the career side this will be the likes of Area Commander, Regional Commander, all the way up to National Commander.

It Begins

So lets start at the very beginning (a very good place to start). An incident has been logged, it sounds small scale, say a trash can on fire, and it’s time to respond. We’re sending out a single fire appliance, with either a qualified SFF or an SO riding as OIC. On arrival, we’re using the mnemonic RECEO to establish our incident structure and tactics. Borrowing from ‘FENZ M1 Command and Control technical manual’:

FENZ M1 Command and Control technical manual page 51

This is how we establish our initial strategy and tactics. The technical manual offers much more detail around this, but from a condensed IT perspective I’ll note that this is a simple overview of the steps necessary to bring the incident under control, and thence to closure. An IT parallel might be something like investigate, identify, workaround/mitigate, resolve, monitor. Let’s say for the sake of this discussion that storage is at critical level, an IT equivalent of a trash can on fire. Needs to be dealt to, but unless things go horribly wrong and nothing else has happens, it’s probably a small scale easily resolved incident. The sort of thing we’re all pretty comfortable with and tend to deal to reasonably effectively with minimal stress.

So, on arrival our OIC checks for Risks (could happen, for instance maybe our trash fire is part of civil unrest). I’m hard pressed to think of an instance where IT response risks injury, but maybe if you got to a physical rack (anybody remember those?) and found a live AC wire or something…

Anyway, on to Exposures: we don’t want that trash can to light the adjacent building on fire, so if that’s a thing we’ll mitigate it before we move on. For our IT comparison, we’re trying to verify that our storage issue is in fact all we’re facing, rather than a symptom of something bigger-or threatening to become something bigger.

Containment of our fire is easy, it’s already contained in that trash bin. For our IT equivalent we’ll say we’re migrating traffic to a different storage cluster until we can solve the problem. Or stopping a runaway process that’s writing logs at wild rates, whatever.

Next, Extinguishment (that’s putting the wet stuff on the red stuff). In our IT analogy maybe we’ll dump some unused data from the storage, expand our capacity, or fix whatever it is that’s causing unusual writes in the first place.

And finally the Overhaul phase. Maybe dig around in the bin a tad, make sure there isn’t something down there smouldering that’s gonna light back up in a few hours. With our IT example perhaps this would be poking around related storage systems and looking for others that may be getting close to the edge…

So that’s the very basics of the fire response, and an example of how it can relate to an IT perspective. Yup, nothing particularly noteworthy so far. But now that we’ve established the groundwork, let’s take the other path on this pick your own adventure. In the next post, we’ll see that initial small response scale out to a reasonably large multi-agency event, and explain the foundations of the incident management command and control structure and principles that apply to the resulting operation.

https://edmund-hintz.medium.com/what-it-incident-response-can-learn-from-emergency-services-incident-management-structure-8135a53a4fb2

--

--

Edmund Hintz
Edmund Hintz

Written by Edmund Hintz

25 yr IT vet. Mostly unix systems administration, network administration, and operations management. Also a Station Officer with Plimmerton Volunteer Fire.

No responses yet