What IT incident response can learn from emergency services: Operational Response
So in the previous pieces I laid out the background around how Fire and Emergency New Zealand (FENZ) incident response and management are done, and drew some parallels to an IT context. This piece is intended to tie together some of the concepts and show the value of their application in that IT context.
Rank
Firstly, rank. As previously discussed, FENZ (and all fire, police, etc) have militaristic-reminiscent structures. A large part of that is rank. This is quite useful, in that your various ranks are immediately identifiable from uniform, and in turn there is a very reasonable expectation of skillsets that particular individual can bring to the incident.
A Senior Firefighter would be somewhat equivalent to a journeyman level IT staffer. Someone who’s stepping up in responsibility levels, and capable of managing small incidents on their own. SFFs often enroll in the Station Officer course, and upon completion may stay at their SFF rank for some time, but they have the required training and can ride as Officer In Charge (OIC) of the appliance in the absence of a Station Officer. They’ll have had a smidgeon of CIMS training as an SFF, and more if they’ve SO qualified, so they’ll understand what’s going on, but probably won’t be particularly comfortable taking a role on the incident management team of a large complex job. Likewise, your journeyman IT staffer can probably handle the simple incidents but may be a bit apprehensive on larger incidents, and benefit from some support.
The Station Officer is analogous to a Senior Engineer, Principal Engineer, Architect, that sort of level. SOs will routinely ride as officer in charge of an appliance, they’ll have applied CIMS concepts on a lot of smaller and medium scale jobs, and can take on specific roles within the incident management team of larger incidents. They’ve got some good experience, and like in the IT context they’ll quite often tend to get stuck in at the sharp end in roles like sector command.
The Executive Officer has undergone significantly more CIMS and incident management training, and will tend to be perfectly comfortable in a primary CIMS function role. This is much like an IT General Manager, Executive General Manager, etc. And much like a large scale IT incident, they will often arrive after the initial response has assessed the scale of the incident and established strategy/tactics, receive a briefing, and assume command.
Training
A notable difference between what I’ve experienced in my IT life vs FENZ is training. In IT, certainly in my career, there’s quite often been very little if any training on how to manage incidents. Most of my experience came from actually responding to incidents, and observing those around me doing the same; the more senior folk had more experience, and I learned from that. By contrast, with NZFS/FENZ, I received quite extensive training, with live action simulations, role playing, proper multi-day courses at the training centers, etc. It’s worth noting here, I’ve been getting paid reasonable sums of cash for my IT work, and I get zilch as a volunteer firefighter. Yet, my training as a volunteer is orders of magnitude better than that in my professional life. This is one key area where I feel IT can dramatically up our game. Running training courses where we teach the theory to our people, and giving them simulated opportunities to role-play various scenarios then run debriefs would pay quite handsome dividends later when the excrement comes into contact with the wind generation device.
Of course I had some nerves the first time I managed a major IT incident, and likewise the first time I managed a FENZ one. The difference is, as a firefighter I was considerably better prepared for it, had a higher level of confidence, and spent a lot less time improvising. There’s a lot of psychological safety to be found in adequate training and preparation, which in turn leads to a more measured response with a higher probability of accurate and comprehensive resolution. And, with an excellent syllabus comes a consistent understanding and expectation of the various levels of skill; there’s no guesswork, I know when I have an SFF on the fireground that I can likely slot them in to a sector command role and they’ll both understand the responsibilities and also how they fit within the larger scale operation. Likewise a Station Officer can be relied on to run a CIMS command role. In the IT context I’ve usually found this is much more about an interpersonal relationship; a Sr Dev in one team may have wildly different skills and experience than one in another, and I can’t really know what to expect from them if I haven’t come across them at a previous incident. Nor do I know what roles they may or may not be familiar with and comfortable assuming.
And, of course, when an Area Commander wanders on to a fireground, I know immediately that person has significant experience, and should perform well in a command role(or, coaching someone stepping up into that command role). I can’t say I’ve found the same consistency with GMs in the IT world; most of them did have that experience, but some didn’t, and in lieu of any training those that didn’t had an unfortunate tendency to create unhelpful noise and chaos rather than facilitate order and resolution.
Response Process
In FENZ, our response process is a well defined cycle… Gather information, analyze and plan backwards, prioritize actions, develop strategy/tactics, resource strategy/tactics, task operations, document the action plan, then review the results and loop back to gathering information. A graphic illustration of this, from the command and control tech manual:
If you’ve been around IT for awhile, you may note that this is conceptually quite similar to the SDLC model-that was certainly my first reaction. Obviously it’s quite applicable to the IT incident response process too, and with a reasonable understanding of the SDLC this process is pretty immediately comprehensible.
As described in the previous blog posts, as an officer I’m using the RECEO mnemonic as a key driver to formulate my action plan-this is where the training is a great thing; there’s a model for me to follow, which I understand well due to my training. Also, due to my training, resourcing has been well and truly beaten in to me. We emphasize asking for help early and often-there’s no shame in raising the alarm early and getting those resources on the way, and nobody is going to criticize me if it turns out I don’t need everything I asked for (that would be, again, psychological safety in action). In the fire services, you might say one of our mottos is “better to have it and not need it, than to need it and not have it”.
Indeed, one afternoon at a call I observed the officer of another appliance engaging with some enthusiastic and curious kids… He explained to them that the Big Red Truck was basically a toolbox on wheels, and that’s a great way to describe it. We’ve thought about the needs beforehand, equipped ourselves with a collection of tools we think we may need, and we’ve trained all of our people in a consistent and controlled fashion such that we’re all generally on the same page and working together. Once again, this is a space where I can see signifiant application within the IT context.
So back to my response as an officer… After RECEO, as part of developing my action plan for response, there’s a nice concisely defined hierarchy:
It’s a straightforward enough process. As examples, FENZ v IT, aim might be ‘resume life as usual’ or ‘get the systems back to normal’. Strategy would be ‘extinguish the fire in that building’, or ‘find and eliminate the source of the errors’. Tactics might be ‘internal fire attack, search and rescue’ or ‘migrate services to unaffected nodes while mitigating’. And lastly, ops is turning those tactics into action, tasking to individuals/teams to take on responsibility for the various requirements. All the while continuing to loop back and reevaluate for new information or conditions which may alter our decisions.
Incident Management Tools
Along with all that aforementioned training, we have tools… One of my favorites as an officer is the M1 FN Command and control field notebook. It’s basically just the same 8 pages of content, repeated over and over again, and it’s standard issue on every truck. It both acts as a reminder of what needs to be done, and an opportunity to document what’s actually happened. One reason I find this particularly helpful is due to my role as a volunteer (not dissimilar to my role as an incident responder in IT, actually). It’s not my full time job, I do it when bad things happen and I need to respond to them. Yeah, I’ve had the training, but other than the yearly reinforcement scenarios, it may have been as much as a few years since the last time I had to do this at scale in anger; having this tool is amazingly helpful in guiding my thoughts and remembering my training. Examples of an action plan, operational tasking, and the CIMS structure in the field notebook:
One of the key things this structure defines is ‘sectorization’. Having identified our aim and strategy, we’ve established the tactics and are starting to break the work down into logical partitions, and assigning them to specific persons who are managing the teams to accomplish the goals (SO Brown and Green, above). In a fire context sectors are quite often geographical; Sector 1 is the front and we go clockwise from there (2 is left, 3 is rear, 4 is right). But, they can also be functional, and in the IT context that would be the norm. For example, perhaps one sector would be database, another might be network/traffic manipulation, customer comms, perhaps there’s a security sector… Point being the job has been broken into manageable chunks, someone has been appointed to manage each of those chunks, and the Incident Controller is focusing on keeping it all going, dealing with different teams or external pressures, and generally ensuring that the requirements of the various teams are met.
But wait, there’s more! Having started this process out with my field notebook, and also having escalated for help early when it looked like I was going to need it, some 30-odd minutes later I’m going to have yet another tool delivered to me, Wellington 2118, one of the many regional command units.
This beauty is basically a giant mobile office(there’s more interior photos if you follow the 111 link); it’s loaded to the gills with all manner of communications, staffed by full-timers very familiar with operating all the bells and whistles, and it serves as our ‘war room’ at a job. This is a physical manifestation of the level of preparedness and resource that we’ve set up to help our people succeed. Lets face it, in IT the vast majority of us aren’t even anywhere close to this level of support for our people. Inside of 2118, among other things, are whiteboards which are ready and waiting for the incident management team to jump in and get to work with-in the photo below it’s that command structure from the notebook:
This photo is from a training exercise in which numerous local brigades (rural and urban) ran an incident simulation-our training in action. And, as part of my Executive Officer course at the National Training Centre, we spent several days running simulated incidents with a indoor mockup of a command unit. We’d take turns roleplaying with simulations and the instructors acting as external agencies (ambulance, power, police, etc). The simulations are designed to escalate and require a lot of resources; when you’re in the hot seat as first responding officer you then call out for help and escalate, and the rest of the students ‘arrive’ with their ‘fire crews’, and are tasked into operational roles within the CIMS structure, while you practice the incident controller role.
These are examples in action of the training breadth and depth that we’re providing to our people (ones that don’t even get paid), and something I could envision us doing a whole lot more of in the IT world. Naturally we’d be doing things differently-we wouldn’t have a great big red truck for starters. But we could have pre-made templates in online collaborative platforms, ready for our IT responders, and we could have run a few exercises using those resources in order to get folks familiar and comfortable with them. Of course this would require buy in and support from the business, which quite often may not immediately understand the value, but it’s definitely an important lesson we can get for free from the fireys, if we want it.
Training Investment
I’ve mentioned this several times, but it’s worthy of a special note. We invest heavily in training our people, far more than I’ve ever seen in an IT context. Notably:
- National Training Centre Rotorua (approx 10mil NZD)
- Incident control/management training from SFF onwards
- Simulations. Lots and lots of simulations. Rly, lots
- Weekly training at brigade level (did I mention simulations?)
That training is invaluable when the unexpected day arrives. In the first post of this series there’s a photo of yours truly, 19 Oct 2018. About 0430 we were tipped out to a rolled big-rig loaded with hazardous materials, second appliance arriving about 30 seconds behind the rescue tender. These processes rolled out just as you’d expect-escalated alarms, command unit, Assistant Area Commander (the executive officer on call). And, both myself and the Chief Fire Officer of Plimmerton ended up on the incident management team. I was perfectly comfortable stepping into a management team role at an incident of national significance because of all that training; at the equivalent point of my IT career I would not have been at all comfortable stepping into an incident of this magnitude due to the lack of preparation (I had to attend a lot of IT incidents-not to mention a few fires and fire simulations-to pick up that experience and confidence).
Likewise, some years earlier, I was a relatively green first responding officer to a 3rd alarm garage fire. Ran my RECEO, laid out my strategy and tactics, got the crew to work before the cavalry arrived. Pretty much nailed it (and the senior officers attending specifically noted this during the debriefs). Not because I’m some natural genius, but because my training had me primed and ready, so when the real deal came I was reasonably comfortable and ready for it.
That covers several of the operational and training thoughts, I’ll close out the series with a look at communications, the first ‘R’-readiness, and the final ‘R’-recovery.