What IT incident response can learn from emergency services: Communications, Recovery, and Readiness
In previous articles we’ve discussed basic incident response, CIMS command structures, and operational applications of firefighting concepts to IT. This final piece will cover Communications, and Recovery, and Readiness.
There’s two areas of comms I thought useful to highlight here. One is the command handover, the other is incident ground communications.
Tying back to training, by policy, at an incident the arrival of a more senior officer does not automatically constitute a change of command. There are instances where the senior officer is required to take over; specifically if they believe that there is some sort of danger they are required to immediately take over. Likewise, if the junior officer asks them to take over, they must. However, our policies state that the senior officer should exercise restraint, discuss with the junior officer, and if the junior wishes to proceed the senior will hang around and provide coaching and assistance as necessary. Obviously this practice assists in the development of officers.
Personally, I’ve been here. I mentioned it in the previous blog, fairly early on in my officer-ing, I was first arrival at a 3rd alarm garage fire. I did all my strategy, tactics, and tasking, and we got the job underway. Then the Senior Station Officer from Porirua arrived, assessed the scene, and noted we should start standing up our incident management structure. He asked me if I’d like to take on incident command, or stay operations sector command. Being fairly junior at the time, I opted to stick with operations, but the offer was quite valuable, and also illustrated to me that I should probably pursue further work in command structures to be ready for it in the future should it be necessary.
It’s quite an important facility; allows the individual to grow into the role when they’re ready, but also protects them from being thrown in the deep end when they aren’t ready (and thus avoiding the resulting stress, PTSD, etc).
However, lets stipulate that there is going to be a command handover. We have a nice mnemonic for this, SHURTS. The example from the tech manual is fairly self explanatory; I’d suggest that it could be used as a model to develop something more apropos to IT. You can see how it is a comprehensive overview of our situation, and having a standardized complete format for the information exchange helps minimize the chances of miscommunication.
Incident ground communications
This is one of the key analogies I see of value to the IT contexts I’ve routinely witnessed… When big incidents happen, invariably a lot of folks start dropping in to the relevant Slack channels. It can sometimes turn into quite a circus, with a lot of folks making suggestions, observations, etc. Not only does this hinder the performance of the incident controller and management team, it also makes the postmortem/discovery process after the fact quite challenging due to the poor signal to noise ratio.
Naturally we can have this problem on the fireground as well. Our solution is, surprise surprise, sectorization. Just like we sectorized our operations earlier in the piece, we’ll sectorize our radio channels too. In the IT world, most of us seem to be using Slack these days, it is quite a logical comparison to create channels within Slack, ideally with naming conventions that identify them as sectors of the main incident.
Our incident ground communications radios have around 15 pre-programmed channels. In the field notebook there’s a nice prepared template, just jot down what channels you assign and off you go. The colors indicate the different networks, it’s easy to see an equivalent Slack (or other chat) channel equivalent, and the value. Note that the sector commands are in both networks; in FENZ context this means you carry two radios, one tuned to each channel. Logically the same would apply to IT, you’d effectively have your sectorized Slack channel, in which that individual would be incident control, and in the main incident channel that individual would be the sector command reporting on progress or issues to the rest of the incident management team.
This is a discipline that I think IT tends to do relatively well. But it’s worth a quick look at some of the FENZ resources for some ideas.
Station Management System (SMS)
Firstly, we have a system known as SMS, which acts as a log of the job. On receipt of a job, Firecom will kick this off, dispatch us, and we’ll put in periodic reports via radio which Firecom will log. It’s sort of our version of a Slack incident channel, and it’s also a legal document. In the event that an incident becomes a crime scene, this could well be used as evidence in court (and the officer or others may find themselves called to court as witnesses). Obviously it’s also a useful document for post incident analysis. Following are links to a few which have been released under Official Information Act requests (with various bits and pieces redacted for privacy reasons):
The correlation to a Slack channel in a postmortem is reasonably obvious, I simply figured it was worth showing for sake of completeness.
Debriefs, formal and informal
The flowchart-like graphic below may be again quite familiar conceptually to IT folk-it even uses one of our go-to processes, ‘Continuous Improvement’… Also, the importance of being a ‘learning organization’ is specifically called out, and it’s absolutely foundational to a resilient IT organization, so I’m including it too:
Starting at the base and working up, the ‘hot’ debrief usually is held at the station level, at the conclusion of a call. Particularly, a call which may have psychological impacts (fatalities will merit this, and we need to keep an eye on each others psychological wellbeing; it’s possibly one of the most important responsibilities we have as leaders, because severe PTSD can result in all manner of carnage up to and including suicide; it’s ok to not be ok). The debrief is most likely a discussion of what worked, what didn’t, that sort of thing, and informal. Almost always, it’s go-around-the-room and recap/discuss, both operations and learnings, and how individuals are feeling about the situation. Note however that at each level we can trigger either a policy or procedure review; it’s important that as part of the learning organization there are no barriers to escalation of the learnings.
The second layer, OIC incident debrief, is a more formal process. There’s an example 8 page form in the appendix of the FENZ M1 Command and Control technical manual (pages 169–176) of the output from the debrief, if you really want to dig deeper. But the general process flow as pictured will likely be somewhat familiar to IT organzations which regularly run postmortems:
Lastly, we have the formal Operational Review. These are a very formalized process, with a dedicated team driving it. A couple of notable things, the Operational Efficiency and Readiness team reports directly to the CEO. I think this is very sensible; it has the effect of mitigating some potential for political infighting or departmentalization. If I’m standing up a similar sort of review board in a corporate structure, I’d intentionally put it at the C-level, in an attempt to sidestep things like not-invented-here syndrome. While it might make sense logically to put it under the CTO, for instance, if other folks are having control issues around the CTO, they may reject findings; if it’s coming more or less direct from the CEO authority, that should help to mitigate any potential for infighting.
Also notable is the format. It’s interesting in that it spells out expectations, then details findings and whether or not they meet expectations. And, of course, any policy or procedure recommendations.
An Operational Review can be performed for any level, but will quite commonly occur after large jobs, or jobs where there may be significant learnings (again, at any level). So, some examples of ORs:
Lastly, an example of our prepared readiness, the Operational Site Plan. At a brigade level we have these on the truck for any relevant sites within our area (in my case that’s a school, a few petrol/gas stations, some industrial businesses with a lot of chemicals hanging around). It’s a fairly self explanatory document, the important thing to note is that it’s been well and truly prepared in advance. When the hooter goes, I’m going to glance at my pager message and on the 3-odd minute drive to station I’ll start thinking about what my strategy/tactics might be if I end up as Officer In Charge. And, while we drive to the incident, I can pull this out of the console and run over it, so by the time we get there I’ve got a fighting chance of understanding some of the complexities we’re about to face, and perhaps even some default strategy/tactics. On the IT side we do sometimes sort of do this with docs and runbooks, but it’s usually not terribly complete or comprehensive, again probably due to the tendency to treat it as a secondary priority after rolling out actual software or products.
So. Whew. That’s my big data dump of applying fire command, operations, and preparedness concepts to IT. I hope it stirs some interest, enthusiasm, and maybe gives y’all some ideas on how to make things better in your own environments. And if you made it this far, congrats on your perseverance. :)
If you’d like to discuss or whatever, feel free to drop me an email, I’m ed at hintz dot org, or ehintz at pvfb dot org dot nz, and I’ll try to engage in intelligent banter whenever I have the time. YMMV.