When the Fire Is Lit, Let the People Who Know How to Fight It Fight It

6 min read

I was on a team once where sales went completely dark on a marketplace app. Not slow, not degraded, completely down. We knew what was wrong. The fix was sitting right there. And the CTO would not authorize anyone to touch production because he didn't want to risk "making things worse."

Two days. The company hemorrhaged sales for two days while the fix waited for a risk tolerance conversation that never actually resolved anything, because the CTO's answer to "what would making things worse look like" was essentially "I don't know, something unexpected." Which is a real concern. It's also not a reason to leave sales down for two days when you have engineers who understand the system and know exactly what happened.

The fix went in on day three. It took less than an hour. Nothing unexpected happened.

NIST, ISO, and SOC 2 all agree with the engineer

This is the thing that makes me actually crazy, because the frameworks that get cited to justify this kind of process gate don't support it. NIST SP 800-61 is explicit that during containment, authorized responders act to limit damage, and the documentation follows after. ISO 27001 requires escalation paths specifically so decision-making authority moves to people who can act fast, not so it moves to a seven-person approval chain. SOC 2's availability criteria wants you to demonstrate a working incident response capability, not evidence that you made your responders wait.

Every one of these frameworks has a carveout for the moment the alarm is going off. The post-mortem, the documentation, the review, those all happen after. During the incident, the job is to stop the bleeding, and the people who know where the bleeding is should be the ones doing it.

That carveout exists because the people who wrote these standards understood something that gets lost somewhere between the standard and its implementation, which is that incident response is a skill, and you don't get to use it if you've buried the person who has it under an approval workflow.

The difference between process and security theater

Process is genuinely good. Change management is good. Code review is good. Requiring a second set of eyes on production changes during normal operations is good. The problem is applying those same controls to an incident, which is a different situation with different stakes and a different cost structure. In steady state, a four-eyes review costs you an hour and catches something real. During an incident, a four-eyes review costs you an hour of downtime and catches the same thing the engineer who's been in the codebase for two years already knows. Those are not equivalent situations, and treating them like they are is just being slow.

The thing I see happen is that "following process" gets treated as synonymous with "being responsible," and then nobody can distinguish between the two. The CTO who blocks access to a broken production system because he doesn't want to risk "making things worse" is following process, and the engineer who pushes the fix and documents it immediately after is not following process. The process-follower looks responsible. Two days of lost sales tells a different story.

When the tool is wrong and nobody will listen

The version of this that's become significantly more common in the last couple of years involves security tooling, and it has a failure mode that's uniquely hard to escape.

A security scanner runs against an environment. It finds vulnerabilities. It generates a report with severity labels, and those labels get escalated to people who have authority over access and deployments but who aren't engineers and can't evaluate whether the findings are real. The findings look alarming, the response is to shut things down until they're resolved, and none of that is unreasonable. The scanner was looking at a stale image. Or a dependency that's in the manifest but never actually called. Or a code path that hasn't been reachable since a refactor six months ago. The engineers who built the system can see this immediately, sometimes in the time it takes to read the report. But the people with authority to restore access don't have the context to evaluate whether the engineers are right, and they don't want to be the ones who reopened access and then something happened. So they defer to the tool.

I've seen services stay completely down for over a week under these circumstances. Not because the vulnerabilities were real, but because the process for proving they weren't real was slower than the process for acting on them. The engineers saying "this finding is based on an outdated scan, the current deployment doesn't have this issue" are asking the security team to trust their judgment over the tool's output. And the security team's incentive runs exactly the wrong way, because if they trust the engineers and something later goes wrong, that's on them, but if they trust the tool and the outage extends, they followed the process.

The tool has no idea what version of the code is actually running in production, no idea that the image it scanned was weeks old, no idea which dependencies are vendored and isolated or what the actual blast radius of the finding would be if it were real. The engineers have all of that, and in a lot of organizations, that knowledge now has to survive an approval chain that was designed to handle a different kind of risk entirely.

Security scanners are useful, and I'm not making a case against running them. The problem is treating their output as more authoritative than the judgment of the engineers who built the system they're scanning, especially when the system is down and the question isn't whether to fix the finding but whether the finding is even accurate.

What actually happens in the post-mortem

The two-day marketplace outage I described earlier ended up in a post-mortem, as it should have. The contributing factors listed included things like "approval process added latency to resolution" and "risk communication gaps." That's a polite way of saying it. What actually happened was that the person who knew how to fix it was held in place by people who didn't, and the outage ran for the time it took to navigate the approval chain rather than the time it took to write the fix.

Nobody writes in the post-mortem "we made the engineer wait." The post-mortem language is always something like "communication overhead" or "process friction" or "escalation path clarity." Same thing, different words, and the words are chosen carefully enough that nobody has to own the actual decision.

The incentive structure here is brutal if you look at it honestly. If the engineer pushes the fix without approval and something goes wrong, it's on them. If the engineer waits for approval and the outage runs long, nobody's accountable because everyone followed the process. So the rational move, from a career preservation standpoint, is to wait, to let it run longer, to not be the person who stepped outside the process even when the process was clearly the wrong tool for the moment.

That's the version of this that doesn't show up in the incident metrics. It shows up in engineer burnout, in the quiet calculation every on-call person makes about whether to move fast or to protect themselves, and in the slow accumulation of incidents that ran longer than they needed to because the people who could have ended them faster were busy following steps designed for a different situation entirely.

Following process during an incident is not the same as having good security posture, and the frameworks know this, which is why they all carve out space for someone to just go fix it. The documentation can happen after the fire is out.