The Undeclared Crisis
Why software has learned to live with regular critical incidents, and what it would take to stop
Most engineering organisations have dozens, sometimes a lot more, of critical incidents a year, and they have largely stopped finding that remarkable. A major incident lands, people scramble, it eventually gets resolved, everyone moves on, and the next one arrives a week or two later.
Now, take that same rhythm into aviation and it stops looking normal immediately. An airline that logged a serious reportable event every couple of weeks, indefinitely, with the same underlying causes recurring, would not be an airline for very long. Take it into nuclear power and the regulator would effectively move into the building. These industries are not calmer than software because their people are better or their systems simpler, they are calmer because somewhere along the way they made it very hard to have a critical event and not declare it.
That is the idea I want to put a name to, and I have come to call it the undeclared crisis: a situation that is already critical from someone’s point of view, usually the customer or the person on the front line, but that the organisation has not classified, declared, or responded to as a crisis. The harm is real and it is accumulating. What is missing is the act of calling it what it really is.
Two ways a crisis stays undeclared
What follows are patterns, not stories about anyone in particular. I have watched each of them repeat across very different organisations over the past decades, and none of it is unique to the company it happens to.
The first way is below the line. Most organisations gate their more serious response on a severity label, and that label is negotiable. I wrote a whole post on that argument a few weeks ago. I have lost count of the times I have watched an incident argued from a P1 down to a P2 or lower while it was still unfolding. And that negotiation did not happen because the impact had changed but because the lower severity carried less work. It came with no formal review, no incident analysis write-up, fewer people pulled in. More than once the incident being argued down was one the customer had already escalated well over the team’s heads. The severity number, which is meant to describe how bad things are, had become a negotiation about how much the organisation would have to do about it. It is also a number the organisation reports on, often with a target to drive it down, which is its own trap: chase the target and the label becomes the thing you manage, so relabelling a P1 a P2 improves the metric without changing what the customer felt. That is surrogation, the measure taking the place of the goal it was meant to stand for. And because deep analysis was reserved for the top severities, the great majority of incidents produced no learning the organisation could carry forward. Most of what an organisation actually runs on sits below the line where anything gets declared. Take the negotiation away and the honest count of critical events runs far higher than the one the organisation reports, because the reported figure is the negotiated one. It is also part of why a status or availability page can stay green while customers are plainly having a bad day: the page tracks what was declared, and declarations have drifted from what people actually experienced.
The second way is more unsettling, because the crisis is named by someone in the organisation and still nothing happens. More than once I have read an internal message to senior leadership saying, in plain words, that the organisation was on fire, that monitoring had large holes in it, that the team was in crisis, drowning, and out of control. It asked for help. The message was either ignored, or the reply was a suggestion to borrow a few people from another team, and then silence. What the warning earned its author, as often as not, was a reputation for being too negative, an alarmist, or not quite a team player. So the next warning, if it came at all, cost more to send. And then, often months later, the exact gap that message had called out takes a chunk of the customer base offline. The crisis had been named, in writing, by someone close enough to see it coming, but it simply never became the organisation’s crisis. It was absorbed.
That second pattern is the one that should worry you most, because the problem there was never that nobody knew. The organisation knew. Someone even sounded the alarm.
Why healthcare is not the counter-example
It would be reasonable to assume the fix is to name the harm, publish the numbers, make the failure impossible to ignore, and the response will follow. However, healthcare would disagree. It has higher stakes than almost any software system and more public awareness of its own failures than almost any industry, and it normalises critical failure all the same.
In 1999 a landmark public report put a number on American hospital deaths from preventable medical error, and the number was somewhere between forty-four and ninety-eight thousand a year. In Europe, estimates have also pointed to a very large burden of avoidable harm, with studies and official reviews repeatedly concluding that patient safety failures cause substantial death and injury across the region. Yet more than two decades later, the problem appears to have improved far less than anyone hoped, and researchers in the field write openly about the normalisation of deviance on hospital wards, the workarounds that quietly become standard practice, the alarms that staff learn to tune out.
So healthcare named its crisis, at the highest possible volume, and then in large part kept operating. That is the same move as a leader writing that the organisation is in crisis and out of control and watching the words get absorbed, only at the scale of an entire profession. Naming a crisis does not declare it. You can know the number, publish the number, agree the number is appalling, and still not build the machinery that turns the knowing into a response, because building it is the hard part. Healthcare has the hardest version of it with harm spread across thousands of separate institutions, no single regulator with authority over all of them, and numbers that stay contested rather than self-evident. Two decades of work has moved it less than anyone hoped. That is how hard the work is at the scale of a whole profession.
What aviation and nuclear actually did
It is tempting to look at aviation and nuclear and conclude that they simply care more but that would be a mistake. The more useful lesson is to look at what they actually built.
Aviation made reporting easy, safe, and expected. NASA’s Aviation Safety Reporting System, operating since 1976, takes confidential, non-punitive reports from anyone in the aviation system and has collected and analysed over 2 million of them. The whole design rests on a just culture: people will only report what went wrong, and what nearly went wrong, if reporting will not be used against them. Near-misses are treated as precursors and captured through voluntary reporting, while accidents and serious incidents trigger mandatory independent investigation.
Nuclear power uses a similar but stricter model. Events are rated on the INES scale from minor deviations to major accidents, operators must report reportable events into shared operating-experience systems, and significant incidents are investigated under regulatory oversight rather than left to the operator’s discretion. The threshold for reportable events is intentionally low, and operators do not get to decide privately to waive away something that meets the reporting rules.
Put those together and the pattern is plain. These industries did not eliminate hidden crises by relying on virtue; they reduced them by making declaration easy, required, protected from punishment, and partly independent of the people who had the strongest incentive to stay quiet. Pressure built that design, over decades: regulators with statutory power, investigations the operator did not control, and accidents with outside witnesses and a body count. Software has none of those forces, so the same design has to be chosen with nothing compelling it, which is the harder road. Left to its incentives, software often inverts it instead: severity is negotiable, reporting is discretionary, investigation lands on the same team already under pressure, and the incentive structure rewards silence because a declaration creates scrutiny and worsens the dashboard. The undeclared crisis then becomes the path of least resistance.
Why it survives
The undeclared crisis survives because, from inside the organisation, nothing about it feels like a crisis. It feels like Tuesday.
This is well-trodden ground in the safety literature, even if it has barely reached software. Diane Vaughan, studying the loss of the Challenger, described the normalisation of deviance: each small departure from what should be acceptable, once it does not immediately cause disaster, stops being seen as a departure and becomes the new baseline, so the next one is measured against that lowered bar rather than the original. Barry Turner, decades earlier, described the long incubation period before man-made disasters, during which the warning signs are present but do not fit the prevailing mental model, so they are noted and not appreciated. More recently, crisis scholars have written about the creeping crisis, the kind that has no clear beginning, where damage accumulates while authorities, in their phrase, sleepwalk into greater trouble.
Every incident argued down a notch resets what a real incident looks like, every risk register that gets shelved teaches its author that writing it down was pointless, and every quarter without catastrophe gets read as proof the current state is fine, when it may only be proof that you were lucky and your people compensated hard enough to keep the surface calm. Nobody is ignoring an obvious emergency. The organisation has recalibrated, one reasonable-looking step at a time, until the emergency stopped reading as one.
That is the view from below. From above, a second force holds the threshold in place, and it is the one that makes this a leadership and culture problem, not a tooling one. The resistance runs deepest exactly where the authority to fix it sits.
Declaring a crisis is never only a statement about the system. It is also a statement about the people who were meant to be running it. To call something a crisis now is to concede two things at once: that it was not caught and declared earlier, when it was smaller and cheaper, and that it was not prevented at all. Both are reasonable things to wish you had done, and both land on whoever is accountable. So the person with the most power to lower the threshold is often the person with the most to lose by lowering it, because a sudden run of newly visible crises reads, in the moment, as a run of things that went wrong on their watch. The incentive at the top is to keep the bar high and let the borderline cases stay blurred.
Chris Argyris gave this its proper name: organisational defensive routines, the patterns of action that protect people from threat or embarrassment while preventing the organisation from addressing whatever caused the threat in the first place. The defensive routine is rarely malice and usually not even conscious. It is the rational behaviour of people who have learned that surfacing a problem at full size is more dangerous to them than letting it stay vague. Argyris noticed that people become skilled at the moves that keep a problem undiscussable while sincerely believing they are trying to solve it. That is why a named warning is rarely engaged head-on. Nobody has to argue against it. The routine just absorbs it, the meeting nods, the message is acknowledged, and the crisis stays undeclared. And when a warning is too loud to be metabolised that quietly, the same routine has a louder version of the same move: it turns on the person rather than the substance. Discrediting the one who raised the alarm settles the matter without anyone having to engage what the alarm actually said, and it is not a failure of the defensive routine but the routine working as designed, protecting the organisation from an uncomfortable message by disqualifying its source.
But the move does more than silence one person. It happens in full view of everyone who might have raised the next alarm, so what is really under attack is the feedback loop itself, the channel through which an organisation finds out it is in trouble. Punish enough sources and it goes quietly blind, not because the signals have stopped existing but because they have stopped being sent.
Leaders in aviation and nuclear manage this more easily, and not because they are braver. Their harm is simply harder to deny. A crash is a discrete event with outside witnesses, and nobody can relabel it into something smaller. Software harm is the opposite kind, spread thin and gradual, felt mostly by the customer and rarely seen by anyone outside, so there is always room to argue it down a notch. But that is a difference in how contestable the harm is, not how serious it is. Healthcare has already argued down the highest stake there is, so the size of the harm was never what forced the issue. What forces it is whether anyone built the channel that puts the harm in front of people who cannot pretend not to see it.
So what do you actually do
It is at once a leadership problem, a culture problem, and a measurement problem, and the three hold one another in place. But there is an order of operations, and it begins by taking the threat out of declaring.
You cannot exhort people past a defensive routine. Telling leaders to be braver about admitting crises asks them to pay a personal cost for an organisational good, and most of them will not, repeatedly, however sincere the values on the wall. What works is changing what a declaration costs the person who makes it. Aviation did exactly this with just culture: it separated reporting an event from blame for honest error, so that surfacing something was no longer the same act as confessing to it. That separation is what software is missing. A declared crisis has to be able to mean we are in a bad situation and we are on it, without automatically also meaning someone failed and should be afraid.
The cleanest version of this is not in software at all. On a Toyota line, any worker who spots a defect can pull the andon cord. Pulling it calls a team leader over, and the line stops only if the problem cannot be sorted within the cycle. What is worth borrowing is the leader’s first move on arriving: they thank the operator for pulling it. The signal is welcomed, not punished, because catching a fault on the line is far cheaper than letting it reach the customer, and the cord gets pulled often by design rather than treated as a sign the factory is failing. It is the exact inverse of turning on the person who raised the alarm: the declaration is met with thanks, so it costs nothing to send, so it keeps being sent. That is the feedback loop a software organisation needs and rarely builds.
The practical moves are therefore not overly complicated. Make declaring cheap and safe, so the cost to the person who raises a hand is close to nothing. Decouple how much you learn from how the incident was labelled, so that arguing the severity down no longer makes the investigation disappear, which removes one of the reasons to argue it down. Stop treating the quarter with the fewest declared incidents as the best quarter, and drop any target to push the count or severity down, both of which turn the label into the thing people manage. Often the cleanest-looking quarter is just the one with the most successful suppression. And let the customer’s definition of critical count. Yes, customers also over-inflate as readily as you downgrade, but at least their incentives never bend toward your dashboard looking green. None of this means declaring everything. A bar set too low trains people to tune out the alarm just as surely as a bar set too high hides the fire, and the aim is a threshold that tracks reality rather than one that simply rings more often.
That handles the crisis that never gets declared. The harder pattern, the one I called the more unsettling of the two, is the crisis that gets declared in writing and absorbed anyway, and the moves above do nothing for it, because the person who sent the warning already paid the cost of sending it. There the failure is on the receiving end, so the receiving end needs the same discipline aviation gave the reporting end. A written warning should carry a clock: a logged decision to accept or reject it, with a reason, within a fixed window, owned by a named person. Warnings that get absorbed should be reviewed by someone who did not receive them and has no stake in keeping them quiet. None of that asks anyone to be braver. It removes the option of doing nothing, by making it visible when nothing was done. I have watched organisations run exactly this: every risk logged, and a named leader required to accept it or reject it on the record. The machinery is not self-enforcing. A clock turns into a rubber-stamp the moment a named owner can write “accept, deprioritised” on a stack of risks without reading them, so it holds only when the volume stays small enough to take seriously and the reviewer has nothing to gain from looking away. What stops it sliding into ritual is visibility upward. When the log of accepted and rejected risks is open to the levels above the named owner, accepting one becomes a decision a director or VP has to own in front of their own leadership, which forces the real conversation: fix the risk, or carry it knowingly and on the record. At AWS I watched this work reasonably well: the risk list rolled up the chain, so a logged risk could not just sit in the team that filed it. Someone senior had to address it or put their name to accepting it, and that exposure was what kept the logging from sliding into a formality.
The hardest part of this sits at the very top, and it cannot be delegated. A leader who answers a surfaced problem with “good, now we can deal with it” rather than “how did we let this happen” does more to lower the threshold than any policy or tool. The permission to declare has to come, visibly, from the people who would otherwise be most exposed by the declaration. That is a great deal to ask of them, and nothing else on this list works without it.
What this leaves you with
So can it be turned around? Yes, but not by anyone insisting more loudly that people should care. It turns when declaring a crisis stops being treated as a confession and becomes ordinary, competent practice. That is a change in what an organisation measures, what it rewards, and what its leaders are seen to do, which is slow and uncomfortable, and it is precisely the change the high-reliability industries chose to make rather than stumbled into.
The incidents that will hurt you most are not the ones you declared and fought. Those got your attention, your best people, and your learning. The ones that will hurt you are the ones you never called crises at all: the chronic condition everyone had stopped seeing, the warning that was filed and absorbed, the customer who was in trouble while your dashboard stayed green. They do their damage in the gap between the moment something becomes critical and the moment someone is finally willing to say so out loud.
Most organisations pour their energy into responding to crises faster. The better move, but also the harder one, is to make a crisis surface whether or not anyone in the room feels brave that day. The measurement you can build, what an organisation rewards, and what its leaders model are set at the top, and nothing here holds until they own it.
//Adrian
—
Sources and further reading
Aviation and just culture. NASA’s Aviation Safety Reporting System has run a confidential, voluntary, non-punitive reporting channel since 1976 and has collected more than two million reports. The idea of a “just culture” is set out in James Reason, Managing the Risks of Organizational Accidents (Ashgate, 1997), developed in Sidney Dekker, Just Culture: Balancing Safety and Accountability (Ashgate, 2007), and brought into healthcare by David Marx, Patient Safety and the “Just Culture” (2001).
Toyota and the andon cord. Stopping the line for quality is part of jidoka in the Toyota Production System, set out by its architect in Taiichi Ohno, Toyota Production System: Beyond Large-Scale Production (Productivity Press, 1988), and in Jeffrey K. Liker, The Toyota Way (McGraw-Hill, 2004). The detail that the cord comes from accounts of Toyota’s shop-floor culture; see John Willis, The Andon Cord, and Mark Graban’s Blog on how the andon system actually works.
Nuclear. The IAEA’s International Nuclear and Radiological Event Scale (INES) rates events on a seven-level scale, and operators file event reports (such as Licensee Event Reports in the United States) into shared operating-experience databases.
Healthcare. The figure of 44,000 to 98,000 deaths a year comes from the Institute of Medicine, To Err Is Human: Building a Safer Health System, eds. Kohn, Corrigan and Donaldson (National Academies Press, 2000; released November 1999). On normalisation specifically, see John Banja, The Normalization of Deviance in Healthcare Delivery, Business Horizons 53, no. 2 (2010): 139-148.
Normalisation, incubation, and creeping crises. Diane Vaughan, The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA (University of Chicago Press, 1996), on the normalisation of deviance; Barry A. Turner, *Man-Made Disasters* (Wykeham, 1978; 2nd ed. with Nick Pidgeon, 1997), on the disaster incubation period; and Arjen Boin, Magnus Ekengren and Mark Rhinard, Hiding in Plain Sight: Conceptualizing the Creeping Crisis, Risk, Hazards & Crisis in Public Policy 11, no. 2 (2020), and Understanding the Creeping Crisis (Palgrave Macmillan, 2021).
Defensive routines. Chris Argyris, Overcoming Organizational Defenses: Facilitating Organizational Learning (Allyn & Bacon, 1990); see also “Teaching Smart People How to Learn,” *Harvard Business Review* (1991).


