The Invoice That Arrives After the Incident Is Over
Why learning always loses when the other side of the comparison is zero
When a major outage happens, the engineering team fixes it. A week later there is a postmortem where someone reads through the timeline that everyone already lived through, a few action items get written down, and everybody goes back to work. The ticket closes. At no point in this process does anyone ask what the outage cost the business. The question is not in the workflow, it is not in the incident template, and it is not in anyone’s job description. The outage is treated as an operational event, not a financial one, and the cost of it remains completely invisible.
This is not an oversight by careless people. It is the normal consequence of how engineering organisations are structured. Engineering teams are organised around systems, not business outcomes. They measure availability, latency, error rates — technical metrics that describe the health of the system. Revenue, margin, customer lifetime value sit in a different function, behind a different reporting line, and the information does not flow between the two because the organisational structure does not carry it there. This is Conway’s Law applied to cost: the structure of the organisation determines which information reaches which people, and in most organisations the financial impact of an outage reaches nobody.
Even in organisations that have tried to bridge this gap, the bridge is partial. At Amazon, the “you build it, you run it” model gave engineering teams operational ownership, and we had regular P&L reporting to leadership where the whole team was invited. But that was the cost of running our team, not the cost of our outage to the whole business. We could see our budget, our headcount, our infrastructure spend. We could not see the customer deals that stalled because of a two-hour incident, or the renewal conversations where procurement suddenly had leverage, or the enterprise pipeline that paused while a buyer’s legal team reviewed whether our platform was reliable enough to depend on. Operational ownership is not financial ownership. In my whole career as an engineer, I never once saw anyone put a monetary figure on an outage.
So the cost stays invisible. And because it is invisible, the organisation makes decisions as if it were zero.
What the invisible cost actually looks like
Take a six-hour outage at a large financial services firm doing $500 million in annual revenue, with about half of its business affected by the incident. The acute phase, the hours during which things are actually broken, involves lost revenue, employee productivity, SLA credits, and the recovery team’s time, and it comes to roughly $2 million. That is already a large number, and nobody calculates it.
But the acute phase is the smaller part of the invoice. What follows over the next two and a half months is considerably worse, and even less visible, because it happens gradually and across multiple functions that do not talk to each other.
Revenue does not snap back to pre-incident levels on day one. It recovers gradually as customers regain confidence, as the sales pipeline that stalled during the outage gets restarted, as the enterprise deals that paused while legal reviewed the incident resume their cycle. Splunk and Oxford Economics found that revenue recovery after a downtime event takes an average of 75 days, stock price recovery takes 79 days, and brand health restoration takes roughly 60 days. That gradual depression, even a few percentage points of daily revenue over that period, adds roughly $500,000 for this scenario, and that is before churn, before regulatory, before brand repair.
Customer churn does not happen during the outage. It happens in the weeks that follow, when contract renewals come up and the procurement team mentions the incident, when a competitor’s sales rep uses your outage in their pitch, when the customers who were quietly unhappy before the outage now have a concrete reason to leave. Bain’s research, now over two decades old but directionally still sound, established that acquiring a new customer costs five to twenty-five times more than retaining one. The replacement cost is not just the lost revenue but the full cost of winning that revenue back from scratch.
Regulatory exposure builds on its own timeline. DORA, now in full effect across the EU, allows fines up to 2% of global annual turnover. The FCA’s enforcement actions have been trending sharply upward. HIPAA breach costs average $9.77 million, the highest of any industry for twelve consecutive years. None of these fines arrive during the outage. They arrive months later, after the investigation, after the reporting, after the regulator’s own review cycle.
Add brand repair, the PR campaigns, the customer outreach, and the total comes to roughly $7 million, about 3.7 times the acute cost. I call this the tail. It is not a rounding error. It is the majority of the invoice, and it accumulates entirely outside the window that the incident management system considers the incident.
There is also a cost that no financial model captures well. Richard Cook, in a piece published posthumously earlier this year, described what he called Organizational Second Hit Syndrome: a major incident creates a vulnerable period during which a second incident, even a minor one, produces reactions that are qualitatively different — structural reorganisations, personnel changes, resource redirections. The first hit is treated as an aberration. The second hit calls the organisation itself into question. The tail is not just the financial cost of the first incident but the increased fragility to the next one, and the cost of a second event during that window is not additive. It compounds.
Why learning always loses the argument
After the outage, someone who lived it — viscerally, hands on, still carrying the adrenaline of the response — proposes a proper investigation. Interviews with the people who responded, architectural analysis, a written report, implementing the fixes. The response from the people who did not live it is predictable: that sounds expensive, we have a backlog, we cannot afford to take engineers off delivery for three weeks. And so the organisation opts for efficiency, for one meeting, one document, and move on. The learning never happens.
What is going on here is not laziness or indifference. It is a predictable outcome of comparing a visible cost against an invisible one. The cost of the investigation is concrete. It is this team, this sprint, this quarter’s roadmap. It comes out of a budget that someone owns and will be asked about. The people who would do the work can see exactly what it would cost them in time they do not have, on a backlog they are already behind on.
The cost of not investigating is the next similar outage the organisation will fail to prevent. But that cost does not have a number attached to it, because nobody calculated it, and it lives in an uncertain future that feels abstract and far away. Trope and Liberman’s construal level theory describes this asymmetry precisely: psychologically near events are processed in concrete terms, with specific costs and trade-offs, while psychologically distant events are processed abstractly, in terms of vague categories and good intentions. The learning investment is near. The prevented outage is distant. So the learning feels heavy and specific while the risk feels light and theoretical, even when the risk is orders of magnitude larger.
Frederick, Novemsky, Wang and Dhar identified the mechanism behind this and called it opportunity cost neglect: people systematically fail to consider the alternatives that their decisions displace. In their experiments, simply reminding participants of the opportunity cost of a purchase, just making it visible, changed decisions dramatically. The cost did not change. Only its visibility did. This is the same mechanism at work in the post-incident conversation. The cost of not learning is the full cost of the next outage the organisation will fail to prevent, including the tail that accumulates for months afterwards. But nobody generates that number spontaneously, so the learning investment gets compared to nothing, and nothing always wins.
It gets worse because of the asymmetry in how certain and uncertain costs are experienced. The learning investment is a certain loss — you will spend this time and money. The next outage is an uncertain loss — it might happen, might not, and even if it does, nobody knows when or how bad it will be. Kahneman and Tversky’s work on loss aversion shows that certain losses feel disproportionately larger than uncertain ones. The $32,000 learning investment triggers loss aversion. The $9 million outage cost does not, because it is wrapped in enough uncertainty that it does not feel like a real loss yet.
This is the prevention paradox and the hyperbolic discounting I wrote about in The Hidden Cost of Delayed Resilience. Nobody walks into a planning meeting with the tail cost of the last outage, because nobody calculated it. The number does not exist. So the investment case never gets made, and the next outage produces the same reactive scramble, the same surface-level postmortem, and the same deferred investment.
Making the comparison fair
I built a calculator to fix this. It is free, it runs at resiliumlabs.com, and it does one thing differently from other downtime cost calculator I have seen: it shows the tail.
To use it, you pick your industry and company size. The model pre-fills defaults from published research (Splunk, ITIC, IBM/Ponemon, Siemens, BLS, DORA, HIPAA, FCC enforcement data), all adjustable, all sourced. It shows the acute cost, the tail cost, the learning investment, and the ratio between them. That should make the difference obvious enough.
The reason I built it is precisely because of what Frederick’s research shows: making the invisible cost visible is often all it takes to change the decision. You do not need to convince anyone that resilience matters. You do not need a better argument. You need the number to exist so that the comparison is no longer between a concrete investment and nothing.
Like I said, every default is adjustable. If you think the churn probability is too high for your context, lower it. If your regulatory exposure is higher than the default, raise it. The point is not to produce a precise number, because no model can do that for a future event that has not happened yet. The point is to make the comparison fair, so that the conversation about investing in learning includes the costs that are currently invisible.
What you will notice if you start using it is that when you do put the number on the table, nobody expects it to be that large. A six-hour outage at a $500 million financial services firm does not cost $2 million. It costs $9 million. And the cost of learning from that incident — the investigation, the fixes, the process improvements — comes to about $32,000, less than half a percent of the total.
Remember that the outage already happened. It is an unplanned investment, a very large one. For a fraction of that cost, you can turn it into a lesson you never have to pay for again, or you can close the ticket and move on.
Allspaw has made the distinction between learning and fixing. Fixing addresses the parts involved in the event. Learning develops a richer understanding of where the event came from, what was difficult for the people responding, and what mattered about it at the time. The return on that understanding is far greater than the return on the fix alone, because it includes the fix and the context that prevents the next incident, the one that would have been different in its specifics but identical in its pattern. Hochstein puts it well: the same incident never happens twice, because the system has changed since it happened, but the patterns recur over and over. If you focus only on preventing the specific details of the last incident, you will miss the higher-level patterns that enable the next one. The learning investment is what breaks that cycle.
The question worth asking
The next time someone in your organisation says the investigation is too expensive, ask them: compared to what? If the answer is the backlog, the sprint, the quarterly plan, they are comparing the cost of learning against the cost of doing nothing. And the cost of doing nothing is not zero. It is the full cost of the next outage they will fail to prevent, including the tail that keeps accumulating long after the dashboard turns green.
Hebert has argued that the incident review itself is the action item — not the tickets that come out of it, but the understanding it produces. If teams cannot apply what they learned because they lack the autonomy to schedule preventive work, that is not a resilience problem. It is an organisational one. My book Why We Still Suck at Resilience is largely about this very problem; the reason most organisations still struggle with resilience is not technical. It is organisational. The lack of learning after incidents is one of the clearest examples. The tail cost is what you pay for leaving it unresolved.
//Adrian



