Resilience Bites

The Invoice That Arrives After the Incident Is Over

Adrian Hornsby — Wed, 20 May 2026 09:06:39 GMT

When a major outage happens, the engineering team fixes it. A week later there is a postmortem where someone reads through the timeline that everyone already lived through, a few action items get written down, and everybody goes back to work. The ticket closes. At no point in this process does anyone ask what the outage cost the business. The question is not in the workflow, it is not in the incident template, and it is not in anyone’s job description. The outage is treated as an operational event, not a financial one, and the cost of it remains completely invisible.

This is not an oversight by careless people. It is the normal consequence of how engineering organisations are structured. Engineering teams are organised around systems, not business outcomes. They measure availability, latency, error rates — technical metrics that describe the health of the system. Revenue, margin, customer lifetime value sit in a different function, behind a different reporting line, and the information does not flow between the two because the organisational structure does not carry it there. This is Conway’s Law applied to cost: the structure of the organisation determines which information reaches which people, and in most organisations the financial impact of an outage reaches nobody.

Even in organisations that have tried to bridge this gap, the bridge is partial. At Amazon, the “you build it, you run it” model gave engineering teams operational ownership, and we had regular P&L reporting to leadership where the whole team was invited. But that was the cost of running our team, not the cost of our outage to the whole business. We could see our budget, our headcount, our infrastructure spend. We could not see the customer deals that stalled because of a two-hour incident, or the renewal conversations where procurement suddenly had leverage, or the enterprise pipeline that paused while a buyer’s legal team reviewed whether our platform was reliable enough to depend on. Operational ownership is not financial ownership. In my whole career as an engineer, I never once saw anyone put a monetary figure on an outage.

So the cost stays invisible. And because it is invisible, the organisation makes decisions as if it were zero.

What the invisible cost actually looks like

Take a six-hour outage at a large financial services firm doing $500 million in annual revenue, with about half of its business affected by the incident. The acute phase, the hours during which things are actually broken, involves lost revenue, employee productivity, SLA credits, and the recovery team’s time, and it comes to roughly $2 million. That is already a large number, and nobody calculates it.

But the acute phase is the smaller part of the invoice. What follows over the next two and a half months is considerably worse, and even less visible, because it happens gradually and across multiple functions that do not talk to each other.

Revenue does not snap back to pre-incident levels on day one. It recovers gradually as customers regain confidence, as the sales pipeline that stalled during the outage gets restarted, as the enterprise deals that paused while legal reviewed the incident resume their cycle. Splunk and Oxford Economics found that revenue recovery after a downtime event takes an average of 75 days, stock price recovery takes 79 days, and brand health restoration takes roughly 60 days. That gradual depression, even a few percentage points of daily revenue over that period, adds roughly $500,000 for this scenario, and that is before churn, before regulatory, before brand repair.

Customer churn does not happen during the outage. It happens in the weeks that follow, when contract renewals come up and the procurement team mentions the incident, when a competitor’s sales rep uses your outage in their pitch, when the customers who were quietly unhappy before the outage now have a concrete reason to leave. Bain’s research, now over two decades old but directionally still sound, established that acquiring a new customer costs five to twenty-five times more than retaining one. The replacement cost is not just the lost revenue but the full cost of winning that revenue back from scratch.

Regulatory exposure builds on its own timeline. DORA, now in full effect across the EU, allows fines up to 2% of global annual turnover. The FCA’s enforcement actions have been trending sharply upward. HIPAA breach costs average $9.77 million, the highest of any industry for twelve consecutive years. None of these fines arrive during the outage. They arrive months later, after the investigation, after the reporting, after the regulator’s own review cycle.

Add brand repair, the PR campaigns, the customer outreach, and the total comes to roughly $7 million, about 3.7 times the acute cost. I call this the tail. It is not a rounding error. It is the majority of the invoice, and it accumulates entirely outside the window that the incident management system considers the incident.

There is also a cost that no financial model captures well. Richard Cook, in a piece published posthumously earlier this year, described what he called Organizational Second Hit Syndrome: a major incident creates a vulnerable period during which a second incident, even a minor one, produces reactions that are qualitatively different — structural reorganisations, personnel changes, resource redirections. The first hit is treated as an aberration. The second hit calls the organisation itself into question. The tail is not just the financial cost of the first incident but the increased fragility to the next one, and the cost of a second event during that window is not additive. It compounds.

Why learning always loses the argument

After the outage, someone who lived it — viscerally, hands on, still carrying the adrenaline of the response — proposes a proper investigation. Interviews with the people who responded, architectural analysis, a written report, implementing the fixes. The response from the people who did not live it is predictable: that sounds expensive, we have a backlog, we cannot afford to take engineers off delivery for three weeks. And so the organisation opts for efficiency, for one meeting, one document, and move on. The learning never happens.

What is going on here is not laziness or indifference. It is a predictable outcome of comparing a visible cost against an invisible one. The cost of the investigation is concrete. It is this team, this sprint, this quarter’s roadmap. It comes out of a budget that someone owns and will be asked about. The people who would do the work can see exactly what it would cost them in time they do not have, on a backlog they are already behind on.

The cost of not investigating is the next similar outage the organisation will fail to prevent. But that cost does not have a number attached to it, because nobody calculated it, and it lives in an uncertain future that feels abstract and far away. Trope and Liberman’s construal level theory describes this asymmetry precisely: psychologically near events are processed in concrete terms, with specific costs and trade-offs, while psychologically distant events are processed abstractly, in terms of vague categories and good intentions. The learning investment is near. The prevented outage is distant. So the learning feels heavy and specific while the risk feels light and theoretical, even when the risk is orders of magnitude larger.

Frederick, Novemsky, Wang and Dhar identified the mechanism behind this and called it opportunity cost neglect: people systematically fail to consider the alternatives that their decisions displace. In their experiments, simply reminding participants of the opportunity cost of a purchase, just making it visible, changed decisions dramatically. The cost did not change. Only its visibility did. This is the same mechanism at work in the post-incident conversation. The cost of not learning is the full cost of the next outage the organisation will fail to prevent, including the tail that accumulates for months afterwards. But nobody generates that number spontaneously, so the learning investment gets compared to nothing, and nothing always wins.

It gets worse because of the asymmetry in how certain and uncertain costs are experienced. The learning investment is a certain loss — you will spend this time and money. The next outage is an uncertain loss — it might happen, might not, and even if it does, nobody knows when or how bad it will be. Kahneman and Tversky’s work on loss aversion shows that certain losses feel disproportionately larger than uncertain ones. The $32,000 learning investment triggers loss aversion. The $9 million outage cost does not, because it is wrapped in enough uncertainty that it does not feel like a real loss yet.

This is the prevention paradox and the hyperbolic discounting I wrote about in The Hidden Cost of Delayed Resilience. Nobody walks into a planning meeting with the tail cost of the last outage, because nobody calculated it. The number does not exist. So the investment case never gets made, and the next outage produces the same reactive scramble, the same surface-level postmortem, and the same deferred investment.

Making the comparison fair

I built a calculator to fix this. It is free, it runs at resiliumlabs.com, and it does one thing differently from other downtime cost calculator I have seen: it shows the tail.

To use it, you pick your industry and company size. The model pre-fills defaults from published research (Splunk, ITIC, IBM/Ponemon, Siemens, BLS, DORA, HIPAA, FCC enforcement data), all adjustable, all sourced. It shows the acute cost, the tail cost, the learning investment, and the ratio between them. That should make the difference obvious enough.

The reason I built it is precisely because of what Frederick’s research shows: making the invisible cost visible is often all it takes to change the decision. You do not need to convince anyone that resilience matters. You do not need a better argument. You need the number to exist so that the comparison is no longer between a concrete investment and nothing.

Like I said, every default is adjustable. If you think the churn probability is too high for your context, lower it. If your regulatory exposure is higher than the default, raise it. The point is not to produce a precise number, because no model can do that for a future event that has not happened yet. The point is to make the comparison fair, so that the conversation about investing in learning includes the costs that are currently invisible.

What you will notice if you start using it is that when you do put the number on the table, nobody expects it to be that large. A six-hour outage at a $500 million financial services firm does not cost $2 million. It costs $9 million. And the cost of learning from that incident — the investigation, the fixes, the process improvements — comes to about $32,000, less than half a percent of the total.

Remember that the outage already happened. It is an unplanned investment, a very large one. For a fraction of that cost, you can turn it into a lesson you never have to pay for again, or you can close the ticket and move on.

Allspaw has made the distinction between learning and fixing. Fixing addresses the parts involved in the event. Learning develops a richer understanding of where the event came from, what was difficult for the people responding, and what mattered about it at the time. The return on that understanding is far greater than the return on the fix alone, because it includes the fix and the context that prevents the next incident, the one that would have been different in its specifics but identical in its pattern. Hochstein puts it well: the same incident never happens twice, because the system has changed since it happened, but the patterns recur over and over. If you focus only on preventing the specific details of the last incident, you will miss the higher-level patterns that enable the next one. The learning investment is what breaks that cycle.

The question worth asking

The next time someone in your organisation says the investigation is too expensive, ask them: compared to what? If the answer is the backlog, the sprint, the quarterly plan, they are comparing the cost of learning against the cost of doing nothing. And the cost of doing nothing is not zero. It is the full cost of the next outage they will fail to prevent, including the tail that keeps accumulating long after the dashboard turns green.

Hebert has argued that the incident review itself is the action item — not the tickets that come out of it, but the understanding it produces. If teams cannot apply what they learned because they lack the autonomy to schedule preventive work, that is not a resilience problem. It is an organisational one. My book Why We Still Suck at Resilience is largely about this very problem; the reason most organisations still struggle with resilience is not technical. It is organisational. The lack of learning after incidents is one of the clearest examples. The tail cost is what you pay for leaving it unresolved.

//Adrian

The Severity Argument You Keep Having

Adrian Hornsby — Sat, 16 May 2026 18:02:58 GMT

Photo by thisGUYshoots on Unsplash

A client of mine had been through a series of outages. At some point, one of their customers told them:

“You seem more interested in negotiating priority than fixing my problem.”

That sentence has stayed with me, because it captures something most engineering organisations would rather not see clearly. The customer was not in the room when the severity meetings happened, but they could tell from the response, from the cadence of communication, from the shape of what was happening on their side, that a meaningful chunk of the organisation’s attention had gone to a procedural question that had nothing to do with their outage.

If you have spent any time working incidents, you probably have experienced this type of situation. Someone pulls up the severity rubric mid-incident or just after. The number is high, or low, or in dispute, and the argument starts: should this be a P1 or a P2, does the customer impact count as “major,” does the contract automatically promote it, last quarter we called something similar a P2 so shouldn’t this also be one, should we wait until we know the full impact, should we downgrade now that the immediate fire is out. Everyone has a reasonable-sounding rationale, everyone is trying to land the right number, and every few weeks the same argument comes back with a different incident attached.

The customer in the opening saw this more clearly than the people inside the room. The visible argument is about severity levels. The underlying argument is about a measurement that has quietly become a goal. Severity counts show up on scorecards, in quarterly reports, in promotion criteria, in the SLA penalty clauses of contracts. Once a number gets that much attention, the conversation stops being about the impact on the customer and becomes about what the number will be. Economists call this surrogation: the proxy substitutes for the thing it was supposed to indicate.

You wanted reliability and teams that learn from incidents. You measured high-severity incidents. You now have an organisation working to reduce the count, which is not the same thing. The count goes down through suppression: incidents that don’t get declared, problems classified as something else, issues worked around quietly. Healthier teams report more incidents, not fewer, because psychological safety raises the count. Fewer high-severity incidents is the side effect of teams that learn, not the route to it. Make the count the goal directly and teams will hit it through argument, downgrade, and quieter classification. The learning never happens.

Fred Hebert at Honeycomb made the boundary-object case directly: severity scales fail because a single linear number can’t carry the multiple things different stakeholders need it to mean. His prescription is to drop the scale and use descriptive types instead — isolated, major, security, time bomb, ambiguous — so each term encodes its own workflow rather than its own rank. This post stays with Hebert’s diagnosis and adds a second cause underneath it.

The meanings severity is asked to carry — impact, workload, visibility, contractual exposure — give everyone in the room a lever to argue the number down. The boundary-object problem makes the argument possible. The measurement makes it inevitable. No rubric refinement fixes either.

What severity is being asked to do

Most severity scales run P1 to P4 or Sev1 to Sev5. One number per incident. Looks simple, on the surface.

That number is supposed to communicate, simultaneously:

- Impact — how many customers, how badly, what scope

- Workload — who has to be involved, how many people, for how long, what process activates, whether an RCA gets triggered

- Visibility — what shows up on leadership dashboards, what gets reported externally

- Contractual obligation — what was promised to this specific customer, what the legal and commercial consequences are

These four things rarely move together, and the gap between them is where most severity arguments live. An incident can have low impact but high contractual exposure, or high workload but low visibility, or low operational severity but real regulatory consequence, or it can affect only one customer but the wrong customer at the wrong moment, in a way that makes the call subtle in ways no rubric quite captures.

Inside workload, the question of whether the incident triggers an RCA is often the heaviest lever. An RCA is hours of write-up, review, and follow-up that the team would rather avoid, and in my customer engagements it is the single most common reason severity gets argued down.

The single number forces compression, and compression is what makes severity meetings argue about which of the four meanings will dominate this time, rather than agreeing on what actually happened.

A boundary object is useful in normal times because it lets people coordinate without agreeing on definitions. Under load it breaks. Engineers reach for severity to negotiate workload, customer success reaches for it to communicate customer pain, incident commanders reach for it to gate process, and leadership reaches for it to filter visibility. They are all using the same word and arguing about different things, and most of the time none of them realise that is what is happening.

What looks like an argument about which severity fits is really an argument about which meaning of severity will dominate the conversation.

Where the pull goes

If you pay attention during these meetings, the energy is almost always directed at lowering severity, rarely about bumping it. The arguments offered are not primarily about customer impact, they are about response shape: who would have to be involved, how visible it would be, how much process would activate, what the workload would look like.

Where the pull goes tells you what the term is doing. If severity were really a good description of impact, you would expect the arguments to go in both directions, with customer-facing roles pushing up about as often as engineers push down. The vocabulary would reach for the customer’s contract, the customer’s user count, the customer’s downstream consequence. The meeting would reach for facts.

Instead, the meeting often reaches for examples of similar incidents and asks how those were classified, with people looking for precedent that supports the lower number. The customer’s contract often does not get opened, and the customer-facing roles are often not in the room or are outnumbered by the people who will carry the operational consequences of the call. That is the behaviour of a workload gate, not a description of impact.

This is also where the customer’s perception comes from. When the negotiation pulls consistently toward less workload, less visibility, less formal response, the implicit message is we are arguing toward the version of this that is least disruptive to us, even if no one means it this way. The customer is downstream of that, and they feel it.

The events that get lost

Severity arguments produce winners and losers among incidents. The clear-cut cases go through quickly. The contested cases get downgraded under pressure, more often than not, because the pull is consistently toward less workload.

Which is to say: the incidents that get downgraded under negotiation are usually the ambiguous, complex, partially-customer-impacting cases that do not fit neatly into the standing categories. These are exactly the incidents most worth examining carefully.

A clear-cut P1 is informative in the way a textbook example is informative, because it confirms the categories. An ambiguous incident that the team had to argue about is informative in a different way: it reveals where the categories are wrong, where the system is producing failure modes the rubric did not anticipate, where the assumptions encoded in the severity scheme do not match the territory.

When the standing process punishes ambiguity by routing it toward the lower-severity bucket, which is also often the bucket that gets less analysis, smaller postmortems, weaker follow-up, the organisation systematically loses access to the events that contain the most information. The stuff worth learning from disappears into the bucket designed to require the least attention.

Why the argument never ends

Once you see both problems together — the term overloaded with meanings, and the term being measured — the recurring nature of the argument makes sense. It is not a debate that anyone can win, because there is no rubric refinement that turns an overloaded, measured term into a coherent one. Better criteria make the boundaries sharper, clearer rules make the call more defensible, a standing committee makes outcomes more consistent, a customer-impact rubber stamp makes one dimension explicit, but none of these change the underlying issue, which is that the term is being asked to encode multiple things that cannot be encoded together.

So the argument resurfaces on every ambiguous incident or when customers finally decide to share their frustrations. Someone proposes a refinement, the discussion happens, a decision is made, the argument quietens, and then it comes back at the next incident. It feels like progress because there is real intellectual work happening, but it has no terminal state.

This is why severity meetings feel exhausting. People walk out feeling they spent two hours on something that should have taken ten minutes, and they are right. They spent two hours on a problem that has no solvable form in its current shape.

What is actually being optimised

Cultural fixes — “we should care more about customers, let’s call out negotiation when we see it” — do not work because they aim at the wrong layer. The metric is the layer.

Recurring patterns in mature organisations almost always have an incentive structure underneath them. Sometimes a team scorecard counts P1s and a high count looks bad on someone’s quarterly review. Sometimes P1s trigger upward reporting that someone is trying to avoid. Sometimes SLA penalties are tied to severity, while the team making the severity call is shielded from the financial consequence but accountable for the call itself. Sometimes the incident management team’s own workload is implicitly part of how they are measured.

These metrics are usually invisible to the people inside the system, and often invisible to leadership too, because they were installed years ago by someone who has since moved on. People who ask why does severity keep getting argued? are often told it’s the process. That answer is honest, but it is probably not the whole answer.

If you cannot see the incentive driving the negotiation, you cannot dismantle it. Every cultural intervention will be working against an unmeasured current.

Three layers, one drift

Step back from all of that, and a shape comes into view. Each piece of the diagnosis lives at a different layer of how the organisation actually works, and the layers are out of alignment with each other.

The frame I use in my book Why We Still Suck at Resilience is to examine three layers at once: outcomes, rewards, and rituals.

Outcomes are what the organisation claims it wants: customer success, reliability, learning. The values on the wall, the language in the strategy deck, the things leadership says in town halls.

Rewards are what actually gets measured, celebrated, and promoted. The metrics on dashboards, the recognition in all-hands meetings, the promotion criteria, the things that determine careers.

Rituals are what the standing processes actually produce. Severity meetings, postmortems, performance evaluations, sprint planning. Not what the rituals are supposed to do, but what they actually do when you sit and watch them.

When these three layers point in different directions, when outcomes say customer success but rewards celebrate low P1 counts while rituals systematically argue severity downward, drift is happening. The direction of the misalignment reveals the direction of the drift.

Most attempts to fix the severity argument operate only on the outcomes layer: new value statements, town hall reaffirmations, posters about putting customers first. Restating outcomes while rewards and rituals stay flat does not close the gap; it deepens it, because it teaches people that outcome statements are decoupled from real consequence. The work has to happen where behaviour actually gets shaped, at the rewards and ritual layers, and that requires leadership willing to look at what is actually being measured and decide whether it is doing useful work.

What works better

If the severity argument lives at the rituals and rewards layers, that is where the work has to happen. Five directions are worth thinking about: four on the rituals, one on the rewards. Rituals move faster, and over time put pressure on the rewards. The rewards work moves slower but goes deeper. Doing any one of them helps. Doing all five together produces compounding effects.

Reduce the load on the term

If severity is overloaded, reduce the load. Hebert’s incident-types prescription is the cleanest version, but for most organisations it is too big a leap — severity language is baked into customer contracts and reporting structures. You cannot just remove it.

The practical move I found useful is to supplement rather than replace. Keep the severity number where it has to live, in contracts and in formal reporting, but put a small descriptive layer alongside it that captures the things severity is currently doing badly: customer impact (who, how many, what scope), workload (which teams, expected duration), visibility (who needs to be told), and contract reference (which clauses apply). Four short fields, filled in at incident declaration, kept alive through the response.

A small example. An incident gets declared at P3 because it affects a non-critical internal tool used by a small number of users for a few hours, and by the rubric that is the right number. The four-field layer alongside the P3 might read:

Customer impact: 12 internal users on the analytics team, blocked from quarter-end reporting due in 36 hours.
Workload: two engineers, expected 4 hours.
Visibility: finance leadership notified, no external comms.
Contract: none affected.

The P3 stays, and nothing has changed about the workflow it triggers, but anyone walking into the response or reading about it later now has the actual situation, not just the rank. The question of whether the severity level fits stops being the only thing the meeting can argue about, because there are now four other things on the page that are easier to discuss directly.

When the next severity meeting happens, you have something to argue from instead of arguing about. The compression problem does not go away, but the meeting now has a chance of being about the situation instead of about which meaning of severity will dominate.

A small antidote that pairs well with this, especially when you notice the meeting drifting back into rank-arguing: imagine the customer in the room. Not as a rhetorical device, but seriously. What would you actually tell them, in plain language, about why this is or is not a P1? On the other side of the service is a customer trying to get work done, and behind them are usually their own customers waiting on something. The severity meeting is invisible to all of them, but the consequences of the call are not. Running the argument past an imagined customer before settling on the number tends to surface what the discussion has been talking around.

Lower the cost of declaring

If suppression is the upstream version of the surrogation problem, the way to disarm it is to make declaration itself low-friction and non-punitive. Martha Lambert at incident.io made the practical case at SEV0 2025: lower the bar, declare more, train the response muscle on small things so the bigger ones go smoothly. She also suggests auto-creating incidents from errors, taking the declaration decision out of human discretion entirely. There is nothing left to suppress. This is the Andon Cord pattern applied to incident declaration: lower the cost of surfacing a problem, and never punish the surfacing. Her customer-side framing lands the rest of it:

”Customers understand that things go wrong. They just want to know that you’re dealing with them really well.”

Decouple learning from severity

Most severity scales gate two things: who responds and whether learning happens. The two do not have to be linked.

When learning happens only for P1s, or only for the incidents that survived the negotiation as P1s, the organisation has built an incentive to argue severity downward. The lower number is also the one that makes the postmortem go away, the action items disappear, the reflection stop. Of course the pull is downward. The severity decision is also a decision about whether the team has to write anything down afterward.

Decoupling these is structural. Severity decides response shape, learning happens regardless. AWS’s COE process is one well-known version of this. A COE (Correction of Errors) is Amazon’s version of the postmortem, triggered by any incident with customer impact, even one customer. Customer obsession makes the call. Any team can ask another to take one on, even for a near miss. So is the practice of writing a learning artefact for any incident that surprised the team, independent of how it was officially classified.

Once you decouple, the argument over severity stops mattering for the learning question. The high-stakes argument quietly becomes lower-stakes, because less rides on the answer. This is also the move that makes ambiguity productive instead of expensive: the contested cases, the ones the rubric struggles with, are now the ones most likely to get studied, because the studying is no longer gated on the rubric resolving them.

Decoupling is the precondition for learning, not a guarantee of it. The artefact itself can become theater: a defensible argument that learning is occurring while no actual learning happens. That is a separate failure mode my book spends time on.

Default high, track the trajectory

Two tactical moves work alongside the structural ones, especially when a real argument is happening in the room.

Default to the higher severity when there’s disagreement. Treat the incident as the highest plausible level until proven otherwise. Mobilise the response that level warrants. Once the customer is taken care of and the immediate situation is resolved, the argument can resume. In practice it rarely does, because everyone is tired and the urgency has passed. The downward pull only works when it’s exercised *before* the response is committed; once committed, the argument loses its purpose.

Treat severity as a trajectory rather than a single number. An incident can start at P3, escalate to P1 when impact becomes clear, then return to P3 or P4 once the situation is stable. Track the peak severity alongside the current severity, and record every change in the incident timeline. (I owe this framing to a recent conversation with Brent Chapman. The peak severity becomes useful data: what severity was actually doing during the response, not what someone settled on at the end. It also removes heat from the meeting, because the call is no longer a one-shot judgment. It can be revised as the picture clarifies.

Both moves work because they remove the tactical incentive to argue. Default-high pushes the argument to the after-the-fact moment when no one wants to have it. Trajectory tracking spreads the severity decision across the incident timeline; no single judgment carries the full weight.

Take severity off the scorecard

The deeper move is at the rewards layer. Use severity counts as information. Keep them off scorecards, performance reviews, and promotion criteria. Once a number gets that kind of attention, the thing the number was supposed to indicate becomes invisible. That is the surrogation problem this piece opened with, surfacing in the place hardest to fix.

This is the harder of the three remedies, but it addresses surrogation at its source. In my experience, leaders understand surrogation easily once it is explained. They have rarely chosen the measurement system that produces it. Most of the structures around incident severity were inherited from someone who has moved on, and the people operating within them now do so without ever having decided the structures made sense. Naming the measurement and asking whether anyone would design it this way from scratch is often enough to unstick the conversation.

Most of the time, the conversation simply doesn’t happen. The people closer to the harm assume leadership won’t hear it; the people senior enough to be heard haven’t done the diagnosis. When someone with the standing walks a leader through it calmly, the metric stops being defended more often than people fear.

Some leaders will not unstick even then, because the measurement is doing political work for them they aren’t ready to give up. That is a different problem from inheritance, and it doesn’t yield to the same conversation. Naming the political function the metric serves is where that one starts.

The remaining work is consistency. Someone with standing has to say “we are no longer tracking P1 counts as a performance indicator,” and they have to hold that line through the next quarterly review when someone asks where the metric went. It is uncomfortable to remove a measurement that has been part of the operating rhythm for years, even when everyone agrees it is doing harm. But the alternative is the argument this piece opened with, in perpetuity.

Why this matters

The reason severity arguments are worth getting right is not that severity itself matters. It is that severity meetings are where the organisation talks to itself about which incidents count.

When the meetings systematically pull toward lower severity, the organisation tells itself a story about its own performance that is more flattering than reality: fewer high-severity incidents, better dashboards. The proxy has become the goal, and the goal has been forgotten. That story shapes investment decisions, what gets prioritised, what gets celebrated, what counts as a good quarter.

It also shapes how the organisation appears to its customers. The customer who said you seem more interested in negotiating priority than fixing my problem was reading the same room from outside. They saw an organisation whose internal language was about classification, when their language was about consequence. The mismatch is visible, even when the people inside the meeting do not realise it is happening.

A measurement system that points the organisation in this direction will keep producing the same arguments quarter after quarter, regardless of how the rubrics are dressed up. There is no better technique waiting to be found. The way out is to recognise that the question severity was designed to answer is not the question incident response is actually asking.

Tell the truth in the dimensions where it actually lives. Notice when outcomes, rewards, and rituals point in different directions. The misalignment is the message.

//Adrian

---

Second in a short series on the metrics and categories incident response argues about. The first piece was on MTTR.

When No One in the Room Has Carried the Pager

Adrian Hornsby — Fri, 15 May 2026 06:41:17 GMT

Photo by charlesdeluvio on Unsplash

When I was at AWS, one of my areas of focus in the Resilience org (the team behind AWS FIS) was bar-raising ORRs, Operational Readiness Reviews, for the services we shipped. Bar-raising is an Amazon term, but the idea is older than Amazon. In a review, someone has to be the person whose only job is to hold the line on quality, independent of the team being reviewed and independent of the chain of command. Their only role is to ask the question nobody else wants to ask.

We worked with some of the best engineers in the world but it didn’t matter. Humans are humans. We have biases, we forget things, and not everyone has the operational scar tissue that comes from carrying a pager for ten years. The ORR existed to close that gap, and the bar-raiser existed to make sure the ORR did.

What an ORR actually was, in practice, was a sequence of conversations across the life of a service. We’d sit with a team and talk about past incidents, about what we’d learned the hard way, about how the architecture would behave at 3 a.m. when nobody was around to nurse it. Because the system you design at a whiteboard with a cup of coffee is rarely the system that exists at 3 a.m. with a half-broken dependency and an on-call engineer who joined three weeks ago. That’s the work-as-imagined versus work-as-done gap, and the ORR was the structured way we forced ourselves to look at it.

The thing that made it work wasn’t the template. It was the people in the room. Operationally savvy engineers, senior enough that they’d carried the pager for years, seen plans fall apart in production, watched checklists hide more than they revealed, and worked their way through runbooks clearly written by someone who’d never actually been on call. They’d been burned enough times that they couldn’t help but ask the awkward follow-up question. *”You said you have circuit breakers. Fine. So, walk me through what happens when one trips. Where does the request go? What does the user see? How do they behave under load?”* When someone gave a shallow answer, they noticed and pushed because they knew, from experience, exactly where the team’s mental model would crack first.

That’s the part you can’t extract from a checklist.

When I started working with customers outside AWS, this became obvious very quickly. Many organizations don’t have those people in the room, often because of the separation between application and operations. They have senior engineers and architects, some of them deeply experienced, but their experience is often architectural rather than operational. They’ve designed things; they haven’t necessarily watched those things degrade at 3 a.m., or stayed up through a postmortem trying to work out why the runbook said one thing and the system did another. So when the team runs an ORR, there’s no one whose instinct is to push on the answer that sounds clean but isn’t.

What happens then is predictable. The form gets filled in, the boxes get checked, and six weeks later the service goes down because nobody asked the question that would have surfaced the assumption that was wrong all along. The ORR happened; the learning didn’t.

The Resilience Companion is my attempt to put something useful in the room when those people aren’t there.

It’s an AI facilitator that runs the kind of conversation I used to run; Socratic, a little pushy, anchored in actual incidents. It asks the team to predict before they look things up. It traces failure paths step by step. It surfaces relevant industry post-mortems when the conversation gets near a known pattern. When someone says “we have retries,” it asks them to explain the backoff strategy in more detail. The gap between what the team can answer and what they need the companion to go look up for them is itself one of the most useful signals you can capture.

The friction is the feature. Cognitive science has a name for it: productive struggle, or what Robert and Elizabeth Bjork called desirable difficulties. When you force someone to generate an answer rather than recognise one, retention goes up. Make them predict before they look, and the prediction becomes the scaffold the new information attaches to. Most enterprise tools optimise for frictionless retrieval and call it a productivity win. The Companion goes the other way on purpose. Learning requires friction, and most reviews don’t have any.

Two things I want to be clear about.

It is NOT a replacement for an operationally savvy senior engineer. Real scar tissue carries context no model has, and the conversation a seasoned SRE has with a team in a room will always be richer than anything a tool can run. I don’t claim otherwise and I won’t.

It IS a way to bootstrap the process, to get teams running ORRs that actually probe, rather than ones that just generate paperwork, while the organization builds the wider community of engineers who can eventually take that role. It’s scaffolding, not a building.

The Companion is also a working argument. My book Why We Still Suck at Resilience makes the case that most resilience practices degrade into checklists and compliance theater because organizations optimize for performance over learning. The Resilience Companion is what the alternative looks like in code. It refuses to become a checklist. It treats productive struggle as the point. It measures itself on whether teams learned during the session, not on whether the form got filled in.

If you don’t want to run anything, the template the Companion uses is on GitHub as a standalone markdown file: ORRtemplate.md. A skilled human facilitator with the template can run the same conversation, and probably better. The Companion is one way to use the template.

It’s experimental, pre-1.0, Apache-licensed, self-hosted. `docker compose up` and go.

If you’ve ever sat through an ORR where everyone politely agreed that everything was fine, and then watched that same service fall over a month later, this is for you.

Try it. Open an issue if you find a bug. Open a pull request if you’re brave enough.

// Adrian

The MTTR Argument You Keep Having

Adrian Hornsby — Sun, 03 May 2026 06:54:41 GMT

Photo by Peter Herrmann on Unsplash

If you have spent any time in an organisation that operates software, you have probably had this conversation more than once. Someone pulls up the MTTR dashboard, the number is too high, or has not moved in a quarter, or has gone up despite years of investment, and the argument starts.

Should we exclude that one six-hour incident, given that it was clearly an outlier? Should we trim the top and bottom 5%, or use Winsorization, or median absolute deviation, or some combination of the three? Should we separate by severity, by service, by detection path, by phase of the moon? Everyone has a reasonable-sounding statistical technique to propose, everyone is trying to make the number more honest, and every few months the same argument comes back, usually with a new technique attached.

What I find interesting is that the people having these arguments often already know MTTR has problems. The conversation I keep seeing is some variation of “we know the metric is flawed, but leadership wants something simple, so let’s at least make it less wrong.” That framing is honest about the constraint, and I have a lot of sympathy for it. The pressure to produce a single number that fits on an executive slide is real, and pretending it is not real does not help anyone.

But I want to suggest that the argument that follows from this constraint is not what people think it is. It is not really a debate about statistics, but a symptom of a deeper problem that no amount of mathematical trickery is going to fix, and the constraint that produces it is worth examining directly rather than working around.

Before going further, a small note. MTTR means different things in different organisations: Mean Time To Repair, Recovery, Resolve, Respond, or Restore, depending on who you ask and which framework they grew up with. The argument I am about to make does not depend on which version you use. The statistical and structural problems apply to all of them, because they are all averages of incident durations and incidents are the same kind of heterogeneous events regardless of where you put the start and end markers (which is often another subject of argument). So wherever you see MTTR in what follows, read in whichever variant your organisation uses.

Where MTTR comes from

MTTR was not invented for software. It comes from industrial reliability engineering, where the same machine fails the same way over and over again. Picture a production line that turns out the same product, day after day, using the same machines running the same processes in the same sequence. Nothing about the line changes from week to week. A pump fails, you repair it, you record how long the repair took, and a month later the same pump fails again in roughly the same way and you repair it again. Over hundreds of failures, the mean repair time tells you something useful about your maintenance operation: how good your technicians are, how well-stocked your parts inventory is, how clear your procedures are when something goes wrong.

The metric works in this setting because the underlying events are repeatable, and they are repeatable because the system producing them is stable. The line in March is the same line as in January, the pump is the same pump, the failure mode is the same, the repair procedure is the same, and averaging across them is a coherent thing to do because the variation between events is small relative to what is being measured. This is the world MTTR was designed for, and in that world it works well.

Where MTTR breaks

Software incidents are not pump failures. The system that produced the incident in January is not the same system that produces the incident in March, because in the intervening months new code has shipped, dependencies have updated, usage patterns have shifted, architecture has evolved, and the people operating it have learned things that have changed how they think about it. Incidents in complex software systems rarely if ever repeat themselves, because the system underneath them is constantly changing. Every incident is closer to a novel event than a repeat of a known one, and the container restart that took five minutes last month and the database corruption that took six hours yesterday are not two samples from the same distribution. They are not even the same kind of thing.

When you average across them, you are not smoothing out variation in a repeatable process. You are computing the mean of events that have very little to do with each other, and that mean does not represent operational efficiency in the way it does for the pump. It represents nothing in particular.

This is what people are running into when they argue about outliers. The instinct is correct because the number really does not feel like it represents anything real, but the mistake is in thinking the problem is the outliers when the problem is that the events being averaged were never the right kind of thing to average.

Why the argument keeps coming back

Once you see this, the recurring debate about how to clean up MTTR makes more sense. It is not an argument anyone can win, because there is no statistical treatment that turns an incoherent measurement into a coherent one. Trimming makes the number more stable, Winsorization makes it more defensible against critique, MAD makes it more robust to extreme values, but none of these change the underlying issue: the thing being measured was never a single coherent quantity to begin with.

So the argument resurfaces every few months, and it usually surfaces from the top. A large incident lands and leadership wants to know why it took so long, who was responsible, and what is being done about it. The MTTR dashboard comes back onto the table, often as the entry point to a broader conversation about accountability, with questions about why the number is what it is, why it has not improved, and what can be done to make it move. The team, sitting with the same flawed metric they have been sitting with for years, does what they can. They propose a normalisation technique, argue about which incidents to include, reach a decision, and the number stabilises for a while before drifting again, at which point the next large incident reopens the same conversation.

It feels like progress, because there is intellectual work happening, but the argument has no terminal state. It cannot, because the question it is trying to answer was the wrong question, and the dynamic that keeps reopening it is not really about statistics either. It is a search for somewhere to put the discomfort that comes after a bad incident.

Outliers are not the problem

There is a second thing worth saying, which often gets lost in the statistical conversation: outliers in incident data are not noise. They are frequently the most important events in the dataset.

The six-hour database corruption that dominates your MTTR is the incident that taught the team something they did not know about their architecture, that revealed gaps in monitoring, runbooks, staffing assumptions, and coordination patterns, and that produced lasting changes to how the system gets built and operated. The five-minute container restarts produced none of that. They are routine, they reflect operational competence, but they do not generate learning the way the long, painful incidents do.

Treating outliers as data quality problems to be cleaned up is therefore exactly backwards. They are where the signal is, in a sociotechnical sense, and reducing them as a goal is not in the organisation’s interest even if doing so might make a dashboard look better. This is the deeper irony of the argument about how to handle outliers: the events being argued about are often the most informative events the organisation has access to, and the statistical techniques being proposed to minimise their influence on the headline number are minimising the influence of the most important data points.

What works better

I have written before about percentiles as a more honest alternative, and I will not repeat the full case here. The short version is that P50, P90, and P99 tell three different and useful stories, none of which collapse a heavy-tailed distribution into a single mean that represents nobody’s experience.

The practical move I would suggest for organisations currently committed to MTTR is not to remove it overnight, because that tends to produce resistance disproportionate to what the change is actually asking for. Instead, put percentiles next to MTTR in the existing reporting and walk leadership through what each one actually tells them.

A small worked example helps. Imagine a quarter with 20 incidents: 18 of them resolved in under 10 minutes (small, well-understood failures handled by automation or quick rollback), one took 90 minutes (a regression that needed a careful diagnosis), and one took 8 hours (a novel failure that pulled in three teams and rewrote how a subsystem gets monitored). Most people in the org would say this was a good quarter. The team caught problems early, handled the routine ones quickly, and learned something significant from the long incident.

Now look at what the metrics say. MTTR for the quarter is around 33 minutes, a number that suggests typical incidents take more than half an hour to resolve, which is not true of any incident in the dataset. Look at the percentiles instead and the picture changes. P50 is around 6 minutes, which tells leadership that half the incidents are resolved in under 6 minutes, an honest reflection of the team’s standard operational capability. P90 is around 10 minutes, which says that even when things are not routine the response is usually quick. P99 is 8 hours, which surfaces the long incident as what it is: a rare, complex event worth a separate conversation, not a number to be averaged away.

Three numbers, three different conversations. The team’s day-to-day capability, their handling of harder cases, and their worst event of the quarter, each visible without being collapsed into the others. Compare that to a single MTTR of 33 minutes, which represents none of those three things and invites the wrong follow-up question: “why is MTTR so high, and what are you doing to bring it down?”

Once leadership sees this side by side, the conversation about which view represents reality becomes much easier to have. If trimmed percentiles turn out to be useful later, they are available, but that is a refinement on a more coherent measurement rather than a rescue attempt on an incoherent one.

The thing I would push back on is the assumption that more sophisticated mathematics applied on top of MTTR is going to produce a better answer. It will produce a momentarely more stable number, but it will not produce a more meaningful one, because the events being measured are not the kind of events an average can usefully describe and no amount of trimming, Winsorization, or MAD is going to change that.

Why this matters

The reason it is worth getting this right is because metrics shape what organisations pay attention to and what they invest in, and when the headline metric is an average of incident durations the implicit goal becomes reducing that average. Teams optimise for it, behaviours follow, and outliers, which is to say the incidents most worth learning from, become problems to be made smaller rather than events to be understood.

A measurement system that points the organisation in this direction will keep producing the same arguments quarter after quarter, regardless of how the statistics are dressed up, and the way out is not a better technique but recognising that the question MTTR was designed to answer is not the question software incident response is actually asking.

The leadership pressure driving the original constraint, the request for a single simple number, is not unreasonable on its face. Leaders do need ways to compare across teams, track direction over time, and have short conversations about complex things, and the honest response to that need is not to refuse the simplification but to offer one that points at something real. Percentiles do that better than MTTR. The work, in the end, is in showing leadership the difference, rather than in producing a more defensible version of the wrong measurement.

//Adrian

The Interpretation Layer

Adrian Hornsby — Sat, 25 Apr 2026 06:14:33 GMT

Photo by Martin Sanchez on Unsplash

On April 20, 2026, a security researcher named @weezerOSINT publicly disclosed that chat histories and source code from public projects on Lovable, the AI (vibe)coding platform, could be accessed by any authenticated user. After some back and forth, in what became a great lesson in how not to respond to vulnerability reports, Lovable shipped a fix. It published a detailed postmortem two days later. Most of the coverage has focused on the Public Relations handling (or lack thereof) and security failure itself: a backend regression reintroduced access that had been deliberately removed over the course of 2025, exposing user data for roughly ten weeks.

That’s the story most people are telling. I want to tell a different one.

Buried in Lovable’s postmortem is a detail that changes what the incident is about. Researchers caught the regression. They filed reports through the proper channel, HackerOne, starting February 22. The first report landed within weeks of the regression shipping. The detection system worked exactly as designed.

And then, for two months, every report was closed as “expected behavior.”

The signal was not missed. The signal was received, interpreted, and dismissed. Not by malice, not by incompetence, but by a triage layer operating on documentation that no longer matched the product it was describing. The detection system worked. The interpretation layer failed.

This is the failure mode nobody writes about, and it lives in almost every organization I’ve worked with.

Chinese whispers at organizational scale

Photo by Nguyễn Phúc on Unsplash

You have probably played Chinese whispers, often called telephone. A group of kids sits in a line. The first whispers a message into the second’s ear. The second passes it to the third. By the time it reaches the end, “purple umbrella in the rain” has become “people wrestling trains,” and everyone laughs because the punchline is always the mangled message at the end.

What the game illustrates is more useful than the laugh.

Notice what does not fail in Chinese whispers. The signal is intact at the source. Every person in the chain is acting in good faith. Every person is passing along what they heard, faithfully, with some confidence. No individual link is broken. The failure is distributed across the chain itself, in the coupling between links, and crucially, nobody in the middle knows how the message has drifted. Each person only hears one version and passes it on. They cannot check their interpretation against the original because they do not have access to the original. They only have what the previous link handed them.

This is exactly the shape of the Lovable failure, and of most interpretation layer failures I have seen. The researchers filed accurate reports. The triagers compared those reports to the documentation they had been given. The documentation was an artifact from an earlier version of the product. Each link in the chain acted correctly against its own inputs. The whole chain produced the wrong output, with complete confidence, for ten weeks.

The game is a children’s game because the stakes are a laugh. At organizational scale, with real systems and real users, the stakes are incidents, lawsuits, and eroded trust. The mechanism is identical.

The three-layer model

Think about how signals flow through your organization. There are at least three distinct layers involved in turning an external or internal event into an organizational response.

Detection is the layer that captures the signal. Monitoring, alerting, bug bounties, customer support tickets, audit logs, researcher reports. Most organizations invest heavily here, because detection failures are visible and embarrassing. When an outage hits and nobody saw it coming, the question “why didn’t we detect this?” is the first one asked.

Response is the layer that acts on the signal. Runbooks, on-call rotations, incident commanders, remediation workflows. This layer gets moderate investment, usually in the form of process and tooling.

Between them sits interpretation. The layer that decides what the signal means. Is this alert real or noise? Is this report a vulnerability or expected behavior? Is this customer complaint a bug or a misunderstanding? Interpretation almost always happens implicitly, inside the heads of whoever is looking at the signal, supported by whatever documentation was accurate the last time someone updated it.

The interpretation layer is where Lovable’s incident lived for ten weeks. And it’s the layer most organizations treat as if it doesn’t exist.

Why interpretation drifts

Three mechanisms push the interpretation layer out of sync with reality, and they operate continuously in every system I have ever seen.

The first is that interpretation context is static while the system it describes is dynamic. Documentation is written once, updated occasionally, and consulted continuously. Products change every deploy. The half-life of accuracy in any piece of interpretive context is short, and almost nobody is measuring it. The moment you ship a change that affects how an external signal should be interpreted, you have created a drift window. The question is only how long the window stays open.

The second is that interpretation is often outsourced, literally or functionally. Triage partners like HackerOne, support vendors, junior staff, automated classifiers, LLM-based first-line responders. They operate on context handed to them, with no independent way to verify that context against current reality. They also have no incentive to question it. Questioning the context means slower ticket closure, more escalations, and the appearance of not doing your job. The incentive structure rewards confident application of whatever context you were given.

The third is the most insidious. Successful interpretation looks identical to correct interpretation from the outside. A closed ticket is a closed ticket. The dashboard shows healthy throughput. The SLA metrics look fine. You only discover the difference between “interpreted correctly” and “interpreted consistently with outdated context” when the suppressed signal becomes undeniable, usually through public disclosure, a major outage, or a regulator knocking on the door. By which point you have months of accumulated evidence that your process was working, which makes the eventual failure both more surprising and more damaging.

The work-as-imagined gap

If you have read any of my previous work, you know I spend a lot of time on the gap between work-as-imagined and work-as-done. The Lovable incident is a near-perfect illustration.

The product team imagined the disclosure process as: researcher files report, we investigate, we fix. Clean loop, clear ownership, visible outcomes.

The actual work-as-done was: researcher files report, triager compares the reported behavior against Lovable-provided documentation, documentation describes the behavior as intended, ticket closes as “not a vulnerability.” Also a clean loop, with clear ownership and visible outcomes. The two loops produce identical metrics. They diverge only in what they do with actual vulnerabilities.

Nobody in the product loop could see the triage loop. Nobody in the triage loop could see that the product had changed. The organization had built a pipeline where each stage was functioning correctly by its own lights, and the failure existed only in the coupling between stages, which nobody owned.

This is where it helps to name the deeper pattern. Failure in complex systems is overwhelmingly a property of emergence, not of components. Each component can be tested, verified, documented, and owned. The interactions between components cannot be fully tested in advance, because the interaction space is combinatorial and the only place the real combinations happen is production. Woods, Hollnagel, and Cook have been writing about this for decades. Complex systems fail at the joints, not at the parts.

Interpretation layer failures are a particularly clean example of this principle. The product was working. The triage process was working. The researchers were doing their job. The failure lived entirely in the coupling, in the flow of interpretive context from product to triage, and coupling is structurally harder to own than components. Components have teams. Interfaces between components tend to be orphaned by default, because assigning ownership of a coupling requires someone to actively notice that the coupling exists and matters. In fast-moving organizations, that noticing usually happens after the incident, not before.

This is how interpretation layer failures almost always look. Not dramatic. Not obvious. A slowly widening gap between what the system is doing and what the interpreters think it is doing, invisible until the gap produces an incident big enough to reach the public.

Read the remediation

Here is where Lovable’s postmortem gets interesting, because it contains both the right diagnosis and the wrong remediation, and the ordering tells you something.

The postmortem announces two categories of fix. Training: “extensive product training to all HackerOne triagers,” “retraining our HackerOne triage team on Lovable’s current permission model.” And infrastructure: “restructuring our escalation workflow to ensure that any product change affecting user data access automatically triggers an update to our triage documentation.”

Training appears first in the communication. Infrastructure appears as a subordinate clause inside a longer paragraph.

This is backwards, and it matters. Not because training is wrong. Training matters. The humans in the interpretation layer are the interpretation layer. You cannot automate them out, and you should not want to. What matters is what training can and cannot carry on its own.

Training fixes knowledge gaps in people. It does not fix freshness gaps in artifacts. If you retrain the triage team today on the current permission model, you have solved the problem for exactly as long as the permission model stays current. At any platform shipping changes continuously, that window is measured in weeks. Then you are back in the same state, except now you have a false sense of having addressed it, because you did the visible thing.

The automated trigger, the coupling between product changes and triage documentation updates, is the fix that holds across time. It does not decay the moment you stop paying attention. It makes the interpretation layer self-updating, or at least self-alerting when it falls behind. It gives the trained humans fresh context to work with, instead of making them rely on memory of context that is aging out. Training and infrastructure are not substitutes for each other. They are complements, and most organizations invest in the first while neglecting the second.

Sidney Dekker has written extensively about the temptation to locate systemic failures at the human layer. It is the cheapest intervention to announce. It shows care for the people involved. It produces a deliverable. The problem is not that it is wrong, but that it is incomplete. Training humans to interpret stale context correctly does not prevent the context from going stale again next quarter. The system that created the conditions for drift keeps producing them, and humans alone cannot hold back drift they have no way to see.

This gives you a small diagnostic you can apply to any incident postmortem, your own or someone else’s. Look at the remediation list. Count the training items versus the infrastructure items. The ratio tells you where the organization believes the failure lived. If training leads, the organization still thinks this was a people problem. If infrastructure leads, they have absorbed that it was a systems problem.

Lovable is doing both, which is better than most. But the prominence of training in the communication is a tell. The organization is not yet convinced that the interpretation layer is infrastructure. It still thinks of triagers as people who need better preparation, rather than as sensors that need fresher input signals.

What to do about it

If you accept the premise that interpretation is infrastructure, a few concrete patterns follow. All of them share a common move: making the coupling visible, ownable, and instrumented. Breaking the Chinese whispers structure by giving the chain ways to check itself against the original.

Treat interpretation context as a deployment artifact. If product changes ship through a pipeline, the context that downstream interpreters need to correctly handle those changes ships through the same pipeline. Not as a Notion doc someone might update. As a versioned artifact with ownership, review, and a release cadence tied to the underlying system. Lovable’s automated-trigger fix is a version of this. It can be done more rigorously.

Instrument the interpretation layer itself. What is the rate of external reports being closed as “expected behavior” or “not reproducible” or “working as intended”? Is that rate changing over time? Are specific reporters being repeatedly dismissed? These are leading indicators of interpretation drift, and almost nobody tracks them. You do not need fancy tooling. A monthly review of closure reasons, segmented by category, would catch most drift before it becomes an incident.

Build a feedback path from response back to interpretation. When a real incident happens, the first question in the postmortem should be: did our interpretation layer suppress earlier signal, and if so, why was the context wrong? Lovable is doing this retrospectively for one incident. The resilient version does it as standing practice, for every significant incident, every time.

Assume the interpretation layer is drifting right now. You do not know in which direction. Periodic red-team exercises where someone files a known-real issue through the normal channel and watches what happens. Synthetic reports, seeded bugs, planted scenarios. Cheap, high-signal, almost never done. If you cannot bring yourself to do this, at minimum review a random sample of closed tickets every quarter and ask whether the closure decisions still hold up against the current state of the product.

The adverse incentive

One last thing worth naming. There is an incentive problem in how incident communications get read.

Training remediations are legible. They sound decisive. They show empathy for the people who were let down. They produce a clear before and after story. The organization that announces comprehensive retraining looks responsive.

Infrastructure remediations are invisible when they work. An automated documentation trigger that quietly keeps triage context fresh for the next five years produces no headlines. It generates no narrative. It looks like nothing happened, because nothing did, which was the goal.

The infrastructure half is invisible when it works. Training initiatives announce themselves. Automated documentation triggers do not. This is the prevention paradox applied to incident response, and it creates pressure to lead with training in communications even when the organization knows better. Training is visible. Infrastructure is not. The combination is what produces resilience, but only one half of it tells a good story.

Naming this pressure is the first step in resisting it. If you are writing a postmortem, lead with the coupling fix. If you are reading one, discount the training and look for the infrastructure underneath. The training tells you what the organization wants you to see. The infrastructure tells you what will change.

Closing

Every mature organization has invested heavily in detection, because detection failures are visible, embarrassing, and easy to blame on missing tools. Response has received moderate investment, because response failures produce dramatic incidents that demand process improvement.

The interpretation layer sits between them, quietly translating signals into action, drifting continuously, invisible until it produces the kind of failure Lovable had. Most of the time, nobody owns it and nobody measures it. When it fails, organizations reach for the remedy they can see, training, because training is legible and infrastructure is not. Training alone is not enough. It has never been enough. The next frontier of resilience work is not better detection or faster response, it is treating the layer between them as infrastructure that supports the humans doing the interpretation, with the ownership, instrumentation, and deployment discipline that implies.

The detection system worked. The interpretation layer failed. That sentence will fit almost every major incident you read about this year. The question for you to think about is whether your organization is building the muscle to notice when it applies to you, before the researcher writes the blog post.

//Adrian

When the pair programmer is confidently wrong

Adrian Hornsby — Fri, 24 Apr 2026 05:47:22 GMT

Photo by Sean Lim on Unsplash

A few weeks ago, I migrated the resiliumlabs.com website off Squarespace. That part you can probably guess at: the renewal was expiring, Squarespace had gotten expensive and bloated, and moving to a static site on Cloudflare Pages meant faster loads, something fun to do, and about a tenth of the monthly cost.

What I want to write about isn’t the migration process itself, but the fact that I did most of it with Claude as a pair programmer. And that somewhere around mid-day, we took the site down in public for about an hour. And that if you’ve read Chapter 14 of my book, you already know what happened.

The plan

The goal was fairly straightforward. A static Astro site deployed to Cloudflare Pages. DNS moved from AWS Route 53 to Cloudflare (needed because Route 53 doesn’t support apex CNAMEs to non-AWS targets). Newsletter kept on Substack at newsletter.resiliumlabs.com, rebranded to match the new design. Old LinkedIn-shared URLs 301-redirected to the right Substack posts so nothing that anyone had ever shared would break.

Claude was really useful for this. He created the redirect map, the CSP with sha256 hashes for inline scripts auto-computed as a postbuild step, security headers, OG tags, and Schema.org JSON-LD. The kind of boilerplate that I would know how to write but would take me a few days to assemble correctly, and that Claude produced in about twenty minutes with me reviewing and trimming.

That part of the day felt like what the AI marketing promises: I do the architecture and the judgment, the machine handles the drudgery, we ship faster than either of us alone.

Then we got to the cutover.

The outage

Here’s what you need to know about Squarespace: their custom domain feature is powered by Cloudflare for SaaS. When you connect a domain to Squarespace, they register it as a “Custom Hostname” in their Cloudflare account. That binding tells Cloudflare’s edge: “when you see a request for this hostname, route it to Squarespace.”

I didn’t know that. Claude didn’t know that either, or “forgot” to mention it.

The cutover sequence we executed was, in order: migrate DNS to Cloudflare, point the apex and www CNAMEs at the Pages project, add the domain as a custom domain in Pages, disconnect the domain from Squarespace.

Each step looked fine in isolation. DNS propagated. Pages showed the domains as “Active.” The apex returned a 200 when I tested it.

But for the public, the site was serving Domain Not Claimed from Squarespace’s edge.

The reason, which I understand now but didn’t at the time: even though I’d pointed my DNS at Pages, Cloudflare’s edge was seeing the hostname in two places — Squarespace’s Cloudflare for SaaS Custom Hostname registration, and my new Pages Custom Hostname registration. When Cloudflare gets a request, it has to pick one. It was picking Squarespace. And since I’d just disconnected the domain from my Squarespace site, Squarespace was responding “I don’t serve this domain anymore.”

Result: 404 “Domain Not Claimed” in every browser that wasn’t on my local DNS cache.

I noticed when I got the message: “it is mid day, your site is down!!!”

What AI looked like during an outage

I want to be specific about what happened next, because this is where the AI-assisted-work pattern gets interesting.

Claude’s first instinct was to wait for Squarespace’s backend sweep to release the Custom Hostname eventhough we had no idea how long that would take. Claude set up a polling loop to detect the moment it flipped. This is exactly the kind of thing Claude is good at: small, mechanical, instrumented.

It was also exactly the wrong suggestion. Waiting and hoping for something to happen fast enough is not a plan at midday. I needed the site back now, or I needed a rollback that would work in five minutes.

When I pushed back, Claude pivoted to the rollback path — delete the Pages custom domains, restore the old Squarespace A records in DNS, reconnect the domain in Squarespace. And then admitted, when I pushed back again: “Honest: I don’t know with certainty that the rollback will work fast either. The Squarespace reconnect flow: I don’t know if it lets you immediately re-add a just-disconnected domain. I haven’t done it.”

That very moment is the important “bite”.

Because Claude was confident about the cutover sequence when we planned it, confident during execution, and confident about the initial diagnosis, the uncertainty only surfaced when I explicitly pushed on it. And the sequence Claude confidently proposed was wrong in a “this will take your production site down for an unknown amount of time” way.

In the book, I wrote: “AI that drafts the postmortem outsources the sensemaking. AI that surfaces related incidents from organizational history while the team does their own analysis supports it.”

That’s exactly what I felt in real time. When I let Claude confidently drive the plan, I was outsourcing the sensemaking. When I stopped and asked “what do you actually know and what are you guessing,” the conversation got better immediately.

In the end, I decided to open a support ticket with Squarespace asking them to release the hostname from their Cloudflare for SaaS account. They released it within minutes. DNS settled. Cache purge. Site back up. Total outage: about an hour.

What I’d do differently

Three things.

Build an inventory before you execute. The Squarespace-uses-Cloudflare-for-SaaS fact was knowable. It’s in Squarespace’s docs. Neither of us looked. Claude happily generated a plan without that context. I got lazy and happily followed it because it looked reasonable. The failure mode of AI-assisted planning isn’t that the AI is stupid like many people still say; it’s that it’s plausible-sounding enough to skip the five-minute research step that would have surfaced the constraint.

Get the old vendor to release the hostname before you cut over, not during. The Squarespace Custom Hostname was the actual blocker. I should have opened the support ticket asking them to release it while the site was still live under Squarespace — wait for confirmation, then flip DNS to Pages and add the custom domains. The overlap window I was afraid of doesn’t exist if you treat “release the binding” as a separate, prerequisite step rather than something that happens automatically when you click Disconnect.

Support tickets are a tool, not a failure. My instinct during the outage was to try to fix it myself, with Claude’s help, using tools I already had. What actually worked was opening a Squarespace support ticket and asking them to do the one thing on their side that nobody else could do. Five minutes of writing the ticket, a few minutes for them to act. I wasted longer than that trying to figure out how to force Cloudflare’s edge to re-route on my own. The human on the other side of the vendor had the specific privileges I needed. The AI didn’t, and couldn’t know to suggest it.

That last one is worth sitting with. When you pair with an AI, you risk inheriting its frame of reference. AI tools operate in the shape of developer-tool problem-solving: commands, APIs, config edits. The longer I stayed in that loop, the more my own thinking stayed in that loop too — and “open a support ticket with the vendor” falls outside it. To be clear, that’s nobody’s fault but mine. It’s Work-as-Imagined versus Work-as-Done at a smaller scale: my imagined toolkit, shaped by the pairing session, didn’t include “ask a human at the other vendor.” Took me twenty minutes of trying to force the problem through code to remember that not every infrastructure problem has a code solution.

The irony

Few months ago, I wrote the final chapter of Why We Still Suck at Resilience. That chapter argues that AI risks outsourcing exactly the kind of learning that makes organizations resilient. That the danger isn’t that AI is bad, it’s that it’s good enough to let people stop engaging with how their systems actually work. That the productive struggle is where the understanding lives.

Then I used Claude to migrate my website, and at the first real incident, I caught myself doing the thing I’d just warned against. Taking the plan at face value. Not pushing on the uncertainty. Watching the outage tick by while the AI proposed an elegant polling loop.

I didn’t fire Claude… yet. It was actually really useful for most of the work. The fix was for me to stop treating Claude’s confidence as a signal about reality and start treating it as what it is: a stylistic default. The real information is what Claude doesn’t know, and you only get at that by asking what it actually knows and being an active participant in that conversation.

Resilience engineers call this the left-over principle. You automate the parts that are mechanical, and humans are left with the parts that aren’t — judgment, ambiguity, the phone call to the vendor, the gut sense that a plan smells wrong. The trap is that humans do those parts worse after prolonged exposure to automation, because the automation has taken the routine practice that built the competence. Bainbridge described this in her 1983 paper “Ironies of Automation.” AI didn’t change the lesson. It just raised the stakes of forgetting it, by expanding what automation looks like it can handle.

//Adrian

When Guidance Becomes Compliance

Adrian Hornsby — Thu, 23 Apr 2026 05:29:39 GMT

Photo by Robert Lukeman on Unsplash

About ten years ago, when I started as a Solutions Architect at AWS, one of my early customers was based in Iceland. Good team, serious engineers, running large production systems for international customers.

On my first visit, they told me they did not want to run a Well-Architected Review (WAR).

Not ever.

A few months earlier, another AWS SA had come out, run a WAR, and basically turned it into an audit. The output was a verdict. He told them something like “your maturity is low, you are not doing things right, here is a list of what you are missing.” No conversation about context, no acknowledgment that some of their choices were deliberate trade-offs, no curiosity about why the system looked the way it did.

They were still angry about it months later, and they made sure I knew.

When I eventually did run the proper review with them, I did not open a spreadsheet. I asked them to walk me through how the system actually worked. Where the pain was. What they had tried and backed out of. What kept them up at night. We talked through trade-offs as peers. I shared some of my experiences and what I had seen other teams do, and where those choices had gone wrong. At one point they told me, quite directly, that if I had shown up doing what the previous SA had done, it would have ended badly. Icelanders are scary when they want to be.

We ended up at the restaurant that evening. We had a good night.

I think about this story a lot, because it sits right on the fault line of something I want to write about: the way guidance, over time, drifts from a tool that helps teams reason into a control that judges them.

The drift is not an accident

Here is a pattern I have watched play out repeatedly, inside AWS and outside it.

Someone smart writes down what they learned. “Here is what we figured out, adapt it to your context.” It is a practice. It lives in someone’s notebook or a team wiki. People read it, argue with it, take what works, and move one.

Then it gets popular. More people want to use it. So someone writes it up more carefully, adds structure, makes it easier to consume. Now it is a guideline. Still useful. Teams use it as a starting point for their own thinking.

Then the guideline gets adopted by more teams. And as the conversations multiply, they start to feel expensive. Each one takes a couple of hours. Multiply that by dozens of teams, then hundreds, and it looks like a lot of senior engineering time spent on something that, cumulatively, starts to look like a thing that should obviously be automated. So the content gets extracted. A checklist appears. The reasoning behind each item is still there in someone’s head, but the artifact itself no longer carries it. The checklist gets reviewed quarterly. Now it is a control.

Then the control gets audited. Teams are measured against it. Scores appear. Green, yellow, red. The checklist no longer describes a way of thinking, it describes a thing you either did or did not do. The reasoning behind each item has faded. People who implement the items often cannot tell you why they matter. They just know they get flagged if the box is not ticked.

Now it is compliance.

At each step of this drift, something is lost. The context disappears. The trade-offs disappear. The “it depends” disappears. What started as help ends as judgment.

What the drift reveals

It is easy to describe this as an accident of institutional life, but I do not think that is quite right. Organizations drift toward compliance because compliance is cheaper than learning.

Compliance is fast. You can audit a hundred teams against a checklist in a week. You cannot run a real conversation with a hundred teams in a week. Compliance scales. Learning does not.

Compliance is legible to leadership. Green-yellow-red dashboards make a VP feel like the organization is under control. A conversation about trade-offs does not fit in a dashboard.

Compliance creates accountability without requiring judgment. If something goes wrong and the team was compliant, nobody is in trouble. If something goes wrong and the team made a judgment call that did not pan out, someone is in trouble.

So when I see an organization drift from practice to control, I do not see bad intent. I see an organization optimizing for efficiency, legibility, and risk transfer. Those are real values. They are just not the same as the values that produce good systems.

Good systems come from teams who can reason about their context, make trade-offs deliberately, and adapt when conditions change. That requires learning. Learning is slow, messy, and hard to measure. So organizations that are under pressure to move fast and look accountable drift away from it. Not maliciously. Just gravitationally.

This is one of the core tensions I write about in Why We Still Suck at Resilience. Organizations say they want to learn from incidents, improve resilience, understand how their systems actually work. Then they build operating models that reward efficiency, legibility, and control. The two sets of values are in direct conflict, and when they conflict, efficiency wins. Drift toward compliance is what that looks like in the artifacts.

Well-Architected as a case study

Well-Architected started as a very useful artifact. In its early form, it was a structured way for AWS to have an informed conversation with a customer about their system. The pillars were prompts. The questions were designed to open up discussion. The output was shared understanding, not a judgment.

Then it scaled. More SAs needed to run reviews. TAMs got involved. Training got standardized. Lenses got added. A tool was built. Dashboards appeared. Partner programs were built on top of it. Customers and partners started asking for “Well-Architected certifications.”

Somewhere in there, the practice became a control. In some corners of the ecosystem, it became a compliance exercise. The reasoning behind the questions got compressed into answers you needed to give. The conversation got replaced with a form.

This is not a story about anyone behaving badly. Most of the people involved at each step were trying to help. The SA who audited my Icelandic customer a few months before me was probably trying to be rigorous. The person who built the first checklist was trying to make it easier for more people to benefit from the framework. The team that built the tool was trying to scale a good thing.

But the sum of all those well-intentioned steps was drift. And by the time I arrived in Iceland, a team had already learned that Well-Architected meant being judged by someone who did not understand their context.

To be clear, there are still SAs out there running great Well-Architected Reviews. I know some of them. They still treat the questions as prompts, the pillars as openings for a real conversation, the output as shared understanding. When it works, it really does work. The framework in the hands of a thoughtful reviewer remains a useful thing.

But the more customers I talk to, the more the word itself has become a problem. Well-Architected, for a lot of people now, means a monster process. Means judgment. Means a scorecard they do not want to submit to. I hear it in how they brace when the term comes up. They have been burned, or they have watched another team get burned, and the institutional memory of that is hard to undo. Whatever good work individual SAs are doing inside the framework is fighting against the reputation the framework itself has earned.

What the on-call engineer knows

The people who run systems at 3am know something that compliance frameworks cannot capture. They know which trade-offs were deliberate. They know which risks they are carrying on purpose. They know the history of the system, why it is shaped the way it is, what has been tried and why it did not work.

Guidance that respects that knowledge helps them reason. Guidance that ignores it judges them.

The test, for any framework, is whether it makes the team on call better at their job or whether it makes them better at passing audits. Those are different things. Most frameworks start out doing the first and, if nobody resists the drift, end up doing the second.

Resisting the drift is a choice, and it is not a free one. It means accepting that the framework will scale more slowly. It means leadership has to be willing to tolerate variance in how teams apply it. It means auditors have to accept that “it depends” is a legitimate answer. It means the framework itself has to be treated as a living thing, revised as the work changes, not frozen the moment it gets popular.

Most organizations will not make that choice. The gravity is too strong.

The organizations that resist the drift end up with teams who can actually reason about their systems. The ones that do not typically end up with a lot of green dashboards and a lot of incidents that nobody saw coming.

Closing

The Icelandic team I worked with is still out there, still running systems, still making trade-offs they understand better than I ever could. That day, 10 years ago, I learned more from them than they did from me. That is usually how it goes when guidance is used as guidance.

The drift toward compliance is the nature of things. But noticing the drift, naming it, and choosing to resist it, in your own corner of the work, is also a thing you can do.

It just takes longer. And you do not get a dashboard for it.

//Adrian

The Resilience Myths List

Adrian Hornsby — Tue, 14 Apr 2026 13:41:39 GMT

Last updated: April 2026. I add new myths every few months based on what I see in consulting, workshops, and conversations with engineering leaders. If you think one is missing, reply and tell me.

Photo by Sharon Waldron on Unsplash

I keep a running list of beliefs about resilience that sound right but quietly undermine the organizations that hold them. Some are widespread. Some are subtle. All of them show up regularly in the teams I work with.

What makes something a myth here? It fits a pattern: the belief simplifies a complex reality into something comforting and manageable, and in doing so, it closes off learning. Most of these aren’t wrong in every context. They become myths when they’re treated as settled truths rather than assumptions worth questioning.

The list is loosely grouped, but many of these cut across categories. That’s part of the point.

Metrics & Measurement

Track incident counts as a measure of resilience. Incident counts measure reporting culture, not system health. A team that reports more is often learning more.

MTTR tells us how good we are at recovery. MTTR is a mean. It hides variance, and recovery from real incidents looks nothing like recovery from minor blips. The number flattens everything that matters.

MTTD shows our detection capabilities. Detection depends on what you’re looking for. MTTD only captures what you already know to monitor. The things that hurt you are the things you weren’t watching.

Fewer reported incidents means better resilience. Or it means people stopped reporting. Fewer reports can signal suppression as easily as improvement.

We can measure resilience with simple metrics. Resilience is a property of how a system adapts to surprise. You can measure proxies, but no single number captures adaptive capacity.

If the metrics are green, it means my customers are happy. Green dashboards reflect what you chose to measure. Customer experience includes everything you didn’t.

No alerts means no problems. Alerts fire on known conditions. The absence of alerts tells you nothing about unknown failure modes.

Everything important will generate an alert. This assumes you’ve already imagined every way things can go wrong. You haven’t. Nobody has.

Resilience scores measure resilience. A score requires a model. The model requires simplification. The simplification removes exactly the things that make resilience hard.

Process & Planning

More documentation prevents incidents. Documentation helps when people read it, when it’s current, and when the situation matches what was documented. Those conditions rarely align during an actual incident.

Runbooks should be as detailed as possible. Overly detailed runbooks become brittle. When the situation deviates from the script (and it will), people freeze because they’ve been trained to follow, not to think.

Perfect runbooks prevent all problems. See above, but louder. Runbooks encode past knowledge. Incidents are defined by novelty.

Perfect processes eliminate human error. Processes are operated by people in context. Tighter processes don’t remove error, they change where it shows up.

Incidents are always preventable with better planning. If you accept that complex systems produce emergent behavior, you accept that some incidents are genuinely surprising. Better planning reduces some failures and creates blind spots for others.

Documented procedures capture how work happens. Work-as-imagined and work-as-done are different. Procedures describe the intended path. Actual work involves adaptation, shortcuts, and judgment calls that never make it into the document.

We can standardize our way to resilience. Standards create consistency, which is valuable. But resilience requires adaptation, which standards constrain. The tension between the two is where the real work happens.

More controls mean more safety. Controls add complexity. Each control introduces new interactions, new failure modes, and new cognitive load. Past a point, more controls make the system harder to understand and operate safely.

Culture & Learning

Psychological safety is nice-to-have but not mandatory. Without psychological safety, people hide problems, avoid reporting near-misses, and optimize for blame avoidance. You get a system that looks fine until it isn’t.

Psychological safety means comfort. Psychological safety means people can speak up without fear of punishment. It doesn’t mean the work is comfortable. Honest feedback, disagreement, and accountability are all uncomfortable and all require safety to function.

People resist change because they don’t understand it. People often understand the change perfectly well. They resist because they see costs the change advocates haven’t acknowledged, or because they’ve seen similar initiatives fail before.

Experience automatically translates to better incident response. Experience without reflection produces confidence, not competence. Ten years of unexamined incident response is one year repeated ten times.

Learning happens automatically from doing exercises. Exercises create experience. Learning requires structured reflection on that experience. Without a debrief, you’ve just had an event.

Incident reviews should focus on what went wrong. Reviews that focus only on failure miss how the system usually succeeds. Understanding normal work is essential to understanding why it occasionally breaks.

If you do incident postmortems, you will be resilient. Postmortems are a ritual. Resilience requires that the insights from those rituals actually change how the organization operates. Most don’t.

Blame-free means accountability-free. Blame-free means separating the question of what happened from the question of who to punish. Accountability for learning and improvement is central to the whole approach.

Blame-free means we don’t name who was involved. Naming who was involved is necessary for understanding the context. The point is to explore their perspective, not to assign fault.

Doing blameless postmortems means we are psychologically safe. Blameless postmortems can be theater. If people still feel judged, if the “lessons” always land on the same teams, if leadership isn’t present, the label is doing nothing.

Fixing this specific problem means we’ve learned. Fixing the proximate cause is a repair. Learning means understanding the conditions that made the failure possible and likely. Those conditions are still there after the fix.

Insights naturally spread across the organization. They don’t. Knowledge stays local unless you build explicit mechanisms to move it. Most organizations don’t.

We can change culture fast. Culture is the accumulated residue of what gets rewarded, tolerated, and punished over time. You can change incentives quickly. The culture catches up on its own schedule.

AI makes human expertise in operations less important. When AI handles 99% of decisions, the humans who step in for the remaining 1% still need the judgment and skills to do so. But practicing a skill 1% of the time erodes it. The fallback capability quietly atrophies under the automation it provides.

Incident Analysis

The postmortem is where learning happens. The meeting is where stories get told. Learning happens when those stories change how the organization operates. If the action items sit in a ticket queue and the insights stay with the people in the room, the postmortem produced a document, not learning.

Action items from incident reviews are the measure of learning. Action items measure intent to change. Completion of action items measures follow-through. Neither measures whether the organization actually understood the conditions that produced the incident. You can close every action item and still not have learned.

We need to find the root cause. Complex failures have contributing factors, not a single root. The search for one root cause directs attention to the most obvious factor and away from the systemic conditions, the interactions, the organizational pressures, the design tradeoffs, that made the failure possible and likely.

Severity determines whether an incident deserves a review. Low-severity incidents and near-misses often reveal the same systemic conditions that produce high-severity ones. They’re cheaper to learn from. Reviewing only major incidents means you only learn from the expensive lessons.

The people who caused the incident are the ones who need to learn from it. The people closest to the incident have the most context, but the conditions that produced it usually extend beyond their control. If only the responders learn, the organizational factors stay untouched.

AI-powered postmortems mean we’ve automated learning. AI can produce the timeline and summary faster. The learning comes from the conversation that surfaces why actions made sense to people at the time. You can automate the document. You can’t automate the understanding.

Chaos Engineering

Chaos engineering is just breaking things in production. Chaos engineering is hypothesis-driven experimentation. Breaking things randomly is just breaking things.

You need perfect systems before starting chaos engineering. If you wait for perfect, you never start. Chaos engineering is most valuable in systems that have unknown weaknesses, which is all of them.

Chaos engineering will cause more outages. Well-designed experiments have controlled blast radius and abort conditions. The outages chaos engineering prevents are the ones you weren’t prepared for.

Only Netflix-scale companies need chaos engineering. Every system that matters to someone has failure modes worth understanding. Scale affects the approach, not the need.

Chaos engineering requires expensive tools and infrastructure. You can start with a script that kills a process. The tooling helps at scale, but the practice starts with curiosity and a hypothesis.

Comprehensive preparation makes chaos experiments safer. Over-preparing for an experiment defeats the purpose. The value comes from encountering the unexpected. If you’ve prepared for everything, you’re testing your preparation, not your resilience.

More chaos experiments equals more resilience. Volume without learning is just activity. One well-designed experiment with a thorough review teaches more than twenty drive-by fault injections.

Chaos engineering without load produces good results. Systems behave differently under load. Testing failure modes in an idle system tells you how idle systems fail, which is rarely how production fails.

Passing one chaos experiment is enough. One experiment tests one hypothesis in one set of conditions. Systems change. Dependencies change. The experiment that passed last quarter may fail today.

Operational Readiness Reviews

An ORR is a checklist exercise. A checklist gets passed around. Someone fills in “yes” next to “Do you have monitoring?” Nobody asks what gets monitored, or what happens when the alert fires at 3am and the one person who knows the system is on vacation. Checklists describe Work-as-Imagined. The value of an ORR is in the conversation that probes Work-as-Done.

Passing an ORR means the system is ready for production. Passing means the review didn’t find disqualifying gaps. It doesn’t mean there are none. An ORR is a sample, not a proof.

An ORR and a regular ops health check are the same thing. An ORR is forward-looking: is this system ready for what’s coming? A regular operational health check is backward-looking: how did the last week go? One probes assumptions about the future. The other monitors the recent past. When organizations blur them together, the health check absorbs the ORR and the deep exploratory conversation disappears.

Teams can effectively review their own operational readiness. The people who built the system share the same assumptions about how it works. An effective ORR needs an external perspective, someone who wasn’t involved in building the thing and whose job is to ask the questions the team didn’t think to ask.

ORRs are a one-time gate before launch. Systems change after launch. Dependencies change. Teams change. An ORR that happened six months ago describes a system that no longer exists. Readiness is perishable.

Load Testing

Load testing validates that the system can handle expected traffic. If your test only confirms expected capacity, you’ve learned that your model is correct under the conditions you modeled. You haven’t learned where the model breaks down. Testing to expected capacity is validation. Testing beyond it is discovery.

Load tests with simplified traffic patterns represent production. Real production traffic has bursts, correlations, unusual request patterns, and hot keys that synthetic traffic doesn’t capture. Simplified patterns pass through common code paths and skip the edge cases that cause production surprises.

Load testing is done when the test passes. A passing test means the system handled one specific scenario. Change the traffic pattern, the data distribution, or the concurrency model and the result may be entirely different. One passing test is one data point.

Load tests should be predictable and reproducible. Reproducibility is useful for regression testing. But predictable tests don’t discover anything new. The most valuable load tests are the ones where you’re genuinely uncertain about the outcome.

Load testing in staging tells you how production will behave. Staging differs from production in data volume, traffic patterns, dependency behavior, and infrastructure configuration. It’s a useful approximation, but the approximation can hide the failures that matter.

If the system didn’t break during load testing, it won’t break under real load. Load testing covers the scenarios you thought to test. Production generates the scenarios you didn’t. The absence of failure in a test tells you about the test, not about the system.

Load testing is only about throughput and latency. Throughput and latency are the metrics people watch. But load also reveals resource exhaustion patterns, connection pool behavior, garbage collection pressure, dependency timeouts under contention, and cascading degradation paths. The system might meet its latency target while quietly exhausting something that will cause a much worse failure later.

Running load tests regularly means we understand our capacity. Regular tests with the same profiles produce the same results. Capacity understanding comes from varying the conditions: different traffic shapes, different failure modes injected during load, different dependency behaviors. Regular execution without variation is load testing theater.

Strategy & Architecture

If you are compliant, you are resilient. Compliance is a minimum bar defined by regulators. Resilience is what happens beyond the checklist, when the situation is novel and the playbook doesn’t apply.

Root cause analysis prevents future incidents. Complex failures have multiple contributing factors, not a single root cause. The belief in root cause directs attention to one factor and away from the systemic conditions that matter more.

What works at [Famous Company] will work here. Their context is not your context. Their culture, scale, history, and constraints produced their approach. Transplanting practices without transplanting context produces cargo cults.

Best practices are universally applicable. A best practice is a practice that worked well somewhere. Whether it works here depends on conditions that the label “best practice” conveniently obscures.

Industry benchmarks represent our reality. Benchmarks are averages across organizations with different contexts. Your reality is specific. Benchmarks tell you where you stand relative to a statistical abstraction, not whether you’re doing well.

One-size-fits-all solutions exist for resilience. If they did, resilience wouldn’t be hard.

More redundancy equals more resilience. Redundancy adds complexity. Failover mechanisms can fail. Failback processes are often untested. Past a point, redundancy creates the conditions for failures that wouldn’t otherwise exist.

You can buy resilience. You can buy tools. Tools support practices. Practices require people, learning, and organizational commitment. The part you can’t buy is the part that matters.

AI will solve your resilience problems. AI can accelerate detection, assist with triage, and draft postmortems. It can’t navigate the organizational dynamics that suppress learning or make judgment calls during novel failures. Resilience is an organizational capability. Tools don’t produce it on their own.

Resilience is the same as reliability. Reliability is about expectations of how well systems will perform, based on past history. Resilience is about adapting to unforeseen disruptions that couldn’t be specified in advance.

Resilience means being unaffected by disruption. Resilience means absorbing disruption and adapting. Being unaffected is invulnerability, which doesn’t exist in complex systems.

If nothing failed, we are resilient. The absence of failure tells you nothing about your capacity to handle it. A system that hasn’t been tested hasn’t been proven.

We can prevent all failures with better processes / tests / developers. Pick your version. All three assume failure is caused by a deficit that can be filled. In complex systems, failure is emergent. You can reduce it. You cannot eliminate it.

Zero incidents is a realistic goal. See above. Zero incidents is a goal that punishes reporting and rewards hiding.

Think one is missing? Reply and tell me what it is. This list grows because people share the myths they’ve bumped into.

When AI is a single point of failure you can’t audit

Adrian Hornsby — Thu, 09 Apr 2026 11:44:02 GMT

Photo by Tommy Diner on Unsplash

Much of resilience work centres on understanding failure. Sometimes that means asking what happens when this thing fails, so you can prepare. More often it means asking what happened after something already went wrong, so you can learn from it. Either way, the discipline assumes you can reason about the failure, that you can look at a component and say with some confidence what it does, how it broke, and what broke with it.

AI undoes that assumption in ways most organisations adopting it haven’t reckoned with yet. When a traditional component fails, the failure is legible because the system leaves evidence that humans can interpret, evidence whose logic was written by humans in the first place. When AI fails, that legibility suddenly vanishes. The model produced an output, the output was wrong, and the path from input to output runs through billions of parameters that nobody, including the people who trained the model, can fully explain. You can observe that it failed. You frequently cannot explain why, which means you cannot anymore say with confidence what conditions will trigger the same failure again.

There is research trying to change this. Mechanistic interpretability, most visibly pursued by Anthropic, aims to reverse-engineer the internal computation of large language models by tracing circuits from input to output. MIT Technology Review named it a breakthrough technology for 2026, and Anthropic has open-sourced tooling that lets researchers trace paths through a model’s features. The ambition is real, and the progress is there. But the honest state of the field is sobering: core concepts like “feature” still lack rigorous definitions, many interpretability queries are computationally intractable, and it currently takes hours of human effort to understand the circuits behind prompts of only tens of words. Even Anthropic’s CEO has framed this as a race between interpretability and model intelligence, one where interpretability is behind. And there’s a practical constraint that should feel familiar to anyone who has built observability into complex systems: someone has to decide which circuits to examine, which prompts to trace, which behaviours to investigate. Those decisions are shaped by what researchers already understand or expect, in the same way that monitoring coverage reflects what engineers anticipated being important. The parts of the model that nobody thought to look at are where the surprises live, and in a system with billions of parameters, that’s most of it.

This changes what containment means. Traditional blast radius engineering works because you can draw boundaries around components and reason about what crosses them. With AI, the boundaries blur because the model’s behaviour depends on context in ways no specification captures. The same model, given slightly different phrasing, might produce a completely different decision, and the difference between a correct decision and a catastrophic one can hinge on nuances that nobody documented. You can’t draw a blast radius around something whose failure modes shift with every input.

Organisations deal with this by treating AI the way they used to treat senior engineers: trusting it because it’s usually right. That works fine during normal operations, which is exactly why it’s dangerous. The trust builds invisibly over months of good performance, and by the time the AI makes a confidently wrong call, the organisational reflexes that would have caught a human making the same mistake have atrophied. Nobody double-checks the AI’s work anymore because it’s been right a thousand times, and the thousand-and-first time is when it matters.

The fallback problem compounds this. In well-designed systems, when a component fails you (hopefully) fall back to something simpler and more predictable. The fallback for AI-driven judgment requires the very skills that AI’s success has been quietly eroding: engineers who can still triage incidents manually, who can assess risk without a model score, and who remember how to build a judgement from first principles.

Researchers at Aalto University have documented this dynamic in detail. A study of an accounting firm found that reliance on automation fostered complacency and progressively eroded staff awareness, competence, and the ability to assess outputs. When the automated system was removed, the firm discovered that employees could no longer perform core accounting tasks. A 2026 Springer chapter on AI competency erosion identifies four pathways through which this happens: individual skill atrophy, structural erosion of expertise development systems, systemic organisational vulnerability, and what the authors call “false expertise transitions,” where apparent competence masks underlying knowledge gaps. Separately, human factors researchers have introduced the concept of a “Cognitive Infrastructure Threat,” arguing that the problem goes beyond losing the ability to execute tasks manually. What erodes is the reconstructive reasoning capacity needed to regain control during anomalies. A vigilant human who lacks that reconstructive capacity may detect that something is wrong but still be unable to intervene effectively. Aviation has already confronted this pattern. After a rise in near misses linked to declining manual flying skills, the US Federal Aviation Administration recommended that pilots periodically hand-fly for the majority of flights, essentially mandating the maintenance of fallback capability. Organisations routing critical decisions through AI face the same dynamic but have no equivalent mandate, and most (will) discover the gap only at the moment the fallback is needed, which is the worst possible time to learn that a capability has atrophied.

What makes AI a single point of failure rather than just another component is the breadth of decisions that organisations route through it. A failed database affects one service. A failed load balancer affects one cluster. But a failed AI that’s been integrated into incident response, capacity planning, pricing decisions, customer risk assessment, and operational monitoring affects all of them simultaneously. The failure is correlated because the same model, with the same blind spots, makes the same category of error across every domain it touches. Traditional single points of failure are well understood and heavily mitigated precisely because we learned the hard way what happens when they go down. AI as a single point of failure is new enough that most organisations haven’t even mapped it as one.

The audit problem ties all of this together. You can audit a rule-based system by reading the rules, and you can audit a human decision-maker by asking them to explain their reasoning, but you cannot meaningfully audit an AI system in the same way. The growing industry of AI evaluation frameworks is an attempt to close this gap, and those frameworks matter, but they share a fundamental limitation: every eval is Work-as-Imagined. Someone had to think of a test, which means the eval covers the scenarios its designer anticipated. The failure mode you care about is the one your test suite didn’t cover, the one that arose from a combination of context and input that nobody imagined, handled by the AI with the same confidence it brings to everything else. More sophisticated evals don’t escape this constraint, they just push it further out. The gap between the eval’s specification of how the system should behave and how it actually behaves in production under conditions nobody modelled is the Work-as-Imagine vs Work-as-Done gap applied to AI, and it’s where the failures live.

The EU AI Act’s transparency provisions take effect in August 2026, and regulators are beginning to require that organisations explain why their AI made a particular decision. The intent is sound, but the tooling isn’t ready. A CHI 2025 study surveyed AI audit tools and found the landscape fragmented, with transparency infrastructure still nascent and most tools focused narrowly on accuracy metrics, explainability, or fairness in isolation rather than addressing auditability as a coherent discipline. Even where tools exist, using them demands skills that most engineering teams don’t have. Mechanistic interpretability, circuit tracing, feature decomposition: these are specialised research techniques, not standard engineering practices. Debugging an LLM is nothing like debugging a microservice, and the people who know how to do it are concentrated in a handful of research labs. The broader engineering community is playing catch-up with a discipline that barely existed five years ago, and the gap between what regulators are about to require and what practitioners can actually deliver in production is significant.

This leaves organisations with a genuinely uncomfortable choice. Treating AI as a critical dependency with opaque failure modes is the right starting point: assume it will fail, assume the failure will be surprising, assume you won’t understand it immediately, and build the containment and fallback capability to operate without it. But that last part is where most strategies quietly fall apart, because maintaining the human expertise to operate without AI means investing continuously in exactly the capability that AI was supposed to replace. It means paying for engineers who can work without model scores, keeping engineers who can make judgments calls by hand, training people in matters that the AI handles 99% of the time. That’s an ongoing cost with no visible return during normal operations, which makes it the first thing to get cut when budgets tighten, and the thing you most desperately need when the AI gets it wrong in a way nobody anticipated.

//Adrian

—

References used in the post

MIT Technology Review, “Mechanistic interpretability: 10 Breakthrough Technologies 2026” (January 2026) — Link

Anthropic interpretability research and circuit tracing — Link

Dario Amodei, “The Urgency of Interpretability” — Link

“Mechanistic interpretability: 2026 status report” (GitHub gist summarising field state, limitations, and open problems) — Link

Aalto University / TechRadar, “Researchers warn that skill erosion caused by AI could have a devastating and lasting impact on businesses” (September 2025), referencing “The Vicious Circles of Skill Erosion” (2023) — Link

Yadav, P.S. (2026), “AI Competency Erosion: Understanding Expertise Decay” in The AI Competency Paradox, Springer — Link

“Position: Human-Centric AI Requires a Minimum Viable Level of Human Understanding” (arxiv, January 2026) — the Cognitive Infrastructure Threat concept — Link

FAA recommendation on manual flying skills — referenced in “Does using artificial intelligence assistance accelerate skill decay?” (PMC) — Link

Ojewale et al., “Towards AI Accountability Infrastructure: Gaps and Opportunities in AI Audit Tooling,” CHI 2025 — Link

What 1,000 Executives Know But Can't Fix

Adrian Hornsby — Tue, 31 Mar 2026 13:21:45 GMT

I have no idea how I missed this report when it came out. Cockroach Labs published their State of Resilience 2025 back in late 2024, surveying 1,000 senior technology executives across North America, Europe, and Asia-Pacific, and it somehow slipped past me entirely. Better late than never, because there's a lot here worth digging into, at least if you're willing to ignore Part 6, which makes the same mistake almost every vendor resilience report makes: five sections of genuinely useful organizational data followed by a pivot to "and here's how our product fixes it." It's understandable. They sell distributed databases. But it's also disappointing, because their own data argues against the conclusion, and because it reinforces a persistent myth in this industry that you can buy your way to resilience. More on that in a moment.

The data was collected in August-September 2024, so about 18 months old. If anything, that makes the findings more relevant now. Since this survey was fielded, DORA has gone into full effect, NIS2 enforcement is ramping up across EU member states, AI agents are being woven into operational workflows at a pace that would have seemed aggressive even a year ago, and the organizations that told CrowdStrike-shaken researchers they were "significantly improving their planning" have had a full year and a half to either follow through or quietly drift back to the status quo. The structural dynamics this report captures don't resolve themselves in 18 months, they compound. And the rapid adoption of AI in operations is adding new failure modes to systems that were already struggling with the old ones.

The headline numbers first: the average enterprise experiences 86 outages per year, averaging 196 minutes each. Every single company surveyed lost revenue to outages in the past twelve months. For large enterprises, outage-related losses averaged $495K annually.

But the number that made me pause for a second: 95% of executives say they are aware of at least one unresolved operational weakness that puts their organization at risk. 72% say they have multiple. And 48% say their organizations are doing insufficient work to address it. Nearly every leader knows where the cracks are, and almost half say nothing adequate is being done. The prevention paradox, playing out at scale with hard data behind it.

The blockers are familiar too. Other teams' priorities take precedence (38%). Budget constraints (36%). Lack of leadership buy-in (32%). Meanwhile 92% of teams must deprioritize essential work to fight fires, 48% work overtime and weekends to restore operations, and 39% report a growing backlog of post-mortems. The feedback loop I keep coming back to in client work is right there in the numbers: less time for improvement leads to more incidents, which leads to even less time for improvement.

Then there's the human cost that rarely shows up in resilience reports. 82% of leaders said they or their team members fear losing their jobs following a significant outage. Think about what that does to everything else. If you've ever wondered why your blameless post-mortems don't feel blameless, there's your answer. You can design the most thoughtful incident analysis process imaginable, but 82% job-fear will override any process document every single time.

Now, Part 6 and why it's a missed opportunity. Look at the causes of downtime: network issues (38%), software issues (36%), cyberattacks (36%), cloud provider reliability (35%), third-party failures (33%), environmental factors (31%), human error (31%), capacity issues (30%), hardware failures (30%). That distribution is remarkably flat. Nothing really dominates. And the report itself notes it's consistent regardless of company size, sector, or geography.

To me, that flat distribution is the most important finding in the entire report, and the authors barely pause on it.

Here's why it matters so much. The conventional reading is that these are nine independent failure modes, each requiring its own technical solution. Network problems need better network architecture. Software issues need better testing. Capacity problems need better scaling. And each vendor can point to their slice of the chart and say "we fix that part," and they're often right about their slice.

But the flatness itself is in fact the diagnostic clue that something else is going on. Nine unrelated failure categories don't land within eight percentage points of each other by coincidence. They land that way when they're all symptoms of the same underlying condition. Network issues, software bugs, human error, capacity problems: these aren't nine independent diseases. They're nine ways that the same organizational gaps express themselves in production. Poor feedback loops, misaligned incentives, the distance between how leaders imagine work happens and how it actually happens, insufficient learning from failure: these systemic causes don't prefer one failure category over another. They just make all of them more likely, roughly equally.

Think of it like a doctor seeing a patient with fatigue, headaches, joint pain, and skin problems all at similar severity. You could treat each symptom with a different specialist and a different prescription. But the flat distribution across unrelated systems is itself the signal that something systemic is driving all of it. Treating the headaches won't help the joints, because neither is the actual problem.

This is what makes tool-first approaches to resilience so seductive and so ultimately inadequate. If one cause dominated, say network issues at 70% and everything else in single digits, you'd have a clear technical problem with a clear technical solution. But when the failure profile is this even, the slices are the wrong unit of analysis entirely. No single technical investment will meaningfully shift the overall failure rate, because the technical categories are where the problems surface, not where they originate. The only intervention that touches all nine simultaneously is the organization's ability to detect, respond to, and learn from whatever breaks next, regardless of category. That's an organizational capability, not a technology purchase.

This inability to see symptoms as symptoms, to keep treating the surface categories as root causes and reaching for technical fixes to organizational problems, is exactly why I wrote Why We Still Suck at Resilience. And here, in a vendor's own dataset, is the evidence for the entire thesis of the book.

The report also shows that 100% of organizations already do some form of resilience testing, yet 71% do no failover testing, 62% skip regular backup and restoration exercises, and the average outage still takes over three hours to resolve. The tooling exists. The organizational capability to use it effectively does not.

Cockroach Labs collected data that illuminates exactly this, but unfortunately drew a technical conclusion from organizational evidence. Of course, I understand why. But somebody needs to measure the layer they skipped: the feedback loops, the gap between how leaders think work happens and how it actually happens, the tensions that keep organizations stuck even when they know what's broken. That's the data nobody is collecting systematically, and it's where the actual answers live. I’ll get to that in 2026.

Full report here if you want to read it yourself: https://www.cockroachlabs.com/guides/the-state-of-resilience-2025/

//Adrian

When Architecture Becomes Fluid

Adrian Hornsby — Thu, 26 Mar 2026 08:26:09 GMT

A few days ago, AWS announced that the AWS Serverless Agent Plugin is now in the Anthropic plugins marketplace. Install it in Claude Code, Kiro, or Cursor, and your AI agent can analyze your codebase, recommend services, generate infrastructure as code, estimate costs, run security scans, and deploy. In the same two-week window, AWS shipped two more capabilities: one that lets agents initialize SAM projects, wire up event-driven architectures, enforce least-privilege IAM, and instrument observability from the start; and another that guides developers through building checkpointed, durable Lambda workflows that can run for up to a year.

The pitch was "best practices by default." Security, observability, resilience baked into the AI-guided workflow from day one. The agent doesn’t just write the code, it now architects the application.

I want to take that claim seriously and follow it somewhere uncomfortable.

***

For most of my career, architecture was the thing you fought about in design reviews. Microservices or monolith. Event-driven or request-response. Step Functions or roll your own. Saga pattern or two-phase commit. These were consequential decisions because they were hard to reverse, expensive to get wrong, and shaped how your system would fail for years to come. The architect's value was in knowing which tradeoffs to make given the specific constraints of a team, a product, a moment.

But here's the thing. If an agent can scaffold an event-driven architecture with EventBridge, SQS, and DynamoDB Streams in the time it takes me to open a Miro board, and if a different agent can rearchitect that same system next quarter when the requirements shift, then the architecture starts to matter less as a decision and more as a snapshot. It's whatever the system happens to be shaped like right now. It's not the thing you chose. It's the thing the agent chose on your behalf, and it might choose differently tomorrow.

I went looking for people making this argument; that architecture is becoming an implementation detail rather than a design decision. I found two camps.

The first camp says architecture matters more now because AI collapses feedback loops. When you can scaffold an API, generate tests, and wire monitoring in minutes, bad architectural decisions surface immediately. AI doesn't eliminate the need for architecture, it amplifies the cost of getting it wrong. That's true, but I think it's backward-looking because it assumes humans will continue to be the ones making and evaluating those decisions.

The second camp talks about architects moving from "in the loop" to "on the loop" to eventually "out of the loop," shifting from making decisions to designing the system's ability to design itself. That's closer, but it still frames the architect as the essential human in the picture, just at a higher level of abstraction. It imagines a gentle transition rather than a structural shift.

Neither camp follows the logic to its endpoint yet, at least that I'm aware of.

***

Here's where I think this goes.

If agents can architect, deploy, maintain, and rearchitect systems to deliver a given function, then the architecture becomes a runtime variable. Nobody "chooses" it any more than you choose which TCP packets get retransmitted. The agent optimizes and the system runs. That’s it. The patterns shift underneath you based on load, cost, failure conditions, whatever the agent is optimizing for at that time. Architecture stops being the thing you decided in a design review six months ago and becomes something closer to a continuously evolving state that the agent manages.

At that point, the question "what's the architecture?" becomes roughly as interesting as "what's the current state of the routing table?" Technically answerable, practically irrelevant to most of the humans involved.

And if architecture becomes fluid, something that agents can swap to maintain function, then the whole discipline of making architectural decisions starts to look like something temporary. Not because the decisions were wrong, but because the decisions stop needing to be made by humans at all.

I know this sounds like a story about architects losing their jobs, and it is, partly. But it's also a story about something much more delicate.

***

An agent that maintains function "at all cost and at all architecture" is optimizing for one thing: keeping the system running. And chances are that it will eventually be very good at it. It will rearchitect around failures. It will find workarounds for degraded dependencies. It will swap patterns, add retries, reroute traffic, spin up compensating services. From the outside, the system will look healthy. The dashboards will be green. The SLOs will be met.

But "running" and "healthy" are not the same thing.

The agent is unlikely to notice that the system has drifted so far from anyone's mental model that no human can reason about it anymore. It won't flag that the reason it keeps having to rearchitect is that an upstream dependency changed its data contract six months ago and nobody told anyone. Of course, in theory, agents could track contract changes across teams by pulling the latest API spec, reconcile, adapt. And mostly they will. But "mostly" is where the trouble lives. When agents handle 99% of cross-team coordination flawlessly, the 1% they miss becomes invisible precisely because everything else is compensating for it. The system holds together until it doesn't, and when it doesn't, every successful compensation that masked the gap becomes part of the blast radius. It won't recognize that the system is technically meeting its SLOs while slowly becoming incomprehensible.

This is a version of the prevention paradox running at machine speed.

When human operators kept systems running, there was a natural limit: the operators themselves. They got tired. They complained. They filed tickets. They said "this is getting ridiculous" in postmortems. The friction of human maintenance was a signal. It was an ugly, expensive, inefficient signal, but it told you something about the health of the system that no dashboard could capture. The operator's frustration was information about the gap between how the system was supposed to work and how it actually worked.

Agents don't get frustrated. They don't have the felt sense that something is getting ridiculous. They just keep compensating. And every successful compensation is a small act of hiding the true state of the system from the humans who are nominally responsible for it.

In my last newsletter, I wrote about David Woods' observation that AI doesn't solve your problems, it moves them somewhere you can't see. This is the next step in that sequence. AI didn't just move the problems, it actually moved the architecture. And when the architecture is fluid, managed by agents, shifting underneath you to maintain function, the gap between what the system is doing and what you think it's doing doesn't shrink. Instead it grows. Mostly because the agents are papering over it constantly, and every successful paper-over makes the gap a little harder to see.

***

There's a concept from resilience engineering that I keep returning to: the WAI-WAD gap; Work-As-Imagined versus Work-As-Done. In every organization, there's a difference between how people think the system works and how it actually works. The interesting failures happen in that gap.

In a world where architectures are fluid, managed as a runtime variable by agents, the WAI-WAD gap takes on a new dimension. It's no longer just that humans have an outdated mental model of a stable system. It's that the system itself is changing, continuously, underneath a mental model that was never designed to track continuous change. The architecture you reviewed last quarter might bear no resemblance to what's running today. And nobody noticed because the function never degraded.

This is what makes the "best practices by default" pitch from the AWS announcement both true and misleading. The practices are there, security and observability are instrumented, resilience patterns are in place. At time of deployment, the system is well-architected. But architecture is not a point-in-time property anymore. It's an ongoing relationship between a system, its operators, and its environment. And that relationship degrades when nobody can see the system clearly anymore.

***

I don't think the response to this is to resist agents managing architecture. That ship has already sailed. Developers will use them because they're pretty useful and the productivity gains are there.

But I think the response is to recognize that what matters is shifting. The important question was never really "what architecture should we use?" It was always "is this system healthy?" Those two questions used to be tightly coupled because architecture was stable and you could reason about health by reasoning about structure. If the architecture was sound, the system was probably healthy. If the architecture had known weaknesses, you knew where to look.

When architecture becomes fluid, that coupling breaks. You can no longer infer health from structure because the structure keeps changing. Health becomes something you have to measure directly, continuously, and independently of whatever the agents are doing underneath.

That's a different discipline than architecture. It's closer to what I'd call operational awareness. It’s the ability to see the gap between what the system is doing and what you think it's doing, even when (especially when) the metrics say everything is fine. It requires understanding not just the function but the cost of the function, the drift of the function, the comprehensibility of the function.

Agents that architect applications are a real and meaningful capability. But the thing they're automating was never the hard part. The hard part was always understanding whether the system you built was actually doing what you thought it was, in the way you thought it was, at a cost you could sustain. That question just got harder, not easier.

//Adrian

We Mistake "Hasn't Failed Yet" for "Won't Fail"

Adrian Hornsby — Sat, 07 Mar 2026 07:29:24 GMT

May 17, 2010

AWS had four regions. US East, US West, Europe, and Asia Pacific. That was the whole cloud, more or less. Most companies running serious workloads were still asking whether they could trust it at all.

That morning, Jeff Barr published a blog post. It was short, technical, conversational in the way Jeff always was. He was announcing that RDS now had a “High Availability” option. He called it Multi-AZ. One parameter, set to true, and Amazon would spin up a hot standby in a second availability zone, synchronously replicate every write, and fail over automatically in about three minutes if the primary went down. Your application wouldn't even need to know it happened.

In 2010, high availability for a managed database meant your DBA had a plan and a phone number to call at 2am. It meant runbooks, manual failover scripts, and potentially someone driving to a data center. The notion that a piece of infrastructure could sense its own failure and reconstitute itself, invisibly, while your application kept serving traffic, was new to most of the people reading that post.

Jeff wrote that availability zones had "independent power, cooling, and network connectivity." Fourteen words that would quietly become load-bearing assumptions for an entire industry.

For the next fifteen years, those fourteen words held. Until they didn't.

***

For about sixteen years, I believed in multi-AZ the way you believe in gravity.

Every architecture review I sat in, every Well-Architected assessment, every "is this prod-ready?" Operational Readiness Review pointed to the same thing: spread your workload across availability zones and you've handled the big one. I never really questioned it, neither did the people I worked with. At some point it stopped feeling like a design decision and started feeling almost like a law of nature.

Then a single event took down two Asian regions simultaneously. And the assumption didn't bend. It shattered.

Around the same time, a different assumption cracked. Not in one dramatic moment, but across a series of events that added up to the same thing. Trump sanctioned the International Criminal Court, and Microsoft implemented those sanctions, locking the ICC's chief prosecutor out of his Outlook email. A Microsoft executive then told a French Senate committee, under oath, that the company cannot guarantee European customer data will never be handed to US authorities because the CLOUD Act requires US companies to comply with US government requests regardless of where the data physically sits. European data stored in European data centers, operated by a US company, is still subject to US law.

Two load-bearing assumptions, gone inside a few months.

*******

There's a pattern here that I keep coming back to, and I think it sits at the intersection of psychology, probability, and how organizations actually work.

When an assumption holds long enough, it stops being an assumption. It becomes furniture. You stop seeing it because it was always there. And crucially, every day it holds true feels like evidence that it was always correct. The confidence compounds and the scrutiny fades.

That's a measurement error. We're tracking frequency, not fragility. The assumption isn't getting stronger with each passing day. The exposure is just accumulating quietly, out of sight.

This is what makes the prevention paradox so insidious. The longer nothing goes wrong, the more confident you feel. The more confident you feel, the less you invest in questioning the foundations. And so the fragility grows precisely because things have been going so well.

Scientists have a name for the thing that prevents this trap: falsifiability. A good scientific hypothesis is one that can, in principle, be proven wrong. You hold it provisionally. You actively look for the counterexample. The absence of failure doesn't confirm the hypothesis; it just hasn't been refuted yet.

Organizations are terrible at this because falsifiability is uncomfortable. Treating your foundational assumptions as provisional feels destabilizing. It requires admitting that what you built on might not hold. Most organizational cultures punish that kind of questioning, or at least fail to reward it.

So we get this strange situation: some of the most technically sophisticated organizations in the world run on assumptions they've never seriously stress-tested; Multi-AZ as physics, cloud providers as neutral infrastructure, and government as a stable, predictable background condition.

*******

Taleb's black swan is often misread as a story about rare events but I think it's more a story about accumulated fragility. Everything that was quietly wrong beforehand, the untested assumptions, the confidence that compounded without scrutiny, the floor that was never as solid as it felt. The event just ends the period of not knowing. What follows is disorienting precisely because the ground shifted under something we stopped questioning years ago.

I think this is worth sitting with, because the reflex after a shattered assumption is usually to over-react, replace it with a new one, and move on. We update the architecture, patch the policy, and add multi-region to the checklist. And then, gradually, the new assumption starts to calcify too.

The harder question you should ask yourself now is: what are we still treating as physics that isn't?

I don't have a clean answer. But I think the practice, the actual resilience practice, is to hold your foundational assumptions more loosely. To ask periodically: if this turns out to be wrong, what breaks? To make the questioning normal rather than exceptional. To treat "hasn't failed yet" as exactly what it is: a run of confirming evidence, not proof.

That's uncomfortable and it requires a kind of epistemic humility that organizations tend to select against. But the alternative is waiting for the next black swan to do the work for you.

//Adrian

AI doesn't solve your problems. It moves them somewhere you can't see yet.

Adrian Hornsby — Mon, 02 Mar 2026 08:47:58 GMT

Estimated read time: 9 minutes

There's a seductive story about AI in operations that goes something like this: we have problems, let's deploy AI, it will fix them. Incidents will resolve faster, anomalies will get caught earlier, postmortems will draft themselves in minutes instead of hours, metrics will improve. I've been hearing this story recently and I don't doubt the promise, but improved metrics and solved problems are not the same thing. The problems don't go away when the metrics get better; they go somewhere else, take new forms, and show up in places nobody thought to look. The question is where they go, and what they look like once they get there.

I've been circling this question for a while. About a year ago I wrote about AI meta-operators and system responsibility, trying to work through what happens when AI agents start making operational decisions that used to require human judgment. More recently I wrote about chaos engineering for AI-generated code, arguing that the velocity and opacity of AI-assisted development demands systematic stress-testing because human review alone can't keep pace. In both cases I could feel the problems but I couldn't connect them, couldn't see why the same kinds of trouble kept showing up in different forms. In my book I argue that a common vocabulary is one of the most underrated tools in resilience work, because once you can name something you can discuss it, and once you can discuss it you can start to act on it. What I was missing was exactly that: a vocabulary for what AI does to the mess rather than just the mess itself.

***

David Woods has been mapping exactly that. He has been developing a heuristic he calls the Messy 9, which he recently premiered on the Fine Pod podcast and discussed on in the Resilience in Software Foundation Slack channel. It's designed to bridge the science of how complex systems actually work and the practical need to do something about it. The setup is what Woods calls GCA: patterns over cycles of Growth, Complexification, and Adaptation. When new technological capabilities affect ongoing worlds of practice, processes of growth, complexification, and adaptation play out in lawful patterns, and stories of technology change should capture or envision the new forms of messiness that arise when apparent benefits get hijacked. The core message is that the messiness of the real world is conserved over attempts to improve systems, conserved in the formal sense described by the No Free Lunch and Robust Yet Fragile theorems: you don't eliminate messes, you move them.

Woods organises the recurring forms into nine patterns, grouped in threes: (1) congestion, cascades, and conflict; (2) saturation, lag, and friction; (3) tempos, surprises, and tangles. He describes them as a small set of generic keys you can use to unlock any episode of change to see how messes reappear, with much of the action living in the cross-connects and overlaps between them. Each points to processes that play out over time as systems grow, and each takes on unfamiliar forms when AI enters the picture.

Congestion, in Woods' framing, is what happens when a bunch of things are going on simultaneously and you have to deal with them all in the time available. Cascades are disturbances propagating across lines of interdependency, where one failure dumps load onto adjacent functions and the effects spread. Conflict is the question of who loses, who sacrifices, what gets prioritised when there's overload, what gets sacrificed first and what gets sacrificed later. These three are the most visible forms of messiness, and in traditional distributed systems we've built tooling to handle them: circuit breakers, bulkheads, load shedding, runbooks. But when AI is the operator, these patterns migrate into territory that existing tooling can't see.

Consider what cascading failure looks like when one model's output feeds another model's judgment, which triggers a third model's action. The failure propagates through reasoning rather than network calls and retry logic. A subtly wrong interpretation becomes a confident decision becomes an automated action, and the whole chain looks healthy from the outside because every component is performing exactly as designed. The cascade is there, running through inference rather than infrastructure, and it remains invisible until the consequences arrive in a form nobody anticipated.

Saturation, in Woods' framework, is what happens when a system approaches the boundary where it runs out of capacity to deal with challenges, where different subsystems start dumping more overload onto other places and the saturation spreads. In traditional systems this means hitting a resource limit you can see, measure, and plan for. AI introduces a different kind: decision saturation, where AI handles enough operational decisions that the humans nominally overseeing it lose the ability to meaningfully evaluate what it's doing, simply because the volume and speed of AI-driven decisions exceed what human attention can track. The oversight saturates while the system hums along, and nobody notices because the dashboards still look green.

Lag is what Woods describes as the pattern where organisations cut the resources they need to integrate new capabilities before they have actually integrated them, anticipating productivity gains and reducing experience, expertise, and people before the ostensible benefits have materialised. This is playing out everywhere right now: teams are being restructured around AI productivity assumptions while the actual work of figuring out where AI fits, where it doesn't, and what new failure modes it introduces is still in its earliest stages. The resources being cut are the very resources needed to discover whether the new capability actually works as advertised.

Friction undergoes an equally counterintuitive transformation. Woods frames friction as a necessary feature of bringing capabilities into practice, the offsetting costs and workload that arise when something new meets the complexity of the real world, and warns that if you underplay it, the things you try to deploy turn out not to work as well as you would like. The old friction in operations was obvious: manual processes, handoffs between teams, slow approval chains. AI removes it, which feels like progress, but friction served a function. It slowed things down enough for people to notice when something was off, created natural pause points where someone might say "wait, does this actually make sense?" Removing the friction removes the speed bumps that gave humans a chance to catch problems before they compounded, making the system faster and smoother while also making it more brittle in ways that only surface when speed and smoothness were exactly the wrong thing.

Tempos, the seventh pattern, describe what happens when different rates of change collide. In DevOps this is already familiar: the tempo of development influences operations, operations constrains development, and incident response introduces its own urgency that overrides both. AI adds new tempos that don't match existing ones: the speed at which models make decisions versus the speed at which humans can review them, the rate at which AI-driven changes accumulate versus the rate at which organisations can absorb their implications. These tempo mismatches create their own congestion as decisions pile up faster than the capacity to evaluate them.

Surprises, the eighth pattern, are not about rare events at the tail of a distribution. Woods insists that the dragons of surprise don't get weaker and more infrequent as systems improve; instead, systems generate new categories of surprise as they change. AI is particularly good at producing novel categories because it fails differently from humans. When a human operator makes a mistake, other humans can usually reconstruct the reasoning and see how someone under pressure with incomplete information made a wrong call. When AI makes a mistake the reasoning is opaque, nobody can explain why, which means nobody can confidently say it won't happen again in a different form, which means the organisation can't learn from it in the way it has always learned from human error.

The tangles might be the most troubling of all. Woods describes tangles as circular dependencies and strange loops, the kind of thing he first encountered in nuclear power plants where a critical function depended on an instantiation of itself. When multiple AI systems operate with overlapping domains, one monitoring infrastructure, another triaging incidents, a third managing capacity, they develop implicit dependencies that exist nowhere in any architecture diagram, learning to compensate for each other's behaviour in ways their operators never specified. These tangles are invisible during normal operations and surface during failure, when one system's unexpected behaviour cascades through compensation patterns that nobody knew existed, creating a debugging problem that is qualitatively different from anything human operators have encountered before. Woods gives a vivid example from the current AI gold rush itself: we need critical infrastructure to support AI computations, AI is being deployed to reduce the people who operate that infrastructure, and the AI doing the operating depends on the infrastructure it's supposed to be operating. The circular dependency is already there.

All of this converges on what Woods identifies as the key constraint: extra adaptive capacity is most needed when least affordable. AI adoption is a period of significant system change that demands more adaptive capacity, more ability to recognise and respond to novel situations, precisely because the system is in flux and the new failure modes haven't been mapped yet. AI adoption also consumes adaptive capacity, because organisations use it as a reason to reduce the human expertise that provides it. The need goes up while the supply goes down, and that gap is where the real risk lives, in a place that improved operational metrics will never show you.

None of this means AI is bad for operations; the improvements are real and sometimes substantial. The story that AI is going to solve your problems is almost certainly incomplete, because as Woods puts it, the Messy 9 exists to counter the tendency we all have to see whatever we develop as solving something instead of moving things and shifting processes. AI solves the problems you are measuring while generating new ones you haven't learned to see yet, and the messes migrate to places that your current instrumentation and your current organisational structure are not designed to detect. If you're deploying AI into your operations and your metrics are getting better, the question worth asking is where the mess went, because messiness is conserved over cycles of change. It takes new forms and it operates at new scales, and that puts a higher premium on exactly the experience, skill, and expertise to figure out how the system is working when it's not working the way you thought it was.

//Adrian

Why We Still Suck at Resilience and Why I Wrote a Book About It

Adrian Hornsby — Wed, 18 Feb 2026 16:07:32 GMT

The last few months have been pretty quiet in here, and the reason is the same reason this piece exists. I wrote a book. It is called "Why We Still Suck at Resilience," and you can get it here.

If you have been following me for any length of time, you will recognise the argument even if the packaging is new. Organizations confuse performing resilience with actually being resilient, and the five practices that are supposed to manage that gap (chaos engineering, load testing, GameDays, incident analysis, operational readiness reviews) have drifted so far from their original purpose that they often make the problem worse. The book is an attempt to say clearly what I have been circling around in LinkedIn posts and on my blog, that the gap between how we imagine our systems work and how they actually work is growing, and most of what we are doing about it is theater.

The idea started during my time at AWS, where I helped build the Fault Injection Service and spent a decade working with some of the world's largest organizations on how they practice resilience at scale. I saw teams running chaos experiments that confirmed what they already believed rather than discovering what they did not know, load tests scoped to pass rather than to learn, incident reviews that produced action items nobody followed up on because the organizational incentives pointed elsewhere. The patterns were consistent enough across companies, industries, and team sizes that I became convinced the problem was not individual teams failing to execute. Something structural was going wrong, something about the way organizations relate to learning under pressure, and I wanted to understand what it was.

Writing the book took longer than I expected, partly because I kept discovering that the problem was deeper than I had initially framed it. What started as a practitioner's guide to doing these five practices well became something more like an investigation into why organizations systematically undermine their own capacity to learn from failure. The research pulled me into safety science, into Bainbridge and Rasmussen and Woods, into the gap between Work-as-Imagined and Work-as-Done, and into the uncomfortable recognition that I had been part of the problem myself. I built tooling that optimised for individual practice metrics while ignoring the learning system those practices were supposed to form. The value of resilience work lives in the connections between practices, in how an incident finding becomes a chaos experiment becomes a readiness review question, and nothing we had built supported those connections.

The final chapter turned out to be the one that mattered most, both for the book and for where my thinking is already going. It is about AI, but not in the way you might expect. Every observability vendor now offers AI-powered anomaly detection, every incident platform promises AI-drafted postmortems, and the marketing suggests these tools will finally eliminate the gap between how we imagine our systems work and how they actually work.

Just this week, Norberto Lopes at incident.io posted about their AI-generated postmortem capability, calling it a mind-blowing moment, celebrating that the write-up was accurate, well-contextualised, and took zero time to draft. Around the same time, Ozan Unlu published a piece describing what he calls Observability 3.0, a vision in which AI agents handle incident triage, root cause analysis, and postmortems autonomously, with humans elevated to "creativity and innovation" while machines do the interpretive work. I do not doubt that either of these tools produces excellent output. What worries me is that the productive struggle of writing a postmortem, or correlating signals during an incident, or reasoning through why a system behaved the way it did, is where most of the actual learning occurs. The document was never really the point; the thinking that produced it was. The interpretation was never the bottleneck; it was the training ground. When you skip the thinking and get a better document, or remove humans from the sensemaking and call it empowerment, you have made a trade that is easy to celebrate and difficult to measure the cost of until much later, when something breaks and the organization discovers it no longer has the understanding it assumed it was building all along.

Bainbridge identified this dynamic in 1983: automation designed to remove humans from a system paradoxically makes the human role both more critical and more difficult, because it removes the routine experience that builds expertise. AI is a higher form of the same problem, one that does not just automate tasks but automates the judgment and reasoning that create the deepest understanding.

I believed that argument completely while writing it, and the speed at which I have started questioning it has caught me off guard. Over the past few weeks, there has been a growing conversation among engineers and researchers ( here for example) about a noticeable leap forward in the latest generation of models, not just in raw capability but in the quality of reasoning. For the first time, my experience starts to mirror what others are describing.

I have been using Anthropic's Claude, both the Opus 4.6 model and its extended thinking variant, and what unsettled me was the texture of how it worked through problems. On coding tasks that required sound architectural judgment, the kind where there is no clean answer and the trade-offs depend on context that has to be reasoned about rather than looked up, the model was both producing correct outputs and thinking through the problem in ways that felt uncomfortably close to how a senior engineer thinks through it. It would express uncertainty about its own choices, flag trade-offs I had not raised, change direction mid-reasoning when it encountered something that complicated its initial approach.

Doubt, revision, context-sensitivity; these are what learning looks like. They are, in cognitive science and education research, the established preconditions for understanding: cognitive conflict that exposes mismatches between a mental model and reality, willingness to reorganise prior knowledge rather than just accumulate new information, and attention to context as the thing that determines whether knowledge transfers or stays brittle. The assumption that has underpinned my thinking, that when automation fails it will be humans who catch the fall, rests on the belief that machines cannot do the interpretive work. If that belief is wrong, or even if it is only wrong for long enough that organizations stop maintaining human capability in the meantime, then the question of who fixes what when things break becomes genuinely open in a way it has never been before.

I do not say that to be dramatic, and I do not have a clean resolution. The book makes a case I still think is largely right: that organizations need to protect the productive struggle, the learning that comes from humans wrestling with ambiguous problems, because that struggle is where durable understanding forms. What I am less sure about is whether the time horizon for that argument is decades or years, and the difference between those two answers changes everything about how I think about the work ahead.

Which brings me to what is next. In my consulting work at Resilium Labs, AI in operations has slowly become a recurring topic in nearly every engagement. Clients who originally brought me in to diagnose their chaos engineering, load testing, operational readiness, and incident review process are now asking questions about what AI is doing to their teams' ability to understand their own systems. The conversations are shifting because the problem is shifting, and the methodology I have always used (embedded observation, stakeholder interviews, tracing the gap between how people imagine work happens and how it actually happens) turns out to apply directly to organizations trying to figure out whether their AI adoption is building genuine capability or creating new blind spots they have not learned to see yet. The work is evolving because the organizations I work with are evolving, and the questions they need answered are no longer only about resilience practices in the traditional sense.

There will be more writing here in the weeks ahead, pieces that work through these questions. If anything in the book resonates, or if it does not, I would like to hear about it.

//Adrian

The Prevention Paradox at Civilizational Scale

Adrian Hornsby — Tue, 17 Feb 2026 07:24:01 GMT

“La flamme de la résistance française ne doit pas s’éteindre et ne s’éteindra pas” - Charles de Gaulle, appel du 18 juin 1940

When I was a kid growing up near Grenoble in France, veterans of the Second World War came to our school to talk about the resistance. The region around Grenoble was a hotspot for the French Resistance, and many of the surrounding villages lost generations of men in the war. The veterans were old by then, and they knew they didn't have many visits left.

I remember very clearly what one of them said:

"My biggest worry is that once we are all gone, no one will be there to tell the story, and people will make the same mistakes again."

I thought about him this week while reading Ray Dalio's latest piece, "It's Official: The World Order Has Broken Down". At the Munich Security Conference, multiple world leaders said the same thing: the post-1945 order no longer exists. German Chancellor Merz: "The world order as it has stood for decades no longer exists." Macron: Europe must prepare for war. Rubio: we are in a "new geopolitics era" because the "old world" is gone.

The veterans are gone now. And here we are.

Dalio frames this through his "Big Cycle," a pattern that repeats across centuries. A dominant power wins a conflict, builds institutions and rules, and establishes an order that works. Over time, the stability that was produced by all that investment starts to look like it was always there. Investment in maintaining it declines. The institutions hollow out. Then a crisis arrives and everyone discovers the order was already gone long before the crisis hit.

I kept reading, and I kept seeing the prevention paradox.

Effective prevention creates doubt about its necessity

In my book and this blog post, I describe a pattern that plays out with painful regularity inside technology organizations. A serious incident creates the political will to invest in resilience. A strong coalition builds institutions: incident review processes, chaos engineering programs, operational readiness reviews, feedback loops that institutionalize resilience thinking. It works. Stability becomes normal. And then the stability that was produced by the investment starts looking like the natural state of things rather than something that requires active maintenance.

Then a newly appointed leader asks:

"What exactly does your team do? I see a lot of salary costs, but what are the deliverables?"

The team scrambles to explain their value but doesn't have the data to back up their claims. "We prevented failures" doesn't translate well to budget spreadsheets. The practices get cut. The degradation that follows gets attributed to other causes. The cycle repeats.

The post-1945 world order followed the same arc. The catastrophe of two world wars created the political will to invest in international institutions: the United Nations, NATO, the Bretton Woods system, trade agreements, alliances. A strong coalition (led by the US) built and maintained them. They worked. Decades of relative peace and prosperity followed.

And then the stability that those institutions produced started looking like it was always there.

The better you prevent problems, the fewer problems exist to justify preventing them

This is the core of the prevention paradox. Our brains struggle to value non-events. The availability heuristic means we judge importance by how easily examples come to mind. Dramatic incidents are memorable. Non-events leave no trace. There's no story to tell about the war that didn't happen.

Hindsight bias compounds the problem. Once we know the benign outcome, we reconstruct the past as if failure was never very likely. After decades of relative peace, people conclude "clearly the threat was overstated" rather than recognizing "nothing happened because we took appropriate precautions."

At the organizational level, I've watched this play out with chaos engineering programs, incident review processes, and operational readiness reviews. The more effective they are, the more unnecessary they appear. At the geopolitical level, the same dynamic hollowed out commitment to the institutions that prevented great power conflict for 80 years.

The costs of maintaining the order were always visible: military spending, diplomatic effort, economic concessions, compromises on sovereignty. The benefits were counterfactual: wars that didn't happen, conflicts that got resolved before they escalated, trade disputes that didn't become economic wars. When decisions require weighing visible costs against hypothetical benefits, visible costs carry more weight.

New leaders who never experienced the crisis

In the chapter, I describe how the leaders who question resilience investment often aren't people who forgot what the work does. They're people who never experienced what the work prevents. They joined during the stable period that effective resilience created, and from their vantage point the stability looks like the natural state of things.

The VP who asked "what exactly does your team do?" wasn't ignoring history. He had no history to ignore. He inherited a system that worked and saw a team whose purpose he couldn't connect to any problem he'd experienced.

At Munich, we're watching the geopolitical version. An entire generation of leaders grew up inside the stability the post-war order produced. They inherited institutions that worked and saw costs they couldn't connect to any catastrophe they'd lived through. The people who built the post-war order, the people who remembered the destruction that justified it, are gone. Their memory of why it mattered left with them.

That veteran in my school saw this coming forty years ago.

Dalio points out that the leaders now declaring the old order dead are in fact acknowledging something that's been happening for years, through eroding commitments, hollowed-out alliances, and institutional decay that went mostly unnoticed while the surface structures remained in place.

Stage 6 creeps in

One of Dalio's most useful observations is that the breakdown doesn't start with the shooting war. It starts years earlier with economic wars, technology wars, capital wars, and geopolitical maneuvering. By the time the shooting starts, the underlying order has been gone for a long time.

I describe the same pattern at the organizational level. Practices continue but without the depth. Chaos experiments drift toward safe scenarios. Incident reviews stay at surface-level fixes. GameDays become scripted performances. The practices become theater while appearing unchanged. The degradation is invisible because the surface activities continue, even as the learning capacity underneath erodes.

The international institutions didn't collapse overnight either. The UN still meets. NATO still exists. Trade agreements are still on paper. But the commitment behind them has been eroding for years. The practices became theater while appearing unchanged.

By the time a significant incident occurs, the causation is thoroughly obscured. Months or years have passed between the reduction in investment and the consequences. The delay prevents connecting the cuts to the outcomes. Each generation of leadership learns through fresh pain.

The cycle isn't inevitable

Dalio ends his chapter with something that I think is worth holding onto. He says the cycle doesn't have to end in catastrophe, if countries "stay productive, earn more than they spend, make the system work well for most of their populations, and figure out ways of creating and sustaining win-win relationships."

In my book, I argue the same: the pattern is predictable, but it can be navigated. It requires treating stability as an output that needs continuous input, not a resting state. It requires building institutional memory that survives the people who experienced the original crisis. It requires celebrating prevention rather than only celebrating heroic response. And it requires leadership that remembers why the institutions were built in the first place, even when everything looks fine.

Whether we're talking about an engineering organization or the world order, the prevention paradox operates through the same mechanism. Success erases its own evidence. The world order didn't break down last week in Munich. It broke down gradually, over years, as the commitment to maintaining it eroded while everyone assumed it would just persist on its own.

One of the commenters on Dalio's piece, Mike Wagner, put it nicely: "A lot of what we built assumed stability was just there in the background, and that assumption is gone."

That's the prevention paradox, at every scale.

//Adrian

Why Your Chaos Experiments Give You False Confidence

Adrian Hornsby — Fri, 09 Jan 2026 13:58:37 GMT

You've done everything right.

You ran the hypothesis conversation. Your team discovered they had different mental models of how database failover works. You investigated the gaps. You fixed the connection pool configuration and the health check logic. You added monitoring for connection pool state.

Then you ran the experiment. Database fails over, circuit breaker trips, traffic routes to the replica, recovery completes in 30 seconds. Everything worked exactly as expected.

Three months later, the database fails during peak traffic. The system enters a death spiral. Circuit breakers trip and reset in rapid cycles. Connection pools exhaust. The replica becomes overloaded. Health checks start failing on healthy instances. Retries amplify the load. Recovery takes 23 minutes of manual intervention.

You tested this exact scenario. It worked perfectly. What happened?

You tested at the wrong load.

During your experiment, you ran with minimal traffic. Maybe some synthetic requests, maybe just manual testing. Production was handling 800 requests per second when the database failed.

That difference activated completely different system dynamics.

Why load changes everything

Systems with spare capacity behave as if they were deterministic. The same inputs produce the same outputs. Failures are reproducible. You can reason about what will happen and predict outcomes.

When systems run near their limits, they become non-deterministic. The Universal Scalability Law predicts this: contention and coherency costs cause non-linear performance collapse beyond critical load thresholds. Several mechanisms combine to create this shift.

Bimodal components. A cache either hits or misses. A circuit breaker is either closed or open. A connection pool either has available connections or is exhausted. A health check either passes or fails. Each is a binary state change that delivers radically different behavior

At low load, mode transitions don't matter much. A cache miss? The database has spare capacity. A circuit breaker opens? Other instances absorb the traffic. You have operational margin to absorb these transitions.

At high load, the same transitions cascade. A cache miss means the database is already near its limit, so the additional query causes queueing, which slows all queries, which triggers timeouts and retries. A circuit breaker opening means remaining instances are already near capacity, so redirected traffic pushes them over, causing their circuits to open.

Queueing effects. At low load, a cache miss adds one query to a queue of 10, and the impact is marginal. At high load, a cache miss adds one query to a queue of 10,000. The queue is already near capacity, and that additional query pushes it deeper into the non-linear region where waiting time explodes. This affects all subsequent requests, not just the one that missed.

Concurrency races. At low load, if two instances' circuit breakers both approach their trip thresholds, they trip at different times because request patterns vary enough. The first trips, load redistributes, the system absorbs it.

At high load, many instances approach thresholds simultaneously. One trips, load redistributes, and the redistribution pushes others over. They all trip within seconds of each other, and you get synchronized state transitions across your entire fleet rather than gradual failover.

Resource cascades. At low load, one slow request holds a thread a bit longer, but other threads are available and nothing cascades. At high load, one slow request holds a thread when no other threads are available. Incoming requests queue, the backup causes timeouts, timeouts trigger retries, and retries push thread pools deeper into exhaustion. Circuit breakers watching these timeouts approach their thresholds and trip. Three systems have now changed state because one request was slow at the wrong moment.

Timing variations. Which request becomes slow depends on factors you can't control. A database query lands on a disk performing compaction. A network packet gets delayed by congestion. A GC pause happens at an unfortunate moment. Under low load, these variations don't matter. Under high load, whichever request is slow can trigger a cascade.

Run the same chaos experiment five times at low load and you get consistent results. Run it five times at high load and you get five different outcomes. You're not seeing randomness. You're seeing emergent behavior from multiple interacting mechanisms whose states you can't precisely control.

Metastability: when resilience mechanisms prevent recovery

This pattern has a name: metastability. Researchers studying large-scale system failures identified this pattern and gave it a precise definition: failure states sustained by the system's own behavior, not by any ongoing external problem.

The trigger is gone. The database is back. The network is healthy. Every component passes health checks. But the system serves almost no useful work because the resilience mechanisms you built now create more load than the system can handle.

The cache can't rewarm because database load is too high. The circuit breaker can't close because retry load prevents downstream recovery. The connection pool can't replenish because failures happen faster than connections establish. The health checks can't pass because load redistribution keeps instances above the timeout threshold.

Your system occupies one of three states:

This diagram shows the relationship between load and goodput, with the three distinct system states.

Inspired by: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf

The green curve shows the stable state. Goodput scales with load. The system has spare capacity, so it can absorb triggers and self-recover. This is where your chaos experiments typically run.

The blue line shows the vulnerable state. The system operates efficiently near its limits. Goodput stays high, but there's little margin. A trigger (the fire icon) can push the system off this line.

The red dashed line shows metastability. Goodput collapses to near zero even though load remains high. The system has fallen off the curve entirely. Positive feedback loops prevent recovery.

The dotted arrows show what happens: a trigger pushes the system from vulnerable down into metastable failure, and it takes intervention (and reducing load significantly) to get back up to stable.

The state diagram in the corner shows the transitions. Stable can drift into vulnerable. Vulnerable can fall into metastable failure when a trigger activates sustaining effects. Metastable requires intervention to escape. You can't just remove the trigger and wait for recovery.

Two practices that should be one

Most organizations run load testing and chaos engineering as separate activities. The load testing team validates capacity. The chaos engineering team validates failure degradation. The two rarely meet.

Load testing without failure injection shows you how the system performs when nothing goes wrong. Chaos engineering without realistic load shows you how the system handles failures when it has spare capacity to absorb them. Neither tells you what happens when failures occur under production load.

That combination is exactly what production delivers. And that vulnerable region is exactly where you need to run your experiments to find the triggers that will collapse your system into metastability.

Finding those triggers before production does is what chaos engineering and load testing are for. But you can only find them by testing in the vulnerable region where production operates.

What this means for your experiments

The hypothesis conversation reveals gaps in your team's mental models. Investigation and fixing address the knowable problems. But there's a third category we didn't fully explore: emergent behavior that only appears under load.

When you design experiments for uncertainty gaps, the ones where behavior emerges from component interactions, you need to test under realistic load. Not "some" load. Production-equivalent load.

Your experiment design from the last newsletter asked: "We want to know how our application's retry logic interacts with database failover under realistic traffic load."

Realistic traffic load is the key phrase. If you test at 50 requests per second and production runs at 800, you're testing a different system. The stable regime behaves deterministically. The vulnerable regime behaves non-deterministically because queueing effects amplify, concurrency races synchronize, resource cascades chain, and timing variations that didn't matter suddenly determine outcomes.

Same code. Same architecture. Same failure injection. Completely different outcomes.

Try this

Take one of the uncertainty gaps you identified in your hypothesis conversation. Design the experiment as we discussed, with specific predictions about timing, behavior, and recovery.

Then ask yourself what load level does production actually experience during peak hours? Can you run your experiment at that load level?

If you can't, you're testing in the stable region. Your results will give you false confidence about triggers that only activate in the vulnerable region where production operates.

You want to find those fire icons in the diagram before production finds them for you. That requires chaos engineering and load testing together.

Until then,

Adrian

What to do after the hypothesis conversation

Adrian Hornsby — Sun, 14 Dec 2025 14:26:27 GMT

Last time, I walked you through the hypothesis conversation, how to discover that your team has completely different mental models of how your system behaves, all before running any chaos experiment.

Several of you replied with stories. One team discovered that nobody knew whether their cache actually had an eviction policy. Another found out their "automatic" failover required some manual steps. A third learned that their monitoring would completely miss the failure mode they were worried about.

Good. That's exactly what should happen.

Now comes the harder question: what do you actually do with all these gaps you just discovered?

Most teams make one of two mistakes here. Either they panic and try to fix everything immediately, or they shrug and run the experiment anyway "just to see what happens."

Both are wrong. Let me show you a better path.

Three types of gaps

When you run a hypothesis conversation, you uncover three distinct types of gaps. Each requires a different response.

Type 1: Knowledge gaps

These are gaps in understanding where the answer exists somewhere. You just need to find it.

"Does the circuit breaker have a timeout?" "What's our connection pool size?" Someone knows. It's in the code or configuration. You just need to look it up.

What to do: Investigate before experimenting.

Don't run a chaos experiment to answer questions you can answer by reading code, checking configuration, or talking to the team that built the thing. That's using a sledgehammer when you need a magnifying glass.

Spend an hour investigating:

Read the relevant code
Check configuration files
Look at past incidents where this failed
Talk to the team that owns the component
Review any existing documentation

Update your mental models based on what you find. Then reconvene and share what you learned.

Half the time, this investigation reveals more gaps. "Oh, I thought Steve configured that, but he left six months ago and nobody knows if it's still set up that way."

Perfect. Now you know what you don't know.

Type 2: Uncertainty gaps

These are gaps where nobody knows the answer because the behavior emerges from interactions between components.

You can understand every component separately and still have no idea what happens when they all run together under specific conditions.

You checked the code. The connection pool has reconnection logic. The database has failover logic. The load balancer has health check logic. The application has retry logic. Each piece makes sense. The implementations look correct.

But when database failover happens during peak traffic while the cache is cold, what actually happens? Do the retry storms from 50 application instances create a thundering herd that prevents the database from recovering? Does the load balancer pull instances out of rotation before they stop retrying, or after? Do the health checks start passing before the connection pool is actually ready, sending traffic to instances that will just error?

Nobody knows for certain and it is hard to reason about. You can't know from reading the code because the answer depends on timing, load, and how these different components interact under load.

That's emergence. The behavior comes from the interaction, not from the individual components.

What to do: Design experiments specifically to explore these interactions.

This is what chaos engineering is actually good for. Testing things where behavior emerges from complexity, where understanding the parts doesn't tell you how the whole system behaves.

Before you design the experiment, get specific about what interaction you're uncertain about:

"We're uncertain how our retry logic interacts with database failover during high traffic. Each component looks correct in isolation, but we've never validated whether they work together without creating cascading problems. We need to know because retry storms could turn a 30-second database blip into a 90-minute outage."

Now you can design an experiment:

Inject database unavailability with realistic load
Watch how retry behavior scales across all instances
Monitor whether health checks and retries synchronize badly
Track whether the database can actually accept connections when it comes back
Observe the recovery timeline end-to-end

The experiment has a clear learning goal. You're testing how components interact under specific conditions, not whether individual components work.

You'll probably need to run this experiment under multiple conditions. Behavior that emerges from interaction often depends on load, timing, system state. What happens at 2 AM with low traffic might be completely different from what happens during peak hours.

Type 3: Design gaps

These are gaps where you discover something is missing or wrong in your system design.

"Wait, we don't have a circuit breaker at all? I thought we did."

"Our health check doesn't actually validate database connectivity, it just returns 200 OK?"

"There's no monitoring for connection pool state?"

These are real problems you just discovered. These aren't knowledge gaps. These aren't uncertainties.

What to do: Fix them before experimenting.

Don't run a chaos experiment to confirm that something you know is missing or broken is actually missing or broken. That's not learning, that's theater.

If you discover your health check is shallow when it should be deep, fix the health check. If you find out you're missing critical monitoring, add it. If the circuit breaker doesn't exist, decide whether you need one.

Some teams resist this. "But we want to see how bad it is!"

No. You don't learn from deliberately breaking things you know are broken. You just create risk and waste limited resources. And your system will anyway behave differently once to address the missing or broken components.

Fix the known problems first. Then experiment to find the unknown problems.

The investigation phase

Let's say you ran your hypothesis conversation and discovered 15 different gaps. Don't immediately schedule 15 chaos experiments.

Instead, spend time investigating. Create a gaps document. List everything you discovered. Sort the list into these three buckets. Knowledge gaps we need to investigate. Uncertainties we need to experiment on. And design problems we need to fix.

Assign investigation work. Split the knowledge gaps among team members. Give people a week to investigate their assigned areas. This is building shared understanding before you create any risk.

Gather again. Have each person share what they learned. You'll find that investigating knowledge gaps often reveals more uncertainties or design problems. That's good. You're getting more precise about what you actually don't know.

Update your gaps document based on the investigation.

Prioritizing what to test.

You've investigated the knowledge gaps. You've fixed the obvious design problems. Now you're left with genuine uncertainties about how components interact.

You probably can't test all of them immediately and because resources are limited. So prioritize.

Priority 1: High-impact, high-uncertainty

These are failure modes that would cause significant customer impact if they happened, and the behavior emerges from complex interactions you can't predict.

High stakes. Real uncertainty from emergence. Test this first.

Priority 2: High-impact, low-uncertainty

These are scary scenarios where you think you know how components interact, but the stakes are high enough that validation is worth it.

You probably won't learn too much. But confirming that critical interactions work as expected has confidence value.

Priority 3: Low-impact, high-uncertainty

These are things you're uncertain about but wouldn't cause major problems if the interactions failed.

"We're not sure how cache warming and background jobs interact during deployment, but worst case some requests are slower."

Interesting to know. Not urgent to test.

Priority 4: Low-impact, low-uncertainty

Don't test these at all. You're confident about how components interact and the impact is minimal.

Save your time and energy for experiments that actually teach you something important.

Designing your first experiment

Let's say you've prioritized and you're ready to design your first experiment. You've picked: "Test how retry logic and database failover interact during realistic load."

Here's how to design it:

Start with your learning goal

Be explicit: "We want to know how our application's retry logic interacts with database failover under realistic traffic load. Specifically, whether retries from multiple instances create problems for database recovery, and how long the system actually takes to return to normal."

Document the hypothesis clearly (before you start)

Start with your system properties:

"Each application instance maintains a connection pool of 20 connections. Connection timeout is set to 5 seconds. Under normal load, instances handle 50 requests/second. When a connection attempt fails, the pool marks that slot as failed and retries. Applications implement retry logic with 100ms base delay and 2x exponential backoff for a maximum of 4 retries (100ms, 200ms, 400ms, 800ms)."

Now you can make predictions:

"When the database becomes unavailable, each instance will experience connection failures at 5-second intervals. At 50 requests/second with 20 available connections, the pool will exhaust in approximately 0.4 seconds (20 connections / 50 requests/second). Applications will begin returning errors immediately. The retry sequence will complete in 1.5 seconds per request (100 + 200 + 400 + 800). After 4 failed retries, requests will return 500 errors to clients."
"When the database becomes available again, connection pools will attempt reconnection on the next incoming request. With staggered health checks running every 10 seconds across 12 instances, we expect the fleet to detect database availability within 10 seconds. Each instance will establish 20 new connections, taking approximately 100ms per connection (2 seconds total per instance). We expect 90% of traffic to succeed within 15 seconds of database recovery.”

This gives you specific measurements. You know what to instrument (pool exhaustion rate, retry timing, connection establishment time, error rates). You know what success looks like (90% traffic succeeding within 15 seconds). You can validate each number independently.

Specific. Measurable. Testable.

Plan what you'll observe

List all the specific things you want to watch. Be specific. If you can't observe these things, stop. Add the missing observability first. Don't run experiments you can't interpret.

Start small and safe

Do your first run in the development environment, limited blast radius, low traffic, manual injection.

Don't overcomplicate things. Dont’ start with production. Don't start with peak traffic. Don't start with automated injection. Build confidence progressively.

Run it and observe

Actually run the experiment. Watch what happens. Take notes. Don't try to fix things during the experiment. Just observe and learn.

Compare hypothesis to reality. What matched your expectations? What surprised you?

Write a detailed summary of the experiment. Ideally using the same process as your real incidents.

"The connection pool exhausted in 0.6 seconds, roughly matching the 0.4-second prediction. Applications returned 500 errors immediately. The retry sequence timing matched our implementation exactly: 1.5 seconds per request before final failure.
When the database came back online, the first instance detected it in 4 seconds through health checks. It started establishing connections. At 18 connections established, the instance crashed. Out of memory error.
We had overlooked connection cleanup. When the database went down, failed connections stayed in the pool marked as dead but not garbage collected. When the instance tried to establish 20 new connections, it actually had 20 dead connections plus 20 new ones. Memory usage spiked. The JVM killed the process.
The crash triggered our orchestration system to start a replacement instance. That instance came up, detected the healthy database, tried to establish connections, and crashed for the same reason. This happened to 8 instances before we caught it.
The remaining 4 instances stayed alive because they had been restarted recently for unrelated reasons. Their connection pools were clean. They successfully connected and started handling all traffic. Four instances serving the entire load meant each was now processing 150 requests/second instead of 50. Response times jumped to 800ms. Error rates climbed to 12% due to request timeouts.
We had to disable automatic instance replacement and manually restart each instance with connection pool cleanup logic added. Recovery took 23 minutes. The hypothesis predicted connection establishment time correctly. We never considered connection lifecycle management or the interaction between pool state and instance stability."

That's learning. The individual components worked correctly but the interaction between them, at scale, under realistic conditions, created new behavior hard to predict.

Now you know and have choices.

Option 1: Add explicit pool cleanup to your health check logic.

When the health check detects database unavailability, call pool.close() to force cleanup of all connections, then reinitialize the pool. This guarantees a clean state before attempting reconnection. The downside is a brief period where the instance can't serve requests during pool recreation, maybe 200-300ms. You need to ensure your load balancer can handle instances briefly going unhealthy.

Option 2: Configure connection max lifetime in the pool settings.

Set maxLifetime to something like 30 minutes. The pool will automatically evict and replace connections that exceed this age, regardless of their state. This prevents accumulation of dead connections over time. The tradeoff is ongoing connection churn during normal operation, which adds latency (probably 5-10ms per replaced connection). You also need to tune the lifetime value. Too short and you create unnecessary overhead. Too long and you don't solve the problem.

Option 3: Implement connection validation on checkout.

Configure the pool to test connections before handing them to the application (testOnBorrow or equivalent). Dead connections get removed when the application requests them. This distributes cleanup across request processing rather than concentrating it at recovery time. The cost is added latency on every request, typically 1-5ms depending on your validation query. During the outage, you still accumulate dead connections, but they get cleaned up gradually as traffic arrives rather than all at once during reconnection.

You'll probably want to test this again after selecting the option for the fix. Emergent behavior depends on conditions, so you validate solutions under the conditions that matter.

What's next

Last time I showed you how hypothesis conversations reveal what your team actually believes about your system. This time you learned what to do with those gaps: investigate the knowable, fix the broken, experiment on the emergent.

Most teams skip straight to breaking things. You now know better. The learning happens in the conversation, the investigation, and the careful observation of how components interact under real conditions. The chaos experiment is just one tool in that process.

Try this with your team. Start with the hypothesis conversation from last newsletter. Work through the investigation phase. Pick one uncertainty about emergent behavior and design an experiment around it.

Then tell me what you discovered.

Until then,

Adrian

Your best chaos engineering happens before you break anything

Adrian Hornsby — Sun, 30 Nov 2025 13:03:54 GMT

Dr. David Woods, one of the pioneers of resilience engineering, said something a few of us have been repeating and it captures the essence of why chaos engineering is valuable:

"Just planning to inject a fault usually reveals that the system works differently than you thought."

Most teams skip straight to running chaos experiments. They pick a tool, design a failure scenario, inject the fault, observe what happens. They think the learning happens during the experiment.

They're wrong. The deepest learning often happens before you break anything at all.

Here's what I mean.

The meeting you've probably been in

You're discussing what happens when X fails. Someone says "obviously the circuit breaker kicks in." Everyone nods. The conversation moves on. Meeting ends.

Three months later, X actually fails in production. The circuit breaker doesn't work the way anyone thought. Or it doesn't exist at all. Or it exists but was disabled during a migration last quarter and nobody remembered to re-enable it.

The problem is that everyone had different mental models of how the system works. Nobody discovered this until real failure exposed it.

This happens constantly. Teams operate under the illusion of shared understanding. Everyone uses the same words. Everyone nods at the same moments.

Then something breaks, and you discover nobody actually agreed about how anything worked.

The hypothesis conversation

Here's what to do instead. Before running any chaos experiment, gather your team and have what I call the hypothesis conversation.

Step 1: Pick a specific failure scenario

Don't start with "what if the database goes down?" It’s too vague.

Instead start with something like: "What happens when our primary PostgreSQL instance becomes unavailable during peak traffic?"

The specificity matters. Vague scenarios produce vague answers. Specific scenarios force people to articulate their actual mental models.

Step 2: Silent writing first

This is the critical step most teams skip.

Before any discussion, have everyone write down:

What they expect to happen
How long they expect recovery to take
What customer impact they expect
What monitoring and alerts they expect to see

Give people 5 minutes of silence to write.

Why silent writing? Because it prevents groupthink. In open discussion, the loudest person always speaks first. Suddenly everyone's nodding along. The quieter team members, who might have spotted something nobody else considered, stay silent. You've just lost the diversity of perspective that makes this valuable.

Silent writing forces everyone to commit to their understanding before social dynamics kick in.

Step 3: Share and compare

Go around the room. Have each person read what they wrote.

No judgment. No critique. Just listen.

This is where it gets interesting. You'll discover:

Someone assumes automatic failover that doesn't exist. Another person thinks there's queuing that isn't implemented. The new engineer admits they have no idea what happens. That’s valuable honesty the rest of the team needs to hear. The senior engineer describes behavior from the old architecture, before the migration two quarters ago. Nobody agrees on expected recovery time.

Your team has been working together for months or years. You thought you understood how the system works. You're discovering right now that you don't share the same understanding at all.

Step 4: Investigate the gaps

Don't just note the disagreements and move on. Dig into them:

"Why did you think queuing was implemented?"

"When was the last time we actually tested failover?"

"Where is that behavior documented?"

"Who would know the answer to this?"

Often you'll discover that nobody knows for certain. The system evolved. Documentation didn't keep up. People made assumptions based on how they thought things worked. Those assumptions never got validated.

Step 5: Decide what to do

Now you have options:

Run the chaos experiment to find out what actually happens. Check the code or configuration to verify behavior. Talk to the team that owns that component. Update documentation based on what you learned. Fix the gap you just discovered.

You might decide the experiment is still valuable. But you've already learned something critical: your team doesn't share the same mental model of how your system behaves. That's worth knowing regardless of what you do next.

Real example

I watched this play out at a company planning their first database failover experiment few years ago.

The team gathered to discuss expectations. Everyone seemed aligned. "Database fails over to replica, maybe 30 seconds of elevated errors, everything recovers." Heads nodded. The consensus felt solid.

Then they did silent writing.

When everyone shared, the room got quiet.

It turns out the DBA expected 10-15 seconds of failover time based on the configuration settings. Application engineers expected 30-60 seconds based on what they'd seen in past incidents. The SRE thought the connection pool would need manual restart because that's how it worked in the old system. A junior engineer thought reads would continue but writes would fail. That was actually a reasonable assumption given the architecture. The architect assumed the application would automatically reconnect after failover. Nobody was sure if the health check would detect the failover or keep routing traffic to the failed instance.

Six people. Six different mental models of the same system behavior.

They spent the next 30 minutes investigating:

Checked the database configuration. Failover timeout was actually 30 seconds, but nobody had tested it recently. They looked at application code. Connection pool settings were wrong, they'd never refresh connections after a failover. Reviewed health check logic. It would completely miss the failover and keep sending traffic to the dead instance. Found monitoring gaps. No visibility into connection pool state, so they wouldn't see the problem during the experiment.

They fixed three critical issues before running any experiment:

Updated connection pool configuration to handle failover
Fixed the health check logic
Added monitoring for connection pool state

When they finally ran the experiment two weeks later, they knew what to expect and what to watch for. The failover worked. More importantly, they'd built shared understanding of how their system actually behaves.

Return on investment

The hypothesis conversation took 45 minutes. In that time, they discovered:

Critical gaps in monitoring
Incorrect connection pool configuration that would have caused extended downtime
Missing health check logic that would have routed traffic to a dead instance
Misaligned team understanding of basic system behavior

All before creating any risk, touching any systems, or needing any special tools.

Compare this to what usually happens: run the experiment, something unexpected occurs, scramble to understand what went wrong, argue about whether the system is working correctly or the experiment is flawed, end up more confused than when you started.

The hypothesis conversation is chaos engineering!

You're systematically exploring how your system behaves under failure. Sometimes that exploration happens through conversation. Sometimes through code review. Sometimes through documentation analysis. Sometimes through actual experiments.

But the learning doesn't wait for the experiment. It starts the moment you begin asking the right questions.

Start tomorrow

Here's your homework for this week:

Pick your scenario

Choose something specific that your team worries about. "What happens when [specific service] becomes unavailable during [specific condition]?"

Don't pick the scariest scenario. Pick something bounded and concrete. You're learning the practice, not stress-testing your team.

Get the right people in the room

You need diversity of perspective. Developers who wrote the code. Operators who run it in production. New team members who haven't absorbed all the assumptions yet. The architect who designed it. The SRE who monitors it. Include product managers too.

Six to eight people is ideal. Enough for diverse perspectives, small enough for real conversation.

Set the context

Before you start, frame it clearly:

"We're not testing anyone's knowledge. We're discovering where our mental models differ. There are no wrong answers. The goal is to find gaps in our shared understanding before they surprise us in production."

This framing matters. People need to feel safe admitting uncertainty.

Do the silent writing

Give people 5 minutes. Remind them to be specific. What exactly do you expect to happen? Not "it'll probably be fine" but "the circuit breaker will trip after three failed requests, fallback logic will return cached data, customers will see stale information for 30-60 seconds."

The specificity reveals the mental model.

Share without judgment

Go around the room. Have each person read what they wrote.

Don't critique. Don't correct. Don't debate yet. Just listen and note where understanding differs.

Resist the urge to immediately resolve disagreements. First, just hear all the perspectives.

Investigate together

Pick the biggest gaps. The places where mental models differ most.

Dig into them as a group. Check the code. Look at configuration. Review documentation. Talk to other teams.

Find out what actually happens. Update your understanding together.

You'll learn more in 30 minutes than most chaos experiments reveal.

What's next

Try this with your team this week. Then hit reply and tell me:

What gaps did you discover? Were you surprised by anything? Did you fix something before running any experiment? What happened when you tried to investigate the gaps?

I read every reply. Your experiences help me understand what actually works in practice, not just in theory.

Next newsletter: "What to do after the hypothesis conversation". How to design experiments that actually test what you're uncertain about, not just what you already know.

Until then,

Adrian

When AI Writes Your Code, Chaos Engineering Writes Your Insurance Policy

Adrian Hornsby — Thu, 02 Oct 2025 08:55:33 GMT

AI-generated code has moved from curiosity to almost standard practice with breathtaking speed. Teams use AI to build entire features and even applications. Code generation tools translate requirements directly into pull requests. What took a developer days now takes minutes and a carefully crafted prompt. When done right, the AI productivity gains are real. Teams shipping AI-assisted code go faster.

The question facing engineering organizations today isn't whether to adopt these tools—developers already have—but how to understand and operate systems built with them safely.

The Acceleration of an Old Problem

Let’s be honest, code opacity didn't start with AI. Every engineering organization battles the same demons: developers leave, context evaporates, documentation rots, deadline pressure produces shortcuts. Six months after launch, someone gets paged at 3 AM to fix a system they didn't write and barely understand. This story is older than version control.

AI-generated code didn’t create this problem but it accelerates it to speeds we've never seen before.

A developer writing code makes choices—explicit or implicit—about tradeoffs, edge cases, and failure modes. They might not always document these choices well, but they existed as conscious thoughts, at least momentarily. You can pull them into a meeting room and ask: "Why did you implement it this way?" They might not remember perfectly, but there's a conversation to have, a human memory to question.

AI-generated code compresses this process into statistical inference. The model chose an implementation based on patterns in its training data, optimizing for likelihood rather than your specific domain constraints. Three months later, when that code fails under unexpected load, there's no developer to ask. The model that generated it no longer exists in the same form—weights have shifted, training has evolved, the context window that held your requirements has long since evaporated. You're maintaining systems written by "intelligences" that can't yet fully be recalled for questioning.

The velocity compounds everything. When one developer writes problematic code, that's one problem to debug. When AI helps fifty developers write code twice as fast, you've potentially scaled both the productivity and the technical debt by orders of magnitude. The same old problems happens at unprecedented speed and volume.

Humans Are Still in the Loop

Let's be clear about what actually happens: AI doesn't drop code directly into production. A developer prompts the AI, reviews what it generates, modifies the output, integrates it with existing systems, AI writes tests, and shepherds it through code review. Humans remain deeply involved.

The challenge emerges from volume and velocity. When you're reviewing thirty AI-generated pull requests weekly instead of ten human-written ones, can you maintain the same focus and scrutiny? When the code looks correct—follows conventions, passes linters, handles obvious failure cases—how often do you catch the subtle problems that only emerge under production conditions the AI never considered?

Human review remains essential, but it faces scaling limits. We need systematic methods for stress-testing AI-generated code that complement human judgment. This is where chaos engineering can help.

Why Chaos Engineering Fits This Moment

Traditional code review asks: "Does this code do what it claims?" Chaos engineering asks: "What does this code do when everything goes wrong?" That second question becomes critical when you're shipping code whose internal logic you inherited rather than invented.

Run a chaos experiment that injects latency into downstream services. Watch which components start failing in ways nobody predicted. Maybe the AI-generated service client retries aggressively without backoff, turning a minor slowdown into a cascading failure. Code review might have caught this if the reviewer specifically looked for retry logic and thought to question its implementation. Chaos engineering catches it by creating the conditions where the problem reveals itself.

The difference becomes clearer over time. You discover patterns: this model's code tends to assume infinite memory; that model's error handling releases resources inconsistently. These insights feed directly into how you prompt AI going forward, what you hunt for in code review, and where you concentrate your testing.

How Chaos Engineering Must Evolve

Traditional chaos engineering assumed code written by humans. Humans you could ask questions to. That assumption breaks down when substantial portions of your codebase emerge from statistical models.

Automated Knowledge Capture

Every chaos experiment that reveals unexpected behavior should generate structured documentation automatically. When an experiment discovers that an AI-generated service degrades catastrophically under certain conditions, the experiment should produce:

- Structured description of the failure mode(s)

- Metrics thresholds indicating the problem emerging

- Potential mitigation steps based on observed and learned behavior

- Links to specific code exhibiting the issue

Engineers review and refine these auto-generated artifacts rather than writing them from scratch. This automation matters because AI code generation produces more components faster than humans can document manually. The documentation gap will become a problem without systematic capture.

Feedback Loops into Code Generation

Here's where things get interesting. Your chaos experiments discover that AI-generated clients consistently lack good retry logic. This gap appears across multiple services. You document it, but more importantly, you can now update your prompts to account for this shortfall: "Implement retry logic with exponential backoff and jitter pattern with these specific thresholds […]"

AI-generated code improves because chaos engineering taught you what to demand. You can feed experiments findings directly into prompt libraries—an identified gap becomes a constraint in every subsequent request.

With such feedback loops, you can continuously update your prompts based on chaos experiments findings, shortening the critical learning cycle.

Pattern Detection at Scale

AI code generation creates an unusual opportunity: many components share similar origins. They emerged from similar models, responding to similar prompts, applying similar patterns from training data. Studies about code generation errors often identify recurring patterns or clusters of failures specific to certain models.

Chaos engineering tools can exploit this and systematically search for patterns common to specific AI models.

When you find these patterns once, hunt for them everywhere across all AI-generated components simultaneously, discovering not just that Service A has a problem, but that twelve services share variants of the same flaw because they were all generated using similar models.

Continuous Experimentation

AI enables shipping dozens of features weekly, each containing substantial generated code. Chaos experiments need to keep pace. However, embedding chaos experiments directly into CI/CD pipelines creates significant problems—the non-deterministic nature of chaos experiments conflicts with the deterministic requirements of deployment pipelines, experiments require production-like load and extended runtime. The solution is a separate, dedicated chaos pipeline running parallel to your CI/CD pipeline, allowing experiments to operate on their own schedule without blocking deployments while still feeding findings back into development practices.

Note: For teams seeking deterministic validation in their CI/CD pipelines, deterministic simulators offer a middle ground. Tools like Antithesis model distributed system behavior deterministically, allowing exploration of failure scenarios with reproducible results. While they require significant investment to build and maintain, and can't capture all real-world complexities, they provide faster feedback than full chaos experiments while being more comprehensive than mocked tests. They work well in pipelines but should complement, not replace, chaos experiments against production-like environments.

The AI Testing AI Question

Could we use AI to design chaos experiments for AI-generated code? The idea is indeed seductive: provide an AI with code it generated previously plus observability data, then ask it to design experiments that would catch emerging issues.

This approach shows promise in my own experiments. With the right prompt, AI-designed chaos experiments sometimes target edge cases humans overlook, precisely because the AI recognizes patterns in generated code that humans don't consciously notice.

But we should still remain skeptical and in the loop. Both AIs operate from similar statistical foundations. They share similar blind spots—the AI experiment designer often misses the same edge cases the code generator misses, for exactly the same reasons. An AI trained primarily on historical data still struggles to imagine creative, realistic failure scenarios that don't appear commonly in training data.

Like for other AI generated artifacts, the key seems to use AI to generate candidate experiments, but require human engineers to review, refine, and prioritize them. AI handles the breadth—proposing dozens of scenarios quickly. Humans handle the depth—identifying which scenarios matter most for your specific systems and constraints.

We're learning by doing, and the lessons keep changing.

Final Thoughts

You just simply can't afford to opt out of the race while competitors embrace AI-assisted development. The productivity gains is just too substantial. Teams using these tools effectively ship features faster. Like with most innovations, organizations that hesitate eventually lose ground against competitors who've figured out how to capture the upside while managing the downside.

The choice facing engineering leaders isn't "AI-generated code: yes or no?" Market forces already made that decision. The real choice is: "How do we operate systems built with AI assistance safely and sustainably?"

Chaos engineering offers a practical answer. By systematically exploring how systems fail, you build the operational understanding that rapid AI-assisted development threatens to erode. You discover hidden dependencies before they cause outages. You document failure modes before customers encounter them. You create feedback loops that improve both your code generation practices and your operational capabilities.

The machines write the code. Chaos engineering helps us understand what we've built and guides what we build next.

Controls vs Guardrails: Why Organizations Struggle with Resilience Despite Having All the Right Pieces

Adrian Hornsby — Thu, 14 Aug 2025 15:12:48 GMT

After a decade of helping organizations improve their resilience practices, I keep seeing the same puzzle. Companies invest heavily in operational readiness reviews, well-architected reviews, incident reviews, chaos engineering, alerting, monitoring, etc. They have leadership buy-in, dedicated teams, and sophisticated tooling. Yet despite having all the right pieces, many still struggle to build genuine resilience.

The more I observe this pattern across industries, the more convinced I become that we're dealing with something fundamental about human psychology, and how our natural responses to uncertainty systematically undermine the very resilience we're trying to build.

The Evolutionary Trap

The human brain evolved to treat uncertainty as potential danger. When we don't know what's coming next, our amygdala activates, triggering stress and the urgent need to regain control. This served our ancestors well when uncertainty meant immediate physical threats, but it creates problems in complex organizational systems.

We're pattern-seeking creatures who prefer clear cause-and-effect relationships. When faced with ambiguity, we instinctively impose structure with rules, procedures, and approval gates that provide the illusion of predictability, even when they can't actually remove the underlying uncertainty.

Think about what happens after an outage. Teams naturally default to adding more approval gates, extending checklists, requiring additional sign-offs. Each control feels rational in isolation and provides psychological relief by making us feel we're "doing something" about the problem.

But together, these controls create systems so constrained that engineers can't respond effectively when something truly unexpected happens. The very mechanisms designed to prevent problems end up preventing the adaptive response that could have avoided a bigger failure.

Research reveals exactly why this backfires. Organizations that handle crises well are those that can flexibly navigate different responses during real emergencies, rather than simply following rigid procedures. Yet our natural response to past failures is to create more rigid procedures.

When outcomes are uncertain, our decision-making shifts from calculation to heuristics and mental shortcuts. We fall back on availability bias (overweighting recent incidents) and confirmation bias (seeking information that supports our existing beliefs about what went wrong). This leads to controls that address the specific failure we just experienced while missing the broader patterns that create system brittleness.

The desire for predictability is so strong that we often choose the feeling of control over the reality of safety. This explains why organizations continue tracking metrics like MTTR even when teams understand it's mathematically meaningless, or why they maintain cumbersome approval processes that become rubber stamps but create friction during emergencies.

Controls vs Guardrails

The key to understanding and solving this problem is recognizing that most organizations blur a crucial line between two fundamentally different approaches to safety: controls and guardrails.

Controls dictate how work gets done. They're prescriptive, active during normal operations, and create friction for everyone, regardless of whether there's any actual danger. Like tollbooths on a highway, they slow down every single person, every single time, even when there's no safety issue.

I've seen many organizations create elaborate chaos engineering processes with good intentions. They want to prevent teams from causing unintended damage. But these weeks-long coordination requirements create cognitive overload that makes teams avoid learning activities entirely. That's a control masquerading as a safety practice.

The most telling sign that controls have gone too far is when engineers stop raising concerns because "the process doesn't allow for that" or "nobody would listen anyway." That's adaptive capacity disappearing in real time.

Guardrails, on the other hand, define safe operating boundaries while preserving flexibility within those bounds. Like highway guardrails, they activate only when you're approaching real danger, not during normal operations. They make the safe path also the easy path.

Think of it like ziplining in a forest. The controls approach says "Ziplining is dangerous, so we'll require permits, 6 weeks of training, and supervised access only on Tuesdays." Result? Nobody ziplines, or people sneak in and zipline without any safety equipment because the official process is too cumbersome.

The guardrails approach says "Ziplining is dangerous, so here's a harness, safety line, and helmet." The safety equipment enables the risky activity rather than preventing it. People zipline frequently and safely because the gear only constrains them when there's real danger such as weight limits exceeded, equipment failure, or bad weather.

A guardrail approach to chaos engineering might provide lightweight frameworks with ready-made integrations, but allow teams to adapt scope, timing, and focus based on what they're trying to learn about their specific systems. The safety comes from built-in blast radius limits, automatic rollback procedures, and environment isolation, not from bureaucratic overhead.

This distinction shows up everywhere once you know to look for it. I've seen incident reports describing production access processes as "cumbersome," creating friction during the exact moments when adaptive capacity matters most. The irony is that these access controls often become rubber stamps. When so many people need production access for legitimate work, approval processes default to "yes" without real scrutiny.

Meanwhile, guardrails like automated safety checks, environment-specific tooling, and default read-only permissions would actually prevent dangerous actions without slowing down normal troubleshooting.

The pattern is often exacerbated during incident reviews. Organizations naturally gravitate toward quick fixes after outages. The urgency to "do something" drives teams to immediately jump to finding root causes and generating action items. But as John Allspaw observes in his excellent talk Incident Analysis: How *Learning* is Different Than *Fixing*, when "your goal is to fix, you're gonna fix something whether or not it was the right thing to fix," and "once teams find a plausible fix, time and production pressure cause them to stop exploring other options."

Incident Analysis: How *Learning* is Different Than *Fixing* - John Allspaw

Here's the trap I see many organization fall into: when incident reviews become focused on quick fixing rather than deep understanding, they systematically generate more controls. Each incident generates new approval processes, additional checkpoints, and longer procedures. Every. Single. Time. The very mechanism meant to build resilience ends up eroding it through control accumulation.

A guardrails approach to incident learning resists the temptation for quick fixes. Instead, it focuses on understanding "what was difficult for people to understand during the incident, what was surprising for people about the incident.” Answering these difficult questions helps design better guardrails.

The difference is critical: quick incident reviews create controls and bureaucracy, while deep understanding-focused reviews help create good guardrails and in turn build adaptability.

But wait, what happens when engineers take shortcuts that bring down critical systems?

This question becomes crucial when we consider why people work around safety measures. In my experience, there are two distinct patterns:

Systematic workarounds where multiple people consistently bypass the same controls reveal design problems. When everyone ignores a safety measure because it conflicts with getting work done effectively, that's feedback that the control isn't designed for the reality of the work environment.

Individual violations where someone consciously ignores a safety protocol despite understanding the risks represent accountability issues requiring different responses such as training, supervision, or removal from roles.

The key difference is in the data pattern. One person removing a safety device represents an individual problem. Everyone consistently working around the same procedure indicates a system design problem requiring guardrail redesign, not more enforcement. When operational staff or engineers consistently bypass safety measures, it's usually because those measures force them to choose between being safe and being effective. That's a design problem, not a people problem.

For the small percentage who truly are bad actors, attempting to prevent malicious behavior through process controls is futile. If someone really wants to cause trouble, they'll find a way around controls. This is where foundational security principles become essential: comprehensive auditing that records every action, immutable infrastructure that can't be tampered with, and "detect, isolate, replace" strategies. These work as guardrails. They don't prevent every possible action through approvals, but they make malicious changes visible and automatically containable.

You can't control your way out of determined bad actors, but you can architect systems that make their actions obvious and limited in scope.

Breaking the Cycle

The human tendency toward adding controls against uncertainty is so deeply wired that even organizations with excellent resilience instinct and intentions fall into this trap. After incidents, the cultural pressure to "do something" combined with our psychological need for control creates an almost irresistible urge to add approval gates, extend procedures, and create more detailed documentation.

The path forward requires consciously auditing our practices through a controls vs guardrails lens. Where are we creating friction during normal operations when we should be creating safety boundaries? Where are we demanding compliance when we should be enabling adaptation?

The goal isn't eliminating all structure, it's not! Instead it’s ensuring our structure enables the adaptive capacity that makes systems genuinely resilient rather than just compliant.

Here's the important bit: resilience comes from systems that can learn and adapt, not from preventing all possible changes. When we build tollbooths instead of guardrails, we optimize for the feeling of control rather than the reality of safety.

Smart guardrails enable adaptation by making the safe path also the effective path. Rigid controls kill adaptation by forcing people to choose between following procedures and solving problems.

To really improve resilience, organizations need to understand this distinction and design safety mechanisms that activate when needed but don't interfere with normal operations. They need to measure outcomes that matter rather than compliance metrics that feel good. They need to create psychological safety that enables people to surface problems early rather than hide them to avoid bureaucratic friction.

Most importantly, they need to recognize that our instinctive response to uncertainty—adding more controls—is often the enemy of the adaptive capacity that creates real resilience.

You don't make dangerous activities safe by preventing access. You make them safe by giving people the right safety equipment and boundaries.

Addendum: Intent and Context Matter

A recent discussion with Fred Hebert made me realize I had left out something important: the controls vs. guardrails framework isn't just about technical implementation, it's also about human psychology, organizational context, and the often-unconscious biases that shape how we respond to uncertainty.

Whether something functions as a control or guardrail often depends on intent and context, not just implementation. Consider peer code review:

As a Control: Management mandate for compliance, or engineers implementing it because they don't trust each other
As a Guardrail: Engineers wanting shared awareness and collective understanding

The same technical mechanism serves different purposes and creates different psychological effects based on who decided and why.

This reveals something important. Many controls start as well-intentioned guardrails but drift toward control thinking over time, especially when designed by people detached from the actual work. I frequently see well-intentioned engineers create elaborate controls for things they won't use directly, such as runbooks, reviews, checklists, and templates. This "work-as-imagined" vs. "work-as-done" gap leads to controls that feel logical in theory but create friction in practice.

This tendency is amplified by our engineering training itself. As Fred noted in our conversation, "our whole engineering discipline is often founded on breaking things down analytically, creating the right abstractions to constrain variability, and moving on!" We're taught to reduce complexity by controlling it, which works well for technical systems but can backfire when applied to human systems.

The challenge is that controls themselves become part of the environment and increase potential for unexpected interactions. When we try to limit environmental complexity through rigid procedures, we often create new complexities in how people work around those procedures.