<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Resilience Bites]]></title><description><![CDATA[Resilience Bites, the newsletter by Resilium Labs - Essays and newsletter issues on why engineering organizations keep having the same incidents — and what the feedback loops, organizational patterns, and tensions actually look like in practice.]]></description><link>https://newsletter.resiliumlabs.com</link><image><url>https://substackcdn.com/image/fetch/$s_!9N0S!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465f403-b7af-4feb-a7ca-ea53007ad3fc_872x872.png</url><title>Resilience Bites</title><link>https://newsletter.resiliumlabs.com</link></image><generator>Substack</generator><lastBuildDate>Wed, 08 Apr 2026 09:06:06 GMT</lastBuildDate><atom:link href="https://newsletter.resiliumlabs.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Adrian Hornsby]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[adhorn@resiliumlabs.com]]></webMaster><itunes:owner><itunes:email><![CDATA[adhorn@resiliumlabs.com]]></itunes:email><itunes:name><![CDATA[Adrian Hornsby]]></itunes:name></itunes:owner><itunes:author><![CDATA[Adrian Hornsby]]></itunes:author><googleplay:owner><![CDATA[adhorn@resiliumlabs.com]]></googleplay:owner><googleplay:email><![CDATA[adhorn@resiliumlabs.com]]></googleplay:email><googleplay:author><![CDATA[Adrian Hornsby]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[What 1,000 Executives Know But Can't Fix]]></title><description><![CDATA[I have no idea how I missed this report when it came out.]]></description><link>https://newsletter.resiliumlabs.com/p/what-1000-executives-know-but-cant-fix</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/what-1000-executives-know-but-cant-fix</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Tue, 31 Mar 2026 13:21:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bLnP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bLnP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bLnP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic 424w, https://substackcdn.com/image/fetch/$s_!bLnP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic 848w, https://substackcdn.com/image/fetch/$s_!bLnP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic 1272w, https://substackcdn.com/image/fetch/$s_!bLnP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bLnP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic" width="500" height="333" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:333,&quot;width&quot;:500,&quot;resizeWidth&quot;:500,&quot;bytes&quot;:19991,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.resiliumlabs.com/i/193363204?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bLnP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic 424w, https://substackcdn.com/image/fetch/$s_!bLnP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic 848w, https://substackcdn.com/image/fetch/$s_!bLnP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic 1272w, https://substackcdn.com/image/fetch/$s_!bLnP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca22b599-3e4e-4700-abf3-a4a57cbbf293_500x333.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I have no idea how I missed this report when it came out. <a href="https://www.cockroachlabs.com">Cockroach Labs</a> published their <a href="https://www.cockroachlabs.com/guides/the-state-of-resilience-2025/">State of Resilience 2025</a> back in late 2024, surveying 1,000 senior technology executives across North America, Europe, and Asia-Pacific, and it somehow slipped past me entirely. Better late than never, because there's a lot here worth digging into, at least if you're willing to ignore Part 6, which makes the same mistake almost every vendor resilience report makes: five sections of genuinely useful organizational data followed by a pivot to "and here's how our product fixes it." It's understandable. They sell distributed databases. But it's also disappointing, because their own data argues against the conclusion, and because it reinforces a persistent myth in this industry that you can buy your way to resilience. More on that in a moment.</p><p>The data was collected in August-September 2024, so about 18 months old. If anything, that makes the findings more relevant now. Since this survey was fielded, DORA has gone into full effect, NIS2 enforcement is ramping up across EU member states, AI agents are being woven into operational workflows at a pace that would have seemed aggressive even a year ago, and the organizations that told CrowdStrike-shaken researchers they were "significantly improving their planning" have had a full year and a half to either follow through or quietly drift back to the status quo. The structural dynamics this report captures don't resolve themselves in 18 months, they compound. And the rapid adoption of AI in operations is adding new failure modes to systems that were already struggling with the old ones.</p><p>The headline numbers first: the average enterprise experiences 86 outages per year, averaging 196 minutes each. Every single company surveyed lost revenue to outages in the past twelve months. For large enterprises, outage-related losses averaged $495K annually.</p><p>But the number that made me pause for a second: 95% of executives say they are aware of at least one unresolved operational weakness that puts their organization at risk. 72% say they have multiple. And 48% say their organizations are doing insufficient work to address it. Nearly every leader knows where the cracks are, and almost half say nothing adequate is being done. The prevention paradox, playing out at scale with hard data behind it.</p><p>The blockers are familiar too. Other teams' priorities take precedence (38%). Budget constraints (36%). Lack of leadership buy-in (32%). Meanwhile 92% of teams must deprioritize essential work to fight fires, 48% work overtime and weekends to restore operations, and 39% report a growing backlog of post-mortems. The feedback loop I keep coming back to in client work is right there in the numbers: less time for improvement leads to more incidents, which leads to even less time for improvement.</p><p>Then there's the human cost that rarely shows up in resilience reports. 82% of leaders said they or their team members fear losing their jobs following a significant outage. Think about what that does to everything else. If you've ever wondered why your blameless post-mortems don't feel blameless, there's your answer. You can design the most thoughtful incident analysis process imaginable, but 82% job-fear will override any process document every single time.</p><p>Now, Part 6 and why it's a missed opportunity. Look at the causes of downtime: network issues (38%), software issues (36%), cyberattacks (36%), cloud provider reliability (35%), third-party failures (33%), environmental factors (31%), human error (31%), capacity issues (30%), hardware failures (30%). That distribution is remarkably flat. Nothing really dominates. And the report itself notes it's consistent regardless of company size, sector, or geography.</p><p>To me, that flat distribution is the most important finding in the entire report, and the authors barely pause on it.</p><p>Here's why it matters so much. The conventional reading is that these are nine independent failure modes, each requiring its own technical solution. Network problems need better network architecture. Software issues need better testing. Capacity problems need better scaling. And each vendor can point to their slice of the chart and say "we fix that part," and they're often right about their slice.</p><p>But the flatness itself is in fact the diagnostic clue that something else is going on. Nine unrelated failure categories don't land within eight percentage points of each other by coincidence. They land that way when they're all symptoms of the same underlying condition. Network issues, software bugs, human error, capacity problems: these aren't nine independent diseases. They're nine ways that the same organizational gaps express themselves in production. Poor feedback loops, misaligned incentives, the distance between how leaders imagine work happens and how it actually happens, insufficient learning from failure: these systemic causes don't prefer one failure category over another. They just make all of them more likely, roughly equally.</p><p>Think of it like a doctor seeing a patient with fatigue, headaches, joint pain, and skin problems all at similar severity. You could treat each symptom with a different specialist and a different prescription. But the flat distribution across unrelated systems is itself the signal that something systemic is driving all of it. Treating the headaches won't help the joints, because neither is the actual problem.</p><p>This is what makes tool-first approaches to resilience so seductive and so ultimately inadequate. If one cause dominated, say network issues at 70% and everything else in single digits, you'd have a clear technical problem with a clear technical solution. But when the failure profile is this even, the slices are the wrong unit of analysis entirely. No single technical investment will meaningfully shift the overall failure rate, because the technical categories are where the problems surface, not where they originate. The only intervention that touches all nine simultaneously is the organization's ability to detect, respond to, and learn from whatever breaks next, regardless of category. That's an organizational capability, not a technology purchase.</p><p>This inability to see symptoms as symptoms, to keep treating the surface categories as root causes and reaching for technical fixes to organizational problems, is exactly why I wrote <a href="https://www.resiliumlabs.com/book">Why We Still Suck at Resilience</a>. And here, in a vendor's own dataset, is the evidence for the entire thesis of the book.</p><p>The report also shows that 100% of organizations already do some form of resilience testing, yet 71% do no failover testing, 62% skip regular backup and restoration exercises, and the average outage still takes over three hours to resolve. The tooling exists. The organizational capability to use it effectively does not.</p><p>Cockroach Labs collected data that illuminates exactly this, but unfortunately drew a technical conclusion from organizational evidence. Of course, I understand why. But somebody needs to measure the layer they skipped: the feedback loops, the gap between how leaders think work happens and how it actually happens, the tensions that keep organizations stuck even when they know what's broken. That's the data nobody is collecting systematically, and it's where the actual answers live. I&#8217;ll get to that in 2026.</p><p>Full report here if you want to read it yourself: <a href="https://www.cockroachlabs.com/guides/the-state-of-resilience-2025/">https://www.cockroachlabs.com/guides/the-state-of-resilience-2025/</a></p><p>//Adrian</p>]]></content:encoded></item><item><title><![CDATA[When Architecture Becomes Fluid]]></title><description><![CDATA[A few days ago, AWS announced that the AWS Serverless Agent Plugin is now in the Anthropic plugins marketplace.]]></description><link>https://newsletter.resiliumlabs.com/p/when-architecture-becomes-fluid</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/when-architecture-becomes-fluid</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Thu, 26 Mar 2026 08:26:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MXKS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MXKS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MXKS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic 424w, https://substackcdn.com/image/fetch/$s_!MXKS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic 848w, https://substackcdn.com/image/fetch/$s_!MXKS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic 1272w, https://substackcdn.com/image/fetch/$s_!MXKS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MXKS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic" width="500" height="334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:334,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:11930,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.resiliumlabs.com/i/193363205?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MXKS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic 424w, https://substackcdn.com/image/fetch/$s_!MXKS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic 848w, https://substackcdn.com/image/fetch/$s_!MXKS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic 1272w, https://substackcdn.com/image/fetch/$s_!MXKS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d1cdbd3-99bb-4335-af9d-a2399802948c_500x334.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A few days ago, AWS announced that the <a href="https://github.com/awslabs/agent-plugins?tab=readme-ov-file#aws-serverless">AWS Serverless Agent Plugin</a> is now in the Anthropic plugins marketplace. Install it in Claude Code, Kiro, or Cursor, and your AI agent can analyze your codebase, recommend services, generate infrastructure as code, estimate costs, run security scans, and deploy. In the same two-week window, AWS shipped two more capabilities: one that lets agents initialize SAM projects, wire up event-driven architectures, enforce least-privilege IAM, and instrument observability from the start; and another that guides developers through building checkpointed, durable Lambda workflows that can run for up to a year.</p><p>The pitch was "best practices by default." Security, observability, resilience baked into the AI-guided workflow from day one. The agent doesn&#8217;t just write the code, it now architects the application.</p><p>I want to take that claim seriously and follow it somewhere uncomfortable.</p><p><strong>***</strong></p><p>For most of my career, architecture was the thing you fought about in design reviews. Microservices or monolith. Event-driven or request-response. Step Functions or roll your own. Saga pattern or two-phase commit. These were consequential decisions because they were hard to reverse, expensive to get wrong, and shaped how your system would fail for years to come. The architect's value was in knowing which tradeoffs to make given the specific constraints of a team, a product, a moment.</p><p>But here's the thing. If an agent can scaffold an event-driven architecture with EventBridge, SQS, and DynamoDB Streams in the time it takes me to open a Miro board, and if a different agent can rearchitect that same system next quarter when the requirements shift, then the architecture starts to matter less as a decision and more as a snapshot. It's whatever the system happens to be shaped like right now. It's not the thing you chose. It's the thing the agent chose on your behalf, and it might choose differently tomorrow.</p><p>I went looking for people making this argument; that architecture is becoming an implementation detail rather than a design decision. I found two camps.</p><p>The first camp says architecture matters <em><strong>more</strong></em> now because AI collapses feedback loops. When you can scaffold an API, generate tests, and wire monitoring in minutes, bad architectural decisions surface immediately. AI doesn't eliminate the need for architecture, it amplifies the cost of getting it wrong. That's true, but I think it's backward-looking because it assumes humans will continue to be the ones making and evaluating those decisions.</p><p>The second camp talks about architects moving from "in the loop" to "on the loop" to eventually "out of the loop," shifting from making decisions to designing the system's ability to design itself. That's closer, but it still frames the architect as the essential human in the picture, just at a higher level of abstraction. It imagines a gentle transition rather than a structural shift.</p><p>Neither camp follows the logic to its endpoint yet, at least that I'm aware of.</p><p><strong>***</strong></p><p>Here's where I think this goes.</p><p>If agents can architect, deploy, maintain, and rearchitect systems to deliver a given function, then the architecture becomes a runtime variable. Nobody "chooses" it any more than you choose which TCP packets get retransmitted. The agent optimizes and the system runs. That&#8217;s it. The patterns shift underneath you based on load, cost, failure conditions, whatever the agent is optimizing for at that time. Architecture stops being the thing you decided in a design review six months ago and becomes something closer to a continuously evolving state that the agent manages.</p><p>At that point, the question "what's the architecture?" becomes roughly as interesting as "what's the current state of the routing table?" Technically answerable, practically irrelevant to most of the humans involved.</p><p>And if <strong>architecture becomes fluid,</strong> something that agents can swap to maintain function, then the whole discipline of making architectural decisions starts to look like something temporary. Not because the decisions were wrong, but because the decisions stop needing to be made by humans at all.</p><p>I know this sounds like a story about architects losing their jobs, and it is, partly. But it's also a story about something much more delicate.</p><p><strong>***</strong></p><p>An agent that maintains function "at all cost and at all architecture" is optimizing for one thing: keeping the system running. And chances are that it will eventually be very good at it. It will rearchitect around failures. It will find workarounds for degraded dependencies. It will swap patterns, add retries, reroute traffic, spin up compensating services. From the outside, the system will look healthy. The dashboards will be green. The SLOs will be met.</p><p>But "running" and "healthy" are not the same thing.</p><p>The agent is unlikely to notice that the system has drifted so far from anyone's mental model that no human can reason about it anymore. It won't flag that the reason it keeps having to rearchitect is that an upstream dependency changed its data contract six months ago and nobody told anyone. Of course, in theory, agents could track contract changes across teams by pulling the latest API spec, reconcile, adapt. And mostly they will. But "mostly" is where the trouble lives. When agents handle 99% of cross-team coordination flawlessly, the 1% they miss becomes invisible precisely because everything else is compensating for it. The system holds together until it doesn't, and when it doesn't, every successful compensation that masked the gap becomes part of the blast radius. It won't recognize that the system is technically meeting its SLOs while slowly becoming incomprehensible.</p><p>This is a version of the <a href="https://www.resiliumlabs.com/blog/the-prevention-paradox">prevention paradox</a> running at machine speed.</p><p>When human operators kept systems running, there was a natural limit: the operators themselves. They got tired. They complained. They filed tickets. They said "this is getting ridiculous" in postmortems. The friction of human maintenance was a signal. It was an ugly, expensive, inefficient signal, but it told you something about the health of the system that no dashboard could capture. The operator's frustration was information about the gap between how the system was supposed to work and how it actually worked.</p><p>Agents don't get frustrated. They don't have the felt sense that something is getting ridiculous. They just keep compensating. And every successful compensation is a small act of hiding the true state of the system from the humans who are nominally responsible for it.</p><p>In <a href="https://www.resiliumlabs.com/resilience-bites-issues/ai-doesnt-solve-your-problems-it-moves-them">my last newsletter</a>, I wrote about David Woods' observation that AI doesn't solve your problems, it moves them somewhere you can't see. This is the next step in that sequence. AI didn't just move the problems, it actually moved the architecture. And when the architecture is fluid, managed by agents, shifting underneath you to maintain function, the gap between what the system is doing and what you think it's doing doesn't shrink. Instead it grows. Mostly because the agents are papering over it constantly, and every successful paper-over makes the gap a little harder to see.</p><p><strong>***</strong></p><p>There's a concept from resilience engineering that I keep returning to: the WAI-WAD gap; Work-As-Imagined versus Work-As-Done. In every organization, there's a difference between how people think the system works and how it actually works. The interesting failures happen in that gap.</p><p>In a world where architectures are fluid, managed as a runtime variable by agents, the WAI-WAD gap takes on a new dimension. It's no longer just that humans have an outdated mental model of a stable system. It's that the system itself is changing, continuously, underneath a mental model that was never designed to track continuous change. The architecture you reviewed last quarter might bear no resemblance to what's running today. And nobody noticed because the function never degraded.</p><p>This is what makes the "best practices by default" pitch from the AWS announcement both true and misleading. The practices are there, security and observability are instrumented, resilience patterns are in place. At time of deployment, the system is well-architected. But architecture is not a point-in-time property anymore. It's an ongoing relationship between a system, its operators, and its environment. And that relationship degrades when nobody can see the system clearly anymore.</p><p><strong>***</strong></p><p>I don't think the response to this is to resist agents managing architecture. That ship has already sailed. Developers will use them because they're pretty useful and the productivity gains are there.</p><p>But I think the response is to recognize that what matters is shifting. The important question was never really "what architecture should we use?" It was always "is this system healthy?" Those two questions used to be tightly coupled because architecture was stable and you could reason about health by reasoning about structure. If the architecture was sound, the system was probably healthy. If the architecture had known weaknesses, you knew where to look.</p><p>When architecture becomes fluid, that coupling breaks. You can no longer infer health from structure because the structure keeps changing. Health becomes something you have to measure directly, continuously, and independently of whatever the agents are doing underneath.</p><p>That's a different discipline than architecture. It's closer to what I'd call <a href="https://www.resiliumlabs.com/resilience-bites-issues/ai-doesnt-solve-your-problems-it-moves-them">operational awareness</a>. It&#8217;s the ability to see the gap between what the system is doing and what you think it's doing, even when (especially when) the metrics say everything is fine. It requires understanding not just the function but the cost of the function, the drift of the function, the comprehensibility of the function.</p><p>Agents that architect applications are a real and meaningful capability. But the thing they're automating was never the hard part. The hard part was always understanding whether the system you built was actually doing what you thought it was, in the way you thought it was, at a cost you could sustain. That question just got harder, not easier.</p><p>//Adrian</p>]]></content:encoded></item><item><title><![CDATA[We Mistake "Hasn't Failed Yet" for "Won't Fail"]]></title><description><![CDATA[May 17, 2010]]></description><link>https://newsletter.resiliumlabs.com/p/cloud-resilience-assumptions</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/cloud-resilience-assumptions</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Sat, 07 Mar 2026 07:29:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xiSd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xiSd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xiSd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic 424w, https://substackcdn.com/image/fetch/$s_!xiSd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic 848w, https://substackcdn.com/image/fetch/$s_!xiSd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic 1272w, https://substackcdn.com/image/fetch/$s_!xiSd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xiSd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic" width="500" height="331" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:331,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38527,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.resiliumlabs.com/i/193363207?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xiSd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic 424w, https://substackcdn.com/image/fetch/$s_!xiSd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic 848w, https://substackcdn.com/image/fetch/$s_!xiSd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic 1272w, https://substackcdn.com/image/fetch/$s_!xiSd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c3a9f43-0147-4c70-ba1b-e1862e13e3c4_500x331.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><em><strong>May 17, 2010</strong></em></h4><p><em>AWS had four regions. US East, US West, Europe, and Asia Pacific. That was the whole cloud, more or less. Most companies running serious workloads were still asking whether they could trust it at all.</em></p><p><em>That morning, Jeff Barr published a <a href="https://aws.amazon.com/blogs/aws/amazon-rds-multi-az-deployment/">blog post</a>. It was short, technical, conversational in the way Jeff always was. He was announcing that RDS now had a &#8220;High Availability&#8221; option. He called it Multi-AZ. One parameter, set to true, and Amazon would spin up a hot standby in a second availability zone, synchronously replicate every write, and fail over automatically in about three minutes if the primary went down. Your application wouldn't even need to know it happened.</em></p><p><em>In 2010, high availability for a managed database meant your DBA had a plan and a phone number to call at 2am. It meant runbooks, manual failover scripts, and potentially someone driving to a data center. The notion that a piece of infrastructure could sense its own failure and reconstitute itself, invisibly, while your application kept serving traffic, was new to most of the people reading that post.</em></p><p><em>Jeff wrote that availability zones had "independent power, cooling, and network connectivity." Fourteen words that would quietly become load-bearing assumptions for an entire industry.</em></p><p><em>For the next fifteen years, those fourteen words held. Until they didn't.</em></p><p><strong>***</strong></p><p>For about sixteen years, I believed in multi-AZ the way you believe in gravity.</p><p>Every architecture review I sat in, every Well-Architected assessment, every "is this prod-ready?" Operational Readiness Review pointed to the same thing: spread your workload across availability zones and you've handled the big one. I never really questioned it, neither did the people I worked with. At some point it stopped feeling like a design decision and started feeling almost like a law of nature.</p><p>Then a single event took down two Asian regions simultaneously. And the assumption didn't bend. It shattered.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!luWH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!luWH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png 424w, https://substackcdn.com/image/fetch/$s_!luWH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png 848w, https://substackcdn.com/image/fetch/$s_!luWH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png 1272w, https://substackcdn.com/image/fetch/$s_!luWH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!luWH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png" width="1114" height="460" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:460,&quot;width&quot;:1114,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!luWH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png 424w, https://substackcdn.com/image/fetch/$s_!luWH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png 848w, https://substackcdn.com/image/fetch/$s_!luWH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png 1272w, https://substackcdn.com/image/fetch/$s_!luWH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F673a7aff-03fb-4855-803d-b0a33c311469_1000x413.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Around the same time, a different assumption cracked. Not in one dramatic moment, but across a series of events that added up to the same thing. Trump sanctioned the International Criminal Court, and Microsoft implemented those sanctions, locking the ICC's chief prosecutor out of his Outlook email. A Microsoft executive then told a French Senate committee, under oath, that the company cannot guarantee European customer data will never be handed to US authorities because the CLOUD Act requires US companies to comply with US government requests regardless of where the data physically sits. European data stored in European data centers, operated by a US company, is still subject to US law.</p><p>Two load-bearing assumptions, gone inside a few months.</p><h4><strong>***</strong></h4><p>There's a pattern here that I keep coming back to, and I think it sits at the intersection of psychology, probability, and how organizations actually work.</p><p>When an assumption holds long enough, it stops being an assumption. It becomes furniture. You stop seeing it because it was always there. And crucially, every day it holds true feels like evidence that it was always correct. The confidence compounds and the scrutiny fades.</p><p>That's a measurement error. We're tracking frequency, not fragility. The assumption isn't getting stronger with each passing day. The exposure is just accumulating quietly, out of sight.</p><p>This is what makes the <a href="https://www.resiliumlabs.com/blog?tag=Prevention paradox">prevention paradox</a> so insidious. The longer nothing goes wrong, the more confident you feel. The more confident you feel, the less you invest in questioning the foundations. And so the fragility grows precisely because things have been going so well.</p><p>Scientists have a name for the thing that prevents this trap: falsifiability. A good scientific hypothesis is one that can, in principle, be proven wrong. You hold it provisionally. You actively look for the counterexample. The absence of failure doesn't confirm the hypothesis; it just hasn't been refuted yet.</p><p>Organizations are terrible at this because falsifiability is uncomfortable. Treating your foundational assumptions as provisional feels destabilizing. It requires admitting that what you built on might not hold. Most organizational cultures punish that kind of questioning, or at least fail to reward it.</p><p>So we get this strange situation: some of the most technically sophisticated organizations in the world run on assumptions they've never seriously stress-tested; Multi-AZ as physics, cloud providers as neutral infrastructure, and government as a stable, predictable background condition.</p><h4><strong>***</strong></h4><p>Taleb's black swan is often misread as a story about rare events but I think it's more a story about accumulated fragility. Everything that was quietly wrong beforehand, the untested assumptions, the confidence that compounded without scrutiny, the floor that was never as solid as it felt. The event just ends the period of not knowing. What follows is disorienting precisely because the ground shifted under something we stopped questioning years ago.</p><p>I think this is worth sitting with, because the reflex after a shattered assumption is usually to over-react, replace it with a new one, and move on. We update the architecture, patch the policy, and add multi-region to the checklist. And then, gradually, the new assumption starts to calcify too.</p><p>The harder question you should ask yourself now is: what are we still treating as physics that isn't?</p><p>I don't have a clean answer. But I think the practice, the actual resilience practice, is to hold your foundational assumptions more loosely. To ask periodically: if this turns out to be wrong, what breaks? To make the questioning normal rather than exceptional. To treat "hasn't failed yet" as exactly what it is: a run of confirming evidence, not proof.</p><p>That's uncomfortable and it requires a kind of epistemic humility that organizations tend to select against. But the alternative is waiting for the next black swan to do the work for you.</p><p>//Adrian</p>]]></content:encoded></item><item><title><![CDATA[AI doesn't solve your problems. It moves them somewhere you can't see yet.]]></title><description><![CDATA[Estimated read time: 9 minutes]]></description><link>https://newsletter.resiliumlabs.com/p/ai-doesnt-solve-your-problems-it-moves-them</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/ai-doesnt-solve-your-problems-it-moves-them</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Mon, 02 Mar 2026 08:47:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!I2fV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I2fV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I2fV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic 424w, https://substackcdn.com/image/fetch/$s_!I2fV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic 848w, https://substackcdn.com/image/fetch/$s_!I2fV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic 1272w, https://substackcdn.com/image/fetch/$s_!I2fV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I2fV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic" width="500" height="375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:375,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72264,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.resiliumlabs.com/i/193363208?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I2fV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic 424w, https://substackcdn.com/image/fetch/$s_!I2fV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic 848w, https://substackcdn.com/image/fetch/$s_!I2fV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic 1272w, https://substackcdn.com/image/fetch/$s_!I2fV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec49a0b-b965-48dc-ab29-bd2175c46bf3_500x375.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Estimated read time: 9 minutes</p><div><hr></div><p>There's a seductive story about AI in operations that goes something like this: we have problems, let's deploy AI, it will fix them. Incidents will resolve faster, anomalies will get caught earlier, postmortems will draft themselves in minutes instead of hours, metrics will improve. I've been hearing this story recently and I don't doubt the promise, but improved metrics and solved problems are not the same thing. The problems don't go away when the metrics get better; they go somewhere else, take new forms, and show up in places nobody thought to look. The question is where they go, and what they look like once they get there.</p><p>I've been circling this question for a while. About a year ago I wrote about <a href="https://adhorn.medium.com/when-ai-makes-the-call-b10b094e1b8f">AI meta-operators and system responsibility</a>, trying to work through what happens when AI agents start making operational decisions that used to require human judgment. More recently I wrote about <a href="https://www.resiliumlabs.com/blog/chaos-engineering-ai-generated-code">chaos engineering for AI-generated code</a>, arguing that the velocity and opacity of AI-assisted development demands systematic stress-testing because human review alone can't keep pace. In both cases I could feel the problems but I couldn't connect them, couldn't see why the same kinds of trouble kept showing up in different forms. In <a href="https://www.resiliumlabs.com/book">my book</a> I argue that a common vocabulary is one of the most underrated tools in resilience work, because once you can name something you can discuss it, and once you can discuss it you can start to act on it. What I was missing was exactly that: a vocabulary for what AI does to the mess rather than just the mess itself.</p><h4>***</h4><p><a href="https://en.wikipedia.org/wiki/David_Woods_(safety_researcher)">David Woods</a> has been mapping exactly that. He has been developing a heuristic he calls the Messy 9, which he recently premiered on the <a href="https://www.youtube.com/watch?v=0mum2JW2C3w">Fine Pod podcast</a> and discussed on in the <a href="https://resilienceinsoftware.org">Resilience in Software Foundation</a> Slack channel. It's designed to bridge the science of how complex systems actually work and the practical need to do something about it. The setup is what Woods calls GCA: patterns over cycles of Growth, Complexification, and Adaptation. When new technological capabilities affect ongoing worlds of practice, processes of growth, complexification, and adaptation play out in lawful patterns, and stories of technology change should capture or envision the new forms of messiness that arise when apparent benefits get hijacked. The core message is that the messiness of the real world is conserved over attempts to improve systems, conserved in the formal sense described by the No Free Lunch and Robust Yet Fragile theorems: you don't eliminate messes, you move them.</p><p>Woods organises the recurring forms into nine patterns, grouped in threes: (1) congestion, cascades, and conflict; (2) saturation, lag, and friction; (3) tempos, surprises, and tangles. He describes them as a small set of generic keys you can use to unlock any episode of change to see how messes reappear, with much of the action living in the cross-connects and overlaps between them. Each points to processes that play out over time as systems grow, and each takes on unfamiliar forms when AI enters the picture.</p><p><strong>Congestion</strong>, in Woods' framing, is what happens when a bunch of things are going on simultaneously and you have to deal with them all in the time available. Cascades are disturbances propagating across lines of interdependency, where one failure dumps load onto adjacent functions and the effects spread. Conflict is the question of who loses, who sacrifices, what gets prioritised when there's overload, what gets sacrificed first and what gets sacrificed later. These three are the most visible forms of messiness, and in traditional distributed systems we've built tooling to handle them: circuit breakers, bulkheads, load shedding, runbooks. But when AI is the operator, these patterns migrate into territory that existing tooling can't see.</p><p>Consider what cascading failure looks like when one model's output feeds another model's judgment, which triggers a third model's action. The failure propagates through reasoning rather than network calls and retry logic. A subtly wrong interpretation becomes a confident decision becomes an automated action, and the whole chain looks healthy from the outside because every component is performing exactly as designed. The <strong>cascade</strong> is there, running through inference rather than infrastructure, and it remains invisible until the consequences arrive in a form nobody anticipated.</p><p><strong>Saturation</strong>, in Woods' framework, is what happens when a system approaches the boundary where it runs out of capacity to deal with challenges, where different subsystems start dumping more overload onto other places and the saturation spreads. In traditional systems this means hitting a resource limit you can see, measure, and plan for. AI introduces a different kind: decision saturation, where AI handles enough operational decisions that the humans nominally overseeing it lose the ability to meaningfully evaluate what it's doing, simply because the volume and speed of AI-driven decisions exceed what human attention can track. The oversight saturates while the system hums along, and nobody notices because the dashboards still look green.</p><p><strong>Lag</strong> is what Woods describes as the pattern where organisations cut the resources they need to integrate new capabilities before they have actually integrated them, anticipating productivity gains and reducing experience, expertise, and people before the ostensible benefits have materialised. This is playing out everywhere right now: teams are being restructured around AI productivity assumptions while the actual work of figuring out where AI fits, where it doesn't, and what new failure modes it introduces is still in its earliest stages. The resources being cut are the very resources needed to discover whether the new capability actually works as advertised.</p><p><strong>Friction</strong> undergoes an equally counterintuitive transformation. Woods frames friction as a necessary feature of bringing capabilities into practice, the offsetting costs and workload that arise when something new meets the complexity of the real world, and warns that if you underplay it, the things you try to deploy turn out not to work as well as you would like. The old friction in operations was obvious: manual processes, handoffs between teams, slow approval chains. AI removes it, which feels like progress, but friction served a function. It slowed things down enough for people to notice when something was off, created natural pause points where someone might say "wait, does this actually make sense?" Removing the friction removes the speed bumps that gave humans a chance to catch problems before they compounded, making the system faster and smoother while also making it more brittle in ways that only surface when speed and smoothness were exactly the wrong thing.</p><p><strong>Tempos</strong>, the seventh pattern, describe what happens when different rates of change collide. In DevOps this is already familiar: the tempo of development influences operations, operations constrains development, and incident response introduces its own urgency that overrides both. AI adds new tempos that don't match existing ones: the speed at which models make decisions versus the speed at which humans can review them, the rate at which AI-driven changes accumulate versus the rate at which organisations can absorb their implications. These tempo mismatches create their own congestion as decisions pile up faster than the capacity to evaluate them.</p><p><strong>Surprises</strong>, the eighth pattern, are not about rare events at the tail of a distribution. Woods insists that the dragons of surprise don't get weaker and more infrequent as systems improve; instead, systems generate new categories of surprise as they change. AI is particularly good at producing novel categories because it fails differently from humans. When a human operator makes a mistake, other humans can usually reconstruct the reasoning and see how someone under pressure with incomplete information made a wrong call. When AI makes a mistake the reasoning is opaque, nobody can explain why, which means nobody can confidently say it won't happen again in a different form, which means the organisation can't learn from it in the way it has always learned from human error.</p><p>The <strong>tangles</strong> might be the most troubling of all. Woods describes tangles as circular dependencies and strange loops, the kind of thing he first encountered in nuclear power plants where a critical function depended on an instantiation of itself. When multiple AI systems operate with overlapping domains, one monitoring infrastructure, another triaging incidents, a third managing capacity, they develop implicit dependencies that exist nowhere in any architecture diagram, learning to compensate for each other's behaviour in ways their operators never specified. These tangles are invisible during normal operations and surface during failure, when one system's unexpected behaviour cascades through compensation patterns that nobody knew existed, creating a debugging problem that is qualitatively different from anything human operators have encountered before. Woods gives a vivid example from the current AI gold rush itself: we need critical infrastructure to support AI computations, AI is being deployed to reduce the people who operate that infrastructure, and the AI doing the operating depends on the infrastructure it's supposed to be operating. The circular dependency is already there.</p><p>All of this converges on what Woods identifies as the key constraint: extra adaptive capacity is most needed when least affordable. AI adoption is a period of significant system change that demands more adaptive capacity, more ability to recognise and respond to novel situations, precisely because the system is in flux and the new failure modes haven't been mapped yet. AI adoption also consumes adaptive capacity, because organisations use it as a reason to reduce the human expertise that provides it. The need goes up while the supply goes down, and that gap is where the real risk lives, in a place that improved operational metrics will never show you.</p><p>None of this means AI is bad for operations; the improvements are real and sometimes substantial. The story that AI is going to solve your problems is almost certainly incomplete, because as Woods puts it, the Messy 9 exists to counter the tendency we all have to see whatever we develop as solving something instead of moving things and shifting processes. AI solves the problems you are measuring while generating new ones you haven't learned to see yet, and the messes migrate to places that your current instrumentation and your current organisational structure are not designed to detect. If you're deploying AI into your operations and your metrics are getting better, the question worth asking is where the mess went, because messiness is conserved over cycles of change. It takes new forms and it operates at new scales, and that puts a higher premium on exactly the experience, skill, and expertise to figure out how the system is working when it's not working the way you thought it was.</p><p>//Adrian</p>]]></content:encoded></item><item><title><![CDATA[Why We Still Suck at Resilience and Why I Wrote a Book About It]]></title><description><![CDATA[The last few months have been pretty quiet in here, and the reason is the same reason this piece exists.]]></description><link>https://newsletter.resiliumlabs.com/p/why-we-still-suck-at-resilience-the-book</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/why-we-still-suck-at-resilience-the-book</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Wed, 18 Feb 2026 16:07:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CjVo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CjVo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CjVo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic 424w, https://substackcdn.com/image/fetch/$s_!CjVo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic 848w, https://substackcdn.com/image/fetch/$s_!CjVo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic 1272w, https://substackcdn.com/image/fetch/$s_!CjVo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CjVo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic" width="500" height="333" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:333,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21955,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.resiliumlabs.com/i/193363209?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CjVo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic 424w, https://substackcdn.com/image/fetch/$s_!CjVo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic 848w, https://substackcdn.com/image/fetch/$s_!CjVo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic 1272w, https://substackcdn.com/image/fetch/$s_!CjVo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa90effd9-a7cb-4c92-8ae3-b44b992fb8ff_500x333.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The last few months have been pretty quiet in here, and the reason is the same reason this piece exists. I wrote a book. It is called "Why We Still Suck at Resilience," and you can get it <a href="https://leanpub.com/whywestillsuckatresilience">here</a>.</p><p>If you have been following me for any length of time, you will recognise the argument even if the packaging is new. Organizations confuse performing resilience with actually being resilient, and the five practices that are supposed to manage that gap (chaos engineering, load testing, GameDays, incident analysis, operational readiness reviews) have drifted so far from their original purpose that they often make the problem worse. The book is an attempt to say clearly what I have been circling around in LinkedIn posts and on my blog, that the gap between how we imagine our systems work and how they actually work is growing, and most of what we are doing about it is theater.</p><p>The idea started during my time at AWS, where I helped build the Fault Injection Service and spent a decade working with some of the world's largest organizations on how they practice resilience at scale. I saw teams running chaos experiments that confirmed what they already believed rather than discovering what they did not know, load tests scoped to pass rather than to learn, incident reviews that produced action items nobody followed up on because the organizational incentives pointed elsewhere. The patterns were consistent enough across companies, industries, and team sizes that I became convinced the problem was not individual teams failing to execute. Something structural was going wrong, something about the way organizations relate to learning under pressure, and I wanted to understand what it was.</p><p>Writing the book took longer than I expected, partly because I kept discovering that the problem was deeper than I had initially framed it. What started as a practitioner's guide to doing these five practices well became something more like an investigation into why organizations systematically undermine their own capacity to learn from failure. The research pulled me into safety science, into Bainbridge and Rasmussen and Woods, into the gap between Work-as-Imagined and Work-as-Done, and into the uncomfortable recognition that I had been part of the problem myself. I built tooling that optimised for individual practice metrics while ignoring the learning system those practices were supposed to form. The value of resilience work lives in the connections between practices, in how an incident finding becomes a chaos experiment becomes a readiness review question, and nothing we had built supported those connections.</p><p>The final chapter turned out to be the one that mattered most, both for the book and for where my thinking is already going. It is about AI, but not in the way you might expect. Every observability vendor now offers AI-powered anomaly detection, every incident platform promises AI-drafted postmortems, and the marketing suggests these tools will finally eliminate the gap between how we imagine our systems work and how they actually work.</p><p>Just this week, Norberto Lopes at <a href="https://incident.io">incident.io</a> <a href="https://www.linkedin.com/posts/norbertomlopes_this-was-a-mind-blowing-moment-the-team-activity-7429128709061836801-6h-A?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAABHL5IBabIttkhSnpXrGDQ3gKQFFJVH-K4">posted</a> about their AI-generated postmortem capability, calling it a mind-blowing moment, celebrating that the write-up was accurate, well-contextualised, and took zero time to draft. Around the same time, Ozan Unlu <a href="https://www.linkedin.com/pulse/welcome-observability-30-evolution-from-search-telemetry-ozan-unlu-9k61c/">published</a> a piece describing what he calls Observability 3.0, a vision in which AI agents handle incident triage, root cause analysis, and postmortems autonomously, with humans elevated to "creativity and innovation" while machines do the interpretive work. I do not doubt that either of these tools produces excellent output. What worries me is that the productive struggle of writing a postmortem, or correlating signals during an incident, or reasoning through why a system behaved the way it did, is where most of the actual learning occurs. The document was never really the point; the thinking that produced it was. The interpretation was never the bottleneck; it was the training ground. When you skip the thinking and get a better document, or remove humans from the sensemaking and call it empowerment, you have made a trade that is easy to celebrate and difficult to measure the cost of until much later, when something breaks and the organization discovers it no longer has the understanding it assumed it was building all along.</p><p>Bainbridge <a href="https://www.sciencedirect.com/science/article/abs/pii/0005109883900468">identified</a> this dynamic in 1983: automation designed to remove humans from a system paradoxically makes the human role both more critical and more difficult, because it removes the routine experience that builds expertise. AI is a higher form of the same problem, one that does not just automate tasks but automates the judgment and reasoning that create the deepest understanding.</p><p>I believed that argument completely while writing it, and the speed at which I have started questioning it has caught me off guard. Over the past few weeks, there has been a growing conversation among engineers and researchers ( <a href="https://shumer.dev/something-big-is-happening">here</a> for example) about a noticeable leap forward in the latest generation of models, not just in raw capability but in the quality of reasoning. For the first time, my experience starts to mirror what others are describing.</p><p>I have been using Anthropic's Claude, both the Opus 4.6 model and its extended thinking variant, and what unsettled me was the texture of how it worked through problems. On coding tasks that required sound architectural judgment, the kind where there is no clean answer and the trade-offs depend on context that has to be reasoned about rather than looked up, the model was both producing correct outputs and thinking through the problem in ways that felt uncomfortably close to how a senior engineer thinks through it. It would express uncertainty about its own choices, flag trade-offs I had not raised, change direction mid-reasoning when it encountered something that complicated its initial approach.</p><p>Doubt, revision, context-sensitivity; these are what learning looks like. They are, in cognitive science and education research, the established preconditions for understanding: cognitive conflict that exposes mismatches between a mental model and reality, willingness to reorganise prior knowledge rather than just accumulate new information, and attention to context as the thing that determines whether knowledge transfers or stays brittle. The assumption that has underpinned my thinking, that when automation fails it will be humans who catch the fall, rests on the belief that machines cannot do the interpretive work. If that belief is wrong, or even if it is only wrong for long enough that organizations stop maintaining human capability in the meantime, then the question of who fixes what when things break becomes genuinely open in a way it has never been before.</p><p>I do not say that to be dramatic, and I do not have a clean resolution. The book makes a case I still think is largely right: that organizations need to protect the productive struggle, the learning that comes from humans wrestling with ambiguous problems, because that struggle is where durable understanding forms. What I am less sure about is whether the time horizon for that argument is decades or years, and the difference between those two answers changes everything about how I think about the work ahead.</p><p>Which brings me to what is next. In my consulting work at <a href="https://www.resiliumlabs.com/home">Resilium Labs</a>, AI in operations has slowly become a recurring topic in nearly every engagement. Clients who originally brought me in to diagnose their chaos engineering, load testing, operational readiness, and incident review process are now asking questions about what AI is doing to their teams' ability to understand their own systems. The conversations are shifting because the problem is shifting, and the methodology I have always used (embedded observation, stakeholder interviews, tracing the gap between how people imagine work happens and how it actually happens) turns out to apply directly to organizations trying to figure out whether their AI adoption is building genuine capability or creating new blind spots they have not learned to see yet. The work is evolving because the organizations I work with are evolving, and the questions they need answered are no longer only about resilience practices in the traditional sense.</p><p>There will be more writing here in the weeks ahead, pieces that work through these questions. If anything in the book resonates, or if it does not, I would like to hear about it.</p><p>//Adrian</p>]]></content:encoded></item><item><title><![CDATA[The Prevention Paradox at Civilizational Scale]]></title><description><![CDATA[&#8220;La flamme de la r&#233;sistance fran&#231;aise ne doit pas s&#8217;&#233;teindre et ne s&#8217;&#233;teindra pas&#8221; - Charles de Gaulle, appel du 18 juin 1940]]></description><link>https://newsletter.resiliumlabs.com/p/prevention-paradox-civilizational-scale</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/prevention-paradox-civilizational-scale</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Tue, 17 Feb 2026 07:24:01 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ec1f9242-65b7-421a-822e-36d2f15099f2_1000x673.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div><hr></div><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.vercors-resistance.fr" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IAn5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IAn5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IAn5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IAn5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IAn5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg" width="1619" height="1090" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1090,&quot;width&quot;:1619,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.vercors-resistance.fr&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!IAn5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg 424w, https://substackcdn.com/image/fetch/$s_!IAn5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg 848w, https://substackcdn.com/image/fetch/$s_!IAn5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!IAn5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ce0eafa-c9f0-4e70-bab2-14063a2e47fa_1000x673.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>&#8220;La flamme de la r&#233;sistance fran&#231;aise ne doit pas s&#8217;&#233;teindre et ne s&#8217;&#233;teindra pas&#8221; - Charles de Gaulle, appel du 18 juin 1940</p><p>When I was a kid growing up near Grenoble in France, veterans of the Second World War came to our school to talk about the resistance. The region around Grenoble was a hotspot for the French Resistance, and many of the surrounding villages lost generations of men in the war. The veterans were old by then, and they knew they didn't have many visits left.</p><p>I remember very clearly what one of them said:</p><blockquote><h4>"My biggest worry is that once we are all gone, no one will be there to tell the story, and people will make the same mistakes again."</h4></blockquote><p>I thought about him this week while reading Ray Dalio's latest piece, <a href="https://www.linkedin.com/pulse/its-official-world-order-has-broken-down-ray-dalio-cuofe/">"It's Official: The World Order Has Broken Down"</a>. At the Munich Security Conference, multiple world leaders said the same thing: the post-1945 order no longer exists. German Chancellor Merz: "The world order as it has stood for decades no longer exists." Macron: Europe must prepare for war. Rubio: we are in a "new geopolitics era" because the "old world" is gone.</p><p>The veterans are gone now. And here we are.</p><p>Dalio frames this through his "Big Cycle," a pattern that repeats across centuries. A dominant power wins a conflict, builds institutions and rules, and establishes an order that works. Over time, the stability that was <em>produced</em> by all that investment starts to look like it was always there. Investment in maintaining it declines. The institutions hollow out. Then a crisis arrives and everyone discovers the order was already gone long before the crisis hit.</p><p>I kept reading, and I kept seeing the prevention paradox.</p><h3>Effective prevention creates doubt about its necessity</h3><p>In my <a href="https://leanpub.com/whywestillsuckatresilience">book</a> and this <a href="https://www.resiliumlabs.com/blog/the-prevention-paradox">blog post</a>, I describe a pattern that plays out with painful regularity inside technology organizations. A serious incident creates the political will to invest in resilience. A strong coalition builds institutions: incident review processes, chaos engineering programs, operational readiness reviews, feedback loops that institutionalize resilience thinking. It works. Stability becomes normal. And then the stability that was produced by the investment starts looking like the natural state of things rather than something that requires active maintenance.</p><p>Then a newly appointed leader asks:</p><p>"What exactly does your team do? I see a lot of salary costs, but what are the deliverables?"</p><p>The team scrambles to explain their value but doesn't have the data to back up their claims. "We prevented failures" doesn't translate well to budget spreadsheets. The practices get cut. The degradation that follows gets attributed to other causes. The cycle repeats.</p><p>The post-1945 world order followed the same arc. The catastrophe of two world wars created the political will to invest in international institutions: the United Nations, NATO, the Bretton Woods system, trade agreements, alliances. A strong coalition (led by the US) built and maintained them. They worked. Decades of relative peace and prosperity followed.</p><p>And then the stability that those institutions produced started looking like it was always there.</p><h3>The better you prevent problems, the fewer problems exist to justify preventing them</h3><p>This is the core of the prevention paradox. Our brains struggle to value non-events. The availability heuristic means we judge importance by how easily examples come to mind. Dramatic incidents are memorable. Non-events leave no trace. There's no story to tell about the war that didn't happen.</p><p>Hindsight bias compounds the problem. Once we know the benign outcome, we reconstruct the past as if failure was never very likely. After decades of relative peace, people conclude "clearly the threat was overstated" rather than recognizing "nothing happened because we took appropriate precautions."</p><p>At the organizational level, I've watched this play out with chaos engineering programs, incident review processes, and operational readiness reviews. The more effective they are, the more unnecessary they appear. At the geopolitical level, the same dynamic hollowed out commitment to the institutions that prevented great power conflict for 80 years.</p><p>The costs of maintaining the order were always visible: military spending, diplomatic effort, economic concessions, compromises on sovereignty. The benefits were counterfactual: wars that didn't happen, conflicts that got resolved before they escalated, trade disputes that didn't become economic wars. When decisions require weighing visible costs against hypothetical benefits, visible costs carry more weight.</p><h3>New leaders who never experienced the crisis</h3><p>In the chapter, I describe how the leaders who question resilience investment often aren't people who forgot what the work does. They're people who never experienced what the work prevents. They joined during the stable period that effective resilience created, and from their vantage point the stability looks like the natural state of things.</p><p>The VP who asked "what exactly does your team do?" wasn't ignoring history. He had no history to ignore. He inherited a system that worked and saw a team whose purpose he couldn't connect to any problem he'd experienced.</p><p>At Munich, we're watching the geopolitical version. An entire generation of leaders grew up inside the stability the post-war order produced. They inherited institutions that worked and saw costs they couldn't connect to any catastrophe they'd lived through. The people who built the post-war order, the people who remembered the destruction that justified it, are gone. Their memory of why it mattered left with them.</p><p>That veteran in my school saw this coming forty years ago.</p><p>Dalio points out that the leaders now declaring the old order dead are in fact acknowledging something that's been happening for years, through eroding commitments, hollowed-out alliances, and institutional decay that went mostly unnoticed while the surface structures remained in place.</p><h3>Stage 6 creeps in</h3><p>One of Dalio's most useful observations is that the breakdown doesn't start with the shooting war. It starts years earlier with economic wars, technology wars, capital wars, and geopolitical maneuvering. By the time the shooting starts, the underlying order has been gone for a long time.</p><p>I describe the same pattern at the organizational level. Practices continue but without the depth. Chaos experiments drift toward safe scenarios. Incident reviews stay at surface-level fixes. GameDays become scripted performances. The practices become theater while appearing unchanged. The degradation is invisible because the surface activities continue, even as the learning capacity underneath erodes.</p><p>The international institutions didn't collapse overnight either. The UN still meets. NATO still exists. Trade agreements are still on paper. But the commitment behind them has been eroding for years. The practices became theater while appearing unchanged.</p><p>By the time a significant incident occurs, the causation is thoroughly obscured. Months or years have passed between the reduction in investment and the consequences. The delay prevents connecting the cuts to the outcomes. Each generation of leadership learns through fresh pain.</p><h3>The cycle isn't inevitable</h3><p>Dalio ends his chapter with something that I think is worth holding onto. He says the cycle doesn't have to end in catastrophe, if countries "stay productive, earn more than they spend, make the system work well for most of their populations, and figure out ways of creating and sustaining win-win relationships."</p><p>In my <a href="https://www.resiliumlabs.com/book">book</a>, I argue the same: the pattern is predictable, but it can be navigated. It requires treating stability as an output that needs continuous input, not a resting state. It requires building institutional memory that survives the people who experienced the original crisis. It requires celebrating prevention rather than only celebrating heroic response. And it requires leadership that remembers why the institutions were built in the first place, even when everything looks fine.</p><p>Whether we're talking about an engineering organization or the world order, the prevention paradox operates through the same mechanism. Success erases its own evidence. The world order didn't break down last week in Munich. It broke down gradually, over years, as the commitment to maintaining it eroded while everyone assumed it would just persist on its own.</p><p>One of the commenters on Dalio's piece, Mike Wagner, put it nicely: "A lot of what we built assumed stability was just there in the background, and that assumption is gone."</p><p>That's the prevention paradox, at every scale.</p><p>//Adrian</p>]]></content:encoded></item><item><title><![CDATA[Why Your Chaos Experiments Give You False Confidence]]></title><description><![CDATA[You've done everything right.]]></description><link>https://newsletter.resiliumlabs.com/p/chaos-experiments-false-confidence</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/chaos-experiments-false-confidence</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Fri, 09 Jan 2026 13:58:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e873a20e-885e-45e6-b79d-ea1de54f06e5_1000x560.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div><hr></div><p>You've done everything right.</p><p>You ran the hypothesis conversation. Your team discovered they had different mental models of how database failover works. You investigated the gaps. You fixed the connection pool configuration and the health check logic. You added monitoring for connection pool state.</p><p>Then you ran the experiment. Database fails over, circuit breaker trips, traffic routes to the replica, recovery completes in 30 seconds. Everything worked exactly as expected.</p><p>Three months later, the database fails during peak traffic. The system enters a death spiral. Circuit breakers trip and reset in rapid cycles. Connection pools exhaust. The replica becomes overloaded. Health checks start failing on healthy instances. Retries amplify the load. Recovery takes 23 minutes of manual intervention.</p><p>You tested this exact scenario. It worked perfectly. What happened?</p><p><strong>You tested at the wrong load.</strong></p><p>During your experiment, you ran with minimal traffic. Maybe some synthetic requests, maybe just manual testing. Production was handling 800 requests per second when the database failed.</p><p>That difference activated completely different system dynamics.</p><h3>Why load changes everything</h3><p>Systems with spare capacity behave as if they were deterministic. The same inputs produce the same outputs. Failures are reproducible. You can reason about what will happen and predict outcomes.</p><p>When systems run near their limits, they become non-deterministic. The <a href="https://www.perfdynamics.com/Manifesto/USLscalability.html">Universal Scalability Law</a> predicts this: contention and coherency costs cause non-linear performance collapse beyond critical load thresholds. Several mechanisms combine to create this shift.</p><p><strong>Bimodal components.</strong> A cache either hits or misses. A circuit breaker is either closed or open. A connection pool either has available connections or is exhausted. A health check either passes or fails. Each is a binary state change that delivers radically different behavior</p><p>At low load, mode transitions don't matter much. A cache miss? The database has spare capacity. A circuit breaker opens? Other instances absorb the traffic. You have operational margin to absorb these transitions.</p><p>At high load, the same transitions cascade. A cache miss means the database is already near its limit, so the additional query causes queueing, which slows all queries, which triggers timeouts and retries. A circuit breaker opening means remaining instances are already near capacity, so redirected traffic pushes them over, causing their circuits to open.</p><p><strong>Queueing effects.</strong> At low load, a cache miss adds one query to a queue of 10, and the impact is marginal. At high load, a cache miss adds one query to a queue of 10,000. The queue is already near capacity, and that additional query pushes it deeper into the non-linear region where waiting time explodes. This affects all subsequent requests, not just the one that missed.</p><p><strong>Concurrency races.</strong> At low load, if two instances' circuit breakers both approach their trip thresholds, they trip at different times because request patterns vary enough. The first trips, load redistributes, the system absorbs it.</p><p>At high load, many instances approach thresholds simultaneously. One trips, load redistributes, and the redistribution pushes others over. They all trip within seconds of each other, and you get synchronized state transitions across your entire fleet rather than gradual failover.</p><p><strong>Resource cascades</strong>. At low load, one slow request holds a thread a bit longer, but other threads are available and nothing cascades. At high load, one slow request holds a thread when no other threads are available. Incoming requests queue, the backup causes timeouts, timeouts trigger retries, and retries push thread pools deeper into exhaustion. Circuit breakers watching these timeouts approach their thresholds and trip. Three systems have now changed state because one request was slow at the wrong moment.</p><p><strong>Timing variations.</strong> Which request becomes slow depends on factors you can't control. A database query lands on a disk performing compaction. A network packet gets delayed by congestion. A GC pause happens at an unfortunate moment. Under low load, these variations don't matter. Under high load, whichever request is slow can trigger a cascade.</p><p>Run the same chaos experiment five times at low load and you get consistent results. Run it five times at high load and you get five different outcomes. You're not seeing randomness. You're seeing emergent behavior from multiple interacting mechanisms whose states you can't precisely control.</p><h3>Metastability: when resilience mechanisms prevent recovery</h3><p>This pattern has a name: metastability. Researchers studying large-scale system failures <a href="https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf">identified this pattern</a> and gave it a precise definition: failure states sustained by the system's own behavior, not by any ongoing external problem.</p><p>The trigger is gone. The database is back. The network is healthy. Every component passes health checks. But the system serves almost no useful work because the resilience mechanisms you built now create more load than the system can handle.</p><p>The cache can't rewarm because database load is too high. The circuit breaker can't close because retry load prevents downstream recovery. The connection pool can't replenish because failures happen faster than connections establish. The health checks can't pass because load redistribution keeps instances above the timeout threshold.</p><h3>Your system occupies one of three states:</h3><p>This diagram shows the relationship between load and goodput, with the three distinct system states.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8pyn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8pyn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png 424w, https://substackcdn.com/image/fetch/$s_!8pyn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png 848w, https://substackcdn.com/image/fetch/$s_!8pyn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png 1272w, https://substackcdn.com/image/fetch/$s_!8pyn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8pyn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png" width="2438" height="1366" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1366,&quot;width&quot;:2438,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8pyn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png 424w, https://substackcdn.com/image/fetch/$s_!8pyn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png 848w, https://substackcdn.com/image/fetch/$s_!8pyn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png 1272w, https://substackcdn.com/image/fetch/$s_!8pyn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cea08d9-3de6-4a2e-a7f0-824447f67d24_1000x560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>Inspired by: <a href="https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf">https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf</a></p><p>The green curve shows the <strong>stable state</strong>. Goodput scales with load. The system has spare capacity, so it can absorb triggers and self-recover. This is where your chaos experiments typically run.</p><p>The blue line shows the <strong>vulnerable state</strong>. The system operates efficiently near its limits. Goodput stays high, but there's little margin. A trigger (the fire icon) can push the system off this line.</p><p>The red dashed line shows <strong>metastability</strong>. Goodput collapses to near zero even though load remains high. The system has fallen off the curve entirely. Positive feedback loops prevent recovery.</p><p>The dotted arrows show what happens: a trigger pushes the system from vulnerable down into metastable failure, and it takes intervention (and reducing load significantly) to get back up to stable.</p><p>The state diagram in the corner shows the transitions. Stable can drift into vulnerable. Vulnerable can fall into metastable failure when a trigger activates sustaining effects. Metastable requires intervention to escape. You can't just remove the trigger and wait for recovery.</p><h3>Two practices that should be one</h3><p>Most organizations run load testing and chaos engineering as separate activities. The load testing team validates capacity. The chaos engineering team validates failure degradation. The two rarely meet.</p><p>Load testing without failure injection shows you how the system performs when nothing goes wrong. Chaos engineering without realistic load shows you how the system handles failures when it has spare capacity to absorb them. Neither tells you what happens when failures occur under production load.</p><p>That combination is exactly what production delivers. And that vulnerable region is exactly where you need to run your experiments to find the triggers that will collapse your system into metastability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dtiD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dtiD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png 424w, https://substackcdn.com/image/fetch/$s_!dtiD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png 848w, https://substackcdn.com/image/fetch/$s_!dtiD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png 1272w, https://substackcdn.com/image/fetch/$s_!dtiD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dtiD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png" width="2430" height="1366" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1366,&quot;width&quot;:2430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dtiD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png 424w, https://substackcdn.com/image/fetch/$s_!dtiD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png 848w, https://substackcdn.com/image/fetch/$s_!dtiD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png 1272w, https://substackcdn.com/image/fetch/$s_!dtiD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd8315d4-d6ff-4951-a4eb-9412cf52919d_1000x562.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>Finding those triggers before production does is what chaos engineering and load testing are for. But you can only find them by testing in the vulnerable region where production operates.</p><h3>What this means for your experiments</h3><p>The hypothesis conversation reveals gaps in your team's mental models. Investigation and fixing address the knowable problems. But there's a third category we didn't fully explore: emergent behavior that only appears under load.</p><p>When you design experiments for uncertainty gaps, the ones where behavior emerges from component interactions, you need to test under realistic load. Not "some" load. Production-equivalent load.</p><p>Your experiment design from the last newsletter asked: "We want to know how our application's retry logic interacts with database failover under realistic traffic load."</p><p>Realistic traffic load is the key phrase. If you test at 50 requests per second and production runs at 800, you're testing a different system. The stable regime behaves deterministically. The vulnerable regime behaves non-deterministically because queueing effects amplify, concurrency races synchronize, resource cascades chain, and timing variations that didn't matter suddenly determine outcomes.</p><p>Same code. Same architecture. Same failure injection. Completely different outcomes.</p><h3>Try this</h3><p>Take one of the uncertainty gaps you identified in your hypothesis conversation. Design the experiment as we discussed, with specific predictions about timing, behavior, and recovery.</p><p>Then ask yourself what load level does production actually experience during peak hours? Can you run your experiment at that load level?</p><p>If you can't, you're testing in the stable region. Your results will give you false confidence about triggers that only activate in the vulnerable region where production operates.</p><p>You want to find those fire icons in the diagram before production finds them for you. That requires chaos engineering and load testing together.</p><p>Until then,</p><p>Adrian</p><div><hr></div>]]></content:encoded></item><item><title><![CDATA[What to do after the hypothesis conversation]]></title><description><![CDATA[Last time, I walked you through the hypothesis conversation, how to discover that your team has completely different mental models of how your system behaves, all before running any chaos experiment.]]></description><link>https://newsletter.resiliumlabs.com/p/what-to-do-after-hypothesis-conversation</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/what-to-do-after-hypothesis-conversation</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Sun, 14 Dec 2025 14:26:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9N0S!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465f403-b7af-4feb-a7ca-ea53007ad3fc_872x872.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div><hr></div><p><a href="https://www.resiliumlabs.com/resilience-bites-issues/hypothesis-conversation-chaos-engineering">Last time</a>, I walked you through the hypothesis conversation, how to discover that your team has completely different mental models of how your system behaves, all before running any chaos experiment.</p><p>Several of you replied with stories. One team discovered that nobody knew whether their cache actually had an eviction policy. Another found out their "automatic" failover required some manual steps. A third learned that their monitoring would completely miss the failure mode they were worried about.</p><p>Good. That's exactly what should happen.</p><p><strong>Now comes the harder question: what do you actually do with all these gaps you just discovered?</strong></p><p>Most teams make one of two mistakes here. Either they panic and try to fix everything immediately, or they shrug and run the experiment anyway "just to see what happens."</p><p>Both are wrong. Let me show you a better path.</p><h3>Three types of gaps</h3><p>When you run a hypothesis conversation, you uncover three distinct types of gaps. Each requires a different response.</p><h4><strong>Type 1: Knowledge gaps</strong></h4><p>These are gaps in understanding where the answer exists somewhere. You just need to find it.</p><p>"Does the circuit breaker have a timeout?" "What's our connection pool size?" Someone knows. It's in the code or configuration. You just need to look it up.</p><p><strong>What to do: Investigate before experimenting.</strong></p><p>Don't run a chaos experiment to answer questions you can answer by reading code, checking configuration, or talking to the team that built the thing. That's using a sledgehammer when you need a magnifying glass.</p><p>Spend an hour investigating:</p><ul><li><p>Read the relevant code</p></li><li><p>Check configuration files</p></li><li><p>Look at past incidents where this failed</p></li><li><p>Talk to the team that owns the component</p></li><li><p>Review any existing documentation</p></li></ul><p>Update your mental models based on what you find. Then reconvene and share what you learned.</p><p>Half the time, this investigation reveals more gaps. "Oh, I thought Steve configured that, but he left six months ago and nobody knows if it's still set up that way."</p><p>Perfect. Now you know what you don't know.</p><h4><strong>Type 2: Uncertainty gaps</strong></h4><p>These are gaps where nobody knows the answer because the behavior emerges from interactions between components.</p><p>You can understand every component separately and still have no idea what happens when they all run together under specific conditions.</p><p>You checked the code. The connection pool has reconnection logic. The database has failover logic. The load balancer has health check logic. The application has retry logic. Each piece makes sense. The implementations look correct.</p><p>But when database failover happens during peak traffic while the cache is cold, what actually happens? Do the retry storms from 50 application instances create a thundering herd that prevents the database from recovering? Does the load balancer pull instances out of rotation before they stop retrying, or after? Do the health checks start passing before the connection pool is actually ready, sending traffic to instances that will just error?</p><p>Nobody knows for certain and it is hard to reason about. You can't know from reading the code because the answer depends on timing, load, and how these different components interact under load.</p><p>That's emergence. The behavior comes from the interaction, not from the individual components.</p><p><strong>What to do: Design experiments specifically to explore these interactions.</strong></p><p>This is what chaos engineering is actually good for. Testing things where behavior emerges from complexity, where understanding the parts doesn't tell you how the whole system behaves.</p><p>Before you design the experiment, get specific about what interaction you're uncertain about:</p><p>"We're uncertain how our retry logic interacts with database failover during high traffic. Each component looks correct in isolation, but we've never validated whether they work together without creating cascading problems. We need to know because retry storms could turn a 30-second database blip into a 90-minute outage."</p><p>Now you can design an experiment:</p><ul><li><p>Inject database unavailability with realistic load</p></li><li><p>Watch how retry behavior scales across all instances</p></li><li><p>Monitor whether health checks and retries synchronize badly</p></li><li><p>Track whether the database can actually accept connections when it comes back</p></li><li><p>Observe the recovery timeline end-to-end</p></li></ul><p>The experiment has a clear learning goal. You're testing how components interact under specific conditions, not whether individual components work.</p><p>You'll probably need to run this experiment under multiple conditions. Behavior that emerges from interaction often depends on load, timing, system state. What happens at 2 AM with low traffic might be completely different from what happens during peak hours.</p><h4><strong>Type 3: Design gaps</strong></h4><p>These are gaps where you discover something is missing or wrong in your system design.</p><p>"Wait, we don't have a circuit breaker at all? I thought we did."</p><p>"Our health check doesn't actually validate database connectivity, it just returns 200 OK?"</p><p>"There's no monitoring for connection pool state?"</p><p>These are real problems you just discovered. These aren't knowledge gaps. These aren't uncertainties.</p><p><strong>What to do: Fix them before experimenting.</strong></p><p>Don't run a chaos experiment to confirm that something you know is missing or broken is actually missing or broken. That's not learning, that's theater.</p><p>If you discover your health check is shallow when it should be deep, fix the health check. If you find out you're missing critical monitoring, add it. If the circuit breaker doesn't exist, decide whether you need one.</p><p>Some teams resist this. "But we want to see how bad it is!"</p><p>No. You don't learn from deliberately breaking things you know are broken. You just create risk and waste limited resources. And your system will anyway behave differently once to address the missing or broken components.</p><p>Fix the known problems first. Then experiment to find the unknown problems.</p><h3>The investigation phase</h3><p>Let's say you ran your hypothesis conversation and discovered 15 different gaps. Don't immediately schedule 15 chaos experiments.</p><p>Instead, spend time investigating. Create a gaps document. List everything you discovered. Sort the list into these three buckets. Knowledge gaps we need to investigate. Uncertainties we need to experiment on. And design problems we need to fix.</p><p>Assign investigation work. Split the knowledge gaps among team members. Give people a week to investigate their assigned areas. This is building shared understanding before you create any risk.</p><p>Gather again. Have each person share what they learned. You'll find that investigating knowledge gaps often reveals more uncertainties or design problems. That's good. You're getting more precise about what you actually don't know.</p><p>Update your gaps document based on the investigation.</p><h3>Prioritizing what to test.</h3><p>You've investigated the knowledge gaps. You've fixed the obvious design problems. Now you're left with genuine uncertainties about how components interact.</p><p>You probably can't test all of them immediately and because resources are limited. So prioritize.</p><h4>Priority 1: High-impact, high-uncertainty</h4><p>These are failure modes that would cause significant customer impact if they happened, and the behavior emerges from complex interactions you can't predict.</p><p>High stakes. Real uncertainty from emergence. Test this first.</p><h4>Priority 2: High-impact, low-uncertainty</h4><p>These are scary scenarios where you think you know how components interact, but the stakes are high enough that validation is worth it.</p><p>You probably won't learn too much. But confirming that critical interactions work as expected has confidence value.</p><h4>Priority 3: Low-impact, high-uncertainty</h4><p>These are things you're uncertain about but wouldn't cause major problems if the interactions failed.</p><p>"We're not sure how cache warming and background jobs interact during deployment, but worst case some requests are slower."</p><p>Interesting to know. Not urgent to test.</p><h4>Priority 4: Low-impact, low-uncertainty</h4><p>Don't test these at all. You're confident about how components interact and the impact is minimal.</p><p>Save your time and energy for experiments that actually teach you something important.</p><h3>Designing your first experiment</h3><p>Let's say you've prioritized and you're ready to design your first experiment. You've picked: "Test how retry logic and database failover interact during realistic load."</p><p>Here's how to design it:</p><h4><strong>Start with your learning goal</strong></h4><p>Be explicit: "We want to know how our application's retry logic interacts with database failover under realistic traffic load. Specifically, whether retries from multiple instances create problems for database recovery, and how long the system actually takes to return to normal."</p><h4><strong>Document the hypothesis clearly (before you start)</strong></h4><p>Start with your system properties:</p><blockquote><p>"Each application instance maintains a connection pool of 20 connections. Connection timeout is set to 5 seconds. Under normal load, instances handle 50 requests/second. When a connection attempt fails, the pool marks that slot as failed and retries. Applications implement retry logic with 100ms base delay and 2x exponential backoff for a maximum of 4 retries (100ms, 200ms, 400ms, 800ms)."</p></blockquote><p>Now you can make predictions:</p><blockquote><p>"When the database becomes unavailable, each instance will experience connection failures at 5-second intervals. At 50 requests/second with 20 available connections, the pool will exhaust in approximately 0.4 seconds (20 connections / 50 requests/second). Applications will begin returning errors immediately. The retry sequence will complete in 1.5 seconds per request (100 + 200 + 400 + 800). After 4 failed retries, requests will return 500 errors to clients."</p><p>"When the database becomes available again, connection pools will attempt reconnection on the next incoming request. With staggered health checks running every 10 seconds across 12 instances, we expect the fleet to detect database availability within 10 seconds. Each instance will establish 20 new connections, taking approximately 100ms per connection (2 seconds total per instance). We expect 90% of traffic to succeed within 15 seconds of database recovery.&#8221;</p></blockquote><p>This gives you specific measurements. You know what to instrument (pool exhaustion rate, retry timing, connection establishment time, error rates). You know what success looks like (90% traffic succeeding within 15 seconds). You can validate each number independently.</p><p>Specific. Measurable. Testable.</p><h4><strong>Plan what you'll observe</strong></h4><p>List all the specific things you want to watch. Be specific. If you can't observe these things, stop. Add the missing observability first. Don't run experiments you can't interpret.</p><h4><strong>Start small and safe</strong></h4><p>Do your first run in the development environment, limited blast radius, low traffic, manual injection.</p><p>Don't overcomplicate things. Dont&#8217; start with production. Don't start with peak traffic. Don't start with automated injection. Build confidence progressively.</p><h4><strong>Run it and observe</strong></h4><p>Actually run the experiment. Watch what happens. Take notes. Don't try to fix things during the experiment. Just observe and learn.</p><p>Compare hypothesis to reality. What matched your expectations? What surprised you?</p><p>Write a detailed summary of the experiment. Ideally using the same process as your real incidents.</p><blockquote><p>"The connection pool exhausted in 0.6 seconds, roughly matching the 0.4-second prediction. Applications returned 500 errors immediately. The retry sequence timing matched our implementation exactly: 1.5 seconds per request before final failure.</p><p>When the database came back online, the first instance detected it in 4 seconds through health checks. It started establishing connections. At 18 connections established, the instance crashed. Out of memory error.</p><p>We had overlooked connection cleanup. When the database went down, failed connections stayed in the pool marked as dead but not garbage collected. When the instance tried to establish 20 new connections, it actually had 20 dead connections plus 20 new ones. Memory usage spiked. The JVM killed the process.</p><p>The crash triggered our orchestration system to start a replacement instance. That instance came up, detected the healthy database, tried to establish connections, and crashed for the same reason. This happened to 8 instances before we caught it.</p><p>The remaining 4 instances stayed alive because they had been restarted recently for unrelated reasons. Their connection pools were clean. They successfully connected and started handling all traffic. Four instances serving the entire load meant each was now processing 150 requests/second instead of 50. Response times jumped to 800ms. Error rates climbed to 12% due to request timeouts.</p><p>We had to disable automatic instance replacement and manually restart each instance with connection pool cleanup logic added. Recovery took 23 minutes. The hypothesis predicted connection establishment time correctly. We never considered connection lifecycle management or the interaction between pool state and instance stability."</p></blockquote><p>That's learning. The individual components worked correctly but the interaction between them, at scale, under realistic conditions, created new behavior hard to predict.</p><p>Now you know and have choices.</p><p><strong>Option 1: Add explicit pool cleanup to your health check logic.</strong></p><p>When the health check detects database unavailability, call pool.close() to force cleanup of all connections, then reinitialize the pool. This guarantees a clean state before attempting reconnection. The downside is a brief period where the instance can't serve requests during pool recreation, maybe 200-300ms. You need to ensure your load balancer can handle instances briefly going unhealthy.</p><p><strong>Option 2: Configure connection max lifetime in the pool settings.</strong></p><p>Set maxLifetime to something like 30 minutes. The pool will automatically evict and replace connections that exceed this age, regardless of their state. This prevents accumulation of dead connections over time. The tradeoff is ongoing connection churn during normal operation, which adds latency (probably 5-10ms per replaced connection). You also need to tune the lifetime value. Too short and you create unnecessary overhead. Too long and you don't solve the problem.</p><p><strong>Option 3: Implement connection validation on checkout.</strong></p><p>Configure the pool to test connections before handing them to the application (testOnBorrow or equivalent). Dead connections get removed when the application requests them. This distributes cleanup across request processing rather than concentrating it at recovery time. The cost is added latency on every request, typically 1-5ms depending on your validation query. During the outage, you still accumulate dead connections, but they get cleaned up gradually as traffic arrives rather than all at once during reconnection.</p><p>You'll probably want to test this again after selecting the option for the fix. Emergent behavior depends on conditions, so you validate solutions under the conditions that matter.</p><h3>What's next</h3><p>Last time I showed you how hypothesis conversations reveal what your team actually believes about your system. This time you learned what to do with those gaps: investigate the knowable, fix the broken, experiment on the emergent.</p><p>Most teams skip straight to breaking things. You now know better. The learning happens in the conversation, the investigation, and the careful observation of how components interact under real conditions. The chaos experiment is just one tool in that process.</p><p>Try this with your team. Start with the hypothesis conversation from last newsletter. Work through the investigation phase. Pick one uncertainty about emergent behavior and design an experiment around it.</p><p>Then tell me what you discovered.</p><p>Until then,</p><p>Adrian</p><div><hr></div>]]></content:encoded></item><item><title><![CDATA[Your best chaos engineering happens before you break anything]]></title><description><![CDATA[Dr.]]></description><link>https://newsletter.resiliumlabs.com/p/hypothesis-conversation-chaos-engineering</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/hypothesis-conversation-chaos-engineering</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Sun, 30 Nov 2025 13:03:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9N0S!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465f403-b7af-4feb-a7ca-ea53007ad3fc_872x872.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div><hr></div><p><a href="https://ise.osu.edu/people/woods.2">Dr. David Woods</a>, one of the pioneers of resilience engineering, <a href="https://www.profound-deming.com/profound-podcast/s4-e25-dr-david-woods-resilience-and-complexity-part-two">said something</a> a few of us have been repeating and it captures the essence of why chaos engineering is valuable:</p><blockquote><p>"Just planning to inject a fault usually reveals that the system works differently than you thought."</p></blockquote><p>Most teams skip straight to running chaos experiments. They pick a tool, design a failure scenario, inject the fault, observe what happens. They think the learning happens during the experiment.</p><p>They're wrong. The deepest learning often happens before you break anything at all.</p><p>Here's what I mean.</p><h3>The meeting you've probably been in</h3><p>You're discussing what happens when X fails. Someone says "obviously the circuit breaker kicks in." Everyone nods. The conversation moves on. Meeting ends.</p><p>Three months later, X actually fails in production. The circuit breaker doesn't work the way anyone thought. Or it doesn't exist at all. Or it exists but was disabled during a migration last quarter and nobody remembered to re-enable it.</p><p>The problem is that everyone had different mental models of how the system works. Nobody discovered this until real failure exposed it.</p><p>This happens constantly. Teams operate under the illusion of shared understanding. Everyone uses the same words. Everyone nods at the same moments.</p><p>Then something breaks, and you discover nobody actually agreed about how anything worked.</p><h3>The hypothesis conversation</h3><p>Here's what to do instead. Before running any chaos experiment, gather your team and have what I call the hypothesis conversation.</p><p><strong>Step 1: Pick a specific failure scenario</strong></p><p>Don't start with "what if the database goes down?" It&#8217;s too vague.</p><p>Instead start with something like: "What happens when our primary PostgreSQL instance becomes unavailable during peak traffic?"</p><p>The specificity matters. Vague scenarios produce vague answers. Specific scenarios force people to articulate their actual mental models.</p><p><strong>Step 2: Silent writing first</strong></p><p>This is the critical step most teams skip.</p><p>Before any discussion, have everyone write down:</p><ul><li><p>What they expect to happen</p></li><li><p>How long they expect recovery to take</p></li><li><p>What customer impact they expect</p></li><li><p>What monitoring and alerts they expect to see</p></li></ul><p>Give people 5 minutes of silence to write.</p><p>Why silent writing? Because it prevents groupthink. In open discussion, the loudest person always speaks first. Suddenly everyone's nodding along. The quieter team members, who might have spotted something nobody else considered, stay silent. You've just lost the diversity of perspective that makes this valuable.</p><p>Silent writing forces everyone to commit to their understanding before social dynamics kick in.</p><p><strong>Step 3: Share and compare</strong></p><p>Go around the room. Have each person read what they wrote.</p><p>No judgment. No critique. Just listen.</p><p>This is where it gets interesting. You'll discover:</p><p>Someone assumes automatic failover that doesn't exist. Another person thinks there's queuing that isn't implemented. The new engineer admits they have no idea what happens. That&#8217;s valuable honesty the rest of the team needs to hear. The senior engineer describes behavior from the old architecture, before the migration two quarters ago. Nobody agrees on expected recovery time.</p><p>Your team has been working together for months or years. You thought you understood how the system works. You're discovering right now that you don't share the same understanding at all.</p><p><strong>Step 4: Investigate the gaps</strong></p><p>Don't just note the disagreements and move on. Dig into them:</p><p>"Why did you think queuing was implemented?"</p><p>"When was the last time we actually tested failover?"</p><p>"Where is that behavior documented?"</p><p>"Who would know the answer to this?"</p><p>Often you'll discover that nobody knows for certain. The system evolved. Documentation didn't keep up. People made assumptions based on how they thought things worked. Those assumptions never got validated.</p><p><strong>Step 5: Decide what to do</strong></p><p>Now you have options:</p><p>Run the chaos experiment to find out what actually happens. Check the code or configuration to verify behavior. Talk to the team that owns that component. Update documentation based on what you learned. Fix the gap you just discovered.</p><p>You might decide the experiment is still valuable. But you've already learned something critical: your team doesn't share the same mental model of how your system behaves. That's worth knowing regardless of what you do next.</p><h3>Real example</h3><p>I watched this play out at a company planning their first database failover experiment few years ago.</p><p>The team gathered to discuss expectations. Everyone seemed aligned. "Database fails over to replica, maybe 30 seconds of elevated errors, everything recovers." Heads nodded. The consensus felt solid.</p><p>Then they did silent writing.</p><p>When everyone shared, the room got quiet.</p><p>It turns out the DBA expected 10-15 seconds of failover time based on the configuration settings. Application engineers expected 30-60 seconds based on what they'd seen in past incidents. The SRE thought the connection pool would need manual restart because that's how it worked in the old system. A junior engineer thought reads would continue but writes would fail. That was actually a reasonable assumption given the architecture. The architect assumed the application would automatically reconnect after failover. Nobody was sure if the health check would detect the failover or keep routing traffic to the failed instance.</p><p>Six people. Six different mental models of the same system behavior.</p><p>They spent the next 30 minutes investigating:</p><p>Checked the database configuration. Failover timeout was actually 30 seconds, but nobody had tested it recently. They looked at application code. Connection pool settings were wrong, they'd never refresh connections after a failover. Reviewed health check logic. It would completely miss the failover and keep sending traffic to the dead instance. Found monitoring gaps. No visibility into connection pool state, so they wouldn't see the problem during the experiment.</p><p>They fixed three critical issues before running any experiment:</p><ul><li><p>Updated connection pool configuration to handle failover</p></li><li><p>Fixed the health check logic</p></li><li><p>Added monitoring for connection pool state</p></li></ul><p>When they finally ran the experiment two weeks later, they knew what to expect and what to watch for. The failover worked. More importantly, they'd built shared understanding of how their system actually behaves.</p><h3>Return on investment</h3><p>The hypothesis conversation took 45 minutes. In that time, they discovered:</p><ul><li><p>Critical gaps in monitoring</p></li><li><p>Incorrect connection pool configuration that would have caused extended downtime</p></li><li><p>Missing health check logic that would have routed traffic to a dead instance</p></li><li><p>Misaligned team understanding of basic system behavior</p></li></ul><p>All before creating any risk, touching any systems, or needing any special tools.</p><p>Compare this to what usually happens: run the experiment, something unexpected occurs, scramble to understand what went wrong, argue about whether the system is working correctly or the experiment is flawed, end up more confused than when you started.</p><p>The hypothesis conversation is chaos engineering!</p><p>You're systematically exploring how your system behaves under failure. Sometimes that exploration happens through conversation. Sometimes through code review. Sometimes through documentation analysis. Sometimes through actual experiments.</p><p>But the learning doesn't wait for the experiment. It starts the moment you begin asking the right questions.</p><h3>Start tomorrow</h3><p>Here's your homework for this week:</p><p><strong>Pick your scenario</strong></p><p>Choose something specific that your team worries about. "What happens when [specific service] becomes unavailable during [specific condition]?"</p><p>Don't pick the scariest scenario. Pick something bounded and concrete. You're learning the practice, not stress-testing your team.</p><p><strong>Get the right people in the room</strong></p><p>You need diversity of perspective. Developers who wrote the code. Operators who run it in production. New team members who haven't absorbed all the assumptions yet. The architect who designed it. The SRE who monitors it. Include product managers too.</p><p>Six to eight people is ideal. Enough for diverse perspectives, small enough for real conversation.</p><p><strong>Set the context</strong></p><p>Before you start, frame it clearly:</p><p>"We're not testing anyone's knowledge. We're discovering where our mental models differ. There are no wrong answers. The goal is to find gaps in our shared understanding before they surprise us in production."</p><p>This framing matters. People need to feel safe admitting uncertainty.</p><p><strong>Do the silent writing</strong></p><p>Give people 5 minutes. Remind them to be specific. What exactly do you expect to happen? Not "it'll probably be fine" but "the circuit breaker will trip after three failed requests, fallback logic will return cached data, customers will see stale information for 30-60 seconds."</p><p>The specificity reveals the mental model.</p><p><strong>Share without judgment</strong></p><p>Go around the room. Have each person read what they wrote.</p><p>Don't critique. Don't correct. Don't debate yet. Just listen and note where understanding differs.</p><p>Resist the urge to immediately resolve disagreements. First, just hear all the perspectives.</p><p><strong>Investigate together</strong></p><p>Pick the biggest gaps. The places where mental models differ most.</p><p>Dig into them as a group. Check the code. Look at configuration. Review documentation. Talk to other teams.</p><p>Find out what actually happens. Update your understanding together.</p><p>You'll learn more in 30 minutes than most chaos experiments reveal.</p><h3>What's next</h3><p>Try this with your team this week. Then <strong>hit reply</strong> and tell me:</p><p>What gaps did you discover? Were you surprised by anything? Did you fix something before running any experiment? What happened when you tried to investigate the gaps?</p><p>I read every reply. Your experiences help me understand what actually works in practice, not just in theory.</p><p>Next newsletter: "What to do after the hypothesis conversation". How to design experiments that actually test what you're uncertain about, not just what you already know.</p><p>Until then,</p><p>Adrian</p><div><hr></div>]]></content:encoded></item><item><title><![CDATA[When AI Writes Your Code, Chaos Engineering Writes Your Insurance Policy]]></title><description><![CDATA[AI-generated code has moved from curiosity to almost standard practice with breathtaking speed.]]></description><link>https://newsletter.resiliumlabs.com/p/chaos-engineering-ai-generated-code</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/chaos-engineering-ai-generated-code</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Thu, 02 Oct 2025 08:55:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5d1d556e-da70-400d-b4ee-bb2d587376d0_1000x564.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8ZWj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8ZWj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg 424w, https://substackcdn.com/image/fetch/$s_!8ZWj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg 848w, https://substackcdn.com/image/fetch/$s_!8ZWj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!8ZWj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8ZWj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg" width="2500" height="1410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1410,&quot;width&quot;:2500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8ZWj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg 424w, https://substackcdn.com/image/fetch/$s_!8ZWj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg 848w, https://substackcdn.com/image/fetch/$s_!8ZWj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!8ZWj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad0276b-9865-45d9-b613-cd1bb05a234f_1000x564.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AI-generated code has moved from curiosity to <em>almost</em> standard practice with breathtaking speed. Teams use AI to build entire features and even applications. Code generation tools translate requirements directly into pull requests. What took a developer days now takes minutes and a carefully crafted prompt. When done right, the AI productivity gains are real. Teams shipping AI-assisted code go faster.</p><p>The question facing engineering organizations today isn't whether to adopt these tools&#8212;developers already have&#8212;but how to understand and operate systems built with them safely.</p><h3>The Acceleration of an Old Problem</h3><p>Let&#8217;s be honest, code opacity didn't start with AI. Every engineering organization battles the same demons: developers leave, context evaporates, documentation rots, deadline pressure produces shortcuts. Six months after launch, someone gets paged at 3 AM to fix a system they didn't write and barely understand. This story is older than version control.</p><p>AI-generated code didn&#8217;t create this problem but it accelerates it to speeds we've never seen before.</p><p>A developer writing code makes choices&#8212;explicit or implicit&#8212;about tradeoffs, edge cases, and failure modes. They might not always document these choices well, but they existed as conscious thoughts, at least momentarily. You can pull them into a meeting room and ask: "Why did you implement it this way?" They might not remember perfectly, but there's a conversation to have, a human memory to question.</p><p>AI-generated code compresses this process into statistical inference. The model chose an implementation based on patterns in its training data, optimizing for likelihood rather than your specific domain constraints. Three months later, when that code fails under unexpected load, there's no developer to ask. The model that generated it no longer exists in the same form&#8212;weights have shifted, training has evolved, the context window that held your requirements has long since evaporated. You're maintaining systems written by "intelligences" that can't yet fully be recalled for questioning.</p><p>The velocity compounds everything. When one developer writes problematic code, that's one problem to debug. When AI helps fifty developers write code twice as fast, you've potentially scaled both the productivity <em>and</em> the technical debt by orders of magnitude. The same old problems happens at unprecedented speed and volume.</p><h3>Humans Are Still in the Loop</h3><p>Let's be clear about what actually happens: AI doesn't drop code directly into production. A developer prompts the AI, reviews what it generates, modifies the output, integrates it with existing systems, AI writes tests, and shepherds it through code review. Humans remain deeply involved.</p><p>The challenge emerges from volume and velocity. When you're reviewing thirty AI-generated pull requests weekly instead of ten human-written ones, can you maintain the same focus and scrutiny? When the code <em>looks</em> correct&#8212;follows conventions, passes linters, handles obvious failure cases&#8212;how often do you catch the subtle problems that only emerge under production conditions the AI never considered?</p><p>Human review remains essential, but it faces scaling limits. We need systematic methods for stress-testing AI-generated code that complement human judgment. This is where chaos engineering can help.</p><h3>Why Chaos Engineering Fits This Moment</h3><p>Traditional code review asks: "Does this code do what it claims?" Chaos engineering asks: "What does this code do when everything goes wrong?" That second question becomes critical when you're shipping code whose internal logic you inherited rather than invented.</p><p>Run a chaos experiment that injects latency into downstream services. Watch which components start failing in ways nobody predicted. Maybe the AI-generated service client retries aggressively without backoff, turning a minor slowdown into a cascading failure. Code review might have caught this if the reviewer specifically looked for retry logic and thought to question its implementation. Chaos engineering catches it by creating the conditions where the problem reveals itself.</p><p>The difference becomes clearer over time. You discover patterns: this model's code tends to assume infinite memory; that model's error handling releases resources inconsistently. These insights feed directly into how you prompt AI going forward, what you hunt for in code review, and where you concentrate your testing.</p><h3>How Chaos Engineering Must Evolve</h3><p><em>Traditional</em> chaos engineering assumed code written by humans. Humans you could ask questions to. That assumption breaks down when substantial portions of your codebase emerge from statistical models.</p><h4>Automated Knowledge Capture</h4><p>Every chaos experiment that reveals unexpected behavior should generate structured documentation automatically. When an experiment discovers that an AI-generated service degrades catastrophically under certain conditions, the experiment should produce:</p><p>- Structured description of the failure mode(s)</p><p>- Metrics thresholds indicating the problem emerging</p><p>- Potential mitigation steps based on observed and learned behavior</p><p>- Links to specific code exhibiting the issue</p><p>Engineers review and refine these <em>auto-generated</em> artifacts rather than writing them from scratch. This automation matters because AI code generation produces more components faster than humans can document manually. The documentation gap will become a problem without systematic capture.</p><h4>Feedback Loops into Code Generation</h4><p>Here's where things get interesting. Your chaos experiments discover that AI-generated clients consistently lack good retry logic. This gap appears across multiple services. You document it, but more importantly, you can now update your prompts to account for this shortfall: "Implement retry logic with exponential backoff and jitter pattern with these specific thresholds [&#8230;]"</p><p>AI-generated code improves because chaos engineering taught you what to demand. You can feed experiments findings directly into prompt libraries&#8212;an identified gap becomes a constraint in every subsequent request.</p><p>With such feedback loops, you can continuously update your prompts based on chaos experiments findings, shortening the critical learning cycle.</p><h4>Pattern Detection at Scale</h4><p>AI code generation creates an unusual opportunity: many components share similar origins. They emerged from similar models, responding to similar prompts, applying similar patterns from training data. <a href="https://arxiv.org/html/2506.11021v1">Studies</a> about code generation errors often identify recurring patterns or clusters of failures specific to certain models.</p><p>Chaos engineering tools can exploit this and systematically search for patterns common to specific AI models.</p><p>When you find these patterns once, hunt for them everywhere across all AI-generated components simultaneously, discovering not just that Service A has a problem, but that twelve services share variants of the same flaw because they were all generated using similar models.</p><h4>Continuous Experimentation</h4><p>AI enables shipping dozens of features weekly, each containing substantial generated code. Chaos experiments need to keep pace. However, embedding chaos experiments directly into CI/CD pipelines creates significant problems&#8212;the non-deterministic nature of chaos experiments conflicts with the deterministic requirements of deployment pipelines, experiments require production-like load and extended runtime. The solution is a separate, <a href="https://medium.com/@adhorn/decoupling-chaos-and-delivery-pipelines-624f1c9cadba">dedicated chaos pipeline</a> running parallel to your CI/CD pipeline, allowing experiments to operate on their own schedule without blocking deployments while still feeding findings back into development practices.</p><p><em>Note: For teams seeking deterministic validation in their CI/CD pipelines, deterministic simulators offer a middle ground. Tools like <a href="https://antithesis.com">Antithesis</a> model distributed system behavior deterministically, allowing exploration of failure scenarios with reproducible results. While they require significant investment to build and maintain, and can't capture all real-world complexities, they provide faster feedback than full chaos experiments while being more comprehensive than mocked tests. They work well in pipelines but should complement, not replace, chaos experiments against production-like environments.</em></p><h3>The AI Testing AI Question</h3><p>Could we use AI to design chaos experiments for AI-generated code? The idea is indeed seductive: provide an AI with code it generated previously plus observability data, then ask it to design experiments that would catch emerging issues.</p><p>This approach shows promise in my own experiments. With the right prompt, AI-designed chaos experiments sometimes target edge cases humans overlook, precisely because the AI recognizes patterns in generated code that humans don't consciously notice.</p><p>But we should still remain skeptical and in the loop. Both AIs operate from similar statistical foundations. They share similar blind spots&#8212;the AI experiment designer often misses the same edge cases the code generator misses, for exactly the same reasons. An AI trained primarily on historical data still struggles to imagine creative, realistic failure scenarios that don't appear commonly in training data.</p><p>Like for other AI generated artifacts, the key seems to use AI to generate candidate experiments, but require human engineers to review, refine, and prioritize them. AI handles the breadth&#8212;proposing dozens of scenarios quickly. Humans handle the depth&#8212;identifying which scenarios matter most for your specific systems and constraints.</p><p>We're learning by doing, and the lessons keep changing.</p><h2>Final Thoughts</h2><p>You just simply can't afford to opt out of the race while competitors embrace AI-assisted development. The productivity gains is just too substantial. Teams using these tools effectively ship features faster. Like with most innovations, organizations that hesitate eventually lose ground against competitors who've figured out <strong>how to capture the upside while managing the downside.</strong></p><p>The choice facing engineering leaders isn't "AI-generated code: yes or no?" Market forces already made that decision. The real choice is: "How do we operate systems built with AI assistance safely and sustainably?"</p><p>Chaos engineering offers a practical answer. By systematically exploring how systems fail, you build the operational understanding that rapid AI-assisted development threatens to erode. You discover hidden dependencies before they cause outages. You document failure modes before customers encounter them. You create feedback loops that improve both your code generation practices and your operational capabilities.</p><p>The machines write the code. Chaos engineering helps us understand what we've built and guides what we build next.</p>]]></content:encoded></item><item><title><![CDATA[Controls vs Guardrails: Why Organizations Struggle with Resilience Despite Having All the Right Pieces]]></title><description><![CDATA[After a decade of helping organizations improve their resilience practices, I keep seeing the same puzzle.]]></description><link>https://newsletter.resiliumlabs.com/p/controls-vs-guardrails-why-organizations-struggle-with-resilience-despite-having-all-the-right-pieces</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/controls-vs-guardrails-why-organizations-struggle-with-resilience-despite-having-all-the-right-pieces</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Thu, 14 Aug 2025 15:12:48 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f8877586-276d-4826-a849-d05e926bfc54_1000x667.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!097S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!097S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!097S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!097S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!097S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!097S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg" width="2500" height="1667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1667,&quot;width&quot;:2500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!097S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!097S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!097S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!097S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77b4e06a-35b7-44ae-99a0-67aa411f0c5d_1000x667.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>After a decade of helping organizations improve their resilience practices, I keep seeing the same puzzle. Companies invest heavily in operational readiness reviews, well-architected reviews, incident reviews, chaos engineering, alerting, monitoring, etc. They have leadership buy-in, dedicated teams, and sophisticated tooling. Yet despite having all the right pieces, many still struggle to build genuine resilience.</p><p>The more I observe this pattern across industries, the more convinced I become that we're dealing with something fundamental about human psychology, and how our natural responses to uncertainty systematically undermine the very resilience we're trying to build.</p><h2><strong>The Evolutionary Trap</strong></h2><p>The human brain evolved to treat uncertainty as potential danger. When we don't know what's coming next, our amygdala activates, triggering stress and the urgent need to regain control. This served our ancestors well when uncertainty meant immediate physical threats, but it creates problems in complex organizational systems.</p><p>We're pattern-seeking creatures who prefer clear cause-and-effect relationships. When faced with ambiguity, we instinctively impose structure with rules, procedures, and approval gates that provide the illusion of predictability, even when they can't actually remove the underlying uncertainty.</p><p>Think about what happens after an outage. Teams naturally default to adding more approval gates, extending checklists, requiring additional sign-offs. Each control feels rational in isolation and provides psychological relief by making us feel we're "doing something" about the problem.</p><p>But together, these controls create systems so constrained that engineers can't respond effectively when something truly unexpected happens. The very mechanisms designed to prevent problems end up preventing the adaptive response that could have avoided a bigger failure.</p><p>Research reveals exactly why this backfires. Organizations that handle crises well are those that can flexibly navigate different responses during real emergencies, rather than simply following rigid procedures. Yet our natural response to past failures is to create more rigid procedures.</p><p>When outcomes are uncertain, our decision-making shifts from calculation to heuristics and mental shortcuts. We fall back on availability bias (overweighting recent incidents) and confirmation bias (seeking information that supports our existing beliefs about what went wrong). This leads to controls that address the specific failure we just experienced while missing the broader patterns that create system brittleness.</p><p>The desire for predictability is so strong that we often choose the feeling of control over the reality of safety. This explains why organizations continue tracking metrics like <a href="https://www.resiliumlabs.com/blog/mttr-problems-better-incident-metrics">MTTR even when teams understand it's mathematically meaningless</a>, or why they maintain cumbersome approval processes that become rubber stamps but create friction during emergencies.</p><h2><strong>Controls vs Guardrails</strong></h2><p>The key to understanding and solving this problem is recognizing that most organizations blur a crucial line between two fundamentally different approaches to safety: <strong>controls</strong> and <strong>guardrails</strong>.</p><p><strong>Controls</strong> dictate how work gets done. They're prescriptive, active during normal operations, and create friction for everyone, regardless of whether there's any actual danger. Like tollbooths on a highway, they slow down every single person, every single time, even when there's no safety issue.</p><p>I've seen many organizations create elaborate chaos engineering processes with good intentions. They want to prevent teams from causing unintended damage. But these weeks-long coordination requirements create cognitive overload that makes teams avoid learning activities entirely. That's a control masquerading as a safety practice.</p><p>The most telling sign that controls have gone too far is when engineers stop raising concerns because "the process doesn't allow for that" or "nobody would listen anyway." That's adaptive capacity disappearing in real time.</p><p><strong>Guardrails</strong>, on the other hand, define safe operating boundaries while preserving flexibility within those bounds. Like highway guardrails, they activate only when you're approaching real danger, not during normal operations. They make the safe path also the easy path.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QIQX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QIQX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png 424w, https://substackcdn.com/image/fetch/$s_!QIQX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png 848w, https://substackcdn.com/image/fetch/$s_!QIQX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png 1272w, https://substackcdn.com/image/fetch/$s_!QIQX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QIQX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png" width="2368" height="1308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1308,&quot;width&quot;:2368,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!QIQX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png 424w, https://substackcdn.com/image/fetch/$s_!QIQX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png 848w, https://substackcdn.com/image/fetch/$s_!QIQX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png 1272w, https://substackcdn.com/image/fetch/$s_!QIQX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23c37f7-6f6a-47f6-abe6-f0fcafeab984_1000x552.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Think of it like ziplining in a forest. The controls approach says "Ziplining is dangerous, so we'll require permits, 6 weeks of training, and supervised access only on Tuesdays." Result? Nobody ziplines, or people sneak in and zipline without any safety equipment because the official process is too cumbersome.</p><p>The guardrails approach says "Ziplining is dangerous, so here's a harness, safety line, and helmet." The safety equipment enables the risky activity rather than preventing it. People zipline frequently and safely because the gear only constrains them when there's real danger such as weight limits exceeded, equipment failure, or bad weather.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NTT-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NTT-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NTT-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NTT-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NTT-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NTT-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg" width="2500" height="1667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1667,&quot;width&quot;:2500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!NTT-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NTT-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NTT-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NTT-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b25c201-639a-4790-a8d9-d902a93e54ea_1000x667.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A guardrail approach to chaos engineering might provide lightweight frameworks with ready-made integrations, but allow teams to adapt scope, timing, and focus based on what they're trying to learn about their specific systems. The safety comes from built-in blast radius limits, automatic rollback procedures, and environment isolation, not from bureaucratic overhead.</p><p>This distinction shows up everywhere once you know to look for it. I've seen incident reports describing production access processes as "cumbersome," creating friction during the exact moments when adaptive capacity matters most. The irony is that these access controls often become rubber stamps. When so many people need production access for legitimate work, approval processes default to "yes" without real scrutiny.</p><p>Meanwhile, guardrails like automated safety checks, environment-specific tooling, and default read-only permissions would actually prevent dangerous actions without slowing down normal troubleshooting.</p><p>The pattern is often exacerbated during incident reviews. Organizations naturally gravitate toward quick fixes after outages. The urgency to "do something" drives teams to immediately jump to finding root causes and generating action items. But as <a href="https://www.linkedin.com/in/jallspaw/">John Allspaw</a> observes in his excellent talk <a href="https://www.youtube.com/watch?v=Zw_ASI-rk1s">Incident Analysis: How *Learning* is Different Than *Fixing*</a>, when "your goal is to fix, you're gonna fix something whether or not it was the right thing to fix," and "once teams find a plausible fix, time and production pressure cause them to stop exploring other options."</p><p><a href="https://www.youtube.com/watch?v=Zw_ASI-rk1s">Incident Analysis: How *Learning* is Different Than *Fixing* - John Allspaw</a></p><p>Here's the trap I see many organization fall into: when incident reviews become focused on quick fixing rather than deep understanding, they systematically generate more controls. Each incident generates new approval processes, additional checkpoints, and longer procedures. Every. Single. Time. The very mechanism meant to build resilience ends up eroding it through control accumulation.</p><p>A guardrails approach to incident learning resists the temptation for quick fixes. Instead, it focuses on understanding "what was difficult for people to understand during the incident, what was surprising for people about the incident.&#8221; Answering these difficult questions helps design better guardrails.</p><p>The difference is critical: quick incident reviews create controls and bureaucracy, while deep understanding-focused reviews help create good guardrails and in turn build adaptability.</p><h3>But wait, what happens when engineers take shortcuts that bring down critical systems?</h3><p>This question becomes crucial when we consider why people work around safety measures. In my experience, there are two distinct patterns:</p><p><strong>Systematic workarounds</strong> where multiple people consistently bypass the same controls reveal design problems. When everyone ignores a safety measure because it conflicts with getting work done effectively, that's feedback that the control isn't designed for the reality of the work environment.</p><p><strong>Individual violations</strong> where someone consciously ignores a safety protocol despite understanding the risks represent accountability issues requiring different responses such as training, supervision, or removal from roles.</p><p>The key difference is in the data pattern. One person removing a safety device represents an individual problem. Everyone consistently working around the same procedure indicates a system design problem requiring guardrail redesign, not more enforcement. When operational staff or engineers consistently bypass safety measures, it's usually because those measures force them to choose between being safe and being effective. That's a design problem, not a people problem.</p><p>For the small percentage who truly are bad actors, attempting to prevent malicious behavior through process controls is futile. If someone really wants to cause trouble, they'll find a way around controls. This is where foundational security principles become essential: comprehensive auditing that records every action, immutable infrastructure that can't be tampered with, and "detect, isolate, replace" strategies. These work as guardrails. They don't prevent every possible action through approvals, but they make malicious changes visible and automatically containable.</p><p>You can't control your way out of determined bad actors, but you can architect systems that make their actions obvious and limited in scope.</p><h2><strong>Breaking the Cycle</strong></h2><p>The human tendency toward adding controls against uncertainty is so deeply wired that even organizations with excellent resilience instinct and intentions fall into this trap. After incidents, the cultural pressure to "do something" combined with our psychological need for control creates an almost irresistible urge to add approval gates, extend procedures, and create more detailed documentation.</p><p>The path forward requires consciously auditing our practices through a controls vs guardrails lens. Where are we creating friction during normal operations when we should be creating safety boundaries? Where are we demanding compliance when we should be enabling adaptation?</p><p>The goal isn't eliminating all structure, it's not! Instead it&#8217;s ensuring our structure enables the adaptive capacity that makes systems genuinely resilient rather than just compliant.</p><p>Here's the important bit: resilience comes from systems that can learn and adapt, not from preventing all possible changes. When we build tollbooths instead of guardrails, we optimize for the feeling of control rather than the reality of safety.</p><p>Smart guardrails enable adaptation by making the safe path also the effective path. Rigid controls kill adaptation by forcing people to choose between following procedures and solving problems.</p><p>To really improve resilience, organizations need to understand this distinction and design safety mechanisms that activate when needed but don't interfere with normal operations. They need to measure outcomes that matter rather than compliance metrics that feel good. They need to create psychological safety that enables people to surface problems early rather than hide them to avoid bureaucratic friction.</p><p>Most importantly, they need to recognize that our instinctive response to uncertainty&#8212;adding more controls&#8212;is often the enemy of the adaptive capacity that creates real resilience.</p><p>You don't make dangerous activities safe by preventing access. You make them safe by giving people the right safety equipment and boundaries.</p><div><hr></div><h2><strong>Addendum: Intent and Context Matter</strong></h2><p>A recent discussion with <a href="https://www.linkedin.com/in/fredth/">Fred Hebert</a> made me realize I had left out something important: the controls vs. guardrails framework isn't just about technical implementation, it's also about human psychology, organizational context, and the often-unconscious biases that shape how we respond to uncertainty.</p><p>Whether something functions as a control or guardrail often depends on intent and context, not just implementation. Consider peer code review:</p><ul><li><p>As a Control: Management mandate for compliance, or engineers implementing it because they don't trust each other</p></li><li><p>As a Guardrail: Engineers wanting shared awareness and collective understanding</p></li></ul><p>The same technical mechanism serves different purposes and creates different psychological effects based on who decided and why.</p><p>This reveals something important. Many controls start as well-intentioned guardrails but drift toward control thinking over time, especially when designed by people detached from the actual work. I frequently see well-intentioned engineers create elaborate controls for things they won't use directly, such as runbooks, reviews, checklists, and templates. This "work-as-imagined" vs. "work-as-done" gap leads to controls that feel logical in theory but create friction in practice.</p><p>This tendency is amplified by our engineering training itself. As Fred noted in our conversation, "our whole engineering discipline is often founded on breaking things down analytically, creating the right abstractions to constrain variability, and moving on!" We're taught to reduce complexity by controlling it, which works well for technical systems but can backfire when applied to human systems.</p><p>The challenge is that controls themselves become part of the environment and increase potential for unexpected interactions. When we try to limit environmental complexity through rigid procedures, we often create new complexities in how people work around those procedures.</p>]]></content:encoded></item><item><title><![CDATA[Why MTTR is a Misleading Metric (And What to Track Instead)]]></title><description><![CDATA[Many engineering teams have that dashboard, the one they've been staring at for months, watching MTTR stubbornly refuse to budge despite all their hard work.]]></description><link>https://newsletter.resiliumlabs.com/p/mttr-problems-better-incident-metrics</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/mttr-problems-better-incident-metrics</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Sat, 12 Jul 2025 08:13:35 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2cbe4f3e-fcb5-466c-b662-e043b63ddb97_1000x649.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Many engineering teams have that dashboard, the one they've been staring at for months, watching MTTR stubbornly refuse to budge despite all their hard work. Leadership keeps asking why the number isn't improving. The team knows they're doing better work, but the metric tells a different story.</p><p>Here's a simple math problem that breaks many organizations.</p><p>You have 10 incidents. 9 resolve in 5 minutes each. 1 takes 6 hours (360 minutes).</p><p>Your MTTR says 40.5 minutes.</p><p>Do you think it tells a good story?</p><p>Most of the incidents were short, but your MTTR suggests everything takes 40+ minutes.</p><p>The message is simple: Stop using MTTR as your north star metric. The math just doesn't add up and almost never makes any sense.</p><p>Similar to the <a href="https://www.resiliumlabs.com/blog/the-prevention-paradox">prevention paradox</a> I recently wrote about, where successful resilience work can make itself appear unnecessary, MTTR creates its own illusion while hiding the real story of your system's health.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ArAh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ArAh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png 424w, https://substackcdn.com/image/fetch/$s_!ArAh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png 848w, https://substackcdn.com/image/fetch/$s_!ArAh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png 1272w, https://substackcdn.com/image/fetch/$s_!ArAh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ArAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png" width="2554" height="1658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1658,&quot;width&quot;:2554,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ArAh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png 424w, https://substackcdn.com/image/fetch/$s_!ArAh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png 848w, https://substackcdn.com/image/fetch/$s_!ArAh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png 1272w, https://substackcdn.com/image/fetch/$s_!ArAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8165e8b-d4d5-499e-b0b0-3b9455ff6cf7_1000x649.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Demoralization Effect</h2><p>A few months ago, I spoke with a platform engineering team that had spent over two years dramatically improving their systems. They'd implemented comprehensive monitoring, automated recovery procedures, and streamlined their incident response. The team was proud of their work, and rightfully so.</p><p>But their MTTR dashboard told a different story. Despite all their improvements, the number stubbornly hovered around 85 minutes. Some months it even went up.</p><p>The team was demoralized because leadership questioned their efforts. "If you're really improving things, why isn't MTTR going down?"</p><p>When we dug deeper, the true story came to light. The team had become really good at detecting small issues early, problems that would previously have cascaded into major outages. They were catching these issues in minutes and resolving them quickly. But they were also tackling increasingly complex infrastructure problems that genuinely took hours to solve properly.</p><p>Their stubborn MTTR was actually masking significant progress: minimal customer-impacting outages in 18 months, a 90% reduction in alert fatigue, and a team that had transformed from a reactive firefighting approach to proactive system improvement.</p><p>MTTR not only failed to reflect their success, but it actually actively undermined it.</p><h3>The Math Simply Doesn't Work</h3><p>MTTR has become the default way organizations measure incident response. Teams dutifully display their MTTR dashboards, track improvements over time, and use these numbers to justify investments, performing operational excellence for leadership rather than actually achieving it.</p><p>But there's a fundamental problem. Incident duration data follows a <a href="https://en.wikipedia.org/wiki/Power_law">power-law distribution</a>. Most incidents are resolved quickly: a container restart here, a cache flush there. But occasionally, you get the big ones: database corruptions, complex cascading failures, or novel security breaches that can take hours or days to resolve.</p><p>When you average that 5-minute container restart with a 6-hour database recovery, you get a number that represents neither reality.</p><p>As discussed in the <a href="https://www.thevoid.community/report-2024">VOID Report</a>, incident duration data follows a positively-skewed distribution where "measures of central tendency like the mean, aren't a good representation of positively-skewed data, in which most values are clustered around the left side of the distribution while the right tail of the distribution is longer and contains fewer values."</p><p>The mean gets pulled toward the outliers, creating a metric that doesn't reflect the typical experience of either your users or your incident response teams. In essence, you're using a statistic designed for normal distributions on data that's anything but normal.</p><p>This is exactly why teams see their MTTR stay flat or even increase despite genuine improvements. <strong>As teams become better at detecting problems early, they catch more small issues that can be resolved quickly.</strong></p><p>But as systems continue to evolve, they also tackle more complex problems that naturally take longer to solve properly. The occasional 6-hour incident will dominate the average, making months of 5-minute fixes invisible.</p><h3>Time Doesn't Equal Impact</h3><p>But the statistical issues are just the beginning. MTTR misses the actual point.</p><p>A 30-minute incident affecting 100,000 users is fundamentally different from a 2-hour incident affecting 5 users. MTTR treats them identically.</p><p>Consider these real-world scenarios:</p><p><strong>Live streaming platform</strong>: Video playback fails for 10 minutes during the final quarter of a playoff game. 50,000 concurrent viewers lose service at the most critical moment. Social media explodes with complaints. Potential subscription cancellations. Customer support is overwhelmed.</p><p>Versus: "My watchlist" feature breaks for 2 hours during off-peak. A few dozen users noticed. Minimal business impact, easily communicated via banner notification.</p><p><strong>E-commerce site</strong>: Your checkout process crashes for 15 minutes during a Black Friday flash sale. Direct revenue loss of an estimated $500K. Abandoned carts. Frustrated customers are switching to competitors. Marketing spend wasted.</p><p>Versus: The "recommended items" widget fails for 3 hours on a Tuesday morning. Slight decrease in discovery metrics. No immediate revenue impact. Most users don't even notice.</p><p>In both cases, MTTR would suggest the longer incidents were "worse" than the shorter ones.</p><p>But which ones actually damaged the business?</p><p>Losing a non-critical service for 5 hours isn't the same as losing your most critical one for 10 minutes. MTTR can't distinguish between these scenarios.</p><h3>The Human Reality</h3><p>There is an even more fundamental challenge beyond the statistical issues: incidents are inherently human processes, and MTTR completely ignores human and organizational factors.</p><p>Real incident response involves complex dynamics that just can't be captured in a simple time measurement.</p><p>First, complex incidents often require multiple teams from different organizations, e.g., database specialists, network engineers, security experts, and product managers. Each handoff introduces delays, communication gaps, and potential misalignment that MTTR treats as "inefficiency" rather than important and necessary collaborative work.</p><p>Additionally, during high-stress incidents, engineers frequently face new failure modes and must diagnose novel problems, often with incomplete information. The time spent carefully analyzing symptoms to avoid making things worse isn't captured meaningfully by MTTR.</p><p>Shift changes, escalation procedures, approval processes, and dependencies on third-party vendors all introduce delays that have nothing to do with technical competence but significantly impact resolution time.</p><p>Finally, thorough incident response often involves deliberately slowing down to understand the real problems, verify fixes, and prevent recurrence or cascading effects, especially for large-scale events. MTTR incentivizes speed over precaution and learning, potentially making systems less resilient in the long term.</p><p>These human factors introduce variability that makes MTTR practically misleading, while missing the very aspects of incident response that distinguish operational excellence.</p><h3>When Good Metrics Go Bad</h3><p>MTTR actively drives counterproductive behaviors when used as a performance metric. When teams are measured by average resolution time, they will often end up optimizing for the metric rather than actual resilience (a process called <a href="https://hbr.org/2019/09/dont-let-metrics-undermine-your-business">surrogation</a>), prioritizing quick fixes over understanding the contributing factors (root causes) and rushing to close tickets rather than implementing lasting solutions. This pressure fosters a culture where teams prioritize finding someone responsible rather than addressing systemic issues, while also promoting superficial verification by skipping thorough post-incident testing to close tickets faster. Teams may even game the numbers by avoiding declaring incidents or manipulating start and stop times to improve their averages. These behaviors make organizations less resilient, not more, and ultimately create a culture focused on looking good on dashboards rather than building genuinely resilient systems.</p><h2>The "Better Metric" Trap</h2><p>I often see teams recognize MTTR's limitations and try alternatives, such as MTTM (Mean Time to Mitigate) or MTTD (Mean Time to Detect). The thinking goes: "If we measure time to mitigation instead of full resolution, we'll better capture when customer pain stops."</p><p>But this also misses the actual issue.</p><p>MTTM has exactly the same statistical problem as MTTR. You're still averaging highly skewed data. Whether you measure "time to resolve" or "time to mitigate," the math doesn't change. That same 6-hour outlier will dominate your MTTM just like it dominated MTTR.</p><p>The real issue is using averages at all on this type of messy data, not where we draw the finish line.</p><h2>What to Track Instead</h2><p>So, what should you measure instead? Great question! And like much of computer science, it depends!</p><p>It depends on what you're trying to achieve, but here are better alternatives:</p><h3>The Dashboard That Actually Tells Your Story</h3><p>Picture this: You're in a quarterly review. Instead of staring at an MTTR dashboard showing 45 minutes (making your team look bad), you pull up a different set of metrics:</p><ul><li><p><strong>"99.97% login success rate, only 12 users affected by incidents this quarter."</strong></p></li><li><p><strong>"Zero revenue-impacting outages in the last 6 months."</strong></p></li><li><p><strong>"95% of incidents resolved in under 8 minutes."</strong></p></li></ul><p>Suddenly, the conversation shifts from "why is your MTTR so high?" to "how did you achieve such results?"</p><p>A big part of our work on resilience is about telling the story our work actually deserves.</p><p>Metrics shape the story about your team's work. And that story determines everything, from budget approvals to whether leadership views resilience investments as beneficial or not. So get it right!</p><p>MTTR often tells a story of mediocrity and failure. Impact metrics and percentiles tell the story of excellence, learning, and genuine improvement. Your choice of metrics shapes the narrative about your team's work, and that narrative in turn influences everything from budget decisions to career advancement.</p><p>You must tell the resilience stories in a way that leadership can understand and appreciate.</p><h3>Focus on Customer Experience</h3><p>Use Service Level Objectives (SLOs) that measure availability, latency, and error rates from the customer's point of view. Instead of asking "How long did it take to fix?" ask "What percentage of user requests succeeded?"</p><p>I worked with a team whose MTTR was consistently "terrible" at 95 minutes. Leadership was frustrated. However, when we looked at their SLOs, it told us a different story: 99.95% uptime, with the vast majority of their "incidents" being minor maintenance work that users never even noticed.</p><p>The team had been solving the right problems and preventing customer impact, but the MTTR made them look like they were failing.</p><p>Start with user-facing metrics, such as "login requests succeed 99.9% of the time." Focus on what users actually experience, not what your infrastructure is doing.</p><h3>Impact-Focused Metrics</h3><p>Impact-focused metrics measure the actual effect of incidents on users, business operations, and system performance:</p><ul><li><p>Number of users affected</p></li><li><p>Duration of user impact</p></li><li><p>Revenue loss or business disruption</p></li><li><p>Service Level Objective (SLO) violations</p></li><li><p>Error rates and availability percentages</p></li><li><p>Customer satisfaction scores during and after incidents</p></li></ul><p>Here's a real example: An e-commerce team had two incidents in the same week. Their MTTR dashboard showed:</p><ul><li><p><strong>Incident A: 3 hours to resolve</strong></p></li><li><p><strong>Incident B: 20 minutes to resolve</strong></p></li></ul><p>Leadership asked why the team seemed more concerned about Incident B. But the impact metrics told the real story:</p><ul><li><p><strong>Incident A: Affected 5 internal users, $0 revenue impact</strong></p></li><li><p><strong>Incident B: Affected 50,000 customers during peak hours, estimated $200K revenue loss</strong></p></li></ul><p>MTTR suggested Incident A was "9x worse." Impact metrics revealed the truth. Again, stories matter!</p><p>Categorize incidents by business impact&#8212;Critical (revenue-affecting, customer-facing), High (internal productivity loss), Medium (degraded experience), Low (no user impact). Track the count and duration of each category separately.</p><p>Impact-focused metrics prioritize what matters most by measuring how many users are impacted and for how long, ensuring teams focus on reducing real-world pain rather than just closing tickets quickly. This approach aligns engineering work with actual business risk and value. They expose hidden risks by highlighting that two incidents with the same MTTR can have vastly different impacts, revealing which incidents truly impact the business and require deeper attention while uncovering chronic issues or fragile components that consistently cause high user impact.</p><p>Unlike MTTR, which incentivizes quick fixes, impact metrics encourage healthy behaviors by encouraging teams to address underlying causes and prevent recurrence, thereby improving long-term resilience.</p><h3>Use Percentiles, Not Averages</h3><p>Use percentiles (P95, P99) instead of averages to understand your real incident distribution. These metrics are less influenced by outliers and give you a clearer picture of what&#8217;s actually happening.</p><p>Let's return to our original 10 incidents example to see how percentiles tell a completely different story than MTTR: 9 incidents at 5 minutes, 1 incident at 360 minutes (6 hours)</p><ul><li><p><strong>P50 (median)</strong>: 5 minutes - This tells you that half of your incidents resolve in 5 minutes or less</p></li><li><p><strong>P90</strong>: 5 minutes - This tells you that 90% of your incidents resolve in 5 minutes or less</p></li><li><p><strong>P95</strong>: 5 minutes - This tells you that 95% of your incidents resolve in 5 minutes or less</p></li><li><p><strong>P99</strong>: 360 minutes - This tells you about your worst-case scenario, the 1% of incidents that take much longer</p></li><li><p><strong>MTTR (mean)</strong>: 40.5 minutes - This tells you... what exactly?</p></li></ul><p>What does that tell us?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VLwX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VLwX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png 424w, https://substackcdn.com/image/fetch/$s_!VLwX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png 848w, https://substackcdn.com/image/fetch/$s_!VLwX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png 1272w, https://substackcdn.com/image/fetch/$s_!VLwX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VLwX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png" width="2556" height="1662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1662,&quot;width&quot;:2556,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!VLwX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png 424w, https://substackcdn.com/image/fetch/$s_!VLwX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png 848w, https://substackcdn.com/image/fetch/$s_!VLwX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png 1272w, https://substackcdn.com/image/fetch/$s_!VLwX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35c9c967-a71a-4f5a-be11-1483382eac9f_1000x650.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The percentiles tell you that your incident response is actually excellent for the vast majority of cases. 95% of your incidents resolve in just 5 minutes! The P99 shows you have an occasional complex incident that takes 6 hours, which is important to know and address separately.</p><p>MTTR, however, suggests that a "typical" incident takes over 40 minutes, which is completely false. No incident in your dataset actually took 40 minutes. It's a mathematical artifact that doesn't represent anyone's actual experience.</p><p><strong>Why this matters for decision-making</strong>:</p><ul><li><p><strong>P50 and P90</strong> help you understand your standard operational capability</p></li><li><p><strong>P95 and P99</strong> help you identify outliers that need special attention</p></li><li><p><strong>Trends in percentiles</strong> show whether you're improving typical performance (P50/P90) or getting better at handling complex incidents (P95/P99)</p></li></ul><p>P99 resolution time tells you what your most challenging incidents look like. P50 tells you about your typical response. Both are more useful than a skewed average that represents nobody's actual experience.</p><p>When you track percentiles over time, you <em><strong>might</strong></em> see improvements in both your standard incident response (lower P50/P90) and your complex incident handling (lower P95/P99), insights that MTTR completely obscures.</p><p>Noticed the "<em><strong>might</strong></em>"?</p><p>Here's the thing: even percentiles won't give you a simple, linear story of improvement. Modern systems are complex, distributed, and constantly evolving. Features are deployed continuously, infrastructure updates are regular, and teams are constantly changing.</p><p>Expecting any single metric to capture this complexity and show steady improvement is like wrapping yourself in a warm blanket during a snowstorm; it feels comforting, but it doesn't actually change the weather outside.</p><p>Here's the practical advantage of percentiles: they're a relatively good transitional metric if your organization is currently committed to MTTR. Because percentiles tell such a dramatically different and more accurate story than MTTR, leadership is unlikely to reject them. When you show that P50 is 5 minutes while MTTR claims 40+ minutes, the mathematical issues becomes obvious. But more importantly, percentiles don't threaten existing processes because they are fairly similar to MTTR. This makes them perfect for organizations that want to wrap themselves in a warm blanket but also tell a better story.</p><p>To further improve percentiles, consider categorizing your incidents into multiple classes, including deployment issues, bugs, regressions, testing issues, infrastructure failures, operational issues, security incidents, and third-party failures.</p><p>Each category has different root causes, recovery patterns, and prevention strategies. Averaging them together hides the specific improvements needed for each type.</p><h3>What the Numbers Can't Tell You</h3><p>Remember that platform engineering team from the beginning? The one with the stubborn 85-minute MTTR despite two years of improvements? When we dug deeper into their incidents, we discovered something very interesting.</p><p>Their "worst" incident, a 6-hour database corruption that dominated their MTTR, had actually taught them more about system resilience than dozens of quick fixes combined. The team had to coordinate across five different groups, improvise solutions when their runbooks failed, and ultimately redesign their backup strategy. Six months later, they prevented three similar issues from happening at all.</p><p>But their MTTR dashboard captured none of this. It just recorded "6 hours" and moved on.</p><p>This is what resilience engineering teaches us: the most valuable insights from incidents aren't about metrics, they're about adaptation and learning.</p><p><strong>Here are some of the important and useful lessons that they actually learned:</strong></p><ol><li><p>Their mental model of how database replication worked was completely wrong. The incident revealed assumptions they'd held for years about their architecture that turned out to be false.</p></li><li><p>That team discovered that their monitoring was blind to a specific type of database issue they didn't even know existed.</p></li><li><p>When their standard recovery procedures failed, the team had to improvise. The database specialist was on vacation, the network team was handling a separate issue, and the junior engineer on-call had limited experience. Despite all that, they succeeded. This taught them something crucial: their runbooks were incomplete, their staffing assumptions were wrong, and their "junior" engineers were more capable than anyone realized.</p></li><li><p>The incident required five teams, seven approval processes, and coordination across three time zones.. It showed them exactly where their architecture was too tightly coupled and where their processes created unnecessary bottlenecks.</p></li></ol><p>Each of these stories contains insights that make the organization more resilient. But MTTR reduces all of this richness to a single number that obscures the very learning that prevents future incidents.</p><h3>What Resilient Organizations Actually Measure</h3><p>Organizations that really care about resilience don't obsess over incident duration. They track how well they learn from each incident and how psychologically safe teams feel when reporting problems:</p><ul><li><p>Do we detect similar problems faster each time?</p></li><li><p>Are teams getting better at improvising when standard procedures fail?</p></li><li><p>What surprises us about our systems, and how do we turn those surprises into knowledge?</p></li><li><p>Are we reducing coordination overhead through better architecture and clearer ownership?</p></li><li><p>Which recovery strategies actually work under pressure?</p></li><li><p>How quickly do insights from one incident improve our response to future ones?</p></li></ul><p>Going back to that original team: when leadership asked "If you're really improving things, why isn't MTTR going down?" they were asking the wrong question.</p><p>The right question was: "Are you building systems that learn, adapt, and get stronger after each failure?"</p><p>The answer, hidden beneath that stubborn MTTR, was absolutely yes.</p><p>Their 90% reduction in alert fatigue meant they could focus on real problems. Their early detection systems meant small issues stayed small. Their systematic approach to complex incidents meant they were building institutional knowledge that would prevent future outages.</p><p>None of this showed up in MTTR. All of it showed up in their actual resilience.</p><p>The goal isn't to reduce incident response to an elusive average resolution time. The goal is to build systems and teams that learn from every incident, adapt when plans fail, and turn today's surprises into tomorrow's preventive measures.</p><p><strong>That's the story great engineering teams should be telling. And it's the story that MTTR will never capture.</strong></p><div><hr></div><h3>References</h3><p>I'm not the first to question the usefulness of MTTR. People have been highlighting these problems for years:</p><p>John Allspaw argued in 2018 that shallow incident data like MTTR "generates very little insight" because incidents are "dynamic events with people making decisions under time pressure" that can't be captured in simple averages.</p><p>The VOID Report confirmed these concerns empirically, finding that "measures of central tendency like the mean aren't a good representation of positively-skewed data."</p><p>&#352;t&#283;p&#225;n Davidovi&#269;'s "Incident Metrics in SRE: Critically Evaluating MTTR and Friends" used Monte Carlo simulations to show that reliable calculations of incident duration improvements aren't possible with MTTR.</p><p>Lorin Hochstein has demonstrated through statistical analysis that when incident durations follow power-law distributions, "observed MTTR trends convey no useful information at all." His work with power-law-distributed data shows how sample means become unreliable indicators of system performance.</p><p>What I hope to contribute here is a practical guide for the many engineering teams still using MTTR today, helping them understand not just why these metrics are misleading, but what they can implement instead to actually improve their resilience.</p><p>Allspaw, J. (2018). "<a href="https://www.adaptivecapacitylabs.com/2018/03/23/moving-past-shallow-incident-data/">Moving Past Shallow Incident Data</a>." Adaptive Capacity Labs.</p><p>Nash, C. et al. (2021-24). "<a href="https://www.thevoid.community/report-2024">The VOID Report</a>."</p><p>Davidovi&#269;, &#352;. (2021). "<a href="https://static.googleusercontent.com/media/sre.google/en//static/pdf/IncidentMeticsInSre.pdf">Incident Metrics in SRE: Critically Evaluating MTTR and Friends</a>."</p><p>Hochstein, L. (2024). "<a href="https://surfingcomplexity.blog/2024/12/01/mttr-when-sample-means-and-power-laws-combine-trouble-follows/">MTTR: When sample means and power laws combine, trouble follows</a>." Surfing Complexity.</p>]]></content:encoded></item><item><title><![CDATA[The Prevention Paradox: Why Successful Resilience Work Becomes Its Own Enemy]]></title><description><![CDATA[I recently spoke with a Staff Engineer whose team was being "rightsized" (I'm keeping details vague to protect their privacy.) Five years earlier, after a catastrophic Black Friday outage, leadership had given them carte blanche to build world-class resilience.]]></description><link>https://newsletter.resiliumlabs.com/p/the-prevention-paradox</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/the-prevention-paradox</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Tue, 03 Jun 2025 07:04:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/05ac0f69-339b-4961-9c6f-f0902ca6c463_1000x667.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J9bR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J9bR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!J9bR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!J9bR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!J9bR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J9bR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg" width="2500" height="1667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1667,&quot;width&quot;:2500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!J9bR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!J9bR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!J9bR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!J9bR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14077e78-77f8-406e-af28-2e4efc26760d_1000x667.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I recently spoke with a Staff Engineer whose team was being "rightsized" (I'm keeping details vague to protect their privacy.) Five years earlier, after a catastrophic Black Friday outage, leadership had given them carte blanche to build world-class resilience. They hired the best engineers money could buy, put a comprehensive incident learning process in place, implemented operational readiness reviews, automated and gradual deployments with continuous testing, built a chaos engineering program, and created feedback loops that institutionalized resilience thinking across the engineering organization. Name it, they either were doing it or planned to have it.</p><p>No major outages in almost 2 years. Some smaller ones of course, but nothing catastrophic. Customer satisfaction was high. Resilience that made many envious. More importantly, the organization was healthy. No blaming culture. Eagerness to learn, and do what it could to limit the number of customer-facing failures.</p><p>It worked great. Everyone seemed happy. They even talked publicly and published about their practices in their own engineering blog.</p><p>Then came a quarterly business review.</p><p>"What exactly does your team do? I see a lot of salary costs, but what are the deliverables? Can you quantify the ROI?" asked the newly appointed VP of Engineering.</p><p>The team had spent the better part of the past 5 years preventing disasters and building operational and organizational resilience capabilities, not gathering proofs or preparing PowerPoints. They scrambled to explain their value, and didn&#8217;t have the data to back up their claims. If you are not prepared to answer the questions from the VP, "We prevented failures" and "we learned a lot" doesn't cut it. It simply doesn&#8217;t translate well to budget spreadsheets.</p><p>"Nothing's breaking right now. We need to refocus resources on features that drive revenue. We can revisit this stuff later if we need to," said the VP.</p><p>Six months later, they were down to only two engineers, and just a few months after that, the bigger and longer outages started again. Finger-pointing started again. People focused on themselves, saving their own jobs, and trying to show the VP how cost-efficient they were. The organization was already losing its operational resilience culture.</p><p>This is the prevention paradox, and it has played out countless times throughout history. The <a href="https://en.wikipedia.org/wiki/Year_2000_problem">Y2K bug</a> offers perhaps the most famous example. After years of preparation and billions in investments to address the Y2K bug, January 1, 2000, passed without major incidents. And rather than celebrating this success, many dismissed Y2K concerns as overblown hysteria precisely because the preventive measures worked as intended. The very absence of catastrophe made it difficult to justify the resources that had been spent on prevention.</p><h3>Why is this happening?</h3><p>The prevention paradox cycle happens because our brains struggle to value "non-events", things that didn't happen. We're wired to respond to and remember actual events and visible outcomes. When a disaster is prevented, there's no dramatic story to hold onto or share, so no one hears about it.</p><p>After a non-event, people conclude:</p><p>"See? Nothing happened, so clearly it wasn't a real problem."</p><p>Rather than recognizing:</p><p>"Nothing happened because we took appropriate precautions."</p><p>Paradoxically, the more successful the resilience efforts, the more that effort appears unnecessary in retrospect.</p><p>Organizations tend to focus on systems that are currently working while ignoring the preventive work that keeps them working. A database that hasn't failed in two years seems "naturally reliable" rather than "well-maintained."</p><p>When systems work well, we often attribute it to good initial design or "stable technology." When they fail, we blame the incident on bad luck or external factors.</p><p>The immediate cost of prevention work is visible and concrete. The future cost of potential failures is abstract and uncertain.</p><p>And leadership isn&#8217;t immune to it. They experience the same cognitive bias as everyone else: if nothing bad is happening, maybe nothing bad was going to happen anyway. They see the salary of the chaos engineering team; they don't see the 2M &#8364; outage that never happened.</p><p>The prevention paradox is the most common, predictable, yet devastating cycle I see in software organizations, and it impacts even the most &#8220;advanced&#8221; organizations out there. Literally NO ONE is immune of it.</p><h2>The Resilience Amnesia Cycle</h2><p>Here's how the prevention paradox typically sets in an organization:</p><h4><strong>1. The Crisis</strong></h4><p>A major outage hits. Revenue lost. Customers angry. Executives embarrassed. "This can never happen again!"</p><h4><strong>2. The Investment</strong></h4><p>Leadership opens the checkbook. "Hire the best SREs and engineers. Build comprehensive incident learning processes. Implement operational excellence. Make it resilient. Whatever it takes!"</p><h4><strong>3. The Success</strong></h4><p>The team delivers. They build robust technical systems, adopt chaos engineering, and develop advanced observability. They establish incident response processes to learn from every incident. They implement Operational Readiness Reviews (ORR) and Continuous Integration and Continuous Deployment (CI/CD) practices that catch issues before deployment. They create feedback loops between development and operations that institutionalize resilience thinking across the entire engineering organization.</p><p>Systems become resilient. Large outages become rare. Customer satisfaction improves. The organization develops the capability to anticipate problems and adapt quickly to changing conditions.</p><h4><strong>4. The Return-On-Investment (ROI) Questions</strong></h4><p>New Leadership. New budget cycle. Different priorities. "What does the team actually do? Can we quantify their impact? Are we over-invested here?"</p><h4><strong>5. The Cuts</strong></h4><p>"Nothing's broken, so clearly we don't need this level of investment. Let's move some people to features. We can always scale back up if needed."</p><h4><strong>6. The Return</strong></h4><p>Outages start happening again. The organization has lost its learning capabilities. Teams make the same mistakes repeatedly. "How did we get here? We need to invest in resilience!"</p><p>And the cycle repeats.</p><p>The team that prevented the crisis gets disbanded. The team that responds to the new crisis gets celebrated as heroes.</p><p>This is what I call the <strong>Resilience Amnesia Cycle</strong>. It is a predictable pattern where organizations systematically forget why they invested in prevention, precisely because that prevention worked.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ok1S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ok1S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png 424w, https://substackcdn.com/image/fetch/$s_!Ok1S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png 848w, https://substackcdn.com/image/fetch/$s_!Ok1S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png 1272w, https://substackcdn.com/image/fetch/$s_!Ok1S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ok1S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png" width="3286" height="2074" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2074,&quot;width&quot;:3286,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Ok1S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png 424w, https://substackcdn.com/image/fetch/$s_!Ok1S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png 848w, https://substackcdn.com/image/fetch/$s_!Ok1S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png 1272w, https://substackcdn.com/image/fetch/$s_!Ok1S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1fc2a4-6b1e-4677-bbd6-300551d9d0ec_1000x631.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>The Resilience Amnesia Cycle</p><h2>Why Teams Struggle to Justify Their Existence</h2><p>The truth is that most (resilience) teams (including reliability, SREs, Ops, etc.) are totally unprepared for budget scrutiny because they were never asked to justify themselves initially. They were hired in crisis mode with a simple mandate: "Fix this."</p><p>They optimized for technical and organizational excellence, not business justification.</p><p>When the inevitable budget questions come, they face several challenges:</p><h4><strong>1. Success Is Invisible</strong></h4><p>"We prevented X number of potential outages" and "we improved organizational learning velocity" doesn't hit the same as "We shipped 12 new features." Prevention work creates an absence of problems and improved adaptive capacity, which are inherently hard to measure and communicate.</p><h4><strong>2. No Business Metrics Framework</strong></h4><p>Teams might track technical metrics, but rarely translate these to business impact. They can show improved velocity and some preparation through the different incidents and operational reviews, but they can't answer "What's the ROI?" because they never built the framework to calculate it or collected the data to even begin to consider it.</p><h4><strong>3. Different Languages</strong></h4><p>Engineers speak in availability percentages, bug fixes, number of deployments and rollbacks, and learning cycles from incidents. Finance speaks in revenue impact and cost per outcome. Neither group is fluent in the other's language.</p><h4><strong>4. Temporal Mismatch</strong></h4><p>Prevention work pays off over long time horizons. Budget cycles demand quarterly justification. The engineer who spent six months building chaos engineering experiments and learning from incident processes that will prevent next year's outage struggles to show this quarter's value.</p><h4><strong>5. Attribution Challenges</strong></h4><p>When systems are resilient, is it because of good initial architecture? Tested dependencies? Or the daily work of the teams building organizational resilience? It's genuinely hard to know, and teams rarely build systems to prove their contribution.</p><h3>The Leadership Perspective</h3><p>To be fair to leadership, their questions aren't completely unreasonable. They're managing competing priorities with limited resources. From their perspective:</p><ul><li><p>The resilience team was expensive to build and maintain</p></li><li><p>Current systems appear stable without obvious ongoing issues</p></li><li><p>Feature development has clear, measurable business impact</p></li><li><p>Market pressure demands shipping new capabilities</p></li><li><p>The "We can always rebuild the team if we need it" is an appealing justification</p></li></ul><p>The problem isn't that leadership doesn't value resilience; in my experience, they always do. It's that, unfortunately, successful resilience work makes itself appear unnecessary.</p><h2>The Real Cost of Resilience Decay</h2><p>What leadership (and, to some extent, everyone in an organization) typically underestimates is the lag time and the compound nature of resilience degradation. The decay follows a predictable pattern that happens slowly until it&#8217;s already too late to realize what happened.</p><p>In the first few months after cuts, everything appears fine. Systems continue running on the momentum of previous investments while technical debt accumulates slowly in the background. Incident response processes start being skipped, reviews become optional, but there's no immediate pain to signal the danger. Small issues begin appearing around month six, individual incidents that seem unrelated, response times that gradually increase with no single smoking gun to point to. More critically, the organization stops learning from incidents effectively, losing the institutional knowledge that once made it resilient.</p><p>The degradation accelerates dramatically as problems compound. What used to be small contained failures start cascading across systems. Teams spend more time firefighting and less time building features, and fingers start pointing at people rather than examining systemic issues. The organization makes the same mistakes repeatedly because institutional learning has degraded, the very capability that once prevented these failures.</p><p>By 18-24 months, major outages return with a vengeance. Leadership now faces the same crisis that triggered their original resilience investment, but with 18 months of accumulated technical debt and lost organizational learning capability stacked on top.</p><p>Finally comes the frantic rebuilding phase, but now they're rebuilding resilience capabilities while simultaneously managing active fires, a much harder and exponentially more expensive proposition than simply maintaining prevention capabilities would have been.</p><p>Read our blog post&#8212;&#8220;<a href="https://www.resiliumlabs.com/blog/the-quiet-erosion-how-organizations-drift-into-failure">The Quiet Erosion</a>&#8221;&#8212;for more details about the slow process of how an organization drifts into failure.</p><p><em><strong>Disclaimer:</strong> The degradation timeline is based on my experience working in the software industry, where I've observed this pattern repeatedly across different company sizes and industries. While I don&#8217;t have the data to quantify this timeline accurately, industry practitioners consistently report similar degradation patterns. The timeline varies based on factors like system complexity, organizational size, and the depth of original resilience investments. Still, the general pattern of slow-then-sudden degradation is a common experience across the software industry. Organizations with more mature practices may see slower degradation, while those with less institutional knowledge may experience faster decline.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pYD5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pYD5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png 424w, https://substackcdn.com/image/fetch/$s_!pYD5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png 848w, https://substackcdn.com/image/fetch/$s_!pYD5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png 1272w, https://substackcdn.com/image/fetch/$s_!pYD5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pYD5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png" width="3280" height="2088" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2088,&quot;width&quot;:3280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pYD5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png 424w, https://substackcdn.com/image/fetch/$s_!pYD5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png 848w, https://substackcdn.com/image/fetch/$s_!pYD5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png 1272w, https://substackcdn.com/image/fetch/$s_!pYD5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7d12f50-327c-459b-a769-4205e9c88e79_1000x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>The Cost of Resilience Cuts: How organizational resilience degrades over time after prevention investments are cut</p><h2>Breaking The Cycle: Making Prevention Visible</h2><p>The key to limiting the Resilience Amnesia Cycle is making invisible work visible and translating technical prevention into business value. This requires both concrete measurement frameworks and deep cultural change&#8212;tactical metrics without culture change won't stick, and culture change without metrics won't convince leadership.</p><h3>Measurement Frameworks</h3><h4><strong>1. Calculate Downtime Cost and ROI</strong></h4><p>In 2024, the<a href="https://itic-corp.com/itic-2024-hourly-cost-of-downtime-part-2/"> Information Technology Intelligence Consulting (ITIC)</a> estimated that 90 percent of enterprises face costs exceeding $300,000 per hour of downtime, with 41 percent exceeding $1&#8210;5 million per hour.</p><p>Downtime costs for smaller businesses range from approximately $137 to $427 per minute, and for larger enterprises, they can reach $16,000 per minute.</p><p>The industry average cost of downtime is estimated at about $9,000 per minute.</p><p>Downtime costs can be approximated using the formula:</p><p><strong>Downtime Cost = Minutes of Downtime x Cost per Minute</strong></p><p>Start with that equation. It is simple and straightforward.</p><p>You can, of course, develop models that better estimate the ROI of resilience work.</p><p><strong>Prevention ROI = (Potential Failure Cost &#215; Probability &#215; Prevention Success Rate) / Prevention Investment Cost</strong></p><p>Even imperfect models are better than no models, and you can always improve them iteratively.</p><p>The key is to start putting a price on prevention.</p><h4><strong>2. Quantify Prevented Failures</strong></h4><p>Document the "alternate reality" and use simulations to show what could have happened without prevention.</p><p>Track and document near-misses with their potential business impact and share these with leadership:</p><ul><li><p>"ORR process caught database scaling issue that could have caused a 4-hour outage during Black Friday, saving $1.2M in potential revenue loss."</p></li><li><p>"Incident review process from previous incident prevented similar failure pattern across three other teams, saving $500K in potential revenue loss."</p></li><li><p>"Chaos experiment identified memory leak that would have led to failure during product launch, saving $300K in potential revenue loss."</p></li></ul><h4><strong>3. Create Prevention Dashboards</strong></h4><p>Build visibility into prevention work with business-relevant metrics:</p><ul><li><p>Issues caught before customer impact</p></li><li><p>System resilience improvements over time</p></li><li><p>Technical debt prevented vs. remediated</p></li><li><p>Learning from incidents and reviews</p></li><li><p>Time to detect, time to respond, time to recover trends</p></li><li><p>Track On-call confidence index</p></li></ul><h4><strong>4. Build Compelling Narratives</strong></h4><p>Document how prevention work creates measurable business value:</p><ul><li><p>Share "near miss" stories and highlight specific instances where measures prevented failures</p></li><li><p>Before/after resilience work comparisons</p></li><li><p>Customer satisfaction correlations with resilience investments</p></li><li><p>Developer productivity gains from improved systems</p></li><li><p>Innovation velocity enabled by confidence in system resilience</p></li><li><p>Learn from others' mistakes and point to organizations that failed to prepare</p></li><li><p>Break prevention into measurable achievements</p></li></ul><h4><strong>5. Establish Prevention SLAs</strong></h4><p>Just as you have uptime SLAs, consider creating accountability for prevention work:</p><ul><li><p>Complete ORR for 100% of new major feature deployments</p></li><li><p>Conduct post-incident reviews for 100% of Sev-1 and Sev-2 incidents</p></li><li><p>Allocate at least 20% of the time to addressing technical debt</p></li><li><p>Execute a minimum of two chaos experiments quarterly for all our critical dependencies</p></li><li><p>Maintain test coverage above 85% across all services</p></li><li><p>Conduct a full-scale simulation (GameDay) once a month to validate our incident response capabilities</p></li><li><p>Review, verify, and exercise at least one Runbook weekly.</p></li><li><p>100% of Runbooks&#8217; last-verification-date should not exceed 6 months.</p></li></ul><h4><strong>6. Build Institutional Memory</strong></h4><p>Document the reasoning behind protective measures and continuously educate stakeholders about:</p><ul><li><p>Why specific resilience investments were made</p></li><li><p>What disasters they're designed to prevent</p></li><li><p>How past incidents shaped current practices</p></li><li><p>The compound value of organizational learning capabilities</p></li><li><p>Regularly help stakeholders understand the value of resilience</p></li></ul><h3>Cultural Transformation</h3><p>Even with perfect measurement frameworks, organizations will resist investing in prevention due to what are called "<a href="https://hbr.org/2012/05/get-the-corporate-antibodies-o">organizational antibodies</a>"&#8212;the people and processes that extinguish new ideas as soon as they begin to course through the organization. Overcoming the Resilience Amnesia Cycle requires deep cultural change:</p><h4><strong>1. Leadership Modeling</strong></h4><p>Executives must visibly value and celebrate prevention work. When an ORR catches a critical issue or a chaos engineering process prevents repeat failures, that should be as celebrated as shipping a major feature.</p><h4><strong>2. Career Path Recognition</strong></h4><p>Create senior career paths for prevention specialists. Principal SREs and Distinguished Engineers in reliability and resilience should be as valued as their counterparts in product development.</p><h4><strong>3. Shared Context</strong></h4><p>Regularly share the cost and impact of outages across the organization. When teams understand that a 4-hour outage costs $800K, they better appreciate the team that prevents them.</p><h4><strong>4. Prevention Storytelling</strong></h4><p>Develop and share organizational narratives around prevention heroes, not just feature development or incident response heroes. The engineer who prevents a disaster through thoughtful ORR or systematic chaos engineering should get the same recognition as the one who fixes it.</p><h3>Moving forward</h3><p>As systems become more complex and customer expectations continue rising, the impacts of the prevention paradox become even more important to understand. Organizations that can't maintain prevention investments will find themselves in a continuous cycle of failure and reactive investment.</p><p>The most successful organizations I work with treat prevention work as a strategic capability, not a cost center. They understand that in a world where software is eating everything, resilience is a competitive advantage.</p><p>More importantly, they recognize that organizational learning and adaptation capabilities are what separate resilient organizations from fragile ones. The ability to anticipate problems, learn from incidents, and continuously improve is what enables sustainable success in uncertain environments.</p><p>If you recognize the prevention paradox in your organization, here's how to start addressing it today:</p><ol><li><p><strong>Audit your current prevention work</strong> - What's already happening that leadership doesn't see?</p></li><li><p><strong>Calculate your prevention ROI</strong> - What failures have been avoided and what would they have cost?</p></li><li><p><strong>Document near-misses</strong> - Start building a catalog of prevented failures and lessons learned.</p></li><li><p><strong>Create visibility dashboards</strong> - Make prevention work as visible as feature delivery.</p></li><li><p><strong>Build business-impact narratives</strong> - Connect technical prevention to business outcomes.</p></li><li><p><strong>Measure learning velocity</strong> - Show how quickly the organization adapts and improves.</p></li></ol><p>Don&#8217;t wait before it is too late. The prevention paradox isn't inevitable. Organizations that recognize and actively counter it build more reliable systems, retain better engineers, create stronger learning capabilities, and develop sustainable competitive advantages.</p><p>However, it requires conscious effort to value invisible work and resist the cognitive biases that make successful prevention appear unnecessary.</p><p>The best failure is the one that never happens. The challenge is proving it.</p>]]></content:encoded></item><item><title><![CDATA[The Quiet Erosion: How Organizations Drift Into Failure]]></title><description><![CDATA[Photo by Raul Ling]]></description><link>https://newsletter.resiliumlabs.com/p/the-quiet-erosion-how-organizations-drift-into-failure</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/the-quiet-erosion-how-organizations-drift-into-failure</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Sun, 25 May 2025 07:07:57 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8c663908-8592-4397-ace6-7b1d2c9f1a81_1000x562.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pqnq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pqnq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pqnq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pqnq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pqnq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pqnq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg" width="2500" height="1406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1406,&quot;width&quot;:2500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pqnq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pqnq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pqnq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pqnq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61aa0e1e-8169-4979-bbc0-c786d3a12da8_1000x562.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>Photo by <a href="https://www.pexels.com/photo/dramatic-icelandic-coastal-landscape-with-volcanic-rocks-32190263/">Raul Ling</a></p><p><em>The Slack notification arrived at 3:17 AM on a Tuesday: "Payment system down. All hands."</em></p><p><em>Six hours later, as the team sat exhausted around the incident room table, the CTO asked the question we hear after every major outage:</em></p><h3>"How did we get here?"</h3><p>The post-mortem timeline showed no single catastrophic decision. Instead, it revealed dozens of small compromises, each one reasonable at the time:</p><p>"We only reduced test coverage for non-critical features."</p><p>"We temporarily bypassed code review for this urgent fix."</p><p>"We'll address that performance issue next sprint."</p><p>Every decision made perfect sense in isolation. Yet somehow, these incremental changes had accumulated into a system operating at the edge of failure, where one memory leak during peak traffic brought down their entire payment infrastructure.</p><p>This is the story of organizational drift, one of the most dangerous and least understood risks facing software teams because drift happens slowly, quietly, through rational adaptations to competitive pressure and operational realities.</p><p>What makes drift particularly challenging is recognizing that it is happening before it's too late.</p><p>To illustrate how drift happens, I've created the story of TrendCart, a fictional e-commerce platform that experiences this exact pattern. While the company and characters are entirely fictional, the problems they experience are real. They are based on my own experiences and observations.</p><p><em><strong>Disclaimer: TrendCart and all characters in this story are fictional. Any resemblance to real companies, people, or events is purely coincidental. This narrative is created solely to illustrate common organizational patterns and does not reference any actual organization.</strong></em></p><div><hr></div><h2>The Story of TrendCart's Gradual Decline</h2><p>Maya joined TrendCart as Lead Developer when the e-commerce platform was still celebrating its Series A funding. With 50,000 monthly active users and a reputation for reliability, TrendCart had carved out a niche for independent fashion designers.</p><p>"Our philosophy is simple," explained Raj, the CTO, during the onboarding process. "We deploy twice monthly, after full testing. Every commit gets two code reviews, unit and integration tests, and a security review. We take our customers' trust seriously." Maya was impressed by the clear and disciplined development process. Thorough yet not bureaucratic, with clear guidelines that developers consistently followed.</p><h3>The Pressure Mounts</h3><p>The first cracks appeared during a routine customer advisory call. Designer after designer asked about features that TrendCart didn't have, features that their competitor, CompeteCart, had already launched.</p><p>"When will you have social login?"</p><p>"Why can't I bulk upload products?"</p><p>"CompeteCart's analytics show me exactly which products are trending in real-time."</p><p>Maya watched the sales team's faces during these calls. Each missing feature represented lost revenue. By week's end, they were maintaining a "feature gap spreadsheet" that grew longer daily.</p><p>The emergency board meeting happened on a Friday. Maya wasn't invited, but the tension was palpable when leadership came out of the room.</p><p>"Eighteen months," the CEO announced to the engineering team. "We have eighteen months of runway. If we don't close the feature gap and accelerate growth, there won't be a TrendCart to run."</p><p>It was about survival.</p><h3>The Reasonable Compromise</h3><p>Raj called an all-hands engineering meeting. "We need to move faster without breaking what works," he began. "I'm proposing we shift to weekly deployments by streamlining our processes."</p><p>The plan seemed logical: Keep thorough testing for payment processing and user data. Reduce the scope of testing for "non-critical" features. Maintain code review, but expedite for urgent fixes. And automate security checks where possible.</p><p>Maya raised her hand. "What is 'non-critical'?"</p><p>"Features that don't directly impact transactions or user data," Raj replied. "UI elements, analytics dashboards, recommendation engines, all important, but not customer-facing if they break."</p><p>Weekly deployments meant faster iteration, quicker feature delivery, and better competitive positioning. The compromise felt reasonable. Maintain safety for core functions while accelerating everything else.</p><p>The leadership team and developers were happy with the faster pace, and for a while, everything seemed to be going fine.</p><p>For three months, it worked great.</p><h3>Early Warning Signs Most Teams Miss</h3><p>Six months later, Maya noticed troubling patterns. Deployment rollbacks had increased from once every few months to twice in the last month. The on-call rotation was getting paged more frequently for "minor" issues that developers would fix with quick patches rather than proper investigations.</p><p>Test coverage showed 71%, but Maya knew the number lied. Developers had learned to game the metrics: tests that verified mocks instead of actual functionality, integration tests marked as "flaky" and skipped in continuous integration (CI), and business logic validated only through happy-path scenarios.</p><p>"Look at this payment processing change," Maya showed her teammate Alex during a code review. "The tests pass, but they're mocking the actual payment gateway. We're testing that our mocks work, not that payments work."</p><p>During the monthly retrospective, Maya presented her concerns. "Yesterday's deploy broke user avatars for three hours. We only discovered it because a customer tweeted a screenshot. We're accumulating technical debt and we're getting comfortable with it.</p><p>"These issues aren't critical individually," she continued, "but they're creating systematic brittleness."</p><p>The product manager's response was fast: "Every platform has technical debt. Our KPIs look excellent. Look, cart abandonment is down 15%, and transaction volume is up 23% this quarter. Let's not slow momentum over edge cases."</p><p>The team moved on.</p><h3>The First Incident</h3><p>Black Friday weekend. Traffic peaked at 3x normal levels. At 2:47 PM, the entire platform went down.</p><p>Thirty-seven minutes of complete outage during the busiest shopping period of the year.</p><p>$127,000 in lost revenue, nearly 3% of their yearly revenue, plus 3,000 abandoned carts and 12 customer complaints. The numbers didn&#8217;t look good.</p><p>But Maya knew the real cost was likely higher.</p><p>The investigation revealed a memory leak that had been flagged during sprint planning three months earlier. Categorized as "performance optimization," it lived in the ever-growing backlog of "technical debt to address later." Under normal load, it was invisible. During peak traffic, it consumed all the available memory, causing the application servers to crash.</p><p>The post-mortem focused on immediate fixes, including memory monitoring alerts, load balancing improvements, and automated restart procedures.</p><p>"We'll review the performance backlog next quarter," Raj concluded.</p><p>Within two weeks, the performance issues were again labeled "not customer-impacting" and deprioritized for feature work.</p><p>Maya started noticing a pattern in the incident reports. Each one ended with "this specific issue has been resolved," rather than addressing the category of problems that made this possible.</p><h3>The Normalization of Deviance Pattern</h3><p>A month later, a more serious incident happened. A privacy incident. Customer addresses were visible, and the bug had actually merged customer profiles, so orders from one customer appeared in another's account history. It took six hours to fully identify the scope because logging wasn't detailed enough to trace which accounts had been affected.</p><p>The investigation revealed a cascade of small failures: A developer had pushed directly to main to "quickly fix" a merge conflict. The automated tests passed because they didn't test cross-user data isolation. The staging environment didn't have enough data volume to reproduce the issue. The code review was marked as "approved," but clearly it hadn&#8217;t been thorough, since none of the comments had been addressed. Instead, it was merged with the message &#8220;Added a few comments, but approving to make the feature release date. Cutting a ticket to the backlog&#8221;</p><p>Maya expected this to be a wake-up call. Instead, in the next incident, she heard familiar patterns: "The root cause was the direct push to main, again. We'll fix that by preventing that altogether." The conversation never addressed why the developer felt pressure to bypass the process.</p><h3>How Small Compromises Compound</h3><p>Maya, feeling things were getting out of control, started mapping their journey. Using incident reports, deployment metrics, and her own observations, she created a timeline showing TrendCart's drift from disciplined engineering to normalized risk-taking.</p><p>She highlighted each small decision that, while logical in isolation, had collectively eroded their safeguards.</p><p>When the fourth incident occurred, a billing bug that undercharged customers for three weeks, Maya presented her diagram to the executive team.</p><p>"We didn't make a single decision to be unsafe," she explained. "We made hundreds of small trade-offs, each seeming reasonable at the time. But look where we've ended up."</p><p>She traced their path across the boundaries. "Here's where we first reduced the testing scope. Here's where we started accepting pull requests with failing tests if they weren't in 'critical paths.' Here's where we stopped requiring security reviews for changes to the user data models."</p><h2>Recognizing Drift Before It's Too Late</h2><p>Maya's presentation to executives didn't immediately change everything. The team asked good questions, but the first response was predictable: "We need better tooling to prevent these specific issues."</p><p>Then Maya showed them data from their own customer support tickets. The volume of "weird bugs" had tripled over the past six months. Customers were reporting issues that individually seemed minor but collectively painted a picture of a platform becoming unreliable.</p><p>"Our Net Promoter Score has dropped eight points," the customer success manager added. "Customers are saying we 'used to be reliable' but now they're not sure they can trust us with their business."</p><p>"When did we leave the safe zone?" the CEO asked.</p><p>"About seven months ago," Maya replied. "When we started treating the tests as a suggestion rather than a requirement."</p><p>"And the red zone?"</p><p>"We've been operating in the danger zone since we decided that certain types of bugs were 'acceptable business risks' rather than issues to be fixed. That normalizing of problems is what concerns me most."</p><h3>The Recognition</h3><p>The executive team finally understood they weren't looking at isolated incidents requiring specific fixes. They were seeing symptoms of systematic drift from engineering discipline toward normalized risk-taking.</p><p>What had felt like necessary adaptations to competitive pressure had gradually transformed into dangerous corner-cutting. Speed and agility had become excuses for compromising foundational practices.</p><p>But Maya proposed a solution that surprised them.</p><p>"We don't need to slow down or revert to monthly deployments," she explained. "We need to make drift visible and intentional rather than invisible and accidental."</p><p>The changes that followed weren't dramatic. They didn't slow down development or return to monthly deployments. Instead, they implemented what Maya called "intentional friction,&#8221; small barriers designed to make unsafe shortcuts visible and require conscious decisions, rather than letting things drift.</p><p>They established "guardrail reviews", quarterly assessments specifically designed to call-out drift in their development practices. They created clear, non-negotiable boundaries for security, testing, and code quality.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RySj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RySj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png 424w, https://substackcdn.com/image/fetch/$s_!RySj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png 848w, https://substackcdn.com/image/fetch/$s_!RySj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png 1272w, https://substackcdn.com/image/fetch/$s_!RySj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RySj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png" width="2628" height="1706" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1706,&quot;width&quot;:2628,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RySj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png 424w, https://substackcdn.com/image/fetch/$s_!RySj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png 848w, https://substackcdn.com/image/fetch/$s_!RySj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png 1272w, https://substackcdn.com/image/fetch/$s_!RySj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69e863e-e5bc-4745-aae0-69701be397c7_1000x649.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>Timeline showing TrendCart's organizational drift from safe practices through warning signs to major incidents and recovery</p><h3>The Adaptation Awareness Framework</h3><p>Most importantly, they created a framework that distinguished between two types of changes:</p><ul><li><p><strong>Reactive adaptation</strong>: Responding to immediate pressures (what led to drift)</p></li></ul><ul><li><p><strong>Proactive adaptation</strong>: Anticipating systemic risks (what Maya demonstrated)</p></li></ul><p>The framework didn't restrict adaptation. Instead, it encouraged more Maya-style adaptation while making pressure-driven adaptations visible for organizational learning.</p><p>Maya had demonstrated exactly the adaptive capacity every organization needs: the ability to sense emerging risks, connect patterns across incidents, and proactively adjust the system's safety boundaries.</p><p>The goal was to cultivate more people who could adapt like Maya, not prevent adaptation.</p><p>The framework included:</p><p><strong>Pattern Recognition</strong>: Regular analysis of when, why, and how teams adapted standard processes <br><strong>Systemic Learning</strong>: Understanding what adaptations reveal about underlying system pressures <br><strong>Adaptive Capacity Building</strong>: Strengthening the organization's ability to handle unexpected situations safely <br><strong>Continuous Sense-Making</strong>: Using adaptations as data to improve the system rather than just controlling behavior</p><h3>The Recovery</h3><p>The transformation wasn't immediate, but it was measurable. Within a few months, they had restored test coverage to a healthy 83% through systematic debt reduction. They also reduced deployment rollbacks by 60% through improved quality gates and practically eliminated &#8220;hot fixes&#8221; pushes to production. They also started to see improvements in customer tickets and on-call operations.</p><p>"The key insight," Maya explained during an all-hands meeting, "wasn't choosing between speed and safety. It really was choosing between intentional trade-offs and invisible drift."</p><h3>Epilogue</h3><p>A few months later, TrendCart's main competitor suffered a major data breach that affected all of its customer records. The incident, caused by accumulated security compromises, forced the company to rebuild its platform. The incident resulted in severe fines and extensive negative publicity.</p><p>Designers who had initially left TrendCart for CompeteCart's faster feature delivery began returning. They cited reliability and trust as primary factors.</p><p>It was a very expensive incident, to the point that CompeteCart ultimately ended its service.</p><p>"The irony," Maya reflected during a conference presentation about their journey, "is that we thought we were making necessary trade-offs to remain competitive. But by recognizing and managing our drift, we ended up with both speed and safety, which turned out to be the real competitive advantage all along."</p><p>Maya had adapted, too. When she first noticed the test quality issues, she could have focused solely on her own code and remained quiet. Instead, she adapted from individual contributor to system advocate.</p><p>That's what resilient organizations need: people who adapt their perspective from local optimization to system health. By encouraging this kind of positive adaptation, organizations actually become more adaptable as a whole.</p><p>A few months later, Maya was promoted to principal engineer.</p><div><hr></div><h2>The Cost of Invisible Drift</h2><p>Why am I telling this story? Organizational failure rarely announces itself with dramatic warning signs. Instead, it accumulates through thousands of small compromises, each reasonable in isolation, collectively lethal in combination.</p><p>The timeline illustrated above shows what most post-mortems forget to mention. The "root cause" is rarely directly related to the final incident itself, but rather the slow erosion of safety boundaries that made that incident inevitable. The memory leak, the privacy breach, and the billing bug weren't separate problems requiring separate solutions. They were symptoms of a system that had gradually drifted from good engineering practices toward normalized risk-taking.</p><p>The most insidious aspect of drift is how it makes teams complicit in their own degradation. When test coverage drops over time, no single day feels dangerous. When minor incidents become routine, each one seems manageable. When shortcuts become standard practice, organizations believe they're adapting to business realities rather than compromising their foundation.</p><h3>Why Traditional Approaches Fail</h3><p>Most organizations fail to detect drift because they focus on symptoms rather than causes. They detect problems after drift has already occurred, rather than preventing drift from accumulating in the first place.</p><p><strong>Drift happens in the space between formal processes and daily reality.</strong></p><p>It's the accumulation of small adaptations that individually seem reasonable but collectively undermine system safety.</p><h3>Intentional Trade-offs</h3><p>Every organization needs people who can adapt the way Maya did: sensing patterns, surfacing concerns, and strengthening the system's capacity to handle surprises.</p><p>Speed and safety aren't mutually exclusive, but they require constant vigilance to maintain together. The goal isn't to eliminate all risk or prevent all adaptations. Instead, it's making risk visible and consciously managed while building the organization's capacity to adapt safely.</p><p>To successfully balance speed and safety, you need to:</p><ul><li><p>Track leading indicators of drift, not just incident outcomes</p></li><li><p>Treat process adaptations as valuable learning data, not violations to prevent</p></li><li><p>Celebrate teams that surface systemic issues early, not just those that respond to incidents quickly</p></li><li><p>Build adaptive capacity&#8212;the ability to handle unexpected situations safely</p></li></ul><h3>The Tale of Two Adaptations</h3><p>This story reveals something important that I almost overlooked when writing this blog post. TrendCart has experienced two completely different types of adaptation.</p><p><strong>Drift-Inducing Adaptations</strong> (what the team did):</p><ul><li><p>Individual responses to immediate pressure</p></li><li><p>Invisible to the broader organization</p></li><li><p>Focused on local optimization</p></li><li><p>Accumulated without systemic awareness</p></li></ul><p><strong>Resilience-Building Adaptations</strong> (what Maya demonstrated):</p><ul><li><p>Proactive pattern recognition across the system</p></li><li><p>Made concerns visible to leadership</p></li><li><p>Focused on long-term system health</p></li><li><p>Enhanced organizational learning capacity</p></li></ul><h3>Next Steps</h3><p>The next time someone asks, "How did we get here?" after a major incident, remember that the answer probably isn't in the immediate timeline. It's in the months or years of small compromises that made that moment inevitable.</p><p><strong>Ask yourself:</strong></p><ul><li><p>What processes has your team adapted or "streamlined" in the past year?</p></li><li><p>How many of your quality metrics could be gamed without affecting actual quality?</p></li><li><p>When did you last review the cumulative impact of individual process adaptations?</p></li><li><p>What would your customers say about your reliability compared to six months ago?</p></li></ul><p><strong>Then take action:</strong></p><ol><li><p><strong>This week:</strong> Implement basic drift tracking for your most critical processes</p></li><li><p><strong>This month:</strong> Conduct a retrospective focused on accumulated technical debt and process adaptations</p></li><li><p><strong>This quarter:</strong> Establish guardrail reviews and leading indicator monitoring</p></li><li><p><strong>This year:</strong> Build systematic adaptation awareness into your organizational culture</p></li></ol><p><em><strong>Note: See Annexes for a more comprehensive set of actions</strong></em></p><p>Transformation is available to every team willing to choose conscious trade-offs over invisible drift.</p><p>The question isn't whether your organization is making compromises; every organization does. The question is whether you're making them intentionally, with full awareness of their cumulative effect, or allowing them to accumulate invisibly until the next major incident.</p><p><strong>Choose intentional drift detection. Choose systematic quality practices. Choose competitive advantage through reliability.</strong></p><p>Your customers, your team, and your future self will thank you.</p><div><hr></div><h1>Implementation Toolkit</h1><p><em>Note: This toolkit represents my current practices, informed by research and field experience; however, organizational resilience is deeply contextual. I welcome suggestions, corrections, and additional strategies based on your own experiences. Please reach out with improvements or examples of what has worked (or hasn't worked) in your organization.</em></p><h2>The Anatomy of Organizational Drift</h2><p>There are typically four drift phases that an organization experiences:</p><h3>Phase 1: The &#8220;Reasonable&#8221; Compromise</h3><ul><li><p>External pressure creates the need for adaptation</p></li><li><p>Leadership proposes logical process changes</p></li><li><p>Team maintains core safety practices while "optimizing" peripheral ones</p></li><li><p>Early metrics show positive results</p></li><li><p>Gap emerges between work-as-imagined (official processes) and work-as-done (actual practice)</p></li></ul><p><strong>Warning Signs:</strong></p><ul><li><p>Increased deployment frequency without proportional quality investment</p></li><li><p>Redefinition of "critical" vs "non-critical" systems</p></li><li><p>Process adaptations that become regular patterns</p></li></ul><h3>Phase 2: Metric Gaming</h3><ul><li><p>Teams adapt to new incentives by optimizing numbers rather than outcomes</p></li><li><p>Quality indicators lose correlation with actual quality</p></li><li><p>Small issues accumulate but remain below the visibility threshold</p></li><li><p>Teams engage in "satisficing" behavior, doing just enough to meet targets rather than achieve actual quality</p></li></ul><p><strong>Warning Signs:</strong></p><ul><li><p>Test coverage maintains while actual test quality deteriorates</p></li><li><p>Increased incidents that get &#8220;quick-fixed&#8221; rather than properly investigated</p></li><li><p>Growing backlog of "technical debt to address later"</p></li></ul><h3>Phase 3: Normalization</h3><ul><li><p>Incidents become less exceptional</p></li><li><p>Each problem gets treated in isolation rather than as part of a pattern</p></li><li><p>Team culture shifts from "how do we prevent this?" to "how do we fix that fast?"</p></li><li><p>Organization loses "chronic unease", the healthy skepticism about system safety that prevents complacency</p></li></ul><p><strong>Warning Signs:</strong></p><ul><li><p>Post-mortems focus on specific fixes rather than systemic improvements</p></li><li><p>Increasing acceptance of "edge cases" and "known issues"</p></li><li><p>Customer complaints about reliability start appearing</p></li></ul><h3>Phase 4: Crisis or Recovery</h3><ul><li><p>Accumulated risk manifests as a major incident or competitive threat</p></li><li><p>Organization either recognizes the pattern and implements systematic changes</p></li><li><p>Or continues drift until catastrophic failure forces dramatic restructuring</p></li><li><p>Important note: Even successful recovery requires ongoing vigilance, as drift is a continuous process that can restart at any time</p></li></ul><p><em>This framework draws from Sidney Dekker's concepts of "drift into failure" and "work-as-imagined vs work-as-done," Diane Vaughan's "normalization of deviance," and supporting research in organizational safety by James Reason, Charles Perrow, and Karl Weick.</em></p><div><hr></div><h2>Building Drift-Aware Organizations</h2><p>Here are strategies for making drift visible before it becomes dangerous:</p><h3>Leading Indicators to Monitor</h3><p><strong>Quality Metrics</strong></p><ul><li><p>Test coverage trends, not just absolute numbers</p></li><li><p>Test execution time and flakiness rates</p></li><li><p>Code review rejection rates and bypass frequency</p></li><li><p>Deployment rollback frequency and time-to-restore</p></li><li><p>Work-as-done vs work-as-imagined gaps (actual vs documented processes)</p></li></ul><p><strong>Process Health</strong></p><ul><li><p>Pattern analysis of process adaptations and their justifications</p></li><li><p>Time between incident detection and customer notification</p></li><li><p>Percentage of incidents that are repeat issues</p></li><li><p>Cross-team dependency failure rates</p></li><li><p>Weak signal detection rate (near-misses identified and reported)</p></li></ul><p><strong>Cultural Indicators</strong></p><ul><li><p>Employee survey responses about process confidence</p></li><li><p>Voluntary overtime trends during deployment periods</p></li><li><p>Knowledge transfer effectiveness between team members</p></li><li><p>Incident response stress levels and team satisfaction</p></li><li><p>"Chronic unease" levels (healthy skepticism about system safety)</p></li><li><p>Psychological safety scores for reporting problems and concerns</p></li></ul><h3>Systematic Drift Detection</h3><p><strong>Guardrail Reviews</strong></p><ol><li><p>Map all process adaptations from the previous quarter</p></li><li><p>Understand why teams felt adaptations were necessary</p></li><li><p>Assess the cumulative impact of individual compromises</p></li><li><p>Identify patterns that indicate systematic drift</p></li><li><p>Review and update boundaries based on emerging risks and system changes</p></li></ol><p><strong>Adaptation Awareness Implementation</strong></p><ol><li><p>Regular pattern analysis of when, why, and how teams adapt processes</p></li><li><p>Focus on understanding systemic pressures that drive adaptations</p></li><li><p>Use adaptations as learning opportunities rather than compliance violations</p></li><li><p>Build organizational capacity to adapt safely rather than just limiting adaptations</p></li><li><p>Create psychological safety for discussing process pressures without blame</p></li></ol><p><strong>Pattern Recognition Training</strong></p><ol><li><p>Train incident responders to identify systemic issues</p></li><li><p>Implement post-mortem that surface contributing factors</p></li><li><p>Maintain cross-incident trend analysis</p></li><li><p>Regular review of themes across multiple incidents</p></li><li><p>Focus on "how the system normally succeeds," not just how it fails (Safety-II approach)</p></li></ol><h3>Recovery Strategies</h3><p>For organizations already experiencing drift:</p><p><strong>Stabilization</strong></p><ol><li><p>Implement automated quality gates that cannot be bypassed</p></li><li><p>Establish emergency change review process</p></li><li><p>Create visible dashboard of leading indicators</p></li><li><p>Begin systematic documentation of current state</p></li><li><p>Restore "chronic unease" through leadership modeling of safety-conscious behavior</p></li></ol><p><strong>Assessment</strong></p><ol><li><p>Conduct an audit of current practices vs. stated policies</p></li><li><p>Map all informal processes and shortcuts currently in use</p></li><li><p>Identify the highest-risk areas where drift has progressed furthest</p></li><li><p>Create a prioritized remediation plan</p></li><li><p>Interview frontline workers to understand work-as-done vs work-as-imagined gaps</p></li></ol><p><strong>Recovery</strong></p><ol><li><p>Implement changes gradually to avoid disrupting operations</p></li><li><p>Provide training and support for teams returning to disciplined practices</p></li><li><p>Establish positive incentives for quality behaviors</p></li><li><p>Measure and communicate progress regularly</p></li><li><p>Build adaptive capacity&#8212;ability to respond to unexpected situations</p></li></ol><p><strong>Reinforcement</strong></p><ol><li><p>Celebrate examples of teams surfacing systemic issues early</p></li><li><p>Share stories of how quality practices prevented incidents</p></li><li><p>Make adaptation awareness a regular part of team retrospectives</p></li><li><p>Maintain executive visibility into adaptation metrics</p></li><li><p>Institutionalize "productive failure"&#8212;learning from near-misses and small failures</p></li><li><p>Create psychological safety for reporting concerns without blame</p></li></ol><h3>Maintaining Vigilance</h3><ul><li><p><strong>Continuous monitoring</strong>: Drift detection is ongoing, not a one-time fix</p></li><li><p><strong>Boundary management</strong>: Regularly review and update safety boundaries as systems evolve</p></li><li><p><strong>Learning orientation</strong>: Treat drift detection as organizational learning, not compliance checking</p></li><li><p><strong>Leadership commitment</strong>: Executive teams must model and reinforce drift-resistant behaviors</p></li><li><p><strong>Adaptive capacity building</strong>: Strengthen the organization's ability to handle unexpected situations safely</p></li></ul><div><hr></div><h2>Further Reading</h2><p><strong>Sidney Dekker</strong> - "Drift into Failure: From Hunting Broken Components to Understanding Complex Systems" (2011)</p><p>Dekker's work provides the conceptual framework and vocabulary for understanding how complex systems gradually move toward failure boundaries through small adaptations.</p><p><strong>Diane Vaughan</strong> - "The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA" (1996)</p><p>Vaughan's research introduced the concept of "normalization of deviance" - how organizations systematically lower their standards for what constitutes acceptable risk.</p><p><strong>Charles Perrow</strong> - "Normal Accidents: Living with High-Risk Technologies" (Updated Edition, 1999)</p><p>Perrow argues that multiple and unexpected failures are built into society's complex and tightly coupled systems, and that accidents are unavoidable and cannot be designed around. His concept of "normal accidents" explains why some failures are inevitable in complex systems, regardless of safety measures.</p><p><strong>Erik Hollnagel, David Woods &amp; Nancy Leveson</strong> (Eds.) - "Resilience Engineering: Concepts and Precepts" (2006)</p><p>This book charts the efforts being made by researchers, practitioners and safety managers to enhance resilience by looking for ways to understand the changing vulnerabilities and pathways to failure.</p><p><strong>James Reason</strong> - "Human Error" (1990) and "Managing the Risks of Organizational Accidents" (1997)</p><p>Reason introduces the Swiss cheese model, a conceptual framework for the description of accidents based on the notion that accidents will happen only if multiple barriers fail.</p><p><strong>James Reason</strong> - "The Human Contribution: Unsafe Acts, Accidents and Heroic Recoveries" (2008)</p><p>Reason's later work that explores both the positive and negative aspects of human performance in complex systems.</p><p><strong>Karl Weick</strong> - "Sensemaking in Organizations" (1995)</p><p>Karl E. Weick's book highlights how the "sensemaking" process &#8212; the creation of reality as an ongoing accomplishment that takes form when people make retrospective sense of the situations in which they find themselves &#8212; shapes organizational structure and behavior.</p><p><strong>Karl Weick &amp; Kathleen Sutcliffe</strong> - "Managing the Unexpected: Sustained Performance in a Complex World" (3rd Edition, 2015)</p><p>Essential reading on high-reliability organizations and how some organizations maintain safety despite operating in hazardous environments.</p><p><strong>Nancy Leveson</strong> - "Engineering a Safer World: Systems Thinking Applied to Safety" (2011)**</p><p>STAMP provides a paradigm for system safety engineering and has been increasing in usage across the transportation industry.</p><p><strong>Sidney Dekker</strong> - "Safety Differently: Human Factors for a New Era" (2014)</p><p>Dekker's evolution from traditional safety thinking toward a more nuanced understanding of how safety is created in practice.</p><p><strong>Erik Hollnagel</strong> - "Safety-I and Safety-II: The Past and Future of Safety Management" (2014)</p><p>The traditional safety concept, known as Safety-I, and its associated methods and models have significantly contributed to enhancing the safety of industrial systems. However, they have proven insufficient for application to complex socio-technical systems. From reactive to proactive safety thinking.</p>]]></content:encoded></item><item><title><![CDATA[Beyond Root Cause: A Better Approach to Understanding Complex System Failures]]></title><description><![CDATA[The Limitations of Traditional Root Cause Analysis]]></description><link>https://newsletter.resiliumlabs.com/p/beyond-root-cause-analysis</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/beyond-root-cause-analysis</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Tue, 20 May 2025 06:43:14 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a2ac74de-846f-4aeb-874e-0f20a750e8c4_1000x649.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Limitations of Traditional Root Cause Analysis</h2><p>In a recent <a href="https://iluminr.io/leadership/gamechangers-resilience-the-prevention-paradox/">interview</a> with <a href="https://iluminr.io">Iluminr</a>, I was asked the following question:</p><blockquote><h3>"Which framework do you think needs to be retired or radically rethought?"</h3></blockquote><p>My answer was clear: the traditional <strong>"root cause analysis"</strong> and <strong>"5 whys"</strong> frameworks.</p><p>This is the director&#8217;s cut version of that answer.</p><p>However, I should first confess: I once believed deeply in the 5 Whys method. It's featured in numerous books and endorsed by industry leaders. As a young engineer, questioning these established practices wasn't my first instinct. I not only used it for years but also passionately shared it with colleagues and taught it to others.</p><p>Who could blame me? The framework is intuitive, easy to explain, and simple to implement. It makes perfect sense &#8230;. on the surface. Keep asking "why" until you find the ultimate cause. Doing that will prevent the incident from recurring.</p><h4><strong>The Turning Point</strong></h4><p>As I grew in seniority and gained more experience with complex systems, I became increasingly uncomfortable with the limitations I was seeing. The implicit accusatory tone of "why" questions started to bother me. Team members would become defensive rather than reflective during post-mortems.</p><p>Something seemed wrong.</p><p>More importantly, I noticed that our "solutions" weren't preventing similar incidents from occurring. We'd fix the specific issue we identified, only to have a different manifestation of the same systemic problem appear again, and again.</p><p>Once I understood that the goal of incident investigation was learning, everything changed.</p><p>And once you see the benefits of a more nuanced approach, there's no going back.</p><h4><strong>The Fundamental Problem</strong></h4><p>These traditional approaches are based on outdated, linear thinking that assumes failures have single, identifiable causes that can be eliminated. But the truth is that's not how complex systems work.</p><p>Complex systems never have one single root cause. Instead, they have multiple contributing factors that combine to create failures. And it is the accumulation of these contributing factors that eventually break down over time.</p><p>And these failures are non-deterministic, meaning that repeating the same conditions would likely lead to different outcomes. That's because systems operate in completely dynamic environments where conditions and context continuously change.</p><h4><strong>Why These Frameworks Persist</strong></h4><p>Yet despite these obvious limitations, these frameworks persist in organizations worldwide. They're comforting in their simplicity. They give us the illusion of control. You can find the root cause, fix it, and the problem is solved.</p><p>Organizations also like them because they often lead to solutions that appear straightforward and actionable. "Retrain the engineer" or "Add another approval step to the process" are easier actions to document than "Our system has fundamental design flaws that interact in unpredictable ways."</p><h2>Real-World Examples: When 5 Whys Lead to Wrong Conclusions</h2><p>In the interview, I shared an example where a company had experienced a 2-hour database outage. Their initial 5 Whys analysis went something like this:</p><ol><li><p><strong>Why did the database go down?</strong> <em>Because it ran out of storage space.</em></p></li><li><p><strong>Why did it run out of storage space?</strong> <em>Because the log files grew too large.</em></p></li><li><p><strong>Why did the log files grow too large?</strong> <em>Because log rotation wasn't functioning properly.</em></p></li><li><p><strong>Why wasn't log rotation functioning properly?</strong> <em>Because the engineer who set it up used incorrect settings.</em></p></li><li><p><strong>Why did the engineer use incorrect settings?</strong> <em>Because they weren't properly trained in database configuration.</em></p></li></ol><p>Conclusion: <em>&#8220;We need better training for engineers on database configuration.&#8221;</em></p><p>And that seems OK. It makes sense, right?</p><p>But if you really think about it, this analysis almost purposely created a linear story ending with "insufficient training." It implicitly blamed the engineer and missed systemic issues.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0kzx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0kzx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png 424w, https://substackcdn.com/image/fetch/$s_!0kzx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png 848w, https://substackcdn.com/image/fetch/$s_!0kzx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png 1272w, https://substackcdn.com/image/fetch/$s_!0kzx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0kzx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png" width="2202" height="1430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1430,&quot;width&quot;:2202,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0kzx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png 424w, https://substackcdn.com/image/fetch/$s_!0kzx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png 848w, https://substackcdn.com/image/fetch/$s_!0kzx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png 1272w, https://substackcdn.com/image/fetch/$s_!0kzx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb190b01-6868-4465-995b-5753c2c63dd3_1000x649.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Complex Systems Demand Different Thinking</h2><p>Instead of accepting that conclusion, we added some different questions alongside their existing framework.</p><p>The instruction was to be curious about the context and to explore different dimensions: culture, processes, and tools.</p><p>After a few iterations, we eventually ended up with something like that:</p><ol><li><p><strong>How did you first become aware of the issue?</strong> <em>I noticed alerts showing unusual disk usage patterns an hour before the crash, but they weren't critical alerts, so I was finishing another urgent task first.</em></p></li><li><p><strong>How did the system appear to be functioning at that time?</strong> <em>It seemed normal except for the disk usage. We've had similar warnings before that resolved themselves, so I wasn't immediately concerned.</em></p></li><li><p><strong>What were you focusing on when making decisions about priorities?</strong> <em>I was trying to balance multiple alerts. Since we typically prioritize customer-facing issues, I was working on a payment processing issue first.</em></p></li><li><p><strong>How was the log rotation system originally set up?</strong> <em>It was configured during our migration six months ago. We copied settings from our test environment, which had different usage patterns. The rotation was set for weekly rather than daily because test data volumes were much smaller.</em></p></li><li><p><strong>How do changes to these systems typically get reviewed?</strong> <em>We usually have a checklist for infrastructure changes, but during the migration period, we moved quickly to meet deadlines, and some review steps were abbreviated.</em></p></li></ol><p>This new approach led the team to different conclusions. Same incident, different ending. They ended up Implementing several improvements instead of just pushing for &#8220;more training&#8221;:</p><ul><li><p>Revising alert classification to distinguish critical issues better</p></li><li><p>Establishing dedicated maintenance periods</p></li><li><p>Enhancing the infrastructure change review process</p></li><li><p>Creating more accurate test environments</p></li><li><p>Addressing workload prioritization issues</p></li></ul><p>The key difference wasn't just in asking better questions. The approach fundamentally recognized that incidents emerge from complex interactions between people, technology, and organizational factors, rather than from a single cause or person's mistake.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S5hu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S5hu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png 424w, https://substackcdn.com/image/fetch/$s_!S5hu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png 848w, https://substackcdn.com/image/fetch/$s_!S5hu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png 1272w, https://substackcdn.com/image/fetch/$s_!S5hu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S5hu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png" width="2200" height="1436" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c04a116-0751-493b-809a-38708c45a19b_1000x653.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1436,&quot;width&quot;:2200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!S5hu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png 424w, https://substackcdn.com/image/fetch/$s_!S5hu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png 848w, https://substackcdn.com/image/fetch/$s_!S5hu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png 1272w, https://substackcdn.com/image/fetch/$s_!S5hu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c04a116-0751-493b-809a-38708c45a19b_1000x653.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><br>Here is another one where a critical service went down during a deployment:</p><ol><li><p><strong>Why did the service fail?</strong> Because an invalid configuration was deployed.</p></li><li><p><strong>Why was an invalid configuration deployed?</strong> Because it wasn't properly tested.</p></li><li><p><strong>Why wasn't it tested?</strong> Because the engineer was rushing to meet a deadline.</p></li><li><p><strong>Why was there a rush?</strong> Because the project was running behind schedule.</p></li><li><p><strong>Why was it behind schedule?</strong> Because the estimate was too optimistic.</p></li></ol><p>Conclusion: <em>&#8220;We need to improve the estimation process."</em></p><p>Instead, the new approach revealed:</p><ul><li><p>The deployment tools made it too easy to accidentally include unrelated changes</p></li><li><p>The monitoring system didn't catch the issue because it was designed to detect hard failures, not degraded performance</p></li><li><p>An engineer had been doing manual system checks that caught several issues early, but this wasn't a formal practice</p></li><li><p>The system degraded gradually rather than failing immediately, making cause-and-effect relationships harder to establish</p></li></ul><p>This, too, led to multiple improvements, including better deployment tooling, enhanced monitoring, formalized morning system checks, and a deeper understanding of how our services degrade under specific conditions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Ffh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Ffh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!7Ffh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!7Ffh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!7Ffh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Ffh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg" width="6240" height="4160" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:4160,&quot;width&quot;:6240,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7Ffh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!7Ffh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!7Ffh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!7Ffh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1c60f6-3c8d-4bd3-9674-40e7f5cb4d20_1000x667.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Trojan Horse Approach: Implementing Change Without Resistance</h2><p>After years of trying to improve how teams analyze incidents, I've learned that announcing "your approach is wrong!" rarely works. Instead, I've developed what I call the "Trojan Horse" approach to changing incident analysis practices.</p><p>Rather than launching a frontal assault on established methodologies, I've found it more effective to introduce new thinking from within existing frameworks. Just like that old wooden horse, this approach appears harmless but carries within it ideas that can transform how organizations understand incidents.</p><p>Here's how I typically introduce it:</p><ol><li><p>Start with the familiar format of a post-incident review that leadership expects</p></li><li><p>Gradually introduce open-ended "what" and "how" questions alongside the traditional "why" questions.</p></li><li><p>Be curious about the context and explore its various dimensions, including culture, processes, and tools.</p></li><li><p>Highlight the richer insights and context these alternative questions produce</p></li><li><p>Expand the scope beyond finding a single root cause to mapping the system interactions and dynamics</p></li></ol><p>Over time, organizations eventually shift from simplistic root cause thinking to a more nuanced understanding of complex systems, without the resistance that often comes with rejecting established methods outright.</p><h2>Practical Questions That Transform Incident Analysis</h2><p>What I've found most effective is introducing subtle but powerful questions into existing processes:</p><ul><li><p>"What surprised you during this incident?"</p></li><li><p>"Where did your understanding of the system prove incorrect?"</p></li><li><p>"How did this make sense to everyone involved at the time?"</p></li><li><p>"What pressures and constraints shaped the environment in which decisions were made?"</p></li><li><p>"Who knew things that others didn't?"&#8216;</p></li><li><p>&#8220;What were we afraid to talk about before this problem happened?&#8221;</p></li></ul><p>These questions don't disrupt the familiar framework but gently expand thinking beyond simplistic cause-and-effect reasoning.</p><h2>Language Evolution: Shifting From "Root Cause" to "Contributing Factors"</h2><p>Similarly, I've found you can transform how teams conceptualize incidents by gradually shifting terminology:</p><ul><li><p>From "root cause" to "contributing factors"</p></li><li><p>From "human error" to "systemic conditions"</p></li><li><p>From "failure" to "unexpected behavior" or &#8220;surprise&#8221;</p></li><li><p>From "preventing" to "learning"</p></li><li><p>From "cause" to "influence"</p></li></ul><p>Most people won't even notice these subtle shifts happening in conversations and documentation, but over time, they profoundly change how incidents are understood.</p><h2>Building Sustainable Improvement in Your Organization</h2><p>That Trojan Horse approach works because it acknowledges the real-world constraints we all face:</p><ul><li><p>Teams have limited time for incident reviews</p></li><li><p>People have varying levels of expertise in systems thinking</p></li><li><p>Leaders want clear, actionable outcomes</p></li><li><p>Everyone feels pressure to "just fix it and move on"</p></li></ul><p>It works precisely because it respects these constraints rather than fighting against them. It enables continuous improvement without demanding radical change all at once.</p><h4><strong>Be Patient, Be Persistent</strong></h4><p>The most successful change often happens not through revolution, but through evolution, making each incident review just a little bit better than the last one.</p><p>Remember: people don't resist change, they resist being changed.</p><p>By meeting teams where they are and gradually expanding their perspective, we create sustainable improvement rather than resistance.</p><h4><strong>Resilience needs to be nurtured, not imposed.</strong></h4>]]></content:encoded></item><item><title><![CDATA[Beyond Traditional Resilience]]></title><description><![CDATA["The path to resilience isn't paved with more complexity, but with elegant simplicity."]]></description><link>https://newsletter.resiliumlabs.com/p/beyond-traditional-resilience-the-resilium-labs-approach</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/beyond-traditional-resilience-the-resilium-labs-approach</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Fri, 16 May 2025 07:24:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f6cdf36d-96be-4610-946d-a3e906dedcda_1000x562.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zQau!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zQau!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zQau!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zQau!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zQau!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zQau!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg" width="2500" height="1406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1406,&quot;width&quot;:2500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zQau!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zQau!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zQau!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zQau!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89b5b8f-365d-4c12-9bf2-ed06c22a08dc_1000x562.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><h3><em>"The path to resilience isn't paved with more complexity, but with elegant simplicity."</em></h3></blockquote><div><hr></div><h3>Why Most Resilience Programs Fail</h3><p>Let's be honest: most resilience efforts end up as elaborate checkbox exercises that consume resources and deliver little actual resilience. You've probably seen it yourself when comprehensive frameworks, meticulously planned runbooks, and detailed incident management protocols that collapse at first contact with a real crisis.</p><p>Why? Because traditional resilience frameworks fundamentally misunderstand the nature of failure in complex systems.</p><h4>The Resilium Labs Difference</h4><p>At<a href="https://www.resiliumlabs.com/home"> Resilium Labs</a>, we're challenging conventional wisdom about what makes systems resilient. Our approach isn't just different for the sake of being different. It's different because decades of research in <a href="https://www.resiliumlabs.com/blog/what-is-resilience-engineering">resilience engineering</a> and complex systems theory demand a paradigm shift.</p><p>Here's how we're redefining resilience:</p><h3>1. Shifting from "Root Cause" to Context and Complexity</h3><p><strong>Traditional approach</strong>: Failures are treated as linear and deterministic. Find the root cause, fix it, create a new rule, and the problem is solved.</p><p><strong>Our approach</strong>: We recognize that failures emerge from interactions within complex systems. Instead of hunting for scapegoats or simplistic causes, we look at the full context in which failures occur and the conditions that allowed them to manifest.</p><p>Our approach shifts the perspective from finding fault to understanding context. Instead of simply identifying what went wrong, we focus on understanding why certain decisions made sense to engineers at the time they were made.</p><h2>2. Championing Uncertainty and Vulnerability</h2><p><strong>Traditional approach</strong>: Command and control rules. Detailed runbooks and rigid processes intended to eliminate uncertainty.</p><p><strong>Our approach</strong>: We embrace the reality that uncertainty is inherent in complex systems. Rather than pretending it can be eliminated, we build systems and teams that thrive in uncertain conditions. We create psychological safety that enables teams to acknowledge vulnerabilities instead of hiding them.</p><p>Resilient systems don't pretend to be invulnerable. Instead, they acknowledge their vulnerabilities and prepare accordingly.</p><h3>3. Dismantling Complexity in Favor of Elegant Simplicity</h3><p><strong>Traditional approach</strong>: Adds complexity on top of complexity, with elaborate frameworks and intricate response protocols.</p><p><strong>Our approach</strong>: We obsessively simplify. Complex systems already have enough moving parts so your resilience approach shouldn't add more. As <a href="https://en.wikiquote.org/wiki/C._A._R._Hoare">Sir Tony Hoare</a> famously said, "The price of reliability is the pursuit of the utmost simplicity."</p><p>Our resilience strategies are elegantly minimalist, focusing on maximum impact with minimal complexity.</p><h3>4. Prioritizing Recovery Over Prevention</h3><p><strong>Traditional approach</strong>: Avoids failure at all costs. Investing heavily in prevention measures.</p><p><strong>Our approach</strong>: We recognize that failures will happen despite our best efforts. While reasonable prevention is important, we prioritize rapid recovery capabilities. The difference between a minor incident and a catastrophe often isn't whether something fails, but how quickly you can recover.</p><p>When you optimize for recovery speed rather than zero failures, your systems become naturally more resilient.</p><h3>5. Resilience as Ongoing Practice, Not a Static State</h3><p><strong>Traditional approach</strong>: Treats resilience like a maturity model with checkpoints and an end state.</p><p><strong>Our approach</strong>: We see resilience as a continuous practice, something you do, not something you have. Organizations don't "achieve resilience". Instead, they practice it daily through learning, adaptation, and evolution.</p><p>This perspective shifts resilience from a project to be completed to a capability to be cultivated.</p><h3>6. Calling Out the "Prevention Paradox"</h3><p><strong>Traditional approach</strong>: Many resilience practitioners actively choose to remain invisible or separate from the business.</p><p><strong>Our approach</strong>: We confront the prevention paradox&#8212;the idea that when resilience efforts succeed, nothing happens, making it difficult to demonstrate value. Rather than hiding from this challenge, we make resilience visible by connecting it directly to business outcomes and strategic objectives. We make the invisible visible.</p><p>Resilience isn't separate from your business. Instead, it's what enables your business to thrive in an increasingly unpredictable world.</p><h3>A Different Kind of Resilience Partner</h3><p>If you're tired of resilience initiatives that consume resources without delivering real results, let's talk. At Resilium Labs, we're not interested in checking boxes or implementing rigid frameworks. Instead, we're committed to building genuine resilience that enables your organization to adapt and thrive in the face of uncertainty.</p><p>Our approach is grounded in decades of research in complex systems, human factors, and resilience engineering. We've distilled these insights into practical approaches that deliver measurable results without unnecessary complexity.</p><p>Remember: <strong>Resilience isn't about preventing failure. It's about designing systems that can adapt and recover when the inevitable occurs.</strong></p><p>Ready to build real resilience? <a href="https://www.resiliumlabs.com/contact">Let's talk</a>.</p>]]></content:encoded></item><item><title><![CDATA[Transform Disruption into Competitive Advantage]]></title><description><![CDATA[Blog RSS]]></description><link>https://newsletter.resiliumlabs.com/p/the-business-case-for-resilience-engineering</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/the-business-case-for-resilience-engineering</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Tue, 13 May 2025 12:28:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d982bc8e-ec38-4ab6-8102-697c24e6eabe_1000x667.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FtFl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FtFl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!FtFl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!FtFl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!FtFl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FtFl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg" width="2500" height="1668" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1668,&quot;width&quot;:2500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FtFl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!FtFl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!FtFl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!FtFl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b24fd69-d8a8-40c7-9fcf-ecc329676f75_1000x667.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><a href="https://www.resiliumlabs.com/blog?format=rss" title="Blog RSS">Blog RSS</a></p><p>Let&#8217;s be honest; disruption is the norm, not the exception. Headlines regularly feature outages affecting banks, e-commerce platforms, entertainment providers, and airlines. Failure has become an everyday reality. From technical outages to market shifts, organizations face an increasingly complex landscape of potential failures.</p><p>But what if I told you that these disruptions could actually become your competitive advantage?</p><h3>Reframing Resilience: From Cost Center to Strategic Asset</h3><p>Most executive conversations about resilience start in the wrong place. They begin with questions like "How much will this cost?" or "What's the ROI?" These questions fundamentally misunderstand what resilience engineering delivers.</p><blockquote><h4><strong>Resilience is not about making money. Resilience is about not losing money.</strong></h4></blockquote><p>This distinction is critical. Unlike features that directly generate revenue, resilience measures typically prevent losses that would occur during failures or outages. This prevention-focused value proposition requires a different calculation framework than traditional ROI models.</p><p>Consider this: Nine major UK banks experienced a staggering 803 hours&#8212;equivalent to 33 full days&#8212;of technology outages in just two years. The compensation costs alone were substantial (<a href="https://www.bbc.com/news/articles/cjd3yzx3xgvo">source</a>):</p><ul><li><p>Barclays: Up to &#163;12.5 million</p></li><li><p>Bank of Ireland: &#163;350,000</p></li><li><p>NatWest: &#163;348,000</p></li><li><p>HSBC: &#163;232,697</p></li></ul><p>And these figures represent only direct compensation payments, not accounting for the lost business, damaged customer trust, or increased regulatory scrutiny.</p><p>The question isn't whether you can afford to invest in resilience; it's whether you can afford not to.</p><h3>Beyond Prevention: The Eight Dimensions of Resilience Value</h3><p>Resilience engineering delivers value far beyond mere incident prevention. Here's how a comprehensive resilience approach transforms organizations:</p><h4><strong>1. Economic Value</strong></h4><p>When calculating the business case for resilience, we must consider both direct and indirect costs of failure. Direct costs include compensation payments, emergency remediation, and lost revenue during outages. Indirect costs&#8212;often far larger&#8212;include reputation damage, lost future business, regulatory penalties, and decreased employee morale.</p><p>By preventing these losses, resilience initiatives can deliver tremendous economic value, even if they don't directly generate revenue.</p><h4><strong>2. Competitive Advantage</strong></h4><p>Organizations with advanced resilience capabilities can thrive in increasingly uncertain environments characterized by volatility, uncertainty, complexity, and ambiguity (VUCA). While competitors stumble during market disruptions or supply chain issues, resilient organizations maintain service continuity, building invaluable trust with customers.</p><p>This isn't theoretical&#8212;companies like Netflix and Amazon have turned their resilience investments into significant competitive advantages, maintaining availability when competitors cannot.</p><h4><strong>3. Operational Excellence</strong></h4><p>Resilience engineering isn't just about crisis management&#8212;it promotes operational excellence by fostering continuous improvement. Rather than viewing incidents as failures to be avoided, resilient organizations treat them as learning opportunities.</p><p>This perspective shift transforms how teams approach operations. Instead of hiding problems, they surface them. Instead of blaming individuals, they examine systems. This creates a virtuous cycle of continuous improvement that extends far beyond crisis response.</p><h4><strong>4. Innovation Enablement</strong></h4><p>Counterintuitively, resilient systems enable faster innovation. When organizations have confidence in their recovery capabilities, they can move more quickly without compromising reliability.</p><p>This is particularly crucial in today's business environment, where digital transformation has made business and IT inseparable. The speed at which your technology organization can implement changes without compromising quality directly limits how fast your business can respond to market movements.</p><h4><strong>5. Risk Management</strong></h4><p>Resilience engineering provides a systematic approach to managing risks in complex systems. Unlike traditional risk management, which focuses on known risks and checklists, resilience engineering-based approaches help organizations prepare for unforeseen challenges, often referred to as the "black swan."</p><p>This comprehensive approach encompasses not just technical failures but also business-level threats such as competitive disruptions, market shifts, or regulatory changes.</p><h4><strong>6. Human-Centered Value</strong></h4><p>One of the most overlooked aspects of resilience is its human foundation. Technical systems alone cannot handle all failures, especially unexpected ones. Resilience engineering recognizes this reality and leverages the creativity, adaptability, and problem-solving capabilities of people.</p><p>By incorporating the human element into system design, resilient organizations can address challenges that automated systems cannot, particularly in novel situations that weren't anticipated during the design phase.</p><h4><strong>7. Adaptive Capacity</strong></h4><p>The most valuable aspect of resilience engineering is that it builds adaptive capacity&#8212;the ability to respond to changing threats and circumstances. This differs significantly from stability, which prevents failures, or robustness, which handles expected failures.</p><p>True resilience includes handling surprises and continuously adapting to changing conditions. Organizations with high adaptive capacity don't just survive disruptions&#8212;they emerge stronger from them.</p><h4><strong>8. Balanced Approach</strong></h4><p>Ultimately, resilience engineering provides a balanced approach to achieving both efficiency and effectiveness. While traditional approaches often focus exclusively on efficiency (doing things right), resilience brings in effectiveness (doing the right things).</p><p>This balance is crucial in today's business environment. Pure efficiency optimization creates brittle systems that collapse under unexpected stress. Resilience approaches strike a balance between efficiency and the flexibility needed to adapt to changing circumstances.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TZvi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TZvi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png 424w, https://substackcdn.com/image/fetch/$s_!TZvi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png 848w, https://substackcdn.com/image/fetch/$s_!TZvi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png 1272w, https://substackcdn.com/image/fetch/$s_!TZvi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TZvi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png" width="2000" height="1301" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1301,&quot;width&quot;:2000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!TZvi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png 424w, https://substackcdn.com/image/fetch/$s_!TZvi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png 848w, https://substackcdn.com/image/fetch/$s_!TZvi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png 1272w, https://substackcdn.com/image/fetch/$s_!TZvi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6f5f71-6df8-43d6-95c0-cc17d37bcf72_1000x651.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>From Theory to Practice</h3><p>The value of resilience isn't theoretical&#8212;it's practical and measurable. Organizations that implement comprehensive resilience programs consistently outperform those that don't, especially during periods of disruption.</p><p>At Resilium Labs, we help organizations build practical resilience capabilities while avoiding common pitfalls:</p><ul><li><p>Instead of chasing "perfect availability," we focus on rapid recovery</p></li><li><p>Instead of building complex defense mechanisms, we create simple, elegant solutions</p></li><li><p>Instead of treating resilience as a project to be completed, we embed it as a continuous practice</p></li></ul><p>The result? Organizations that not only survive disruptions but also transform threats into opportunities for growth.</p><h3>The Path Forward</h3><p>If you're ready to move beyond traditional approaches to resilience and build genuine adaptive capacity in your organization, we should talk. Resilium Labs specializes in practical resilience approaches that deliver measurable business value without unnecessary complexity.</p><p>Remember: The most resilient organizations aren't those that never fail&#8212;they're those that learn and grow from every challenge they face.</p><p><a href="https://www.resiliumlabs.com/#contact">Contact us</a> to start your resilience journey today.</p>]]></content:encoded></item><item><title><![CDATA[Gamechangers in Resilience - Interview with Iluminr]]></title><description><![CDATA[Permalink]]></description><link>https://newsletter.resiliumlabs.com/p/gamechangers-resilience-the-prevention-paradox</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/gamechangers-resilience-the-prevention-paradox</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Tue, 13 May 2025 04:19:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9N0S!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa465f403-b7af-4feb-a7ca-ea53007ad3fc_872x872.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.resiliumlabs.com/blog/gamechangers-in-resilience">Permalink</a></p>]]></content:encoded></item><item><title><![CDATA[What is Resilience Engineering?]]></title><description><![CDATA[Beyond Reliability in Complex Systems]]></description><link>https://newsletter.resiliumlabs.com/p/what-is-resilience-engineering-ffd59932adf8</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/what-is-resilience-engineering-ffd59932adf8</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Mon, 12 May 2025 07:35:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/78d2ea74-cfd8-4a55-9a32-71f6f4fa19c5_1024x683.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h4>Beyond Reliability in Complex&nbsp;Systems</h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W2U3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W2U3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg 424w, https://substackcdn.com/image/fetch/$s_!W2U3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg 848w, https://substackcdn.com/image/fetch/$s_!W2U3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!W2U3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W2U3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A resilient lone tree grows horizontally from a rocky mountainside, with half its branches covered in green leaves while others remain bare. The tree extends over a dramatic landscape of gray limestone rock formations, with distant mountains, valleys, and a cloudy sky in the background. The scene captures nature&#8217;s remarkable adaptability in harsh conditions.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A resilient lone tree grows horizontally from a rocky mountainside, with half its branches covered in green leaves while others remain bare. The tree extends over a dramatic landscape of gray limestone rock formations, with distant mountains, valleys, and a cloudy sky in the background. The scene captures nature&#8217;s remarkable adaptability in harsh conditions." title="A resilient lone tree grows horizontally from a rocky mountainside, with half its branches covered in green leaves while others remain bare. The tree extends over a dramatic landscape of gray limestone rock formations, with distant mountains, valleys, and a cloudy sky in the background. The scene captures nature&#8217;s remarkable adaptability in harsh conditions." srcset="https://substackcdn.com/image/fetch/$s_!W2U3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg 424w, https://substackcdn.com/image/fetch/$s_!W2U3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg 848w, https://substackcdn.com/image/fetch/$s_!W2U3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!W2U3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccfe502f-34ee-426b-a3ad-6290a740f757_1024x683.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>As our systems become increasingly complex and interconnected, the question isn&#8217;t whether failures will occur, but when and how we&#8217;ll respond. This reality has given rise to resilience engineering, a discipline that transforms how we think about failure, recovery, and adaptation.</p><h4><strong>Beyond Simple Reliability</strong></h4><p>Resilience engineering isn&#8217;t just about preventing failures or building reliable systems. While reliability focuses on avoiding failures, resilience acknowledges their inevitability and focuses on successfully responding to them. It&#8217;s the difference between trying to build an impenetrable fortress and creating a system that can take a hit and quickly&nbsp;recover.</p><p>At its core, resilience engineering is about developing the ability to cope with adverse events and situations successfully. This includes handling expected adverse events (robustness), managing unexpected adverse events (coping with surprises), and improving due to adverse events (learning).</p><p>What sets resilience engineering apart is its focus on socio-technical systems, recognizing that technology and human operators function as an integrated whole. It considers not just your technical infrastructure, but also your people, processes, and organizational structure.</p><h4><strong>Pioneers of Resilience Engineering</strong></h4><p>With a rich 20-year history as a scientific field, resilience engineering emerged as a discipline focused on how complex systems maintain function during disturbances rather than just preventing failures. It is important to note that resilience engineering spans far beyond software systems. It applies equally to aviation, energy distribution, transportation, financial services, emergency, aerospace, healthcare, telecommunications, and many other domains where complex systems must function reliably despite unpredictable challenges.</p><p>Several key figures established the field: <a href="https://erikhollnagel.com/">Erik Hollnagel</a> and <a href="https://en.wikipedia.org/wiki/David_Woods_(safety_researcher)">David D. Woods</a> are widely recognized as the primary founders, co-authoring the seminal &#8220;<a href="https://www.researchgate.net/publication/50232053_Resilience_Engineering_Concepts_and_Precepts">Resilience Engineering: Concepts and Precepts</a>&#8221; (2006). Hollnagel introduced the influential <a href="https://erikhollnagel.com/books/safety-i-safety-ii-2014">Safety-I vs Safety-II </a>paradigm, while Woods developed concepts like &#8220;<a href="https://www.researchgate.net/publication/327427067_The_Theory_of_Graceful_Extensibility_Basic_rules_that_govern_adaptive_systems">graceful extensibility</a>.&#8221;</p><p>Other foundational contributors include <a href="https://en.wikipedia.org/wiki/Nancy_Leveson">Nancy Leveson</a> with her <a href="https://www.sciencedirect.com/science/article/abs/pii/S092575350300047X">STAMP (System-Theoretic Accident Model and Processes)</a> methodology, Sidney Dekker, who explored how systems &#8220;drift into failure,&#8221; <a href="https://en.wikipedia.org/wiki/Richard_Cook_(safety_researcher)">Richard Cook</a>, whose paper &#8220;<a href="https://how.complexsystems.fail/">How Complex Systems Fail</a>&#8221; became a classic text, and John Wreathall, who helped organize the first resilience engineering symposium in 2004. <a href="https://en.wikipedia.org/wiki/Diane_Vaughan">Diane Vaughan</a> made crucial contributions with her work on the &#8220;<a href="https://en.wikipedia.org/wiki/Normalization_of_deviance">normalization of deviance</a>,&#8221; showing how organizations gradually accept increasingly risky decisions because nothing bad has happened&nbsp;yet.</p><p>The application of safety science principles to software engineering represents a natural evolution of these foundational concepts. While drawing heavily from these safety science pioneers, the field of software resilience engineering has developed its own distinct practices tailored to the unique challenges of distributed systems. Key figures like <a href="https://en.wikipedia.org/wiki/John_Allspaw">John Allspaw </a>played a pivotal role in this adaptation, bringing these concepts into software operations and DevOps culture. Similarly, <a href="https://en.wikipedia.org/wiki/Jesse_Robbins">Jesse Robbins</a>&#8202;&#8212;&#8202;known as Amazon&#8217;s &#8220;Master of Disaster&#8221;&#8202;&#8212;&#8202;made significant contributions through his pioneering GameDay exercises, which introduced simulated failure scenarios designed to build organizational resilience in technical environments.</p><p>Today, resilience engineering principles are fundamental to managing complex distributed software systems, though the field continues to evolve with unique practices specific to software challenges.</p><h4><strong>Adaptive Capacity: The Heart of Resilience</strong></h4><p>&#8220;Adaptive capacity&#8221;&#8202;&#8212;&#8202;the uniquely human ability to respond creatively to unexpected challenges&#8202;&#8212;&#8202;forms the foundation of resilience. While adaptive capacity represents potential, resilience is its successful application when confronting adversity. Organizations practicing resilience engineering deliberately invest in cultivating this adaptive capacity, often confronting what I call the &#8220;prevention paradox&#8221;: where companies must spend money preparing for problems they can&#8217;t foresee, and their biggest wins are simply the disasters that never&nbsp;happen.</p><p>This human element is critical. While our technical systems can be designed to handle known failure modes, only human operators can improvise solutions to novel problems. Resilience engineering acknowledges and enhances this capability by fostering environments where adaptation can flourish.</p><h4><strong>The Journey to Resilience</strong></h4><p>Becoming resilient isn&#8217;t an overnight transformation. Organizations typically progress through several&nbsp;stages:</p><ol><li><p><strong>Stability</strong>&#8202;&#8212;&#8202;Initially focusing on preventing failures through technical means</p></li><li><p><strong>Robustness</strong>&#8202;&#8212;&#8202;Embracing failures and handling them gracefully</p></li><li><p><strong>Basic Resilience</strong>&#8202;&#8212;&#8202;Preparing for surprises and considering the entire socio-technical system</p></li><li><p><strong>Advanced Resilience</strong>&#8202;&#8212;&#8202;Treating adversities as opportunities for improvement</p></li></ol><p>Each step along this journey involves not just technical changes but shifts in mindset, culture, and organizational practices.</p><h4><strong>Prepared to be Unprepared</strong></h4><p>Perhaps the most profound insight from resilience engineering is the importance of being &#8220;prepared to be unprepared.&#8221; No matter how thorough our planning and testing are, we will encounter situations we didn&#8217;t anticipate. Our systems&#8217; resilience depends not on preventing every possible failure but on our ability to detect, respond to, and learn from the unexpected.</p><p>This perspective transforms how we approach system design, operations, and organizational culture. Instead of fruitlessly pursuing perfect reliability, we build systems and organizations that can gracefully handle the inevitable imperfections of complex technological environments.</p><h4><strong>Resilience in&nbsp;Practice</strong></h4><p>In practical terms, this means organizations build capabilities across multiple dimensions:</p><ul><li><p>Develop flexible processes that allow for adaptation when conditions change unexpectedly</p></li><li><p>Implement comprehensive monitoring to detect weak signals before incidents escalate</p></li><li><p>Learn from both successes and&nbsp;failures</p></li><li><p>Support rather than constrain human performance variability</p></li><li><p>Take a holistic, systems-thinking approach to understanding interactions between components</p></li></ul><p>Resilient organizations exemplify these capabilities through practices like chaos engineering.</p><p>This intersection of technical, human, and organizational factors in resilience engineering will be the focus of an upcoming blog post. In it, we&#8217;ll explore how organizations at different maturity levels implement these principles, practical examples across various industries, and strategies for balancing resilience with efficiency.</p><h4><strong>Why Resilience Matters Now More Than&nbsp;Ever</strong></h4><p>As our dependency on digital systems continues to grow, so does the impact of their failures. The cost of downtime has never been higher, both in financial terms and in terms of eroded trust and reputation.</p><p>Meanwhile, the complexity of our systems continues to increase, making traditional approaches to reliability increasingly inadequate. We can no longer predict and prevent all possible failure modes&#8202;&#8212;&#8202;we must develop the capacity to respond effectively to the unexpected.</p><p>Resilience engineering offers a path forward in this challenging future. By embracing its principles, organizations can build systems that not only survive but thrive amid uncertainty and change. It&#8217;s not about avoiding failure at all costs&#8202;&#8212;&#8202;it&#8217;s about failing gracefully, recovering quickly, and emerging stronger than&nbsp;before.</p><p>In a world of inevitable surprises, resilience isn&#8217;t just a nice-to-have; it&#8217;s an essential characteristic of successful organizations and the systems they&nbsp;build.</p>]]></content:encoded></item><item><title><![CDATA[What is Resilience Engineering?]]></title><description><![CDATA[Blog RSS]]></description><link>https://newsletter.resiliumlabs.com/p/what-is-resilience-engineering</link><guid isPermaLink="false">https://newsletter.resiliumlabs.com/p/what-is-resilience-engineering</guid><dc:creator><![CDATA[Adrian Hornsby]]></dc:creator><pubDate>Mon, 12 May 2025 07:06:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/93ad9d39-32ac-42a8-b844-094f0eddfa67_1000x667.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.resiliumlabs.com/blog?format=rss" title="Blog RSS">Blog RSS</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mmhx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mmhx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Mmhx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Mmhx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Mmhx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mmhx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg" width="2500" height="1667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1667,&quot;width&quot;:2500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Mmhx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Mmhx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Mmhx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Mmhx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc27e01f4-b8f9-4847-8e36-099f9fe1b068_1000x667.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>A resilient lone tree grows horizontally from a rocky mountainside, with half its branches covered in green leaves while others remain bare. The tree extends over a dramatic landscape of gray limestone rock formations, with distant mountains, valleys, and a cloudy sky in the background. The scene captures nature's remarkable adaptability in harsh conditions.</p><p>As our systems become increasingly complex and interconnected, the question isn't whether failures will occur, but when and how we'll respond. This reality has given rise to resilience engineering, a discipline that transforms how we think about failure, recovery, and adaptation.</p><p><strong>Beyond Simple Reliability</strong></p><p>Resilience engineering isn't just about preventing failures or building reliable systems. While reliability focuses on avoiding failures, resilience acknowledges their inevitability and focuses on successfully responding to them. It's the difference between trying to build an impenetrable fortress and creating a system that can take a hit and quickly recover.</p><p>At its core, resilience engineering is about developing the ability to cope with adverse events and situations successfully. This includes handling expected adverse events (robustness), managing unexpected adverse events (coping with surprises), and improving due to adverse events (learning).</p><p>What sets resilience engineering apart is its focus on socio-technical systems, recognizing that technology and human operators function as an integrated whole. It considers not just your technical infrastructure, but also your people, processes, and organizational structure.</p><p><strong>Pioneers of Resilience Engineering</strong></p><p>With a rich 20-year history as a scientific field, resilience engineering emerged as a discipline focused on how complex systems maintain function during disturbances rather than just preventing failures. It is important to note that resilience engineering spans far beyond software systems. It applies equally to aviation, energy distribution, transportation, financial services, emergency, aerospace, healthcare, telecommunications, and many other domains where complex systems must function reliably despite unpredictable challenges.</p><p>Several key figures established the field: <a href="https://erikhollnagel.com">Erik Hollnagel</a> and <a href="https://en.wikipedia.org/wiki/David_Woods_(safety_researcher)">David D. Woods</a> are widely recognized as the primary founders, co-authoring the seminal "<a href="https://www.researchgate.net/publication/50232053_Resilience_Engineering_Concepts_and_Precepts">Resilience Engineering: Concepts and Precepts</a>" (2006). Hollnagel introduced the influential <a href="https://erikhollnagel.com/books/safety-i-safety-ii-2014">Safety-I vs Safety-II </a>paradigm, while Woods developed concepts like "<a href="https://www.researchgate.net/publication/327427067_The_Theory_of_Graceful_Extensibility_Basic_rules_that_govern_adaptive_systems">graceful extensibility</a>."</p><p>Other foundational contributors include <a href="https://en.wikipedia.org/wiki/Nancy_Leveson">Nancy Leveson</a> with her <a href="https://www.sciencedirect.com/science/article/abs/pii/S092575350300047X">STAMP (System-Theoretic Accident Model and Processes)</a> methodology, <a href="https://en.wikipedia.org/wiki/Sidney_Dekker">Sidney Dekker</a>, who explored how systems "<a href="https://www.researchgate.net/publication/306219866_Drift_into_Failure_From_Hunting_Broken_Components_to_Understanding_Complex_Systems">Drift into failure</a>," <a href="https://en.wikipedia.org/wiki/Richard_Cook_(safety_researcher)">Richard Cook</a>, whose paper "<a href="https://how.complexsystems.fail">How Complex Systems Fail</a>" became a classic text, and John Wreathall, who helped organize the first resilience engineering symposium in 2004. <a href="https://en.wikipedia.org/wiki/Diane_Vaughan">Diane Vaughan</a> made crucial contributions with her work on the "<a href="https://en.wikipedia.org/wiki/Normalization_of_deviance">Normalization of deviance</a>," showing how organizations gradually accept increasingly risky decisions because nothing bad has happened yet.</p><p>The application of safety science principles to software engineering represents a natural evolution of these foundational concepts. While drawing heavily from these safety science pioneers, the field of software resilience engineering has developed its own distinct practices tailored to the unique challenges of distributed systems. Key figures like <a href="https://en.wikipedia.org/wiki/John_Allspaw">John Allspaw </a>played a pivotal role in this adaptation, bringing these concepts into software operations and DevOps culture. Similarly, <a href="https://en.wikipedia.org/wiki/Jesse_Robbins">Jesse Robbins</a>&#8212;known as Amazon's "Master of Disaster"&#8212;made significant contributions through his pioneering <a href="https://www.youtube.com/watch?v=zoz0ZjfrQ9s">GameDay</a> exercises, which introduced simulated failure scenarios designed to build organizational resilience in technical environments.</p><p>Today, resilience engineering principles are fundamental to managing complex distributed software systems, though the field continues to evolve with unique practices specific to software challenges.</p><p><strong>Adaptive Capacity: The Heart of Resilience</strong></p><p>&#8220;Adaptive capacity"&#8212;the uniquely human ability to respond creatively to unexpected challenges&#8212;forms the foundation of resilience. While adaptive capacity represents potential, resilience is its successful application when confronting adversity. Organizations practicing resilience engineering deliberately invest in cultivating this adaptive capacity, often confronting what I call the "prevention paradox": where companies must spend money preparing for problems they can't foresee, and their biggest wins are simply the disasters that never happen.</p><p>This human element is critical. While our technical systems can be designed to handle known failure modes, only human operators can improvise solutions to novel problems. Resilience engineering acknowledges and enhances this capability by fostering environments where adaptation can flourish.</p><p><strong>The Journey to Resilience</strong></p><p>Becoming resilient isn't an overnight transformation. Organizations typically progress through several stages:</p><ol><li><p><strong>Stability</strong> - Initially focusing on preventing failures through technical means</p></li><li><p><strong>Robustness</strong> - Embracing failures and handling them gracefully</p></li><li><p><strong>Basic Resilience</strong> - Preparing for surprises and considering the entire socio-technical system</p></li><li><p><strong>Advanced Resilience</strong> - Treating adversities as opportunities for improvement</p></li></ol><p>Each step along this journey involves not just technical changes but shifts in mindset, culture, and organizational practices.</p><p><strong>Prepared to be Unprepared</strong></p><p>Perhaps the most profound insight from resilience engineering is the importance of being "prepared to be unprepared." No matter how thorough our planning and testing are, we will encounter situations we didn't anticipate. Our systems' resilience depends not on preventing every possible failure but on our ability to detect, respond to, and learn from the unexpected.</p><p>This perspective transforms how we approach system design, operations, and organizational culture. Instead of fruitlessly pursuing perfect reliability, we build systems and organizations that can gracefully handle the inevitable imperfections of complex technological environments.</p><p><strong>Resilience in Practice</strong></p><p>In practical terms, this means organizations build capabilities across multiple dimensions:</p><ul><li><p>Develop flexible processes that allow for adaptation when conditions change unexpectedly</p></li><li><p>Implement comprehensive monitoring to detect weak signals before incidents escalate</p></li><li><p>Learn from both successes and failures</p></li><li><p>Support rather than constrain human performance variability</p></li><li><p>Take a holistic, systems-thinking approach to understanding interactions between components</p></li></ul><p>Resilient organizations exemplify these capabilities through practices like chaos engineering.</p><p>This intersection of technical, human, and organizational factors in resilience engineering will be the focus of an upcoming blog post. In it, we'll explore how organizations at different maturity levels implement these principles, practical examples across various industries, and strategies for balancing resilience with efficiency.</p><p><strong>Why Resilience Matters Now More Than Ever</strong></p><p>As our dependency on digital systems continues to grow, so does the impact of their failures. The cost of downtime has never been higher, both in financial terms and in terms of eroded trust and reputation.</p><p>Meanwhile, the complexity of our systems continues to increase, making traditional approaches to reliability increasingly inadequate. We can no longer predict and prevent all possible failure modes&#8212;we must develop the capacity to respond effectively to the unexpected.</p><p>Resilience engineering offers a path forward in this challenging future. By embracing its principles, organizations can build systems that not only survive but thrive amid uncertainty and change. It's not about avoiding failure at all costs&#8212;it's about failing gracefully, recovering quickly, and emerging stronger than before.</p><p>In a world of inevitable surprises, resilience isn't just a nice-to-have; it's an essential characteristic of successful organizations and the systems they build.</p>]]></content:encoded></item></channel></rss>