Why We Still Suck at Resilience and Why I Wrote a Book About It
The last few months have been pretty quiet in here, and the reason is the same reason this piece exists. I wrote a book. It is called "Why We Still Suck at Resilience," and you can get it here.
If you have been following me for any length of time, you will recognise the argument even if the packaging is new. Organizations confuse performing resilience with actually being resilient, and the five practices that are supposed to manage that gap (chaos engineering, load testing, GameDays, incident analysis, operational readiness reviews) have drifted so far from their original purpose that they often make the problem worse. The book is an attempt to say clearly what I have been circling around in LinkedIn posts and on my blog, that the gap between how we imagine our systems work and how they actually work is growing, and most of what we are doing about it is theater.
The idea started during my time at AWS, where I helped build the Fault Injection Service and spent a decade working with some of the world's largest organizations on how they practice resilience at scale. I saw teams running chaos experiments that confirmed what they already believed rather than discovering what they did not know, load tests scoped to pass rather than to learn, incident reviews that produced action items nobody followed up on because the organizational incentives pointed elsewhere. The patterns were consistent enough across companies, industries, and team sizes that I became convinced the problem was not individual teams failing to execute. Something structural was going wrong, something about the way organizations relate to learning under pressure, and I wanted to understand what it was.
Writing the book took longer than I expected, partly because I kept discovering that the problem was deeper than I had initially framed it. What started as a practitioner's guide to doing these five practices well became something more like an investigation into why organizations systematically undermine their own capacity to learn from failure. The research pulled me into safety science, into Bainbridge and Rasmussen and Woods, into the gap between Work-as-Imagined and Work-as-Done, and into the uncomfortable recognition that I had been part of the problem myself. I built tooling that optimised for individual practice metrics while ignoring the learning system those practices were supposed to form. The value of resilience work lives in the connections between practices, in how an incident finding becomes a chaos experiment becomes a readiness review question, and nothing we had built supported those connections.
The final chapter turned out to be the one that mattered most, both for the book and for where my thinking is already going. It is about AI, but not in the way you might expect. Every observability vendor now offers AI-powered anomaly detection, every incident platform promises AI-drafted postmortems, and the marketing suggests these tools will finally eliminate the gap between how we imagine our systems work and how they actually work.
Just this week, Norberto Lopes at incident.io posted about their AI-generated postmortem capability, calling it a mind-blowing moment, celebrating that the write-up was accurate, well-contextualised, and took zero time to draft. Around the same time, Ozan Unlu published a piece describing what he calls Observability 3.0, a vision in which AI agents handle incident triage, root cause analysis, and postmortems autonomously, with humans elevated to "creativity and innovation" while machines do the interpretive work. I do not doubt that either of these tools produces excellent output. What worries me is that the productive struggle of writing a postmortem, or correlating signals during an incident, or reasoning through why a system behaved the way it did, is where most of the actual learning occurs. The document was never really the point; the thinking that produced it was. The interpretation was never the bottleneck; it was the training ground. When you skip the thinking and get a better document, or remove humans from the sensemaking and call it empowerment, you have made a trade that is easy to celebrate and difficult to measure the cost of until much later, when something breaks and the organization discovers it no longer has the understanding it assumed it was building all along.
Bainbridge identified this dynamic in 1983: automation designed to remove humans from a system paradoxically makes the human role both more critical and more difficult, because it removes the routine experience that builds expertise. AI is a higher form of the same problem, one that does not just automate tasks but automates the judgment and reasoning that create the deepest understanding.
I believed that argument completely while writing it, and the speed at which I have started questioning it has caught me off guard. Over the past few weeks, there has been a growing conversation among engineers and researchers ( here for example) about a noticeable leap forward in the latest generation of models, not just in raw capability but in the quality of reasoning. For the first time, my experience starts to mirror what others are describing.
I have been using Anthropic's Claude, both the Opus 4.6 model and its extended thinking variant, and what unsettled me was the texture of how it worked through problems. On coding tasks that required sound architectural judgment, the kind where there is no clean answer and the trade-offs depend on context that has to be reasoned about rather than looked up, the model was both producing correct outputs and thinking through the problem in ways that felt uncomfortably close to how a senior engineer thinks through it. It would express uncertainty about its own choices, flag trade-offs I had not raised, change direction mid-reasoning when it encountered something that complicated its initial approach.
Doubt, revision, context-sensitivity; these are what learning looks like. They are, in cognitive science and education research, the established preconditions for understanding: cognitive conflict that exposes mismatches between a mental model and reality, willingness to reorganise prior knowledge rather than just accumulate new information, and attention to context as the thing that determines whether knowledge transfers or stays brittle. The assumption that has underpinned my thinking, that when automation fails it will be humans who catch the fall, rests on the belief that machines cannot do the interpretive work. If that belief is wrong, or even if it is only wrong for long enough that organizations stop maintaining human capability in the meantime, then the question of who fixes what when things break becomes genuinely open in a way it has never been before.
I do not say that to be dramatic, and I do not have a clean resolution. The book makes a case I still think is largely right: that organizations need to protect the productive struggle, the learning that comes from humans wrestling with ambiguous problems, because that struggle is where durable understanding forms. What I am less sure about is whether the time horizon for that argument is decades or years, and the difference between those two answers changes everything about how I think about the work ahead.
Which brings me to what is next. In my consulting work at Resilium Labs, AI in operations has slowly become a recurring topic in nearly every engagement. Clients who originally brought me in to diagnose their chaos engineering, load testing, operational readiness, and incident review process are now asking questions about what AI is doing to their teams' ability to understand their own systems. The conversations are shifting because the problem is shifting, and the methodology I have always used (embedded observation, stakeholder interviews, tracing the gap between how people imagine work happens and how it actually happens) turns out to apply directly to organizations trying to figure out whether their AI adoption is building genuine capability or creating new blind spots they have not learned to see yet. The work is evolving because the organizations I work with are evolving, and the questions they need answered are no longer only about resilience practices in the traditional sense.
There will be more writing here in the weeks ahead, pieces that work through these questions. If anything in the book resonates, or if it does not, I would like to hear about it.
//Adrian

