When No One in the Room Has Carried the Pager

Why I built the Resilience Companion — and why “bootstrap” is the right word for it.

May 15, 2026

man using MacBook — Photo by charlesdeluvio on Unsplash

When I was at AWS, one of my areas of focus in the Resilience org (the team behind AWS FIS) was bar-raising ORRs, Operational Readiness Reviews, for the services we shipped. Bar-raising is an Amazon term, but the idea is older than Amazon. In a review, someone has to be the person whose only job is to hold the line on quality, independent of the team being reviewed and independent of the chain of command. Their only role is to ask the question nobody else wants to ask.

We worked with some of the best engineers in the world but it didn’t matter. Humans are humans. We have biases, we forget things, and not everyone has the operational scar tissue that comes from carrying a pager for ten years. The ORR existed to close that gap, and the bar-raiser existed to make sure the ORR did.

What an ORR actually was, in practice, was a sequence of conversations across the life of a service. We’d sit with a team and talk about past incidents, about what we’d learned the hard way, about how the architecture would behave at 3 a.m. when nobody was around to nurse it. Because the system you design at a whiteboard with a cup of coffee is rarely the system that exists at 3 a.m. with a half-broken dependency and an on-call engineer who joined three weeks ago. That’s the work-as-imagined versus work-as-done gap, and the ORR was the structured way we forced ourselves to look at it.

The thing that made it work wasn’t the template. It was the people in the room. Operationally savvy engineers, senior enough that they’d carried the pager for years, seen plans fall apart in production, watched checklists hide more than they revealed, and worked their way through runbooks clearly written by someone who’d never actually been on call. They’d been burned enough times that they couldn’t help but ask the awkward follow-up question. *”You said you have circuit breakers. Fine. So, walk me through what happens when one trips. Where does the request go? What does the user see? How do they behave under load?”* When someone gave a shallow answer, they noticed and pushed because they knew, from experience, exactly where the team’s mental model would crack first.

That’s the part you can’t extract from a checklist.

When I started working with customers outside AWS, this became obvious very quickly. Many organizations don’t have those people in the room, often because of the separation between application and operations. They have senior engineers and architects, some of them deeply experienced, but their experience is often architectural rather than operational. They’ve designed things; they haven’t necessarily watched those things degrade at 3 a.m., or stayed up through a postmortem trying to work out why the runbook said one thing and the system did another. So when the team runs an ORR, there’s no one whose instinct is to push on the answer that sounds clean but isn’t.

What happens then is predictable. The form gets filled in, the boxes get checked, and six weeks later the service goes down because nobody asked the question that would have surfaced the assumption that was wrong all along. The ORR happened; the learning didn’t.

The Resilience Companion is my attempt to put something useful in the room when those people aren’t there.

It’s an AI facilitator that runs the kind of conversation I used to run; Socratic, a little pushy, anchored in actual incidents. It asks the team to predict before they look things up. It traces failure paths step by step. It surfaces relevant industry post-mortems when the conversation gets near a known pattern. When someone says “we have retries,” it asks them to explain the backoff strategy in more detail. The gap between what the team can answer and what they need the companion to go look up for them is itself one of the most useful signals you can capture.

The friction is the feature. Cognitive science has a name for it: productive struggle, or what Robert and Elizabeth Bjork called desirable difficulties. When you force someone to generate an answer rather than recognise one, retention goes up. Make them predict before they look, and the prediction becomes the scaffold the new information attaches to. Most enterprise tools optimise for frictionless retrieval and call it a productivity win. The Companion goes the other way on purpose. Learning requires friction, and most reviews don’t have any.

Two things I want to be clear about.

It is NOT a replacement for an operationally savvy senior engineer. Real scar tissue carries context no model has, and the conversation a seasoned SRE has with a team in a room will always be richer than anything a tool can run. I don’t claim otherwise and I won’t.

It IS a way to bootstrap the process, to get teams running ORRs that actually probe, rather than ones that just generate paperwork, while the organization builds the wider community of engineers who can eventually take that role. It’s scaffolding, not a building.

The Companion is also a working argument. My book Why We Still Suck at Resilience makes the case that most resilience practices degrade into checklists and compliance theater because organizations optimize for performance over learning. The Resilience Companion is what the alternative looks like in code. It refuses to become a checklist. It treats productive struggle as the point. It measures itself on whether teams learned during the session, not on whether the form got filled in.

If you don’t want to run anything, the template the Companion uses is on GitHub as a standalone markdown file: ORRtemplate.md. A skilled human facilitator with the template can run the same conversation, and probably better. The Companion is one way to use the template.

It’s experimental, pre-1.0, Apache-licensed, self-hosted. `docker compose up` and go.

If you’ve ever sat through an ORR where everyone politely agreed that everything was fine, and then watched that same service fall over a month later, this is for you.

Try it. Open an issue if you find a bug. Open a pull request if you’re brave enough.

// Adrian

Discussion about this post

Ready for more?