The Resilience Myths List
Things we keep telling ourselves about resilience that aren't true
Last updated: April 2026. I add new myths every few months based on what I see in consulting, workshops, and conversations with engineering leaders. If you think one is missing, reply and tell me.
I keep a running list of beliefs about resilience that sound right but quietly undermine the organizations that hold them. Some are widespread. Some are subtle. All of them show up regularly in the teams I work with.
What makes something a myth here? It fits a pattern: the belief simplifies a complex reality into something comforting and manageable, and in doing so, it closes off learning. Most of these aren’t wrong in every context. They become myths when they’re treated as settled truths rather than assumptions worth questioning.
The list is loosely grouped, but many of these cut across categories. That’s part of the point.
Metrics & Measurement
Track incident counts as a measure of resilience. Incident counts measure reporting culture, not system health. A team that reports more is often learning more.
MTTR tells us how good we are at recovery. MTTR is a mean. It hides variance, and recovery from real incidents looks nothing like recovery from minor blips. The number flattens everything that matters.
MTTD shows our detection capabilities. Detection depends on what you’re looking for. MTTD only captures what you already know to monitor. The things that hurt you are the things you weren’t watching.
Fewer reported incidents means better resilience. Or it means people stopped reporting. Fewer reports can signal suppression as easily as improvement.
We can measure resilience with simple metrics. Resilience is a property of how a system adapts to surprise. You can measure proxies, but no single number captures adaptive capacity.
If the metrics are green, it means my customers are happy. Green dashboards reflect what you chose to measure. Customer experience includes everything you didn’t.
No alerts means no problems. Alerts fire on known conditions. The absence of alerts tells you nothing about unknown failure modes.
Everything important will generate an alert. This assumes you’ve already imagined every way things can go wrong. You haven’t. Nobody has.
Resilience scores measure resilience. A score requires a model. The model requires simplification. The simplification removes exactly the things that make resilience hard.
Process & Planning
More documentation prevents incidents. Documentation helps when people read it, when it’s current, and when the situation matches what was documented. Those conditions rarely align during an actual incident.
Runbooks should be as detailed as possible. Overly detailed runbooks become brittle. When the situation deviates from the script (and it will), people freeze because they’ve been trained to follow, not to think.
Perfect runbooks prevent all problems. See above, but louder. Runbooks encode past knowledge. Incidents are defined by novelty.
Perfect processes eliminate human error. Processes are operated by people in context. Tighter processes don’t remove error, they change where it shows up.
Incidents are always preventable with better planning. If you accept that complex systems produce emergent behavior, you accept that some incidents are genuinely surprising. Better planning reduces some failures and creates blind spots for others.
Documented procedures capture how work happens. Work-as-imagined and work-as-done are different. Procedures describe the intended path. Actual work involves adaptation, shortcuts, and judgment calls that never make it into the document.
We can standardize our way to resilience. Standards create consistency, which is valuable. But resilience requires adaptation, which standards constrain. The tension between the two is where the real work happens.
More controls mean more safety. Controls add complexity. Each control introduces new interactions, new failure modes, and new cognitive load. Past a point, more controls make the system harder to understand and operate safely.
Culture & Learning
Psychological safety is nice-to-have but not mandatory. Without psychological safety, people hide problems, avoid reporting near-misses, and optimize for blame avoidance. You get a system that looks fine until it isn’t.
Psychological safety means comfort. Psychological safety means people can speak up without fear of punishment. It doesn’t mean the work is comfortable. Honest feedback, disagreement, and accountability are all uncomfortable and all require safety to function.
People resist change because they don’t understand it. People often understand the change perfectly well. They resist because they see costs the change advocates haven’t acknowledged, or because they’ve seen similar initiatives fail before.
Experience automatically translates to better incident response. Experience without reflection produces confidence, not competence. Ten years of unexamined incident response is one year repeated ten times.
Learning happens automatically from doing exercises. Exercises create experience. Learning requires structured reflection on that experience. Without a debrief, you’ve just had an event.
Incident reviews should focus on what went wrong. Reviews that focus only on failure miss how the system usually succeeds. Understanding normal work is essential to understanding why it occasionally breaks.
If you do incident postmortems, you will be resilient. Postmortems are a ritual. Resilience requires that the insights from those rituals actually change how the organization operates. Most don’t.
Blame-free means accountability-free. Blame-free means separating the question of what happened from the question of who to punish. Accountability for learning and improvement is central to the whole approach.
Blame-free means we don’t name who was involved. Naming who was involved is necessary for understanding the context. The point is to explore their perspective, not to assign fault.
Doing blameless postmortems means we are psychologically safe. Blameless postmortems can be theater. If people still feel judged, if the “lessons” always land on the same teams, if leadership isn’t present, the label is doing nothing.
Fixing this specific problem means we’ve learned. Fixing the proximate cause is a repair. Learning means understanding the conditions that made the failure possible and likely. Those conditions are still there after the fix.
Insights naturally spread across the organization. They don’t. Knowledge stays local unless you build explicit mechanisms to move it. Most organizations don’t.
We can change culture fast. Culture is the accumulated residue of what gets rewarded, tolerated, and punished over time. You can change incentives quickly. The culture catches up on its own schedule.
AI makes human expertise in operations less important. When AI handles 99% of decisions, the humans who step in for the remaining 1% still need the judgment and skills to do so. But practicing a skill 1% of the time erodes it. The fallback capability quietly atrophies under the automation it provides.
Incident Analysis
The postmortem is where learning happens. The meeting is where stories get told. Learning happens when those stories change how the organization operates. If the action items sit in a ticket queue and the insights stay with the people in the room, the postmortem produced a document, not learning.
Action items from incident reviews are the measure of learning. Action items measure intent to change. Completion of action items measures follow-through. Neither measures whether the organization actually understood the conditions that produced the incident. You can close every action item and still not have learned.
We need to find the root cause. Complex failures have contributing factors, not a single root. The search for one root cause directs attention to the most obvious factor and away from the systemic conditions, the interactions, the organizational pressures, the design tradeoffs, that made the failure possible and likely.
Severity determines whether an incident deserves a review. Low-severity incidents and near-misses often reveal the same systemic conditions that produce high-severity ones. They’re cheaper to learn from. Reviewing only major incidents means you only learn from the expensive lessons.
The people who caused the incident are the ones who need to learn from it. The people closest to the incident have the most context, but the conditions that produced it usually extend beyond their control. If only the responders learn, the organizational factors stay untouched.
AI-powered postmortems mean we’ve automated learning. AI can produce the timeline and summary faster. The learning comes from the conversation that surfaces why actions made sense to people at the time. You can automate the document. You can’t automate the understanding.
Chaos Engineering
Chaos engineering is just breaking things in production. Chaos engineering is hypothesis-driven experimentation. Breaking things randomly is just breaking things.
You need perfect systems before starting chaos engineering. If you wait for perfect, you never start. Chaos engineering is most valuable in systems that have unknown weaknesses, which is all of them.
Chaos engineering will cause more outages. Well-designed experiments have controlled blast radius and abort conditions. The outages chaos engineering prevents are the ones you weren’t prepared for.
Only Netflix-scale companies need chaos engineering. Every system that matters to someone has failure modes worth understanding. Scale affects the approach, not the need.
Chaos engineering requires expensive tools and infrastructure. You can start with a script that kills a process. The tooling helps at scale, but the practice starts with curiosity and a hypothesis.
Comprehensive preparation makes chaos experiments safer. Over-preparing for an experiment defeats the purpose. The value comes from encountering the unexpected. If you’ve prepared for everything, you’re testing your preparation, not your resilience.
More chaos experiments equals more resilience. Volume without learning is just activity. One well-designed experiment with a thorough review teaches more than twenty drive-by fault injections.
Chaos engineering without load produces good results. Systems behave differently under load. Testing failure modes in an idle system tells you how idle systems fail, which is rarely how production fails.
Passing one chaos experiment is enough. One experiment tests one hypothesis in one set of conditions. Systems change. Dependencies change. The experiment that passed last quarter may fail today.
Operational Readiness Reviews
An ORR is a checklist exercise. A checklist gets passed around. Someone fills in “yes” next to “Do you have monitoring?” Nobody asks what gets monitored, or what happens when the alert fires at 3am and the one person who knows the system is on vacation. Checklists describe Work-as-Imagined. The value of an ORR is in the conversation that probes Work-as-Done.
Passing an ORR means the system is ready for production. Passing means the review didn’t find disqualifying gaps. It doesn’t mean there are none. An ORR is a sample, not a proof.
An ORR and a regular ops health check are the same thing. An ORR is forward-looking: is this system ready for what’s coming? A regular operational health check is backward-looking: how did the last week go? One probes assumptions about the future. The other monitors the recent past. When organizations blur them together, the health check absorbs the ORR and the deep exploratory conversation disappears.
Teams can effectively review their own operational readiness. The people who built the system share the same assumptions about how it works. An effective ORR needs an external perspective, someone who wasn’t involved in building the thing and whose job is to ask the questions the team didn’t think to ask.
ORRs are a one-time gate before launch. Systems change after launch. Dependencies change. Teams change. An ORR that happened six months ago describes a system that no longer exists. Readiness is perishable.
Load Testing
Load testing validates that the system can handle expected traffic. If your test only confirms expected capacity, you’ve learned that your model is correct under the conditions you modeled. You haven’t learned where the model breaks down. Testing to expected capacity is validation. Testing beyond it is discovery.
Load tests with simplified traffic patterns represent production. Real production traffic has bursts, correlations, unusual request patterns, and hot keys that synthetic traffic doesn’t capture. Simplified patterns pass through common code paths and skip the edge cases that cause production surprises.
Load testing is done when the test passes. A passing test means the system handled one specific scenario. Change the traffic pattern, the data distribution, or the concurrency model and the result may be entirely different. One passing test is one data point.
Load tests should be predictable and reproducible. Reproducibility is useful for regression testing. But predictable tests don’t discover anything new. The most valuable load tests are the ones where you’re genuinely uncertain about the outcome.
Load testing in staging tells you how production will behave. Staging differs from production in data volume, traffic patterns, dependency behavior, and infrastructure configuration. It’s a useful approximation, but the approximation can hide the failures that matter.
If the system didn’t break during load testing, it won’t break under real load. Load testing covers the scenarios you thought to test. Production generates the scenarios you didn’t. The absence of failure in a test tells you about the test, not about the system.
Load testing is only about throughput and latency. Throughput and latency are the metrics people watch. But load also reveals resource exhaustion patterns, connection pool behavior, garbage collection pressure, dependency timeouts under contention, and cascading degradation paths. The system might meet its latency target while quietly exhausting something that will cause a much worse failure later.
Running load tests regularly means we understand our capacity. Regular tests with the same profiles produce the same results. Capacity understanding comes from varying the conditions: different traffic shapes, different failure modes injected during load, different dependency behaviors. Regular execution without variation is load testing theater.
Strategy & Architecture
If you are compliant, you are resilient. Compliance is a minimum bar defined by regulators. Resilience is what happens beyond the checklist, when the situation is novel and the playbook doesn’t apply.
Root cause analysis prevents future incidents. Complex failures have multiple contributing factors, not a single root cause. The belief in root cause directs attention to one factor and away from the systemic conditions that matter more.
What works at [Famous Company] will work here. Their context is not your context. Their culture, scale, history, and constraints produced their approach. Transplanting practices without transplanting context produces cargo cults.
Best practices are universally applicable. A best practice is a practice that worked well somewhere. Whether it works here depends on conditions that the label “best practice” conveniently obscures.
Industry benchmarks represent our reality. Benchmarks are averages across organizations with different contexts. Your reality is specific. Benchmarks tell you where you stand relative to a statistical abstraction, not whether you’re doing well.
One-size-fits-all solutions exist for resilience. If they did, resilience wouldn’t be hard.
More redundancy equals more resilience. Redundancy adds complexity. Failover mechanisms can fail. Failback processes are often untested. Past a point, redundancy creates the conditions for failures that wouldn’t otherwise exist.
You can buy resilience. You can buy tools. Tools support practices. Practices require people, learning, and organizational commitment. The part you can’t buy is the part that matters.
AI will solve your resilience problems. AI can accelerate detection, assist with triage, and draft postmortems. It can’t navigate the organizational dynamics that suppress learning or make judgment calls during novel failures. Resilience is an organizational capability. Tools don’t produce it on their own.
Resilience is the same as reliability. Reliability is about preventing failure. Resilience is about what happens when failure occurs anyway, which it will.
Resilience means being unaffected by disruption. Resilience means absorbing disruption and adapting. Being unaffected is invulnerability, which doesn’t exist in complex systems.
If nothing failed, we are resilient. The absence of failure tells you nothing about your capacity to handle it. A system that hasn’t been tested hasn’t been proven.
We can prevent all failures with better processes / tests / developers. Pick your version. All three assume failure is caused by a deficit that can be filled. In complex systems, failure is emergent. You can reduce it. You cannot eliminate it.
Zero incidents is a realistic goal. See above. Zero incidents is a goal that punishes reporting and rewards hiding.
Think one is missing? Reply and tell me what it is. This list grows because people share the myths they’ve bumped into.



Great list! I would like to propose adding “We collect data on everything important”. I have experienced several incidents caused by TLS certificate and handshake errors, yet what happens on the TLS layer is rarely monitored. Clients who fail to connect don’t get a HTTP server error response, which is often the basis for measuring availability.