The MTTR Argument You Keep Having
A metric inherited from manufacturing, applied to systems that don't behave like production lines
If you have spent any time in an organisation that operates software, you have probably had this conversation more than once. Someone pulls up the MTTR dashboard, the number is too high, or has not moved in a quarter, or has gone up despite years of investment, and the argument starts.
Should we exclude that one six-hour incident, given that it was clearly an outlier? Should we trim the top and bottom 5%, or use Winsorization, or median absolute deviation, or some combination of the three? Should we separate by severity, by service, by detection path, by phase of the moon? Everyone has a reasonable-sounding statistical technique to propose, everyone is trying to make the number more honest, and every few months the same argument comes back, usually with a new technique attached.
What I find interesting is that the people having these arguments often already know MTTR has problems. The conversation I keep seeing is some variation of “we know the metric is flawed, but leadership wants something simple, so let’s at least make it less wrong.” That framing is honest about the constraint, and I have a lot of sympathy for it. The pressure to produce a single number that fits on an executive slide is real, and pretending it is not real does not help anyone.
But I want to suggest that the argument that follows from this constraint is not what people think it is. It is not really a debate about statistics, but a symptom of a deeper problem that no amount of mathematical trickery is going to fix, and the constraint that produces it is worth examining directly rather than working around.
Before going further, a small note. MTTR means different things in different organisations: Mean Time To Repair, Recovery, Resolve, Respond, or Restore, depending on who you ask and which framework they grew up with. The argument I am about to make does not depend on which version you use. The statistical and structural problems apply to all of them, because they are all averages of incident durations and incidents are the same kind of heterogeneous events regardless of where you put the start and end markers (which is often another subject of argument). So wherever you see MTTR in what follows, read in whichever variant your organisation uses.
Where MTTR comes from
MTTR was not invented for software. It comes from industrial reliability engineering, where the same machine fails the same way over and over again. Picture a production line that turns out the same product, day after day, using the same machines running the same processes in the same sequence. Nothing about the line changes from week to week. A pump fails, you repair it, you record how long the repair took, and a month later the same pump fails again in roughly the same way and you repair it again. Over hundreds of failures, the mean repair time tells you something useful about your maintenance operation: how good your technicians are, how well-stocked your parts inventory is, how clear your procedures are when something goes wrong.
The metric works in this setting because the underlying events are repeatable, and they are repeatable because the system producing them is stable. The line in March is the same line as in January, the pump is the same pump, the failure mode is the same, the repair procedure is the same, and averaging across them is a coherent thing to do because the variation between events is small relative to what is being measured. This is the world MTTR was designed for, and in that world it works well.
Where MTTR breaks
Software incidents are not pump failures. The system that produced the incident in January is not the same system that produces the incident in March, because in the intervening months new code has shipped, dependencies have updated, usage patterns have shifted, architecture has evolved, and the people operating it have learned things that have changed how they think about it. Incidents in complex software systems rarely if ever repeat themselves, because the system underneath them is constantly changing. Every incident is closer to a novel event than a repeat of a known one, and the container restart that took five minutes last month and the database corruption that took six hours yesterday are not two samples from the same distribution. They are not even the same kind of thing.
When you average across them, you are not smoothing out variation in a repeatable process. You are computing the mean of events that have very little to do with each other, and that mean does not represent operational efficiency in the way it does for the pump. It represents nothing in particular.
This is what people are running into when they argue about outliers. The instinct is correct because the number really does not feel like it represents anything real, but the mistake is in thinking the problem is the outliers when the problem is that the events being averaged were never the right kind of thing to average.
Why the argument keeps coming back
Once you see this, the recurring debate about how to clean up MTTR makes more sense. It is not an argument anyone can win, because there is no statistical treatment that turns an incoherent measurement into a coherent one. Trimming makes the number more stable, Winsorization makes it more defensible against critique, MAD makes it more robust to extreme values, but none of these change the underlying issue: the thing being measured was never a single coherent quantity to begin with.
So the argument resurfaces every few months, and it usually surfaces from the top. A large incident lands and leadership wants to know why it took so long, who was responsible, and what is being done about it. The MTTR dashboard comes back onto the table, often as the entry point to a broader conversation about accountability, with questions about why the number is what it is, why it has not improved, and what can be done to make it move. The team, sitting with the same flawed metric they have been sitting with for years, does what they can. They propose a normalisation technique, argue about which incidents to include, reach a decision, and the number stabilises for a while before drifting again, at which point the next large incident reopens the same conversation.
It feels like progress, because there is intellectual work happening, but the argument has no terminal state. It cannot, because the question it is trying to answer was the wrong question, and the dynamic that keeps reopening it is not really about statistics either. It is a search for somewhere to put the discomfort that comes after a bad incident.
Outliers are not the problem
There is a second thing worth saying, which often gets lost in the statistical conversation: outliers in incident data are not noise. They are frequently the most important events in the dataset.
The six-hour database corruption that dominates your MTTR is the incident that taught the team something they did not know about their architecture, that revealed gaps in monitoring, runbooks, staffing assumptions, and coordination patterns, and that produced lasting changes to how the system gets built and operated. The five-minute container restarts produced none of that. They are routine, they reflect operational competence, but they do not generate learning the way the long, painful incidents do.
Treating outliers as data quality problems to be cleaned up is therefore exactly backwards. They are where the signal is, in a sociotechnical sense, and reducing them as a goal is not in the organisation’s interest even if doing so might make a dashboard look better. This is the deeper irony of the argument about how to handle outliers: the events being argued about are often the most informative events the organisation has access to, and the statistical techniques being proposed to minimise their influence on the headline number are minimising the influence of the most important data points.
What works better
I have written before about percentiles as a more honest alternative, and I will not repeat the full case here. The short version is that P50, P90, and P99 tell three different and useful stories, none of which collapse a heavy-tailed distribution into a single mean that represents nobody’s experience.
The practical move I would suggest for organisations currently committed to MTTR is not to remove it overnight, because that tends to produce resistance disproportionate to what the change is actually asking for. Instead, put percentiles next to MTTR in the existing reporting and walk leadership through what each one actually tells them.
A small worked example helps. Imagine a quarter with 20 incidents: 18 of them resolved in under 10 minutes (small, well-understood failures handled by automation or quick rollback), one took 90 minutes (a regression that needed a careful diagnosis), and one took 8 hours (a novel failure that pulled in three teams and rewrote how a subsystem gets monitored). Most people in the org would say this was a good quarter. The team caught problems early, handled the routine ones quickly, and learned something significant from the long incident.
Now look at what the metrics say. MTTR for the quarter is around 33 minutes, a number that suggests typical incidents take more than half an hour to resolve, which is not true of any incident in the dataset. Look at the percentiles instead and the picture changes. P50 is around 6 minutes, which tells leadership that half the incidents are resolved in under 6 minutes, an honest reflection of the team’s standard operational capability. P90 is around 10 minutes, which says that even when things are not routine the response is usually quick. P99 is 8 hours, which surfaces the long incident as what it is: a rare, complex event worth a separate conversation, not a number to be averaged away.
Three numbers, three different conversations. The team’s day-to-day capability, their handling of harder cases, and their worst event of the quarter, each visible without being collapsed into the others. Compare that to a single MTTR of 33 minutes, which represents none of those three things and invites the wrong follow-up question: “why is MTTR so high, and what are you doing to bring it down?”
Once leadership sees this side by side, the conversation about which view represents reality becomes much easier to have. If trimmed percentiles turn out to be useful later, they are available, but that is a refinement on a more coherent measurement rather than a rescue attempt on an incoherent one.
The thing I would push back on is the assumption that more sophisticated mathematics applied on top of MTTR is going to produce a better answer. It will produce a momentarely more stable number, but it will not produce a more meaningful one, because the events being measured are not the kind of events an average can usefully describe and no amount of trimming, Winsorization, or MAD is going to change that.
Why this matters
The reason it is worth getting this right is because metrics shape what organisations pay attention to and what they invest in, and when the headline metric is an average of incident durations the implicit goal becomes reducing that average. Teams optimise for it, behaviours follow, and outliers, which is to say the incidents most worth learning from, become problems to be made smaller rather than events to be understood.
A measurement system that points the organisation in this direction will keep producing the same arguments quarter after quarter, regardless of how the statistics are dressed up, and the way out is not a better technique but recognising that the question MTTR was designed to answer is not the question software incident response is actually asking.
The leadership pressure driving the original constraint, the request for a single simple number, is not unreasonable on its face. Leaders do need ways to compare across teams, track direction over time, and have short conversations about complex things, and the honest response to that need is not to refuse the simplification but to offer one that points at something real. Percentiles do that better than MTTR. The work, in the end, is in showing leadership the difference, rather than in producing a more defensible version of the wrong measurement.
//Adrian


