Discussion about this post

User's avatar
Nune Isabekyan's avatar

Really enjoyed this.

Classical statistics assumes your measurements are independent samples from a fixed distribution, and then the average converges to something real. Software incidents violate every part of that at once:

- Every incident you've recorded was generated by a different version of the system: different code, architecture, team, dependencies. Averaging over a long window summarizes that historical sequence of systems, not the one you have today. It tells you what happened, not how the current system behaves.

- Even with infinite data, the time average wouldn't equal the ensemble average. Pooling incidents from across years is a category error, not a sampling problem.

- Each incident is drawn from a different sub-population (failure mode, service, severity). A change in MTTR could come from the underlying durations getting worse, or just from the mix of incident types shifting, and you can't tell which from the average alone.

- When durations are this skewed, the sample mean doesn't even concentrate well. One incident can dominate the estimate.

Actually any summary statistic is a *projection* of a high-dimensional, time-varying object (the full distribution of incidents, evolving over time) onto a single number. Information loss is by construction. So the real question isn't "what's the right average?" but "what projection preserves the structure I care about?" That forces three explicit choices: what functional of the distribution (mean, median, tail, shape), over what reference class of events, with what weighting over time.

Your percentiles argument is the first choice done better. Severity and service segmentation is the second. Rolling windows are the third. The recurring MTTR debate is really an argument about all three at once, disguised as an argument about arithmetic, which is why no statistical technique resolves it.

2 more comments...

No posts

Ready for more?