The Severity Argument You Keep Having

The argument no rubric will ever settle — and what would

May 16, 2026

a close up of a red and green light — Photo by thisGUYshoots on Unsplash

A client of mine had been through a series of outages. At some point, one of their customers told them:

“You seem more interested in negotiating priority than fixing my problem.”

That sentence has stayed with me, because it captures something most engineering organisations would rather not see clearly. The customer was not in the room when the severity meetings happened, but they could tell from the response, from the cadence of communication, from the shape of what was happening on their side, that a meaningful chunk of the organisation’s attention had gone to a procedural question that had nothing to do with their outage.

If you have spent any time working incidents, you probably have experienced this type of situation. Someone pulls up the severity rubric mid-incident or just after. The number is high, or low, or in dispute, and the argument starts: should this be a P1 or a P2, does the customer impact count as “major,” does the contract automatically promote it, last quarter we called something similar a P2 so shouldn’t this also be one, should we wait until we know the full impact, should we downgrade now that the immediate fire is out. Everyone has a reasonable-sounding rationale, everyone is trying to land the right number, and every few weeks the same argument comes back with a different incident attached.

The customer in the opening saw this more clearly than the people inside the room. The visible argument is about severity levels. The underlying argument is about a measurement that has quietly become a goal. Severity counts show up on scorecards, in quarterly reports, in promotion criteria, in the SLA penalty clauses of contracts. Once a number gets that much attention, the conversation stops being about the impact on the customer and becomes about what the number will be. Economists call this surrogation: the proxy substitutes for the thing it was supposed to indicate.

You wanted reliability and teams that learn from incidents. You measured high-severity incidents. You now have an organisation working to reduce the count, which is not the same thing. The count goes down through suppression: incidents that don’t get declared, problems classified as something else, issues worked around quietly. Healthier teams report more incidents, not fewer, because psychological safety raises the count. Fewer high-severity incidents is the side effect of teams that learn, not the route to it. Make the count the goal directly and teams will hit it through argument, downgrade, and quieter classification. The learning never happens.

Fred Hebert at Honeycomb made the boundary-object case directly: severity scales fail because a single linear number can’t carry the multiple things different stakeholders need it to mean. His prescription is to drop the scale and use descriptive types instead — isolated, major, security, time bomb, ambiguous — so each term encodes its own workflow rather than its own rank. This post stays with Hebert’s diagnosis and adds a second cause underneath it.

The meanings severity is asked to carry — impact, workload, visibility, contractual exposure — give everyone in the room a lever to argue the number down. The boundary-object problem makes the argument possible. The measurement makes it inevitable. No rubric refinement fixes either.

What severity is being asked to do

Most severity scales run P1 to P4 or Sev1 to Sev5. One number per incident. Looks simple, on the surface.

That number is supposed to communicate, simultaneously:

- Impact — how many customers, how badly, what scope

- Workload — who has to be involved, how many people, for how long, what process activates, whether an RCA gets triggered

- Visibility — what shows up on leadership dashboards, what gets reported externally

- Contractual obligation — what was promised to this specific customer, what the legal and commercial consequences are

These four things rarely move together, and the gap between them is where most severity arguments live. An incident can have low impact but high contractual exposure, or high workload but low visibility, or low operational severity but real regulatory consequence, or it can affect only one customer but the wrong customer at the wrong moment, in a way that makes the call subtle in ways no rubric quite captures.

Inside workload, the question of whether the incident triggers an RCA is often the heaviest lever. An RCA is hours of write-up, review, and follow-up that the team would rather avoid, and in my customer engagements it is the single most common reason severity gets argued down.

The single number forces compression, and compression is what makes severity meetings argue about which of the four meanings will dominate this time, rather than agreeing on what actually happened.

A boundary object is useful in normal times because it lets people coordinate without agreeing on definitions. Under load it breaks. Engineers reach for severity to negotiate workload, customer success reaches for it to communicate customer pain, incident commanders reach for it to gate process, and leadership reaches for it to filter visibility. They are all using the same word and arguing about different things, and most of the time none of them realise that is what is happening.

What looks like an argument about which severity fits is really an argument about which meaning of severity will dominate the conversation.

Where the pull goes

If you pay attention during these meetings, the energy is almost always directed at lowering severity, rarely about bumping it. The arguments offered are not primarily about customer impact, they are about response shape: who would have to be involved, how visible it would be, how much process would activate, what the workload would look like.

Where the pull goes tells you what the term is doing. If severity were really a good description of impact, you would expect the arguments to go in both directions, with customer-facing roles pushing up about as often as engineers push down. The vocabulary would reach for the customer’s contract, the customer’s user count, the customer’s downstream consequence. The meeting would reach for facts.

Instead, the meeting often reaches for examples of similar incidents and asks how those were classified, with people looking for precedent that supports the lower number. The customer’s contract often does not get opened, and the customer-facing roles are often not in the room or are outnumbered by the people who will carry the operational consequences of the call. That is the behaviour of a workload gate, not a description of impact.

This is also where the customer’s perception comes from. When the negotiation pulls consistently toward less workload, less visibility, less formal response, the implicit message is we are arguing toward the version of this that is least disruptive to us, even if no one means it this way. The customer is downstream of that, and they feel it.

The events that get lost

Severity arguments produce winners and losers among incidents. The clear-cut cases go through quickly. The contested cases get downgraded under pressure, more often than not, because the pull is consistently toward less workload.

Which is to say: the incidents that get downgraded under negotiation are usually the ambiguous, complex, partially-customer-impacting cases that do not fit neatly into the standing categories. These are exactly the incidents most worth examining carefully.

A clear-cut P1 is informative in the way a textbook example is informative, because it confirms the categories. An ambiguous incident that the team had to argue about is informative in a different way: it reveals where the categories are wrong, where the system is producing failure modes the rubric did not anticipate, where the assumptions encoded in the severity scheme do not match the territory.

When the standing process punishes ambiguity by routing it toward the lower-severity bucket, which is also often the bucket that gets less analysis, smaller postmortems, weaker follow-up, the organisation systematically loses access to the events that contain the most information. The stuff worth learning from disappears into the bucket designed to require the least attention.

Why the argument never ends

Once you see both problems together — the term overloaded with meanings, and the term being measured — the recurring nature of the argument makes sense. It is not a debate that anyone can win, because there is no rubric refinement that turns an overloaded, measured term into a coherent one. Better criteria make the boundaries sharper, clearer rules make the call more defensible, a standing committee makes outcomes more consistent, a customer-impact rubber stamp makes one dimension explicit, but none of these change the underlying issue, which is that the term is being asked to encode multiple things that cannot be encoded together.

So the argument resurfaces on every ambiguous incident or when customers finally decide to share their frustrations. Someone proposes a refinement, the discussion happens, a decision is made, the argument quietens, and then it comes back at the next incident. It feels like progress because there is real intellectual work happening, but it has no terminal state.

This is why severity meetings feel exhausting. People walk out feeling they spent two hours on something that should have taken ten minutes, and they are right. They spent two hours on a problem that has no solvable form in its current shape.

What is actually being optimised

Cultural fixes — “we should care more about customers, let’s call out negotiation when we see it” — do not work because they aim at the wrong layer. The metric is the layer.

Recurring patterns in mature organisations almost always have an incentive structure underneath them. Sometimes a team scorecard counts P1s and a high count looks bad on someone’s quarterly review. Sometimes P1s trigger upward reporting that someone is trying to avoid. Sometimes SLA penalties are tied to severity, while the team making the severity call is shielded from the financial consequence but accountable for the call itself. Sometimes the incident management team’s own workload is implicitly part of how they are measured.

These metrics are usually invisible to the people inside the system, and often invisible to leadership too, because they were installed years ago by someone who has since moved on. People who ask why does severity keep getting argued? are often told it’s the process. That answer is honest, but it is probably not the whole answer.

If you cannot see the incentive driving the negotiation, you cannot dismantle it. Every cultural intervention will be working against an unmeasured current.

Three layers, one drift

Step back from all of that, and a shape comes into view. Each piece of the diagnosis lives at a different layer of how the organisation actually works, and the layers are out of alignment with each other.

The frame I use in my book Why We Still Suck at Resilience is to examine three layers at once: outcomes, rewards, and rituals.

Outcomes are what the organisation claims it wants: customer success, reliability, learning. The values on the wall, the language in the strategy deck, the things leadership says in town halls.

Rewards are what actually gets measured, celebrated, and promoted. The metrics on dashboards, the recognition in all-hands meetings, the promotion criteria, the things that determine careers.

Rituals are what the standing processes actually produce. Severity meetings, postmortems, performance evaluations, sprint planning. Not what the rituals are supposed to do, but what they actually do when you sit and watch them.

When these three layers point in different directions, when outcomes say customer success but rewards celebrate low P1 counts while rituals systematically argue severity downward, drift is happening. The direction of the misalignment reveals the direction of the drift.

Most attempts to fix the severity argument operate only on the outcomes layer: new value statements, town hall reaffirmations, posters about putting customers first. Restating outcomes while rewards and rituals stay flat does not close the gap; it deepens it, because it teaches people that outcome statements are decoupled from real consequence. The work has to happen where behaviour actually gets shaped, at the rewards and ritual layers, and that requires leadership willing to look at what is actually being measured and decide whether it is doing useful work.

What works better

If the severity argument lives at the rituals and rewards layers, that is where the work has to happen. Five directions are worth thinking about: four on the rituals, one on the rewards. Rituals move faster, and over time put pressure on the rewards. The rewards work moves slower but goes deeper. Doing any one of them helps. Doing all five together produces compounding effects.

Reduce the load on the term

If severity is overloaded, reduce the load. Hebert’s incident-types prescription is the cleanest version, but for most organisations it is too big a leap — severity language is baked into customer contracts and reporting structures. You cannot just remove it.

The practical move I found useful is to supplement rather than replace. Keep the severity number where it has to live, in contracts and in formal reporting, but put a small descriptive layer alongside it that captures the things severity is currently doing badly: customer impact (who, how many, what scope), workload (which teams, expected duration), visibility (who needs to be told), and contract reference (which clauses apply). Four short fields, filled in at incident declaration, kept alive through the response.

A small example. An incident gets declared at P3 because it affects a non-critical internal tool used by a small number of users for a few hours, and by the rubric that is the right number. The four-field layer alongside the P3 might read:

Customer impact: 12 internal users on the analytics team, blocked from quarter-end reporting due in 36 hours.
Workload: two engineers, expected 4 hours.
Visibility: finance leadership notified, no external comms.
Contract: none affected.

The P3 stays, and nothing has changed about the workflow it triggers, but anyone walking into the response or reading about it later now has the actual situation, not just the rank. The question of whether the severity level fits stops being the only thing the meeting can argue about, because there are now four other things on the page that are easier to discuss directly.

When the next severity meeting happens, you have something to argue from instead of arguing about. The compression problem does not go away, but the meeting now has a chance of being about the situation instead of about which meaning of severity will dominate.

A small antidote that pairs well with this, especially when you notice the meeting drifting back into rank-arguing: imagine the customer in the room. Not as a rhetorical device, but seriously. What would you actually tell them, in plain language, about why this is or is not a P1? On the other side of the service is a customer trying to get work done, and behind them are usually their own customers waiting on something. The severity meeting is invisible to all of them, but the consequences of the call are not. Running the argument past an imagined customer before settling on the number tends to surface what the discussion has been talking around.

Lower the cost of declaring

If suppression is the upstream version of the surrogation problem, the way to disarm it is to make declaration itself low-friction and non-punitive. Martha Lambert at incident.io made the practical case at SEV0 2025: lower the bar, declare more, train the response muscle on small things so the bigger ones go smoothly. She also suggests auto-creating incidents from errors, taking the declaration decision out of human discretion entirely. There is nothing left to suppress. This is the Andon Cord pattern applied to incident declaration: lower the cost of surfacing a problem, and never punish the surfacing. Her customer-side framing lands the rest of it:

”Customers understand that things go wrong. They just want to know that you’re dealing with them really well.”

Decouple learning from severity

Most severity scales gate two things: who responds and whether learning happens. The two do not have to be linked.

When learning happens only for P1s, or only for the incidents that survived the negotiation as P1s, the organisation has built an incentive to argue severity downward. The lower number is also the one that makes the postmortem go away, the action items disappear, the reflection stop. Of course the pull is downward. The severity decision is also a decision about whether the team has to write anything down afterward.

Decoupling these is structural. Severity decides response shape, learning happens regardless. AWS’s COE process is one well-known version of this. A COE (Correction of Errors) is Amazon’s version of the postmortem, triggered by any incident with customer impact, even one customer. Customer obsession makes the call. Any team can ask another to take one on, even for a near miss. So is the practice of writing a learning artefact for any incident that surprised the team, independent of how it was officially classified.

Once you decouple, the argument over severity stops mattering for the learning question. The high-stakes argument quietly becomes lower-stakes, because less rides on the answer. This is also the move that makes ambiguity productive instead of expensive: the contested cases, the ones the rubric struggles with, are now the ones most likely to get studied, because the studying is no longer gated on the rubric resolving them.

Decoupling is the precondition for learning, not a guarantee of it. The artefact itself can become theater: a defensible argument that learning is occurring while no actual learning happens. That is a separate failure mode my book spends time on.

Default high, track the trajectory

Two tactical moves work alongside the structural ones, especially when a real argument is happening in the room.

Default to the higher severity when there’s disagreement. Treat the incident as the highest plausible level until proven otherwise. Mobilise the response that level warrants. Once the customer is taken care of and the immediate situation is resolved, the argument can resume. In practice it rarely does, because everyone is tired and the urgency has passed. The downward pull only works when it’s exercised *before* the response is committed; once committed, the argument loses its purpose.

Treat severity as a trajectory rather than a single number. An incident can start at P3, escalate to P1 when impact becomes clear, then return to P3 or P4 once the situation is stable. Track the peak severity alongside the current severity, and record every change in the incident timeline. (I owe this framing to a recent conversation with Brent Chapman. The peak severity becomes useful data: what severity was actually doing during the response, not what someone settled on at the end. It also removes heat from the meeting, because the call is no longer a one-shot judgment. It can be revised as the picture clarifies.

Both moves work because they remove the tactical incentive to argue. Default-high pushes the argument to the after-the-fact moment when no one wants to have it. Trajectory tracking spreads the severity decision across the incident timeline; no single judgment carries the full weight.

Take severity off the scorecard

The deeper move is at the rewards layer. Use severity counts as information. Keep them off scorecards, performance reviews, and promotion criteria. Once a number gets that kind of attention, the thing the number was supposed to indicate becomes invisible. That is the surrogation problem this piece opened with, surfacing in the place hardest to fix.

This is the harder of the three remedies, but it addresses surrogation at its source. In my experience, leaders understand surrogation easily once it is explained. They have rarely chosen the measurement system that produces it. Most of the structures around incident severity were inherited from someone who has moved on, and the people operating within them now do so without ever having decided the structures made sense. Naming the measurement and asking whether anyone would design it this way from scratch is often enough to unstick the conversation.

Most of the time, the conversation simply doesn’t happen. The people closer to the harm assume leadership won’t hear it; the people senior enough to be heard haven’t done the diagnosis. When someone with the standing walks a leader through it calmly, the metric stops being defended more often than people fear.

Some leaders will not unstick even then, because the measurement is doing political work for them they aren’t ready to give up. That is a different problem from inheritance, and it doesn’t yield to the same conversation. Naming the political function the metric serves is where that one starts.

The remaining work is consistency. Someone with standing has to say “we are no longer tracking P1 counts as a performance indicator,” and they have to hold that line through the next quarterly review when someone asks where the metric went. It is uncomfortable to remove a measurement that has been part of the operating rhythm for years, even when everyone agrees it is doing harm. But the alternative is the argument this piece opened with, in perpetuity.

Why this matters

The reason severity arguments are worth getting right is not that severity itself matters. It is that severity meetings are where the organisation talks to itself about which incidents count.

When the meetings systematically pull toward lower severity, the organisation tells itself a story about its own performance that is more flattering than reality: fewer high-severity incidents, better dashboards. The proxy has become the goal, and the goal has been forgotten. That story shapes investment decisions, what gets prioritised, what gets celebrated, what counts as a good quarter.

It also shapes how the organisation appears to its customers. The customer who said you seem more interested in negotiating priority than fixing my problem was reading the same room from outside. They saw an organisation whose internal language was about classification, when their language was about consequence. The mismatch is visible, even when the people inside the meeting do not realise it is happening.

A measurement system that points the organisation in this direction will keep producing the same arguments quarter after quarter, regardless of how the rubrics are dressed up. There is no better technique waiting to be found. The way out is to recognise that the question severity was designed to answer is not the question incident response is actually asking.

Tell the truth in the dimensions where it actually lives. Notice when outcomes, rewards, and rituals point in different directions. The misalignment is the message.

//Adrian

---

Second in a short series on the metrics and categories incident response argues about. The first piece was on MTTR.

Discussion about this post

Ready for more?