Deployment Failure Rate: The Metric Every Team Measures Wrong
Change failure rate is supposed to be one of the cleaner DORA metrics. In theory, you count the percentage of deployments that cause degraded service and require remediation. In practice, teams carve away uncomfortable cases until the number looks respectable.
I often hear a company say its failure rate is 12%, then discover it only counts P1 incidents opened during business hours. Slow checkout after release, a rollback that happened before customers complained, and emergency config fixes after midnight all disappear from the metric.
1. The formal definition is broader than most dashboards
A failed change is not limited to a catastrophic outage. It includes degraded latency, elevated error rates, rollback, hotfix, and other post-deployment remediation. If a release forces engineers to intervene to restore expected performance, the change failed in a meaningful way.
That matters because the metric is designed to connect delivery speed with delivery quality. When the definition is narrowed too far, teams protect the score and lose the signal.
2. Rollback-only counting misses fix-forward reality
Some organisations count only rollbacks. That is understandable because rollback events are easy to detect, but it ignores modern deployment patterns. Many teams fix forward with a fast patch, a feature-flag disable, or a config correction. The customer still experienced a failed change even if the application version number did not move backward.
I prefer a remediation-based definition: if the deployment required active correction after release, it belongs in the numerator.
3. Teams undercount in four predictable ways
- They exclude degraded performance unless it crosses a severe incident threshold.
- They count rollbacks but ignore hotfixes and flag reversals.
- They attribute failures to infrastructure and remove them from delivery reporting entirely.
- They omit incidents discovered hours later, even when the deployment introduced the problem.
Each shortcut produces a cleaner metric and a less useful one. The score becomes easier to present upward and harder to improve from.
4. Deployment frequency and failure rate usually move together
One of the more counterintuitive patterns in DORA data is that teams deploying more often often have a lower failure rate. Small, reversible changes are easier to reason about. Massive weekly releases carry more risk, make blast radius harder to isolate, and complicate attribution when something breaks.
That is why honest CFR instrumentation matters. Teams that undercount often keep defending infrequent “safe” releases that are neither safe nor easy to remediate.
5. Instrument the metric close to the pipeline
The best CFR systems I have seen join deployment events, rollback actions, incident creation, and feature-flag reversal into one timeline. You do not need a perfect platform to start. A practical first step is tagging deployments with a release identifier and linking that identifier to rollback jobs, hotfix PRs, and post-release incidents.
Once that thread exists, the metric stops being a negotiation exercise. It becomes a factual record of how often shipping work triggers extra recovery work.