DORA metrics: elite vs average teams

DORA metrics are the best framework we have for measuring software delivery performance. They are also a frequent victim of Goodhart's Law: “When a measure becomes a target, it ceases to be a good measure.” If you use these numbers to evaluate or punish teams, expect the numbers to improve while the underlying performance does not.

The four metrics — precise definitions

According to the DORA State of DevOps research, elite teams are defined by these thresholds:

Deployment Frequency (DF) — how often code deploys to production. Elite: multiple times per day. The clock starts at merge to trunk, not at merge to a feature branch.
Lead Time for Changes (LTC) — time from code committed to running in production. Elite: less than one hour. Note: the common proxy of mergedAt - createdAt includes review queue time, not just deployment pipeline time. Long lead times usually indicate PR review queues, not slow tests.
Change Failure Rate (CFR) — percentage of deployments that cause a degraded service or require a hotfix. Elite: 0–5% (per 2023+ DORA State of DevOps reports; earlier reports used 0–15%).
Time to Restore Service (TTRS) — mean time to recover from a production incident. Elite: less than one hour.

How your overall profile is calculated

Your overall DORA tier is determined by your lowest single metric, not the average. One Low metric makes your entire profile Low — regardless of how the other three look. This is intentional: a team deploying ten times per day (Elite DF) but taking a week to restore service after an incident (Low TTRS) has a critical reliability gap that an average would hide.

The practical consequence: fix your worst metric first. A team at Medium/Medium/Low/High improves faster by addressing the Low metric than by optimising any of the others.

What DORA does not measure

You can achieve Elite DORA metrics while your engineering organisation is accumulating technical debt, burning out on-call engineers, and shipping features nobody uses. DORA is a delivery health check, not a comprehensive engineering health check. It does not measure:

On-call burden — hitting a 30-minute TTRS target by waking engineers at 3 AM is not sustainable performance.
Technical debt accumulation — high deployment frequency achieved by skipping tests or deferring refactoring is borrowing from future velocity.
Developer satisfaction — the DORA research program does separately track well-being, but the four core metrics do not capture it.

The gaming problem

If Deployment Frequency is tied to team performance reviews, teams will split cohesive features into smaller deployments to increase the count. If Change Failure Rate is tracked, teams may avoid classifying incidents as “caused by a deployment” or roll back quietly without logging the event. Use DORA metrics as a diagnostic tool to find bottlenecks, not as performance targets.

The 30-day baseline without SaaS tooling

You do not need a $20,000 platform tool to establish a baseline. The GitHub CLI gives you the raw data:

# Pull merged PRs with timestamps
gh pr list --state merged --json number,createdAt,mergedAt,title \
  --limit 500 > prs.json

# Lead Time = mergedAt - createdAt (for simple pipelines)
# For accurate LTC, join with your deployment timestamps

For incident data, export your PagerDuty or Opsgenie alerts and join them against your deployment log. A simple spreadsheet with four columns (deployment timestamp, incident timestamp, incident resolved timestamp, caused by deploy yes/no) gives you all four DORA metrics.

Once you have your baseline, the bottleneck is almost always visible: PRs sitting in review for days (LTC problem), or a high CFR pointing to missing test coverage or insufficient staging environments.

To skip the spreadsheet entirely, use the DORA Metrics Calculator — enter your last 30 or 90 days of data, see your tier for each metric, and get a shareable link to the results.