SLA and SLO (service level agreement vs objective)
An SLA (Service Level Agreement) is a contract between a service provider and a customer that specifies measurable guarantees — typically availability, latency, or error rate — with financial or legal consequences for breaches. An SLO (Service Level Objective) is the internal engineering target, deliberately set stricter than the SLA to create an error budget: a cushion that allows teams to detect and fix problems before the external contract is violated.
Formula
Example: SLA guarantees 99.9% availability (8.7 hours downtime/year). Internal SLO is set at 99.95% (4.4 hours/year). The 0.05% difference is the error budget.Why it matters in practice
The error budget model, popularised by Google SRE, changes the relationship between reliability and velocity. Teams with remaining error budget can deploy aggressively. Teams that have exhausted their budget must slow down and focus on stability. This makes reliability a shared engineering concern rather than a ops-only problem. SLOs are typically measured per rolling 28-day window, not calendar month, to avoid end-of-month gaming.
Common mistakes
- •Setting SLOs at 100% — this is unachievable and creates an environment where any incident is a crisis rather than an expected event to manage.
- •Measuring availability as "uptime" rather than successful request rate — a server that is "up" but returning 500 errors is not available to users.
- •Confusing SLO breach with SLA breach — SLO breaches should trigger engineering action, not customer compensation.