What’s Wrong With SLAs?
“A service-level agreement (SLA) is a commitment between a service provider and a client,” according to Wikipedia. “Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user.”
SLAs are thus contracts between third parties – contracts that can drive the wrong business outcomes.
SLAs require the service provider to do the bare minimum to comply with the contract. Anything more and they’d simply be spending money they don’t have to. This focus on providing the minimum required level of service forces IT organizations to become commodity purveyors and not strategic, value-added partners to the business.
At best, SLAs are marginally useful when there’s a contractual relationship between service provider and customer. However, when the service provider is internal to the organization – a common pattern among large enterprises – SLAs make even less sense. We need a better approach.
Forget SLAs. Focus on SLOs
Instead of focusing on the contractual relationship between the service provider and its client, IT leadership should instead shift its attention toward a business expectation model that enables the business to define targets that match IT delivery models, to business outcomes.
Fundamentally, availability is but one of many business priorities. Also on the list: the reliability of a deployed system, the ability for developers to roll out new features quickly, and the quality of the deployed software.
What organizations require is a formal statement of users’ expectations about any particular dimension of reliability: what we call the service-level indicator, or SLI. The SLI is the proportion of valid events that were good, expressed as a percentage.
In this context, ‘good’ can refer to availability, latency, the freshness of the information provided to users at the user interface, or other key performance metrics that are important to the business. For example, an SLI might state that 99.9% of valid requests for the page index.html were successful (returned a 200 ‘OK’ HTTP code).
Each SLI provides a guideline for each dimension of reliability an organization wants to measure for a given user journey. Once the ops team has specified the SLIs important to the business, they must make the appropriate decisions about measurement and validity, essentially classifying which events are ‘good.’
Availability is the most important consideration when making such decisions, as an unavailable system will fail by default. Furthermore, past availability metrics can inform the probability the system will perform properly in the future.
The Service Level Objective (SLO) for a system, in turn, is a precise numerical target for the availability of the system, for example. To define your SLO, start with your SLIs. Make sure they have an event and success criterion. The SLO, in turn, specifies the proportion of SLI events that were good.
For example, your SLO might state that 99.9% of the valid requests for the page index.html over the last 30 days returned a 200 ‘success’ code in 150 milliseconds or less.
What to Do With Your SLOs
Once you’ve defined your SLOs, you should frame all discussions about whether a particular system is running sufficiently reliably in terms of whether the system is meeting its SLOs.
In particular, it’s always important to frame this discussion in terms of cost. After all, the more reliable the service, the more expensive it is to operate. Based upon this calculation, define the lowest level of reliability for each service that the business is willing to pay for in order to set its desired SLO for the service.
Based upon this SLO, then, the ops team and its stakeholders can make fact-based judgments about whether to increase a service’s reliability (and hence, its cost), or lower its reliability and cost in order to increase the speed of development of the applications leveraging the service.
While system reliability is a good thing, this focus on SLOs will prevent your team from making services overly reliable – a critical mistake that can increase costs and impede development.
Focus on the Error Budget
Once your team gets its collective heads around the fact that perfect reliability is both unattainable and undesirable, the question then becomes just how far short of perfect reliability should you aim for. We call this quantity the error budget.
The error budget represents the number of allowable errors in a given time window that results from an SLO target of less than 100%. In other words, this budget represents the total number of errors a particular service can accumulate over time before users become dissatisfied with the service.
No matter how reliable the infrastructure, there is always a risk that something will go wrong. Error budgets are an effective approach to quantifying and managing this risk.
The reason the error budget is so important is that it represents the tension between the fast pace of development (and hence innovation) vs. the need for service reliability. In essence, the organization can ‘spend’ the error budget, but should never exceed it.
The Intellyx Take
Enterprise technology has always been a matter of tradeoffs. As organizations move toward cloud-native architectures and DevOps-driven development cultures, these tradeoffs become an important part of day-to-day operations.
SLAs are too inflexible and limited to manage these increasingly sophisticated alternatives, Instead, SLOs and their corresponding error budgets give ops teams a quantifiable approach for balancing the reliability vs. innovation tradeoff.
Remember, however, that setting SLOs isn’t solely an engineering decision. It’s a business decision that requires input from multiple stakeholders.
It’s essential, therefore, for any modern IT shop to consider whether its vendors are providing increasingly problematic SLAs or helping to build a reliable infrastructure that will support the needs of the business moving forward.