Join us
@squadcast ・ Aug 18,2024 ・ 12 min read ・ 367 views ・ Originally posted on www.squadcast.com
Readers should note that the term SLA has taken different meanings over time. Some companies define SLA as the service quality clause in a contractual agreement and refer to SLOs as the measurable objectives that substantiate the SLA. In this article, we adhere to Google’s definitions in the context of site reliability engineering practices, as summarized below
Service-Level Indicator (SLI)
SLIs are metrics such as latency and error rate used to measure service quality
Service-Level Objective (SLO)
An SLO is a target value (or a value range) for service quality as measured by an SLI.
Composite SLO
The resulting SLO when combining sub-services with varying levels of SLO.
Error Budget
The amount of time a system can fail or the number of errors it can sustain before causing an SLO breach.
Service-Level Agreement (SLA)
An SLA is an agreement between a service provider and its users stipulating SLO guarantees and a penalty payable upon their breach.
Before the proliferation of software as a service (SaaS) and hosted applications, SLOs and SLAs were mostly used in the IT industry by telecom carriers when offering data services such as internet access, committing to service quality metrics such as 99.99% availability or a minimum bandwidth of 50 Megabits per second (Mbps).
The principles haven’t changed for software and infrastructure service providers, but the metrics focus instead on indicators such as latency and error rates. For example, an application programming interface (API) may have an internal service quality objective to process a minimum of 100 requests per second 99.99% of the time with an error rate of less than 0.5% and a query response time of less than 200 milliseconds.
The contractual SLA commitment may be based on fewer indicators for the sake of simplicity and use more conservative values such as guaranteeing that the average response time calculated over an hour won’t exceed 300 milliseconds as conceptually illustrated in the diagram below.
A service provider’s internal SLO is more aggressive than its external SLA (source).
The software industry has come a long way in recent years in defining and implementing SLOs. For instance, the Open SLO project ( https://openslo.com/) helps companies configure vendor-agnostic SLOs using YAML files in the same way that DevOps teams use them as part of their continuous code delivery processes. In another example, Squadcast has open-sourced its internal SLO tools, known as the SLO Tracker ( https://slotracker.com/), to help the SaaS industry improve the stability of software services. In another sign of industry maturity and cooperation, SLOConf ( https://www.sloconf.com/) is a community resource for learning about vendors who are developing new tools and services for implementing SLOs.
SLA and SLO concepts, explained
Service-level indicators (SLIs) are the metrics that measure service performance, accuracy, and availability. The core SLI metrics for mobile and web applications are uptime, latency, error rate, and throughput. It’s worth noting that one service can have multiple service level indicators. The table below provides a list of common SLIs.
Common Service Level Indicators (SLIs)Definition
Availability (or uptime): A percentage of time the service has been fully functioning and available to users over a time interval (e.g., 99.95% of the time over a 24-hour period).
Latency: The time it takes for a web page or an application programming interface to return a response to a request (e.g., 200 milliseconds).
Error Rate: The percentage of the requests resulting in an error over a period of time (e.g., 0.1%). An example of an HTTP error is a 404 code meaning a page was not found.
Throughput: The capacity of an API to support requests, and typically expressed in terms of requests per minute (RPM). In networking, the throughput would typically be measured in terms of megabits per second.
Mean time between failure (MTBF): The average amount of time separating two consecutive failures (e.g., 5 days 4 hours, and 34 minutes)
Mean time to repair (MTTR): The average time it takes the service provider to remedy a service failure (e.g., 1 hour)
Service-Level Objectives are targets set by DevOps teams for measuring service quality based on a service level indicator (SLI). For example, a service may aspire to be available 99.99% of the time, or limit errors (such as an HTTP 500 error) to less than 0.5% of the time.
SLOs are increasing in popularity because they provide multiple benefits, such as:
Service providers often target a more aggressive SLO value internally compared to the value published for end-users. For example, a service provider may require its site reliability engineering team to deliver a service availability of 99.99% while only advertising an SLO of 99.9% to its end-users. The difference between the two SLO values is viewed as a safety buffer of execution.
Modern applications rely on a multitude of independent services to operate. For example, a web application requires its frontend web server farm to be running in conjunction with the backend services including a database service. However, a web application often won’t function properly unless the content delivery network (CDN), and domain name service (DNS) are also fully operational.
Before a service provider contractually commits to a service level objective, it must consider the SLOs from all its constituent services and calculate a composite SLO.
The value of a composite SLO is calculated by multiplying the SLOs of its sub-services which may not be intuitive at first. This formula is derived from the compound probability theory of two independent events occurring at the time.
In the example shown below, the application’s composite SLO is 99.899% based on the following mathematical multiplication formula: 0.999 (SLO of service A) x 0.99999 (SLO of service B) = 0.9989901 (SLO of the application service).
A composite SLO is calculated based on the SLOs of its supporting sub-services.
An error budget is an amount of acceptable buffer before an SLO is breached. For example, an uptime commitment of 99.9% per month means that a service can be down 43.83 minutes in a month without breaching the SLO. Suppose a service suffers 30 minutes of downtime during the first fifteen days of a month, leaving 13.83 minutes of error budget that the operations team can afford to spend before they fail to meet their objectives.
Error budgets are traded off against the pace of innovation. In other words, a high velocity of code release in a production software environment supports innovation but causes instability. Companies that measure error budgets can course-correct their strategies mid-month. For example, if they have sustained outages in the early part of the month, they would instead focus their efforts on testing and documentation so as to reduce consumption of the Error Budget during the latter half of the month.
A service-level agreement (SLA) represents an agreement between a service provider and end-user that establishes service performance, accuracy, and availability standards (we refer to them collectively as service quality in this article) based on SLOs. By definition, SLAs involve a contractual obligation to the customers upon breach of the committed SLO values.
SLA and SLO case study: An Internet Service Provider
Let’s consider an internet access service to describe in practical terms how SLAs and SLOs are offered and implemented by a service provider.
We use the example of a dedicated internet access service. Dedicated access is a significant investment that requires the installation of on-premise equipment and usually a multi-year contract.
In return, the Internet Service Provider (ISP) promises a higher availability and throughput as compared to a shared internet service. If the ISP violates this agreement, the contract’s SLA will involve a penalty resulting in service credits or a refund.
The list below shows the terms and conditions of a typical contract. The first few bullets establish the rules of engagement, followed by the SLA clause that makes quantitative guarantees.
Service Availability: 99.99% (or a maximum downtime of 4.38 minutes per month)
Throughput as measured by https://www.speedtest.net/: > 50 Megabits per second
Mean Time To Repair (MTTR) upon an outage: 2 hours
An ISP must invest in an infrastructure architecture designed for high availability to sustain its service through standard equipment and infrastructure failures. High availability requires the fiber optics and networking equipment to be redundant, but the infrastructure must also have redundant power supplies and switches to handle hardware failures.
With all of this planning, service interruptions will still inevitably occur. Outages could take the form of failed maintenance or an underground cable break due to accidental construction.
A typical ISP would establish an internal SLO, leaving a margin of error for its engineering and operations teams. In our example, supporting an SLA of 99.99% (4.38 minutes of allowed downtime per month) may require an SLO of 99.999% (26.30 seconds per month). The reasoning is that strict internal SLOs give the provider the best chance to catch and mitigate issues before they result in SLA violations.
Ultimately, the engineering team may decide that offering such a high level of SLA requires excessive capital investment and convince the legal department to consider a lower level of contractual commitment such as 99.95% which translates to 4.38 hours of acceptable downtime per month instead of 26.30 seconds.
The table below shows how each “nine” places a significant operational burden on the service provider’s engineering and operations teams.
Best practices for SLAs and SLOs
Consider the following recommendations when planning to introduce a new SLO or SLA.
Introducing SLAs typically requires months of planning, testing, and upgrading tools and processes. Business stakeholders like sales and legal departments should collaborate with stakeholders from engineering, support, and operations organizations to create a well-defined SLA support plan and practice responding to incidents using internal SLOs.
SLAs often remain buried as a clause in a legal contract with the hope customers forget and don’t request a refund upon breach. SLAs displayed on a public service status page help align a provider’s operations with the expectations of its clients.
Some providers go as far as displaying the SLOs (used in the service level agreements with clients) on physical monitors in their offices to embed them in the company’s culture.
Choose SLOs that are as simple as possible with clear service level indicators that can be easily monitored and calculated. It’s best to start with only one SLI.
In practice, even simple SLO calculations can get complicated. For example, if an application is performing well (less than 500 ms of access time for most of the web pages that make up the application’s user interface), but one of its reports is generating slowly (taking 2 minutes due to the large size of the data covered by the report combined with a sub-optimized database query). Does this scenario constitute a breach of SLA? The service provider would say no, but a user of that particular report would disagree.
SLAs should be measured using a third-party testing tool outside of the company’s network to simulate the behavior of end-users who reach the platform from a remote location. An example of such a test might be a ping test conducted by a third-party testing provider with globally-distributed locations.
SLOs and SLAs are based on average measurements during an hour, a day, or a month. However, the timing of the outages, slowness, and errors contributing to the SLO degradation are equally important. For example, two services may meet the SLO of 99.9% uptime by having no more than 43 minutes of downtime in a month; however, one of them had the outages late at night on weekends while the other had it mid-morning on weekdays resulting in a very different customer satisfaction outcomes. Some DevOps teams avoid releasing codes or making certain types of configuration changes during peak business hours to reduce the risk levels affecting service quality.
Customers expect rapid service restoration but may not provide enough information about the problems they are experiencing. For example, an application may be slow in certain regions and only from mobile devices while all other locations are operating normally from desktops. It’s important to require customers to file support tickets to report problems that can result in SLA penalties. Support tickets should include mandatory fields for providing information such the OS version and browser version of the platform where the problem was experienced, and include screenshots or browser logs. The more information a service provider has, the more likely it is to shorten its mean time to repair (MTTR) and meet its SLA obligations.
It’s best to start with a lower level of commitment, even if it’s not the industry standard. This approach gives your teams time to adjust. For example, if your competitor offers a 99.99% commitment, start with 99.9% for the first few months.
Make sure your internal processes and architecture support it before increasing your commitment to 99.99%. Your operations team will appreciate the difference between 43 minutes of allowed downtime per month and 4 minutes until they are used to regularly enforcing the SLA.
Establishing SLOs helps organizations drive towards a common measurable goal and reach the level of client satisfaction needed for a company to prosper. It’s best to start measuring and privately sharing SLOs inside a company many months or even years before contractually committing them to customers. Start simple to give your company the time to evolve the processes, tools, and service architecture necessary to honor legally binding commitments.
The difference between SLA vs. SLO boils down to a formal commitment and its consequences. A service level objective (SLO) is a best-effort target, while a service level agreement (SLA) is a commitment with financial implications.
Originally published at https://www.squadcast.com
Join other developers and claim your FAUN account now!
Influence
Total Hits
Posts
Only registered users can post comments. Please, login or signup.