Building Sustainable SLOs: How to Align User Needs with Business Goals (and Keep Your Customers Happy)

This blog dives into Service Level Objectives (SLOs) and how to create sustainable SLOs that benefit your users, technology platform, and business. By following these steps, you can build robust systems, keep your customers happy, and achieve business success.

What are SLOs and Why They Matter

SLOs are powerful tools that leverage metric-based targets to limit activities that might negatively impact users (like maintenance or failed deployments). Traditionally, SLOs were seen within Service Level Agreements (SLAs) as guarantees for IT platforms (SaaS, IaaS, PaaS). However, their applications extend far beyond that.

Improved User Experience: SLOs guide process improvement and technological advancements to enhance user satisfaction.
Data-Driven Decision Making: SLOs rely on user data to pinpoint areas for system improvement and resource allocation.

Building SLOs Based on User Needs

Here’s a two-stage process to establish data-driven SLOs that deliver positive user outcomes:

Data Gathering:

User Input: Conduct surveys or user interviews to understand user behavior and pain points.
System Analysis: Analyze system logs to assess performance and identify bottlenecks.
Business Process Review: Evaluate maintenance and support lifecycles to understand downtime requirements.

Remember: The Pareto Principle (80/20 rule) applies here. Focus on establishing SLOs for the most frequently used system functionalities to deliver the most value.

Sample Questions to Consider:

When are our users most active?
How often is system maintenance required?
What downtime tolerance do our users have?
Is our application critical to their business?
What’s our current system performance level?
What performance levels do our users expect?

Defining SMART SLOs

Once you’ve gathered your data, it’s time to define your SLOs. Here’s a helpful framework:

Specific: Clearly define what’s being measured (e.g., availability by testing server requests, not just server uptime).
Measurable: The SLO should be quantifiable (e.g., disk latency less than 5ms, not just “fast disk”).
Achievable: Set attainable SLOs (e.g., if an underlying service has a 95% SLO, you can’t guarantee 100%).
Relevant: SLOs should reflect user experience (e.g., web server response time, not CPU activity).
Time-bound: Consider user behavior when setting timeframes (e.g., if users access the system between 9 AM and 5 PM, a 24-hour SLO might mask issues).

Example:

Let’s say a system processes stock trades and requires finalizing all requests within 300ms as per regulations. The company aims to offer an SLO guaranteeing requests are completed in under 250ms on average over 30 days. Currently, the system responds to 98% of requests within 232ms on a 30-day rolling average. Here’s a well-written SLO based on this data:

Over a 30-day period, at least 98% of user requests will be processed within 250 milliseconds.

Why This SLO Works:

Achievable: The system already surpasses the SLO.
Relevant: Regulations mandate request finalization within the SLO limits.
Specific: It clearly defines the guaranteed metric (request response rate).
Time-bound: The 30-day timeframe allows for time-bound reporting.
Measurable: A Prometheus metric measures the metric.

Accounting for Maintenance and Downtime

Schedule maintenance downtime into your SLOs. If your system offers 97% availability over a month but requires 14 hours of maintenance (2%), then offer a 95% SLO.

Exceeding SLO Targets: Optimizing System Performance

If you’re not meeting your SLO targets or want to exceed them, here are some approaches:

Technical Enhancements: Streamline proxy configurations for faster requests, consider high-performance storage options for faster disk reads, or right-size instances for quicker batch job processing. In some cases, you might need to consider changes to operating systems, database platforms, or even development frameworks.
Improved Monitoring: Robust monitoring helps identify and address issues before they impact SLOs.
Disaster Recovery: A solid disaster recovery plan minimizes downtime caused by unforeseen events.

Conclusion

By prioritizing user needs and aligning SLOs with business goals, you can achieve a win-win situation. Here’s how:

Improved Customer Satisfaction: Meeting or exceeding SLOs leads to a reliable and positive user experience, fostering customer trust and loyalty.
Enhanced System Performance: The focus on SLOs compels continuous improvement of system performance and infrastructure, leading to a more robust and efficient technological foundation.
Streamlined Operations: Clear SLOs guide resource allocation and prioritization, optimizing team efforts and ensuring efficient incident response.

In essence, SLOs bridge the gap between technical operations and business objectives. By establishing SLOs that are user-centric, measurable, and achievable, you can ensure a reliable and performant system that keeps your users happy and your business thriving.

Squadcast is an Incident Management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.