Join us

heart Posts from the community tagged with slo...
Sponsored Link FAUN Team
@faun shared a link, 1 year, 5 months ago

Read DevOps Weekly - DevOpsLinks

DevOps Weekly Newsletter, DevOpsLinks. Curated DevOps news, tutorials, tools and more! 

Join thousands of other readers, 100% free, unsubscribe anytime.

Story
@squadcast shared a post, 3 months, 2 weeks ago

How to Implement SRE Practices Even Without a Dedicated SRE Team

This blog post tackles how to implement core Site Reliability Engineering (SRE) principles even if you don't have a dedicated SRE team. It simplifies complex SRE concepts like error budgets, SLAs, SLOs, and SLIs, making them understandable for beginners.

The blog post offers a step-by-step guide to get you started with SRE, including:

Defining what matters to your customers (SLIs)

Setting achievable targets for those metrics (SLOs)

Considering how much downtime you can afford (error budgets)

Identifying and automating repetitive tasks (toil)

Implementing ways to easily rollback deployments if necessary

Prioritizing team well-being to avoid burnout

Maintaining open communication to set realistic expectations

Overall, the blog emphasizes that SRE is a gradual process that can significantly improve your system's reliability and provide a better customer experience.

Story
@squadcast shared a post, 3 months, 2 weeks ago

Understanding SLOs, SLAs, and SLIs: Essential Metrics for Service Quality

This blog post explains the concepts of SLAs, SLOs, and SLIs, all of which are important for measuring and ensuring service quality.

SLI (Service Level Indicator): A measurable value that reflects how well a service is performing. Common examples include uptime, latency, error rate, and throughput.

SLO (Service Level Objective): A target value for an SLI. It essentially defines the desired level of service quality.

SLA (Service Level Agreement): A formal agreement between a service provider and its customers that outlines the service quality guarantees, often based on SLOs. SLAs typically involve penalties if the SLOs are not met.

The blog post also highlights the benefits of SLOs and provides best practices for implementing SLAs and SLOs. Some key takeaways include:

SLOs help teams collaborate and set measurable goals for service quality.

SLAs should be transparent and based on realistic SLOs.

It's better to start with simpler SLOs and gradually increase complexity.

Timing of outages can significantly impact customer satisfaction.

By understanding these concepts, organizations can establish a framework to deliver high-quality services and maintain a competitive edge.

Story
@squadcast shared a post, 3 months, 4 weeks ago

Understanding SLO, SLI, and SLA: A Guide with a Free, Open-Source SLO Tracker Tool

This blog post explains the concepts of SLO, SLI, and SLA, which are all important for ensuring that a service meets expectations for reliability. It also introduces a free, open-source tool named SLO Tracker that helps users track SLOs and Error Budgets.

Here are the key takeaways:

SLO (Service Level Objective): A target for how often a specific aspect of a service should be available or functional (e.g., 99.9% uptime).

SLI (Service Level Indicator): A measurable metric that reflects an SLO (e.g., percentage of time a service is up).

SLA (Service Level Agreement): A formal agreement between a service provider and its customers that outlines the expected level of service (including SLOs and consequences for not meeting them).

The blog post also highlights the challenges of SLO monitoring and how SLO Tracker can help by providing features like:

A unified dashboard for viewing SLOs and SLIs.

Error Budget visualization and alerts.

Integration with observability tools.

Ability to manage false positive alerts.

Story
@yair_stark shared a post, 2 years, 6 months ago

Error Budget Is All You Need - Part 2

In part 1 I proposed a simple modification to Google’s Multi-Window Multi-Burn Rate alerting setup and I showed how this modification addresses the cases of varying-traffic services and typical latency SLOs.

1_gm3BXHRG_TVt9Hc5cQbOJA (1).png
Story
@yair_stark shared a post, 2 years, 6 months ago

Error Budget Is All You Need - Part 1

One of the great chapters of Google’s Site Reliability Engineering (SRE) second book is chapter 5 — Alerting on SLOs (Service Level Objectives). This chapter takes you on a comprehensive journey through several setups of alerts on SLOs, starting with the simplest non-optimized one and by iterating through several setups reach the ultimate one, which is optimized w.r.t to the main four alerting attributes: recall, precision, detection time and reset time.

1_gm3BXHRG_TVt9Hc5cQbOJA.png
Story
@squadcast shared a post, 2 years, 8 months ago

What can SREs do to make holiday season’s peak traffic less chaotic?

Holiday season's peak traffic is the most challenging period for SREs and on-call engineers. In this blog, we have highlighted the things that SREs can do to make the holiday season less chaotic.

HowCanSRE_BlackFriday-570x330.png