Join us

heart Posts from the community tagged with reliability...
Sponsored Link FAUN Team
@faun shared a link, 1 year, 9 months ago

Read CloudNative Weekly Newsletter

CloudNative Weekly Newsletter, The Chief I/O. Curated CloudNative news, tutorials, tools and more!

Join thousands of other readers, 100% free, unsubscribe anytime.

Story
@squadcast shared a post, 7 months, 4 weeks ago

Striking a Balance: Reliability Management for Innovation-Driven Companies

This blog post dives into the world of reliability management for SRE teams. It emphasizes the importance of achieving a balance between innovation and system stability. The article explores various frameworks and best practices that SRE teams can leverage to achieve this equilibrium. Some of the key takeaways include implementing SLOs and error budgets, adopting DevOps practices, and utilizing Infrastructure as Code (IaC). The blog also highlights the importance of fostering a culture of collaboration and learning within the SRE team.

Story
@boldlink shared a post, 2 years, 5 months ago
AWS DevOps Consultancy, Boldlink

An Overview of AWS Well-Architected Framework

Thinking of getting started with AWS cloud computing or migrating your existing workloads to AWS? Here is a quick guide on how the 5 pillars of AWS’s well-architected framework will help you build a secure, high-performing, resilient and efficient cloud infrastructure for your workloads.So basically..

AWS Image.png
Story
@yair_stark shared a post, 2 years, 11 months ago

Error Budget Is All You Need - Part 2

In part 1 I proposed a simple modification to Google’s Multi-Window Multi-Burn Rate alerting setup and I showed how this modification addresses the cases of varying-traffic services and typical latency SLOs.

1_gm3BXHRG_TVt9Hc5cQbOJA (1).png
Story
@yair_stark shared a post, 2 years, 11 months ago

Error Budget Is All You Need - Part 1

One of the great chapters of Google’s Site Reliability Engineering (SRE) second book is chapter 5 — Alerting on SLOs (Service Level Objectives). This chapter takes you on a comprehensive journey through several setups of alerts on SLOs, starting with the simplest non-optimized one and by iterating through several setups reach the ultimate one, which is optimized w.r.t to the main four alerting attributes: recall, precision, detection time and reset time.

1_gm3BXHRG_TVt9Hc5cQbOJA.png
Story
@tharunshiv shared a post, 3 years ago
Site Reliability Engineer, PhonePe

#1 What's Site Reliability Engineering [SRE] | Roles & Responsibilities | Technologies involved

Site Reliability Engineering, also popularly referred to as the SRE, is a role in Computer Science Engineering where the main purpose is to provision, maintain, monitor, and manage the infrastructure in order to provide maximum application uptime and reliability. SRE is an emerging role, but the tasks that the SRE does were always there ever since the first application that was developed. The scope of the software developers ends where they write code to develop the application and right from setting up the infrastructure, the various services that run on them, the network connectivity that is required, providing a platform for the application to run and making sure every part of the application is up and running reliably 24x7 is the duty of an SRE. In fact, we can consider Site Reliability Engineers are the strong bridge between the users and a reliable application.

SRE
Link
@prathamesh-sonpatki shared a link, 1 year, 6 months ago
SRE, Last9.io

MTTF vs. MTBF vs. MTTD vs. MTTR

MTTF vs. MTBF vs. MTTD vs. MTTR