Stories, tutorials, & tips | The fastest way for busy developers to keep up with technologies 🚀

Story

@squadcast shared a post, 11 months, 2 weeks ago

The Complete On-Call Scheduling Guide of 2024 - All You Need to Know

Discover the secrets to effective on-call scheduling. Learn about follow-the-sun vs. rotation schedules, best practices, and essential software features. Optimize your team's workload, reduce burnout, and ensure rapid incident resolution.

Story

@squadcast shared a post, 11 months, 2 weeks ago

Curb alert noise for better productivity : How-To’s and Best Practices | Squadcast

#on call... #alert n... #inciden...

Blog Summary:Reducing Alert Noisewith Squadcast

Problem: Modern software platforms rely on complex interconnected microservices, which can lead to cascading failures and an overwhelming number of alerts.

Solution: Squadcast, an incident management platform, offers advanced deduplication features to reduce alert noise and improve on-call productivity.

Key Points:

Alert Noise: Excessive alerts can hinder productivity and lead to alert fatigue.

Microservices Complexity: Interdependent microservices increase the likelihood of cascading failures and alert storms.

Squadcast Deduplication:

Status-based deduplication: Controls alert generation based on incident status (triggered, suppressed, acknowledged).

Service dependency-based deduplication: Combines alerts from dependent services into a single incident.

Benefits:

Reduced alert fatigue

Improved incident response time

Better focus on critical issues

Use Cases:

High-failure rate services

Dependent services (e.g., database and payment gateway)

Overall: Squadcast's deduplication features provide granular control over alert management, helping organizations effectively handle complex alert scenarios and improve on-call efficiency.

Story

@laura_garcia shared a post, 11 months, 2 weeks ago

Software Developer, RELIANOID

Netdev Recap

Just wrapped up an incredible Netdev 0x18! From cutting-edge innovations in Linux networking to insightful talks from industry leaders, this year’s event was packed with highlights. Curious about what went down? Check out our full recap article here!https://www.relianoid.com/blog/netdev-conference-0..

Story

@squadcast shared a post, 11 months, 2 weeks ago

Observability: A Deep Dive into Tools, Best Practices, and Examples

#O11y #Squadca... #observa... #observa...

Observability is a critical component of modern software development, providing insights into system performance, availability, and quality. The blog delves into the concept of observability, differentiating it from traditional monitoring.

Key points covered include:

Evolution of observability: From system-centric monitoring to service-focused observability in microservices architectures.

Three pillars of observability: Metrics, logs, and traces, their roles, and popular tools (Prometheus, ELK Stack, Jaeger).

Building a comprehensive observability strategy: Best practices like data centralization, quality, alerting, visualization, correlation, anomaly detection, and continuous improvement.

Challenges: Data volume, complexity, tooling, and skillset requirements.

Overall, the blog emphasizes the importance of observability for understanding system behavior, improving performance, and ensuring reliability.

Story

@squadcast shared a post, 11 months, 2 weeks ago

Conquering On-Call Challenges: A Guide and Best Practices for SRE Teams

#on call... #Squadca... #on call...

The blog provides a comprehensive guide to effective on-call scheduling for SRE teams. It emphasizes the importance of on-call management for maintaining system reliability and preventing team burnout.

Key points include:

The role of on-call scheduling software in automating and optimizing the process.

Strategies for creating balanced and efficient on-call rotations, such as the "follow-the-sun" approach.

The importance of clear communication, documentation, and escalation plans.

The need for regular post-mortem meetings and SRE training.

Tips for fostering a supportive on-call culture.

Ultimately, the blog aims to help SRE teams implement best practices for on-call scheduling, leading to improved team morale, incident response, and overall system reliability.

Story

@adammetis shared a post, 11 months, 2 weeks ago

DevRel, Metis

The Importance of Being Agile in the Database World

This agility in managing database schema changes is key to maintaining speed and flexibility in our database strategies. But how can we move fast around databases? How can we be agile in the database world? Read on to see.

Story

@squadcast shared a post, 11 months, 2 weeks ago

Runbook Automation: Achieving Faster Incident Recovery | Squadcast

#Runbook... #Squadca... #runbook

ARun bookis a predefined set of steps or procedures that is usually executed manually by a systems engineer. For instance: say you want to upgrade an application on production, and you have a defined set of steps that are documented. We call this a runbook. It contains procedures to begin, stop, sup..

Story

@ketbostoganashvili shared a post, 11 months, 2 weeks ago

Technical Content Writer

How to Create an HTML Template That Email Clients Render Well

A developer can’t code an HTML email template using the same technologies and approaches as one would when building a web page. It may sound ridiculous, but it’s the truth. So, let’s try to figure out how valid this statement is.

Story

@laura_garcia shared a post, 11 months, 2 weeks ago

Software Developer, RELIANOID

Discover the key differences between Active-Active and Active-Standby failover strategies

Ensuring network resilience is critical for maintaining continuous business operations. Discover the key differences between Active-Active and Active-Standby failover strategies, their benefits, use cases, and implementation considerations. Learn how to choose the right approach to keep your network..

Knowledge base_Understanding Active-Active an Active-Standby Fail-over_RELIANOID

Story

@laura_garcia shared a post, 11 months, 2 weeks ago

Software Developer, RELIANOID

Understanding the CrowdStrike Outage

Understanding the CrowdStrike Outage: The Largest IT Disruption in History A recent software update from CrowdStrike caused an unprecedented global IT outage, disrupting millions of devices and affecting key sectors like airlines, healthcare, and emergency services. This incident highlights the crit..