Posts & Updates about "SRE"

Newest FAUNers

Senior Platform Engineer, Com…

@environmentalbit3940 (Sergei Sorokin)

DevOps

@samkumar1006 (Sampath S)

devsecops engineer, MSF

Trending FAUNers

@kala (Kala #GenAI)

FAUN.dev()

50.00

@devopslinks (DevOpsLinks #DevOps)

FAUN.dev()

35.00

@varbear (VarBear #SoftwareEngineering)

FAUN.dev()

35.00

@kaptain (Kaptain #Kubernetes)

FAUN.dev()

26.00

@arshadmas (arshad mas)

Product Marketer, manageengine

12.00

@sancharini (Sancharini Panda)

11.00

@elenamia (Elena Mia)

Technical Consultant, Damco S…

10.00

@suarezsara (Sara Suarez)

10.00

@shubham321 (shubham jha)

Software engineer, Keploy

10.00

@marxjenes (Marx Jenes)

10.00

Latest Pawfives 🐾

@shurup gave 🐾 to
Helm Cheat Sheet: Everything You Need to Know to Start Using Helm by @eon01

@shurup gave 🐾 to
OpenClaw Lightweight Alternative Launches: A 10MB AI Assistant That Runs on $10 Hardware by @kala

@shurup gave 🐾 to
Spotlight on SIG Architecture: API Governance by @kaptain

@nelly96 gave 🐾 to
Verification vs Validation Explained for Beginners in QA by @sancharini

@aleonrangel gave 🐾 to
Difference between Agile and Scrum by @viktoriiagolovtseva

@mjh gave 🐾 to
Announcing FAUN.sensei() — Self-paced guides to grow fast — even when tech moves faster. by @eon01

@tairascott gave 🐾 to
Helm 4 or Nelm? What's the difference by @shurup

@tairascott gave 🐾 to
Hidden Correlations Traditional Monitoring Misses by @anjali

@tairascott gave 🐾 to
How to Track Down the Real Cause of Sudden Latency Spikes by @anjali

Publish on FAUN.dev()

Orchestrating the Cloud

⚡️ The clean “Shh… Orchestrating the Cloud” design says just enough — a subtle nod to late-night deployments, calm incident handling, and systems humming in the background

> Get this Swag!

cat /var/logs/*

⚡️ Cats prefer Linux! Warm your soul with a nice mug perfectly sized black ceramic mug.

> Get this Swag!

kubectl apply -f mug.yaml

Because one container ain't enough

> Get this Swag

Git Pull Coffee

Git pull coffee then git merge your code! Warm your soul with a nice mug perfectly sized black ceramic mug.

> Get this Swag!

I fix problems

I fix problems you didn’t know you have in a way, you don’t understand.

> Get this Swag!

Never Quit

This unisex heavy blend Hooded Sweatshirt is relaxation itself. It's made with a thick blend of Cotton and Polyester, which makes it plush, soft and warm

> Get this Swag

Painless Docker - 2nd Edition

A Comprehensive Guide to Mastering Docker and its Ecosystem

> Get your Copy

Helm in Practice

Designing, Deploying, and Operating Kubernetes Applications at Scale

> Get your Copy

Observability with Prometheus and Grafana

A Complete Hands-On Guide to Operational Clarity in Cloud-Native Systems

> Get your Copy

Generative AI For The Rest Of US

Your Future, Decoded

> Get your Copy

Posts tagged with SRE..

Story

@squadcast shared a post, 1 year, 8 months ago

How Developers Can Help SREs with Observability

#observa... #inciden... #SRE

This blog post argues that collaboration between developers and SREs is essential for building reliable software. The blog post outlines five ways that developers can improve SRE observability:

Embrace the 12-Factor App Methodology: This methodology creates applications that are easier to deploy and monitor.

Share Performance Testing Data: This data helps SREs understand how the application should function under pressure.

Maintain Clear and Concise Documentation: Clear documentation empowers SREs to resolve issues faster.

Leverage AIOps for System Administration: AIOps automates tasks and improves IT operations.

Increase System Observability Through Code: Expose relevant metrics within the code to provide SREs with real-time insights.

Dev Swag

@ByteVibe shared a product

Orchestrate - Developer / Programmer / Kubernetes Kiss Cut Sticker

#developer #merchandise #swag

👨‍🚀 ByteVibe, a space out of space 👨‍🚀 ─ ✅ White or transparent✅ Durable color / long lasting✅ Durable material✅ Vibrant colors✅ Grey adhesive left side for white stickers✅ 100% vinyl with 3M glue✅ Gl...

Story

@squadcast shared a post, 1 year, 8 months ago

How to Implement SRE Principles Even Without a Dedicated SRE Team

#slo vs ... #SRE #slo

This blog post targets beginners who want to learn about SRE (Site Reliability Engineering) but are intimidated by the idea of needing a dedicated SRE team. The blog assures readers that anyone can begin implementing SRE principles to improve their service reliability and performance.

The core of the blog focuses on understanding SLOs (Service Level Objectives), SLIs (Service Level Indicators), and error budgets. SLOs define what you want your service to achieve in terms of metrics like uptime and latency. SLIs are the specific metrics you track to see if you're meeting your SLOs. Error budgets set the limits for downtime allowed before impacting users or business goals.

Choosing the right SLOs and SLIs is crucial and should start with considering what matters most to your customers. The blog recommends focusing on a few key metrics, gathering historical data to set achievable SLOs, and continuously monitoring and improving your approach over time.

Beyond SLOs and SLIs, the blog highlights other important SRE practices:

Eliminating toil (repetitive manual tasks) through automation.

Implementing rollback strategies to quickly recover from problematic deployments.

Managing stress and burnout for IT teams.

Keeping customers informed about limitations and downtime.

The overall message is that SRE is a journey of continuous improvement, and even organizations without a dedicated SRE team can benefit by adopting these core practices.

Story

@squadcast shared a post, 1 year, 8 months ago

How Developers Can Help SREs with Observability

#observa... #inciden... #SRE

This blog post outlines five ways developers can improve collaboration with SREs and boost overall system reliability. Effective collaboration is essential because SREs (site reliability engineers) are responsible for maintaining system health and performance, while developers focus on building the software.

The five ways developers can improve SRE observability are:

Building with the 12-Factor App Methodology: This approach promotes creating stateless and immutable applications, simplifying deployment across various cloud environments.

Sharing Performance Testing Data Insights: Providing SREs with data from performance testing helps them understand application thresholds and make informed decisions for optimization.

Maintaining Clear Documentation and Configuration Files: Well-documented code and configuration files allow SREs to efficiently troubleshoot outages and implement changes without modifying the source code.

Utilizing AIOps-Enabled System Administration Functionalities: AIOps (Artificial Intelligence for IT Operations) automates tasks and streamlines workflows, reducing the burden on SREs during deployments and updates.

Increasing System Observability: Enhancing observability involves making it easier to understand how the system functions and identify potential problems. Developers can achieve this by enabling debug support and providing SREs with relevant metrics.

Story

@squadcast shared a post, 1 year, 9 months ago

Transparency in Incident Response: How SLIs Drive Team Success

#slo mea... #SRE #slo #sli

This blog post argues that transparency is a vital but often overlooked aspect of SRE (Site Reliability Engineering). It discusses the benefits of transparency, including reduced finger-pointing, improved trust, and better decision-making. The blog post also outlines four levels of transparency that SRE teams can adopt, ranging from internal engineering transparency to complete public transparency. It emphasizes that Service Level Indicators (SLIs) are fundamental to achieving transparency because they provide a common understanding of how well a service is performing. The blog post concludes by highlighting the importance of using the right tools to support transparent incident response and mentions Squadcast as an example.

Story

@squadcast shared a post, 1 year, 9 months ago

From SysAdmin to SRE: How to Evolve Your Skillset with SRE Tools

#SRE Too... #SRE

This blog post targets SysAdmins who are interested in becoming SREs. It outlines the key skills and tools needed to make the switch.

The first part of the blog highlights the growing popularity of SRE roles and how they differ from SysAdmins. While both deal with IT operations, SREs leverage software engineering principles to manage systems at scale.

The blog then dives into the specific areas where SysAdmins need to develop their skillset. This includes adopting a new mindset that embraces calculated risks and prioritizes automation. It also emphasizes the importance of learning from failures and using data to inform decision-making.

Several crucial SRE tools are introduced throughout the blog. These include programming languages like Python and Go, infrastructure as code (IaC) tools, cloud and containerization technologies, modern monitoring tools, and statistical analysis skills.

Finally, the blog concludes by emphasizing the transferable skills SysAdmins already possess and the bright future of SRE careers.

Story

@squadcast shared a post, 1 year, 9 months ago

Understanding SLOs, SLAs, and SLIs: Essential Metrics for Service Quality

#slo #inciden... #sla #SRE #sli

This blog post explains the concepts of SLAs, SLOs, and SLIs, all of which are important for measuring and ensuring service quality.

SLI (Service Level Indicator): A measurable value that reflects how well a service is performing. Common examples include uptime, latency, error rate, and throughput.

SLO (Service Level Objective): A target value for an SLI. It essentially defines the desired level of service quality.

SLA (Service Level Agreement): A formal agreement between a service provider and its customers that outlines the service quality guarantees, often based on SLOs. SLAs typically involve penalties if the SLOs are not met.

The blog post also highlights the benefits of SLOs and provides best practices for implementing SLAs and SLOs. Some key takeaways include:

SLOs help teams collaborate and set measurable goals for service quality.

SLAs should be transparent and based on realistic SLOs.

It's better to start with simpler SLOs and gradually increase complexity.

Timing of outages can significantly impact customer satisfaction.

By understanding these concepts, organizations can establish a framework to deliver high-quality services and maintain a competitive edge.

Story

@squadcast shared a post, 1 year, 9 months ago

The Vital Role of SRE Observability in Ensuring System Reliability

#observa... #SRE #SRE aut...

This blog post explains the importance of SRE observability for building reliable systems. Observability, unlike traditional monitoring, goes beyond just checking if something is wrong. It allows SREs to understand what's happening inside a system by looking at its external outputs like metrics, traces, and logs. This data is crucial for troubleshooting, maintaining, and developing scalable systems.

The blog post also highlights the benefits of SRE observability for businesses. By understanding user satisfaction through SLOs (Service Level Objectives), businesses can make better decisions about feature development and resource allocation. Additionally, observability tools can reduce the workload for engineers by automating tasks and providing better insights into system behavior. Overall, SRE observability is essential for ensuring system reliability and business success.

Story

@squadcast shared a post, 1 year, 10 months ago

Building and Maintaining a Strong SRE Team in Your Company: 7 Key Tips

#SRE Too... #SRE #DevOps

This blog post offers guidance on building and maintaining an SRE team. It emphasizes the importance of SRE in today's world and outlines seven key tips to achieve success. Here's a summary of those tips:

Start small and focus internally: Begin by assigning staff from existing departments to focus on maintaining service reliability.

Recruit the right people: Look for SRE professionals with problem-solving skills, automation expertise, and a commitment to continuous learning. They should also be excellent team players with a broad perspective. Consider using SRE tooling to improve team efficiency.

Define your SLOs: Establish clear and achievable performance indicators for your systems.

Establish a holistic incident management system: Implement a system for tracking on-call duties and streamlining the incident resolution process. SRE tooling can be helpful here.

Accept failure as inevitable: Recognize that failures are part of the development process. Focus on creating a minimum viable product and improving over time.

Conduct incident postmortems to learn from mistakes: Analyze incidents to identify root causes and develop solutions to prevent future occurrences.

Maintain a user-friendly incident management system: Choose an incident management system that is easy to use, fosters communication, and integrates with other relevant tools.

By following these steps and leveraging SRE tooling, you can establish a strong SRE team that keeps your systems reliable and your customers satisfied.

Story

@squadcast shared a post, 2 years, 9 months ago

Prometheus Blackbox Exporter: Guide & Tutorial

#SRE too... #Squadca... #SRE #prometh...

Learn how Prometheus Blackbox Exporter can monitor external systems with multiple protocols and custom endpoints to provide rich metrics, alerting, increased visibility, and faster issue resolution.

6426d5469df8da4e20bde876_SRE_Pinciples-570x330 (1).png

Story

@squadcast shared a post, 2 years, 9 months ago

Scaling Site Reliability Engineering Teams the Right Way

#SRE too... #Runbook... #SRE

This blog unpacks everything you need to know about scaling an SRE team like the common indicators, and the steps that need to be taken for scaling your team. The blog uses the People-Process-Tools approach for an effective explanation.

Build & Scale AI Workloads on Kubernetes, March 28th

Most AI workloads run fine in a demo and fall apart in production. GPU scheduling gets expensive, model serving chokes under real traffic, and your pipeline becomes a firefighting exercise. This 4-hour hands-on workshop fixes that. You'll build and deploy AI workloads on Kubernetes yourself. Walk away with a production-ready setup you can use at work on Monday. FAUN.dev readers get 30% off with code FAUN30.

Get your Discount