Join us

Scaling Site Reliability Engineering Teams the Right Way

This blog post discusses how to scale Site Reliability Engineering (SRE) teams effectively. It emphasizes that adding more people is not always the best solution and explores alternative methods such as utilizing SRE tools and improving processes.

The blog post highlights specific categories of SRE tools that can help teams handle more load, reduce errors and rework, eliminate certain tasks, and delegate work to other teams. It cautions against implementing these tools without a cost-benefit analysis as they can be expensive and disruptive.

When adding people to the team is necessary, the post advises on capacity planning including using data to project workload and considering the experience level of new hires. It also emphasizes the importance of building a diverse team with the right cultural fit.

How SRE Tools Can Help

Most SRE teams eventually reach a point where they can’t meet all the demands placed on them. This is when these teams need to scale. However, adding more people isn’t always the answer. Let’s explore what scaling a team is about, what the indicators are, steps you can take, and how you know when you’re done.

SRE Tools for Scaling

The subject of SRE tools is vast. Rather than listing specific tools, let’s discuss how to think about them for scaling.

Different tools address different scaling challenges. Analyze your team’s needs to determine the most impactful improvements. This data may be in project management or ticketing systems, but often you’ll need team feedback.

Generally, effective SRE tools can:

  • Handle more load with the same team: Tools like pssh or Ansible can manage large server fleets. Modern incident response platforms often scale well and are easier to configure. Incident management tools like Squadcast can prioritize and deduplicate incidents, allowing engineers to focus on critical tasks.
  • Reduce rework by reducing errors: Script libraries, runbooks, and runbook automation systems promote task repeatability. Using containers with immutable servers avoids errors caused by configuration drift.
  • Eliminate certain kinds of work: Container orchestration systems like Kubernetes eliminate tasks like setting up process supervisors and managing load balancers. Distributed tracing systems like OpenTelemetry reduce the need for complex log aggregation systems to track transactions through distributed systems.
  • Delegate work: Tools like RunDeck allow secure, role-based access to scripts. This empowers dependent teams to work independently without adding to the SRE workload. Similarly, tools like Metabase, Kibana, and Grafana can provide self-service access to production data, logs, or metrics to product management, customer support, or management. This frees SREs from low-value tasks.

There are no silver bullets

Don’t view SRE tools as a cure-all. Introducing new tools can be expensive and disruptive. A cost-benefit analysis is necessary before investing.

When to Add People

Once you’ve exhausted other options, you can start adding people.

Capacity Planning

Capacity planning is an art, requiring a blend of data and judgment. Here are some tips:

  • Use existing load data to make projections (ideal man-hours or story points) related to the services under management. You should be able to estimate the workload impact of adding new services.
  • Factor in the relative productivity and cost of senior vs. junior engineers. Juniors take longer on tasks, while seniors have other responsibilities. Quantify and reason about capacity.
  • High utilization (ratio of task hours to available working hours) isn’t a good efficiency measure. Less slack time reduces innovation and can lead to burnout. Plan for 30% slack. It’s better to have slightly more capacity than less.
  • Be conservative in capacity projections and liberal in demand projections. Add buffers.

Team Composition

Consider these factors when planning your team composition:

  • Experience: Balance the experience mix. Generally, teams have juniors, intermediates, and seniors. The definitions of these categories will vary depending on your location, tech stack, and business domain.
  • Diversity: There are good reasons to seek a diverse team. Multiple perspectives lead to greater creativity and innovation. Diverse teams are also better behaved and more professional.
  • Culture Fit: Focus on excluding jerks, not those who don’t conform to a stereotype. Jerks kill team productivity.

Candidate Sources

  • Internal hiring: Poach from elsewhere in your company. They’re a known quantity and often cheaper than external hires. Look for System Administration, Build, or DevOps teams with potential SREs. Software developers can bring engineering rigor to the team.
  • External hiring: If internal hiring moves the scaling problem to another team, consider external sources like employee referrals, recruitment consultants, job boards, advertising, and your careers page. Employee referrals are typically cheaper and have a better hit rate because they’re pre-filtered by the employee. Provide referral rewards and incentives.

Conclusion

Scaling SRE teams requires careful analysis and planning. Adding people is slow, expensive, and risky, so consider process or technology improvements first. When hiring, plan capacity requirements with data, and think about team composition for long-term success.

Squadcast is an Incident Management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.


Only registered users can post comments. Please, login or signup.

Start blogging about your favorite technologies, reach more readers and earn rewards!

Join other developers and claim your FAUN account now!

Avatar

Squadcast Inc

@squadcast
Squadcast is a cloud-based software designed around Site Reliability Engineering (SRE) practices with best-of-breed Incident Management & On-call Scheduling capabilities.
User Popularity
897

Influence

87k

Total Hits

352

Posts