How SRE Tools Can Help
Most SRE teams eventually reach a point where they can’t meet all the demands placed on them. This is when these teams need to scale. However, adding more people isn’t always the answer. Let’s explore what scaling a team is about, what the indicators are, steps you can take, and how you know when you’re done.
SRE Tools for Scaling
The subject of SRE tools is vast. Rather than listing specific tools, let’s discuss how to think about them for scaling.
Different tools address different scaling challenges. Analyze your team’s needs to determine the most impactful improvements. This data may be in project management or ticketing systems, but often you’ll need team feedback.
Generally, effective SRE tools can:
- Handle more load with the same team: Tools like pssh or Ansible can manage large server fleets. Modern incident response platforms often scale well and are easier to configure. Incident management tools like Squadcast can prioritize and deduplicate incidents, allowing engineers to focus on critical tasks.
- Reduce rework by reducing errors: Script libraries, runbooks, and runbook automation systems promote task repeatability. Using containers with immutable servers avoids errors caused by configuration drift.
- Eliminate certain kinds of work: Container orchestration systems like Kubernetes eliminate tasks like setting up process supervisors and managing load balancers. Distributed tracing systems like OpenTelemetry reduce the need for complex log aggregation systems to track transactions through distributed systems.
- Delegate work: Tools like RunDeck allow secure, role-based access to scripts. This empowers dependent teams to work independently without adding to the SRE workload. Similarly, tools like Metabase, Kibana, and Grafana can provide self-service access to production data, logs, or metrics to product management, customer support, or management. This frees SREs from low-value tasks.
There are no silver bullets
Don’t view SRE tools as a cure-all. Introducing new tools can be expensive and disruptive. A cost-benefit analysis is necessary before investing.
When to Add People
Once you’ve exhausted other options, you can start adding people.
Capacity Planning
Capacity planning is an art, requiring a blend of data and judgment. Here are some tips:
- Use existing load data to make projections (ideal man-hours or story points) related to the services under management. You should be able to estimate the workload impact of adding new services.
- Factor in the relative productivity and cost of senior vs. junior engineers. Juniors take longer on tasks, while seniors have other responsibilities. Quantify and reason about capacity.
- High utilization (ratio of task hours to available working hours) isn’t a good efficiency measure. Less slack time reduces innovation and can lead to burnout. Plan for 30% slack. It’s better to have slightly more capacity than less.
- Be conservative in capacity projections and liberal in demand projections. Add buffers.
Team Composition
Consider these factors when planning your team composition:
- Experience: Balance the experience mix. Generally, teams have juniors, intermediates, and seniors. The definitions of these categories will vary depending on your location, tech stack, and business domain.
- Diversity: There are good reasons to seek a diverse team. Multiple perspectives lead to greater creativity and innovation. Diverse teams are also better behaved and more professional.
- Culture Fit: Focus on excluding jerks, not those who don’t conform to a stereotype. Jerks kill team productivity.
Candidate Sources
- Internal hiring: Poach from elsewhere in your company. They’re a known quantity and often cheaper than external hires. Look for System Administration, Build, or DevOps teams with potential SREs. Software developers can bring engineering rigor to the team.
- External hiring: If internal hiring moves the scaling problem to another team, consider external sources like employee referrals, recruitment consultants, job boards, advertising, and your careers page. Employee referrals are typically cheaper and have a better hit rate because they’re pre-filtered by the employee. Provide referral rewards and incentives.
Conclusion
Scaling SRE teams requires careful analysis and planning. Adding people is slow, expensive, and risky, so consider process or technology improvements first. When hiring, plan capacity requirements with data, and think about team composition for long-term success.
Squadcast is an Incident Management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.