How to Make On-Call Rotations Less Stressful for Your Team

Incident management is inherently stressful, especially during holidays. This article provides a checklist to ensure your on-call team stays calm and collected if an incident occurs.

Why On-Call Rotations Are Stressful

Unclear processes and undefined procedures can make on-call rotations a nightmare, particularly around holidays. Imagine getting interrupted by your phone constantly while enjoying Christmas dinner or unwrapping presents.

The stress level of your on-call team directly reflects the health of your systems, code quality, and overall company culture. Therefore, it’s crucial to do everything possible to make on-call rotations easier for your team. After all, a happy on-call team translates to a smoother-running organization.

How to Make On-Call Rotations Easier

Define a Clear Framework and Pre-Holiday Checklist

Establish a framework with a well-defined set of rules, especially during the holiday season. Create a pre-holiday checklist to ensure everything is in order before employees take time off.

Create Sensible On-Call Rotation Schedule

In most cases, the burden of on-call duties falls on a small group of engineers. On-call burnout is a significant issue in the SRE and DevOps world, especially during holidays when there are fewer people willing to be on-call.

To start, expand your on-call team to distribute the stress among more people. Everyone deserves vacation time, and spreading the load across a larger team makes a big difference.

Implement a System to Override Schedules When Necessary

Allow for automatic schedule overrides for emergencies where a specific person or team is clearly suited to handle the incident. This can be done using custom automated incident tags to route notifications to the right people or trigger predefined actions or scripts.

Utilize Vacation Mode for On-Call Shifts

On-call schedules and rotations provide some structure, but they don’t account for planned or unplanned time off. A “Vacation Mode” feature allows team members to hand off shifts to others, ensuring coverage during emergencies or vacations.

Here are some best practices for using Vacation Mode:

Give your team ample notice before planned vacations to allow for adjustments to the on-call schedule.
If you’re the primary on-call for a service or system, activate Vacation Mode and find someone to cover your shift before your time off begins.
Reciprocate the favor for colleagues when they need coverage, if possible. Track your on-call hours and those of your teammates to avoid overburdening anyone.
In emergencies requiring a last-minute shift change, ask a colleague with sufficient bandwidth to cover for you. Ideally, choose someone who hasn’t been on-call recently.

Adopt a “No Deploys” Policy During Weekends and Holidays

This builds upon the common practice of “No Deploy Fridays” familiar within the on-call community. Ideally, your infrastructure should be able to automatically detect and rollback failed deployments. While this may not be feasible for all systems and teams, having these practices in place helps teams quickly identify errors. It’s also standard practice to be available for at least a full workday following new deployments to monitor functionality and respond swiftly to any issues.

Make Incidents Context-Rich

A significant portion of on-call stress stems from a lack of information about why something malfunctioned. Valuable time is wasted gathering context instead of resolving the incident, leading to higher Mean-Time-To-Resolve (MTTR).

Here’s how to add context to incidents:

Ensure all relevant tags are attached to incidents, either automatically or manually. Examples include “Backend issue” or “Frontend issue” and “Severity: High” or “Severity: Low.”
Clearly define and update severity levels for each incident. This clarity allows your on-call team to determine if immediate action is required or if the fix can wait until after the holidays.
On-call teams often struggle to switch between various tools to find the information they need. Carefully configure your alert source integrations within your incident management tool to ensure valuable contextual information is automatically added to every incident. This could include your knowledge base, runbooks, or relevant data from monitoring, logging, tracing, or visualization tools. Time series data, graphs, or post-mortems of similar past incidents can provide valuable context to aid in faster decision-making.

Proactive Incident Management with SLOs and Error Budgets

A proactive approach to incident management involves anticipating potential incidents and having a plan in place. Conversely, a reactive approach means scrambling to react when incidents occur. Understanding trends from your Service Level Objectives (SLOs) and error budget graphs is a valuable tool for proactive incident management. By correlating error budget consumption with past incidents, you can predict potential customer-impacting downtimes. Analyze the types of incidents that have occurred and develop automated scripts to resolve and mitigate them.

Implement a Resolution and Remediation Plan

Services fail for various reasons, some known and some unknown. Having solutions readily available makes troubleshooting and resolution smoother.

The first step in incident resolution is minimizing customer impact as quickly as possible. The next step is long-term remediation, achieved through maintaining playbooks or creating a knowledge base for different incident types. These resources guide on-call personnel through the resolution process.

Use Tools to Reduce MTTR

Squadcast Actions: Having a predefined remediation plan in place is crucial. Integrate the tools you use for taking action, such as your CI/CD platform or infrastructure automation tools, into your incident management platform. This allows you to execute actions directly from the platform when an incident occurs. For instance, you might rollback a feature or rebuild a project in response to an alert. Having these integrations in place can ensure incidents are resolved before they impact customers.
Runbooks: If you already know the resolution steps for an incident, using an executable script (runbook) can save significant time. Runbooks automate resolution, making it a single click away compared to manual and repetitive processes.

Conclusion

There are numerous ways to improve the on-call experience for your team. Understanding why these practices are important and communicating this to the broader engineering team is vital. Remember, the well-being of your on-call team reflects the health of your systems and your overall company culture.

Therefore, the entire team has a responsibility to ensure a positive on-call experience for everyone. Let’s make improvements in incident management a priority!

Squadcast is an Incident Management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.