Runbook Automation: Achieving Faster Incident Recovery

A Run book is a predefined set of steps or procedures that is usually executed manually by a systems engineer. For instance: say you want to upgrade an application on production, and you have a defined set of steps that are documented. We call this a runbook. It contains procedures to begin, stop, supervise, and debug the system.

Recent research shows that 80% of the time spent by engineering teams is invested in triaging incidents. Over the past few years, shift to microservices has resulted in an exponential increase in code-base complexity. Managing and monitoring several microservice endpoints means a large number of checkpoints and alerts. As a result, we end up having too many incidents during outages and engineering teams get buried in operational work. To get a better handle on incidents, teams can use Automated and Executable Runbooks to set up auto-mitigation or remediation. These runbooks should be triggered by Events/Logs and create incidents for engineers only when necessary. Broadly speaking, Runbooks can be categorized as:‍

Procedural Runbooks: Procedural Runbooks are manual runbooks where you have to just follow the technical documents and run the steps. Here, a systems engineer will use standard tools to access production systems and follow the procedure manually.
Executable Runbooks: Executable Runbooks are like procedural Runbooks where systems engineers will follow the procedure as described. Additionally, systems engineers can also run an automation task from his or her machine (could be Shell-Script, Powershell or any other scripts) on a target system and fix the problem.
Automated Runbooks: As the name suggests automated runbooks runs automatically without any manual interaction. This blog talks about Automated Runbooks and a few automation tools.
Automated Runbooks allow us to automate time-consuming and repetitive tasks. Using these, we can automate any tasks on one or more servers.

Listed below are a few instances where automated runbooks can potentially save the day:

Active Directory:
We can use automated runbooks to update Active directories when any new user is onboarded onto the system. Using these runbooks, we can create a user account and assign the user to multiple groups. This will ensure that they have the appropriate permissions and are part of an organizational domain. We could also add activities that might be needed when any new employee is onboarded.And with automated runbooks we can automate these manual tasks and help users to onboard quickly.
Virtual Machine/Service Management:
We can use automated runbooks to manage our Virtual Machine(VM) or services. These can be in scenarios such,
* Need to restart VMs after patching
* To know any service status
* Want to restart any services running in VMs after deployments.
When you see VMs in a hung state or not serving any traffic/requests, create a quick fix type of runbook to run on top of the active incident and mitigate them.
Log Archive:
One of the use cases is to automate log management by creating runbooks which can either delete your old data or archive your data into some azure log tables. Later you can use these Azure log table to analyze and get some themes out of them. It could be what types of error our webApps server encountered in the last 30 days. Again, by looking into that data you can improve the reliability of the product.
Monitoring:
Another use case scenario would be monitoring. Using runbooks, we can monitor computer responsiveness. Is the host available on the network? How much disk space is left on the machine? How is the health of the daemon or services? What is the resource utilization for servers? By using any scripting language we can fetch these details and update them on incidents or start our investigations.
Configurations Management:
Deploying standard baseline configurations can be done using runbooks. Configurations could be related to services, clients or network equipment. Even mobile devices can be configured. This way we can meet a certain minimum-security standard as per the organizational security policy. We can also deploy OS and app configuration using runbooks. And if any software or patching needs to be deployed we can achieve it using runbook automation. Here are a few runbook automation tools, that we may use for the above -

Azure Automation:

Azure Automation is Microsoft’s cloud-hosted automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources. It uses PowerShell Runbook or Powershell Graphical/Workflows or Python Runbook. We can trigger these runbooks from Azure Alerts, Webhooks, Schedule, Logic Apps, Another Runbook or watcher tasks.

In this example, we have created one Powershell Runbook to restart the WebApps servers. We can schedule this Powershell runbook as per requirement also we can trigger it from Webhooks or by schedule.

Rundeck:

Rundeck is a web-accessible console for dispatching commands and scripts to your nodes. It can also be used for deployments, operations tasks and more. Rundeck lets you create jobs made from existing scripts, run commands on selected nodes or schedule jobs to run at a later time.In short, using Rundeck you can automate routine or ad hoc tasks by creating runbooks.

Rundeck supports multi-step workflows.
Distributed command execution.
Job Execution can be done with ad-hoc demands or we can set it with the scheduler.
Rundeck provides a graphical web console for job execution and command.
It’s a command line interface tool with Web API to operate it from code.
It logs all the command or job execution history for audit purposes.

Rundeck Features:

A Rundeck plugin implemented in Java or shell script, installed into a Rundeck server.
External Service.
An external service that is used by a Rundeck plugin or the Rundeck core.
External Plugin.
A plugin is installed in another tool that interacts with Rundeck through its API.

Rundeck can integrate with tools in several manners.

Ansible:

Ansible is a very powerful open-source configuration management tool. Ansible uses ‘Playbook’ to deploy, manage and configure anything from a single-server to multi-server environments. Here Playbook is similar to runbooks where you can define a set of procedures.

Agentless: It means there is no need for any software or client/agent to manage your nodes unlike Puppet or chef.
Python Supported: Ansible is built on python and provides a lot of python features and modules. Once you install ansible you will see python is also getting installed on your servers.
Secure SSH: Ansible uses a secure shell to connect to the servers to do any operation. Secure shell is the password-less network authentication protocol. This makes ansible fast and more secure than others.
Push Architecture: Ansible follows push-based architecture for deploying any configuration. Whenever you want to push any configuration, just update the playbook and push. It will take care of the rest. In short, the central server manages all the configuration and pushes it to the target servers.

Ansible Features:

Ansible Playbook written in YAML, declaratively defines your configuration. Let’s see one example of a playbook here we are installing Nginx servers using Ansible.

‍

Squadcast:

Squadcast Runbooks will allow you to up level up your Incident Management with the next generation Reliability Orchestration Engine based on Site Reliability Engineering (SRE). It is designed to host and execute runbooks automation in response to operational events or incidents. By using Squadcast runbooks you can remove the toil or repetitive tasks from your system. We have already seen how we can create runbook using Azure Automation lets see how easy we can create it using Squadcast.

Let’s say your Squadcast dashboard is showing that your Web Apps servers are using a large amount of resources, and could be due to high CPU or high traffic. To mitigate this, we want to create an automation that checks if resource utilization on web apps server has been increased, and has crossed a certain threshold-say 65%. To do this, we would create a runbook in Squadcast, and schedule this runbook to run on incident tickets automatically.

Runbooks Support:

Currently, Squadcast Runbooks supports the below languages

Shell script
Lua script
Python3 script
NodeJS script
Ansible configuration

Here are some best practices for runbook:

Know your Application:
Within our application, we need to consider which processes need improvement and when we define processes that could benefit from automation using runbook, we need to start gathering requirements.
Gather Requirement:
While gathering requirements we should focus on determining input and output values for our runbook whether its automatically supplied or the user needs to input these values.
Use of Integration pack:
The Integration pack gives us additional runbook activities. For example: if you want to automate user onboarding and that includes working with an active directory user account, you are going to need to register and deploy the integration pack for the active directory.
Single or multiple host runbook:
We need to know whether we are going to run our automation runbook for single or multiple host at the same time because we need to design our runbook based on that decision.
Runbook Execution Trigger:
We should know how we are going to execute the runbook. Will it be a schedule? Is it going to be done periodically so manual? Will it need any kind of user interaction?
Runbook Logs:We should also focus on what logs will be needed once runbook is executed and where we are going to save these logs for future or debug purposes.

How do you write a runbook?

A runbook is a collection of procedures for dealing with common issues. They can help your team deal with these situations more efficiently. Here are a few general steps to keep in mind while writing a runbook:

Get a full understanding of your systems architecture. Identify all processes, configurations and dependencies
Brainstorm the most common issues that come up. What problems do you see people running into again and again? What kind of information does it take to resolve them?
Create a flowchart or diagram of the steps involved in resolving each issue, from start to finish-from when someone first encounters the problem until they’ve resolved it and gotten back to work. Add information of key personnels (such as an Incident lead) who can help in keeping systems and processes running
Before you deploy your runbooks, make sure that they have been thoroughly tested. Keep them in a place where everyone who needs them will be able to find them easily. Review them periodically to make sure they are up-to-date.‍

What should a runbook include?

Detailed clear and concise steps to deal with specific problems, such as systems failures and security breaches
It should include who is on-call to resolve an incident, what are the resources available with them to tackle an incident and who can assist them in resolving an incident
A runbook may also include emergency contact information, procedures for data backup and recovery, and a list of critical systems with their dependencies
Keep runbooks in a place where everyone who needs them can easily find them. Review them periodically to make sure they are up-to-date

Difference between runbook and sop?

A Runbook is a predefined set of technical steps, procedures or documentation that is usually executed manually by a systems engineer. A runbook can also contain information related to application deployment, monitoring and maintenance. Whereas SOPs are descriptions of the steps required to complete specific activities or tasks. They can be used to ensure that industry rules and regulations are followed in an organization.

Playbooks versus Runbooks

A runbook is a step-by-step procedure that helps ensure the technical aspects of an organization’s systems continue to function smoothly. A playbook is more general-outlining an organization’s approach to a task and the responsibilities of its workers. While both a runbook and a playbook include information on technical aspects, a playbook will likely go into greater detail about the cultural, compliance, or user experience aspects of a task.

Conclusion

With the right amount of automation and strategic process management, you can improve incident remediation instructions and ensure runbooks are updated in a timely manner. This ensures that when an incident occurs next, the documentation is updated and also is available to the right person at the right time.

Originally published at https://www.squadcast.com.

Let's keep in touch!

Stay updated with my latest posts and news. I share insights, updates, and exclusive content.

By subscribing, you share your email with @squadcast and accept our Terms & Privacy. Unsubscribe anytime.

Only registered users can post comments. Please, login or signup.

Share with your friends and followers

Start blogging about your favorite technologies, reach more readers and earn rewards!

Join other developers and claim your FAUN.dev account now!

Publish your first story!

Squadcast Inc

@squadcast

Squadcast is a cloud-based software designed around Site Reliability Engineering (SRE) practices with best-of-breed Incident Management & On-call Scheduling capabilities.

User Popularity

4k

Influence

388k

Total Hits

447

Posts

Runbook Automation: Achieving Faster Incident Recovery | Squadcast

Azure Automation:

Rundeck:

Ansible:

Squadcast:

How do you write a runbook?

What should a runbook include?

Difference between runbook and sop?

Playbooks versus Runbooks

Conclusion

Let's keep in touch!

Start blogging about your favorite technologies, reach more readers and earn rewards!

FAUN.dev is where engineers from GitHub, Netflix, and Shopify go to stay ahead — fast.

Squadcast Inc

User Popularity

4k

388k

447

You may also like ..

Prometheus Blackbox Exporter: Guide & Tutorial

Top Five Pitfalls of On-Call Scheduling

Announcing our improved Schedules & On-Call Rotations

What are Runbooks? And why are they needed?

Scaling Site Reliability Engineering Teams the Right Way