How important is Observability for SRE?

Observability is the practice of assessing a system’s internal state by observing its external outputs. Through instrumentation, systems can provide telemetry such as metrics, traces, and logs that help organizations better understand, debug, maintain and evolve their platforms.

SREs use many tools and practices to manage services at scale and observability is a crucial part of it. Observability enhances SRE by allowing its practitioners to infer a system’s internal state. Actionable data is of the utmost importance for SRE in order to develop and maintain scalable, reliable, and secure systems. Observability provides the data that SREs need to better understand their systems, what is happening, and why.

What is Observability?

In traditional monitoring systems, you usually have a series of dashboards that help you understand when something wrong is happening. Usually in cloud native environments, using a microservices architecture, we assume services are meant to be run by software and not by humans. This increases the level of complexity and its dynamic nature makes it difficult to reason about problems. You need to make your systems observable so that you can dig into what’s going on.

Observability gives you the capacity to measure the internal state of your systems by checking their external outputs. It is built around three key pillars: metrics, traces, and logs.

Metrics are measurements of something about your system. They are numeric values, over an interval of time, usually with associated metadata (e.g., timestamp, name). They can be raw, calculated, or aggregated over a period of time. They can come from a variety of sources like servers or APIs. Metrics are structured by default and can be stored in open source systems like Prometheus and Riemann or in off-the-shelf solutions like Amazon CloudWatch and Azure Monitor. These optimized storage systems allow you to perform queries, create alerts, and store them for long periods of time.

Traces are the record of the execution path of a program or system. They represent the flow of a request through your services and allow you to see the end-to-end path of execution. Distributed tracing is particularly important in modern distributed architectures, like microservices. The primary building block of a trace is the span. In the OpenTracing specification, spans encapsulate the following information:

Operation name
Start and finish timestamp
key:value span Tags
key:value span Logs
SpanContext

A trace is a group of multiple spans that usually contain “References” to each other. They can be displayed using open source solutions like Jaeger or Zipkin as well as in SaaS offerings like Honeycomb or Datadog.

Logs are text records that describe discrete events, at a specific point in time (e.g. error, an important operation was executed). They’re typically the first place you’ll look to find what is going on with your systems. They include a timestamp and a payload to provide context. Logs can be in three major formats: plain text, structured and binary. Structured logs, which include additional metadata, can be stored in systems like Elasticsearch or Loki to be easily and efficiently queried.

SREs can leverage this information to better understand, maintain and design systems that work at scale.

How can SREs leverage Observability

According to the 2020 SRE Report, only 53% of respondents said they were using observability tools. This is a surprisingly low number considering that the pressure to iterate faster and meet customer expectations increased the demand for observability.

The increasing complexity of systems results in more unknowns and teams need to answer specific questions about their systems. Observability tools can help you take proactive actions to fix issues before they have a major user impact. In order to leverage observability, you’ll need to put in place the proper tooling and services to collect the necessary telemetry. Using open source software or commercial solutions you’ll need to:

Instrument your services to collect telemetry. This telemetry can come from servers, containers, or services and will provide information about your entire infrastructure
Correlate data between multiple sources, creating context, enhancing visualization, and enhancing automation

By using relevant metrics that track user satisfaction you’ll be able to understand when your services are not being reliable enough. By using traces, you’ll be able to understand the flow of requests through your systems and pinpoint where bottlenecks are forming. By using logs you’ll be able to track and understand meaningful events in your services. Armed with this information you’ll be able to detect issues faster before compromising SLOs. Mean time between failures (MTBF), mean time to failure (MTTF), and mean time to repair/recovery (MTTR) can be greatly reduced due to better insights and the alerts observability provides. Well-crafted alerts, based on SLOs and powered by observability, can help reduce alerts to a sustainable amount of actionable events. This helps reduce burnout and creates a culture that supports sustainable innovation.

Incident analysis and postmortems benefit greatly from observability. It enables you to know what’s happening under the hood, what needs to be improved or fixed. It allows end-to-end observability, enabling faster root cause analysis and fixing.

By gathering telemetry in a consistent and automated way, you’ll be able to implement MLOps and AIOps practices. These practices use Machine Learning and Artificial Intelligence techniques to simplify and enhance operations and accelerate problem resolution. They’ll allow you to replace repetitive manual tasks with intelligent and automated solutions that allow you to be proactive in the event of slowdowns or outrages. Observability generates huge amounts of information that humans can’t possibly analyze and correlate. By ingesting all that data, from the various observability solutions, these techniques can conclude what is relevant to focus and point SREs in the right direction.

How SRE and Observability can enhance business

SRE work and business goals are directly intertwined. Users determine the reliability of a system making it one of its most important features. Happy users generate value (e.g. revenue, product popularity), and as such, understanding and keeping users satisfied is of the utmost importance.

Observability provides the tooling necessary to understand user happiness by offering solutions to craft SLOs that measure user happiness. SLO, which stands for Service Level Objective, are measurements of user satisfaction. Instead of understanding how reliable your systems are by using indirect measurements (e.g., server metrics like CPU and memory usage), SLOs can be crafted to understand how satisfied users are (e.g., users can’t buy certain products). You can leverage projects like sloth to help craft SLOs, create dashboards and meaningful alerts. Businesses can use the metrics to make decisions about what features to develop and what type of work needs to be prioritized. SLO-based approaches allow organizations to have informed discussions, backed by data, about when reliability work should be a priority and when feature work should be prioritized.

Having better insights and understanding about systems, allows organizations to reduce the cognitive load on engineers to develop and maintain services. Smaller, multifunctional, autonomous teams will be able to operate their services with increased productivity. Toil reduction is made easier since you now have ways to quickly measure and assess the impact of any change introduced to the system.

Conclusion

The increasing complexity of systems drives the need for better ways to understand them. Observability bridges the gap between your mental models about a system and what they really are. Metrics, traces, and logs provide the necessary information for you to develop and maintain services at scale.

SREs can leverage observability in order to enhance their understanding of systems. Increased visibility allows engineers to more easily understand what is happening under the hood and what actions need to be performed. Well-crafted SLOs and alerts help SREs reduce burnout and be more effective.

Businesses benefit from observability by leveraging it to understand user satisfaction. By understanding how happy users are with your services, you can make informed decisions about the type of work that needs to be prioritized. This increased systems understanding will allow engineers to reduce the cognitive load necessary to develop and maintain them, opening the door to smaller, multifunctional teams to be more effective.

Keeping users happy and engineers more productive will help businesses thrive. Site Reliability Engineering will leverage observability tools to make that a reality.