Monitoring vs Observability, What’s the Difference?

Monitoring and observability are two closely related concepts that help IT professionals understand and trouble their systems. This article explains the important component of the two concepts, focusing on observability to differentiate between the two.

Microservices and distributed architecture have become the new model of building and serving applications. As this increases scalability, so is the complexity and dynamicity of the system continuously grows. Increasing interaction between services in a microservice architecture makes it complicated to understand, identify, and resolve abnormal behaviors in their IT environments. Professionals look to concepts like monitoring and observability to solve these problems.

What is monitoring?

Monitoring in IT refers to the periodic tracking of software and infrastructure performance by systematically collecting, analyzing, and implementing data from the system. The purpose of monitoring is to determine how well your software and the underlying infrastructure perform in real-time to ensure that the level of performance is as expected. Monitoring IT environments involves using a combination of tools and technologies to establish a simultaneous performance of the infrastructure and resolution of identified issues. While monitoring is a well known concept among developers, DevOps engineers and IT professionals, observability is just gaining momentum.

What is observability?

According to Wikipedia, observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs. Observability aims to figure out what is going on in an IT environment by simply examining its outputs.

The term originated in the 20th century but was first used in IT by Twitter's engineering team in 2013.

It was used to define the ability of the team to determine or estimate the current state of their system by only using external information from its outputs.

Since then, the term has greatly evolved and is gradually becoming an important concept in modern IT infrastructure management.

White-box monitoring vs black-box monitoring

Whitebox monitoring is the monitoring of the internal metrics of applications running on a server. Whitebox monitoring deals with monitoring the number of HTTP requests to a server, MySQL queries running on a database server, and so on.

On the other hand, Blackbox monitoring refers to the monitoring of servers and server resources such as CPU usage, memory, load averages, and so on. In modern IT environment settings, applications developers, DevOps and SRE engineers share the responsibilities of Whitebox and Blackbox monitoring.

Both white-box and black-box monitoring come in handy to gain insights into the application and the underlying infrastructure and ensure optimal performance.

Observability is a function of white-box monitoring. White-box monitoring targets logs, metrics, and tracing of external events in a system, which are observability pillars.

Pillars of observability

To understand why a system develops a fault or behaves in a certain arbitrary way and achieves observability, you need to gather and examine telemetry data from the system.

Logs, metrics, and tracing are the three telemetry data widely referred to as the pillars of observability.

Logs

Logs are structured messages or unstructured lines of text generated by a system when certain code runs. Logs are comprehensive records of events in a system. They provide details of a system event such as an error, why it occurred, where it occurred, and the specific time the event happened. Especially in a microservices environment, logs help to uncover the details of unknown faults or emergent behaviors exhibited,

This data from a log is important to achieving observability. By analyzing the details of log data, you can debug and troubleshoot where, why, and the time an error in the system occurred.

Metrics

Metrics are collective values represented as a measure of the aggregate performance of a system over a period. Unlike logs, metrics give a holistic view of the events and performance of a system over a period.

You can gather metrics such as system uptime, number of requests, response time, failure rate, memory, and other resource usages over time. DevOps engineers typically use metrics to trigger alerts or certain actions when the metric value goes above or below a specified threshold.

Metrics are also easy to correlate across multiple systems to observe trends in the performance and identify issues in the system.

Tracing

The third telemetry data that makes up the observability pillars, tracing, refers to tracking the root source of a fault, especially in distributed systems. Tracing records the journey of a request or action as it moves from one service to another in a microservice architecture. This enables professionals to identify the system bottlenecks and resolve issues faster.

Tracing is especially useful when debugging complex applications because it allows us to understand a request's journey from its starting point and identify which service a fault originated from in a microservice architecture. Even though the first two pillars, logs, and metrics, provide adequate information about the behavior and performance of a system, tracing enhances this information by providing helpful information about the lifecycle of requests in the system.

From these three data, you can estimate or determine the current state of an IT system without further investigation, which makes the system observable.

However, these pillars are only components of observability and are not actionable enough.

To implement observability in your system, the Twitter engineering team highlighted four actionable steps: collection, storage, query, and visualization.

The collection involves aggregating telemetry data, logs, metrics, and tracing, with their unique identifiers and timestamps from various endpoints in the system. After collecting data, you need to store it in a database responsible for filtering, validating, indexing, and reliably storing the data for future use. To use the data collected and stored in a database, you need to query relevant information from the storage system. "While collecting and storing the data is important, it is of no use to our engineers unless it is visualized in a way that can immediately tell a relevant story." Visualization is the last step where the stored data is queried and displayed in charts and dashboards for analysis purchase. By analyzing the visualized telemetry data, you can achieve observability in any system.

Four golden signals of monitoring

Monitoring has to do with tracking and recording various metrics from an environment. According to Google, there are four golden signals of monitoring: latency, traffic, errors, and saturation. These are key metrics that can help you achieve optimal performance in your system when measured properly.

Latency

Latency is the time taken for the system to respond to a request. It also helps to keep the latency of failed requests and successful requests separately. This will help in properly diagnosing failures.

Traffic

Traffic is a measure of how much service request is sent to a system over a period. This request may differ based on the type of services the system serves. For a web service, traffic is measured in HTTP requests per second. In contrast, traffic for an audio streaming system is measured by network I/O rate or concurrent session. In a database or key-value storage system, it is measured in the number of transactions or retrieval per second. Measuring traffic will help you understand the workload on your system, which is why it is an important metric.

Errors

As opposed to just filtering out errors, taking records of errors encountered by users will help in improving the system. You should monitor the rate of requests that fail by storing the type of request and the latency.

Saturation

Saturation is a measure of how "full" a system is. A system is saturated when its underlying resources (memory, I/O, CPU) cannot handle any further requests. Saturation is a golden signal because it allows you to test the limit of how much traffic or workload can handle. It also affords you the ability to predict what the state of your resources will be over time.

These four golden signals are important to achieve quality monitory and, eventually, observability in any system.

Monitoring vs Observability: What's the difference?

Monitoring is a prerequisite for observability. While monitoring deals with collecting data, observability collects, stores, queries, and visualizes these data to grant professionals an easy way of understanding the reasons behind every system's behaviour.

Monitoring gives you information about a problem or failure in your system, while observability lets you understand what caused the failure, where, and why it happened. A system that is not monitored is not observable.