Golden Signals: Monitoring from Fundamental Principles for Zabbix and Nagios Users

Building a robust monitoring process is crucial for ensuring high availability in your applications. In this blog we’ll delve into the four key SRE Golden Signals for metrics-driven measurement and their role in crafting a comprehensive monitoring strategy. This is especially valuable for Zabbix vs Nagios users who are already familiar with these monitoring tools.

Understanding Monitoring: The Cornerstone of Effective Systems Management

Monitoring is the foundation of efficiently operating any software system or application. The more visibility you have into your software and hardware functionalities, the better you can serve your customers. It acts as an indicator of whether you’re on the right track and by how much you deviate from your goals.

What Should We Expect from a Incident Monitoring Tool?

Many monitoring concepts applicable to information systems can be applied to other projects and systems as well. Any monitoring system should be able to:

Gather information regarding the system under observation.
Analyze and process the collected data.
Present the derived information in a way that is comprehensible for system operators and consumers.

The valuable information we aim to gather from the system is called signals. The focus should always be on collecting signals relevant to the system’s health. But similar to radio communication, where terminology is derived from, unwanted and often irrelevant information, also known as noise, can interfere with these signals.

Traditional monitoring relied on active and passive checks, along with near real-time metrics. Classic tools like Nagios and RRDTools functioned in this manner. Monitoring gradually evolved to favor metrics-based monitoring, leading to the popularity of platforms like Prometheus and Grafana.

Centralized log analysis and extracting metrics from logs became standard practice, with the ELK stack at the forefront of this change. However, the focus is now shifting towards traces, and the term “monitoring” is being replaced by “observability.” Beyond this, we also have a plethora of APM (Application Performance Monitoring) and Synthetic monitoring vendors offering various observability and control functionalities.

While these platforms provide the tools to monitor anything, they don’t dictate what to monitor. So, how do we choose the relevant metrics from all this data? The multitude of monitoring and observability tools can make this task more challenging, not to mention the additional effort required to identify the right metrics and differentiate between signal and noise.

When things become intricate, a solution is to approach the problem from fundamental principles. We need to break down the problem, identify the core elements, and build upon them. In the context of monitoring, this translates to identifying the absolute minimum metrics we need to track and then construct a strategy around that. Now, let’s explore a popular strategy for selecting the right metrics:

SRE Golden Signals: A Foundation for Building Monitoring Strategies

SRE Golden Signals, introduced in the Google SRE book, define the essential core metrics required to monitor any service. This model emphasizes considering metrics from first principles and serves as a base for building application-centric monitoring. The strategy is straightforward: for any system, monitor at least these four metrics: Latency, Traffic, Errors, and Saturation.

Understanding Latency

Latency refers to the time it takes to service a request. While the definition seems simple, latency needs to be measured from the client or server application’s perspective. For an application that serves web requests, the measurable latency is the time difference between the moment the application receives the first byte of a request and the moment the last byte of the response to that request leaves the application. This includes the time the application takes to process the request and construct the response, along with everything in between, such as disk seek latencies, downstream database queries, and time spent in the CPU queue.

Things become a bit more complex when measuring latency from the client’s perspective because the network between the client and server also influences the latency. There are two main client types:

The first is another upstream service within your infrastructure (think Zabbix server monitoring another web server).
The second, and more intricate, is real users located anywhere on the internet. There’s no way to guarantee a consistently stable network between them and the server.

For the first type, you have control and can measure latencies directly from the upstream application using Zabbix or Nagios. For internet users, you can employ synthetic monitoring or Real User Monitoring (RUM) to get an approximation of latencies. These measurements become even more complex when firewalls, load balancers, and reverse proxies exist between the client and the server.

Here are some crucial aspects to consider when measuring latencies:

Distinguishing Between Good and Bad Latency: Differentiate and segregate successful request latencies from failed request latencies. As quoted from the SRE Book, an HTTP 500 error latency should be measured as bad latency and shouldn’t be allowed to contaminate the HTTP 200 latencies. This can be easily configured in both Zabbix and Nagios to separate metrics for successful and failed requests.
Metric Choice for Latency: Average or rate are not ideal choices for latency metrics because a single large latency outlier can skew the average and mask potential issues. Zabbix and Nagios can both be configured to use percentiles or histograms for latency, allowing you to identify and investigate outliers effectively. For instance, monitoring the 95th percentile latency provides a better understanding of how long the slowest 5% of requests take to process.

Traffic: Understanding Demand on Your System

Traffic refers to the demand placed on your system by its clients. The exact metric used to represent traffic will vary depending on the system’s function. Here are some common examples:

Web Applications: Number of requests served in a specific timeframe.
Streaming Services (e.g., Youtube): Amount of video content served.
Databases: Number of queries served.
Cache: Number of cache misses and cache hits.

Traffic metrics can be further broken down based on the nature of requests. For instance, in a web application, you might categorize traffic by:

HTTP status code (2xx, 4xx, 5xx)
HTTP method (GET, POST, PUT, etc.)
Content type (HTML, images, videos)

Understanding geographical distribution or other relevant user characteristics can also be valuable. Zabbix and Nagios both allow you to configure various items and triggers to monitor these different aspects of traffic.

Error Rates: Identifying Issues That Impact Functionality

Errors are measured by counting the number of errors originating from the application and calculating the error rate over a specific time interval. Common error metrics include:

Server-side errors (5xx status codes)
Client-side errors (4xx status codes)
2xx responses with application-level errors (wrong content, no data found)

It’s crucial to define what constitutes an error within the context of your system and business logic. Not all unexpected responses are necessarily errors. For instance, a service might return a 200 status code but deliver incorrect content. This wouldn’t be a network error (5xx) but still warrants attention.

Both Zabbix and Nagios enable you to create alerts and notifications based on these error metrics, allowing you to proactively address potential issues before they impact user experience.

Saturation: Recognizing Resource Constraints

Saturation indicates how utilized or “full” your system’s resources are. While 100% resource utilization might seem ideal in theory, a system nearing full capacity can lead to performance degradation. Saturation can occur with various resources, including:

System resources: Memory, CPU, disk space
Open file counts
Network queues
Database connections
Application-level request queues

Saturation is typically measured as a “gauge” metric, indicating a value on a spectrum between a defined upper and lower bound. Zabbix and Nagios can be configured to monitor these resource metrics and trigger alerts when they approach saturation levels. Additionally, the 99th percentile latency, which identifies request outliers, can serve as an early warning sign for potential resource constraints.

By monitoring these SRE Golden Signals (Latency, Traffic, Errors, and Saturation), you gain a foundational understanding of your system’s health. Zabbix and Nagios provide the tools to collect, analyze, and visualize these metrics, empowering you to build a comprehensive monitoring strategy for your applications.

Squadcast is an Incident Management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.