How to Write Effective Prometheus Alert Rules

Prometheus is a powerful monitoring and alerting system widely used in cloud-native and Kubernetes environments. One of its critical features is its ability to create and trigger alerts based on metrics it collects from various sources. This blog post will cover everything you need to know about Prometheus alerts, including alert template fields, alert expression syntax, Prometheus sample alert rules, limitations of Prometheus, best practices for Prometheus alerts configuration, and incident response handling.

Alert Template Fields

Prometheus alert templates provide a way to define standard fields and behavior for multiple alerts. You can define these templates in the Prometheus configuration file. You can reuse templates across multiple alerts to keep your alert configuration clean, maintainable, and understandable.

The following are the main fields available in Prometheus alert templates:

Alert: This field specifies the alert’s name. It identifies the alert and must be unique within a Prometheus instance.
Expr: This field specifies the Prometheus query expression that evaluates the alert condition. It is the most important field in an alert template, and you must specify it.
Labels: This field adds additional information to the alert. You can use it to specify the severity of the alert, the affected service or component, and any other relevant information.
Annotations: This field provides additional context and human-readable information about the alert. You can include a summary of the alert, a description of the issue, or any other relevant information.
For: This field specifies the duration for which the alert condition must be true before Prometheus triggers the alert.
Groups: This field groups multiple alerts together. A single alert condition in a group triggers all alerts in the same group.

Alert Expression Syntax

Prometheus uses the PromQL (Prometheus Query Language) to create alerting rules. The alert expression is the core of a Prometheus alert. You use PromQL to define the condition that triggers an alert. For example, the following expression triggers an alert if the average CPU utilization on a host exceeds 80% for 5 minutes:

avg(node_cpu{mode="system"}) > 80

Basic alert syntax

The basic syntax of an alert expression is as follows:

<metric_name>{<label_name>="<label_value>", ...} <operator> <value>

The <metric_name> is the name of the metric being queried.

The {<label_name>="<label_value>", ...} is an optional part of the query that specifies the labels that should be used to filter the metric.

The <operator> is a mathematical operator, such as >, <, ==, etc.

The <value> is the value that the metric must be compared against using the specified operator.

Advanced alert queries

For more complex scenarios, you can use functions, like avg, sum, min, max, etc., in the expression to aggregate the metrics and make more complex comparisons. For instance, the below query triggers an alert if the average rate of HTTP requests per second to the “api” service exceeds 50 for a 5-minute period.

avg(rate(http_requests_total{service="api"}[5m])) > 50

Other advanced features include:

Logical operators, like and, or, and unless
The on or ignoring keywords for vector matching

Prometheus Sample Alert Rules

Here are some examples of Prometheus alert rules that you can use as-is or adapt to fit your specific needs:

High CPU utilization alert

groups:
- name: example_alerts
rules:
- alert: HighCPUUtilization
expr: avg(node_cpu{mode="system"}) > 80
for: 5m
labels:
severity: critical
annotations:
summary: High CPU utilization on host {{ $labels.instance }}
description: The CPU utilization on host {{ $labels.instance }} has exceeded 80% for 5 minutes.

Low disk space alert

groups:
- name: example_alerts
rules:
- alert: LowDiskSpace
expr: node_filesystem_free{fstype="ext4"} < 1e9
for: 5m
labels:
severity: critical
annotations:
summary: Low disk space on host {{ $labels.instance }}
description: The free disk space on host {{ $labels.instance }} has dropped below 1G

High request error rate alert

groups:
- name: example_alerts
rules:
- alert: HighRequestErrorRate
expr: (sum(rate(http_requests_total{status="500"}[5m])) /
sum(rate(http_requests_total[5m]))) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: High request error rate
description: The error rate for HTTP requests has exceeded
5% for 5 minutes.

This rule:

Triggers: When the rate of HTTP requests with a 500 status code (server errors) over the last 5 minutes exceeds 5% of all HTTP requests over the same period.
Labels: Assigns a “severity” label of “critical” to this alert.
Annotations: Provides a summary and a more detailed description of the alert.

Limitations of Prometheus Alerts

Like any tool, Prometheus has its own set of challenges and limitations. Here are some of the common limitations to consider:

Excessive alerts for noisy metrics: Prometheus alerts are based on metrics, and sometimes metrics can be noisy and difficult to interpret. This may lead to false positives or false negatives, which can be difficult to troubleshoot.
Scaling challenges: As the number of metrics and alerting rules increases, Prometheus becomes resource-intensive and may require additional scaling or optimization. Too many complex alerting rules can also become challenging to understand and troubleshoot. Additionally, Prometheus does not have built-in dashboards, so you have to use external dashboarding tools, like Grafana, for metric visualization.
Inability to detect dependent services: Prometheus alerts are based on metrics, but in some scenarios, a particular service metric depends on a different service behavior. In such cases, inaccuracy increases, and alerts become difficult to action.
No alert suppression: Prometheus does not have built-in alert suppression or deduplication. Depending on your configuration, you could have a high volume of alerts for non-critical issues. To mitigate this, users can use an additional component, such as Alertmanager, to group, deduplicate, and route alerts to the appropriate channel.
Limited integration with other tools: While you can integrate Prometheus with various notification channels, it presents limited integration opportunities with other monitoring and alerting tools. You may already have existing monitoring infrastructure that is incompatible with Prometheus.

Best Practices for Prometheus Alerts Configuration

Despite some challenges, you can customize Prometheus to meet your organization’s needs. Proper planning and configuration proactively identify and resolve issues before they become critical. Here are some best practices to follow when using Prometheus alerting rules:

Create meaningful Prometheus alert templates:

Write alert templates and configurations that even new team members can understand. For example:
Choose alert names that clearly describe the metric and scenario they monitor.
Write descriptive annotations for each alert.
Assign appropriate severity levels to your alerts, such as critical, warning, or info.
Group related alerts together in a single alert group to improve manageability.
These best practices provide more context about the alert and improve response and troubleshooting time.

Set the appropriate alert frequency:

Make sure the time window specified in the ‘for’ clause of an alert is appropriate for the metric you are monitoring. A short time window may result in too many false positive alerts, while a long time window may delay detecting real issues. For example, some user actions may cause your application’s CPU usage to spike quickly before subsiding again. You may not want to action every small spike.

Test Prometheus before deployment:

Test your alert rules in a test environment before deploying them to production. This helps to ensure that the rules are working as expected and eliminates the risk of unintended consequences. Additionally, you can:
Monitor the Prometheus Alertmanager to ensure it functions properly and handles alerts as expected.
Regularly review and update your alert rules to ensure that they continue to accurately reflect your system state and incorporate environment changes.
Use alert templates to reduce the amount of duplication in your alert rules, as duplication increases management complexity.

Use incident response systems:

Automate alert handling where possible to reduce the time required to respond to alerts and to minimize human error. You can also use your Prometheus metrics and alerts for productive incident retrospectives or build runbooks to handle similar issues.
You can use tools like Squadcast to route alerts to applicable teams. Squadcast extends beyond basic incident response functionality to provide many other features like documenting retrospectives, tracking service level objectives (SLOs), and error budgets.

Incident Response Handling

Your organization’s incident response algorithms could be as simple as sending an email to your team letting them know that a failure is imminent. More complex alerts may trigger runbooks to automate the resolution process. For example, your ruleset could be defined to automatically scale services if a particular error budget exceeds a predefined threshold. Should the error rate continue to climb, a tool like Squadcast contacts the on-call administrator to step in and handle the incident.

Conclusion

As you can see, Prometheus is an excellent tool to alert on key metrics in cloud-native environments. Prometheus’s flexible query language and integration capabilities make it a versatile solution for efficient monitoring and alerting at scale. Our Prometheus sample alert rules and best practices will surely assist you in fully utilizing the most comprehensive Kubernetes alerting tools available today.