Sharding and Federation: Scaling Strategies for Prometheus
Federation
What is Federation?
In the previous section, we have seen how sharding can help scale Prometheus by distributing the scraping workload across multiple instances. However, in some cases, querying data across shards can be challenging because aggregates are not directly available. Manually querying each shard and aggregating the results can be cumbersome and inefficient. Federation could be a solution to this concern.
How Federation Works
We started with a single Prometheus instance scraping a pool of servers. We then scaled out by adding a second Prometheus instance and sharding the workload. We now have two Prometheus instances, each scraping a subset of the total metrics. Federation requires a third Prometheus instance, called the federating Prometheus, to aggregate data from the shards.
In general, federation in Prometheus works by enabling a higher-level Prometheus server to scrape metrics from other Prometheus instances, sometimes called "leaf nodes," using their /federate endpoint.
This is achieved by configuring scrape_configs on the federating Prometheus instance, specifying the target instances and defining match[] parameters to select specific metrics or label sets to pull.
The honor_labels option is often used to retain the original labels from the source instances to ensure that metric identity is preserved. Each target Prometheus aggregates and exposes its metrics, and the federating Prometheus scrapes these metrics at regular intervals.
Advantages and Limitations of Federation
Using federation has the main advantage of being simple and straightforward. Setting up a higher-level Prometheus instance to scrape metrics from other Prometheus instances is relatively easy and requires minimal configuration. The fact that it is built into Prometheus makes it a convenient solution for many use cases.
However, this approach comes with limitations. It is not ideal for handling high-cardinality metrics or for complete duplication of metrics across instances. Additionally, it introduces data delays due to periodic scraping. This makes it unsuitable and less attractive for real-time alerting or systems requiring immediate responses.
High volumes of federated data can also strain both the central and leaf Prometheus instances and lead to timeouts or performance bottlenecks. For scenarios requiring high availability, deduplication and unified querying without these limitations, tools like Thanos are often a better alternative.
Configuring Federation
To configure a Prometheus instance to federate metrics from other Prometheus instances, you need to define a new scrape_config in the configuration file. Here is an example configuration for a central Prometheus instance that selectively federates metrics from two other Prometheus instances. You can adapt it to your specific requirements, especially the match[] parameters, to pull the desired metrics and the targets to specify the addresses of the leaf instances.
global:
external_labels:
# The labels to add to any time series or alerts
# when communicating with external systems
federate: 'true'
scrape_configs:
# Job name for the federation scrape
- job_name: 'federate'
# How often to scrape metrics from the target Prometheus instances
scrape_interval: 15s
# Preserve the labels from the scraped metrics
honor_labels: true
# Path to scrape for federated metrics
metrics_pathObservability with Prometheus and Grafana
A Complete Hands-On Guide to Operational Clarity in Cloud-Native SystemsEnroll now to unlock all content and receive all future updates for free.
