Fundamentals of Scaling Prometheus

There are different strategies to scale Prometheus, each with its own advantages and limitations. Some of the strategies requires using third-party tools like Thanos or Cortex. The main topics of this guide is Prometheus itself, so any third-party tools will be mentioned only briefly.

The choice of the scaling strategy depends on the specific requirements of your monitoring environment, such as the number of scrape targets, the volume of metrics, and the desired level of fault tolerance.

To better illustrate the scaling strategies, let's consider a scenario: You have a pool of 4 servers running a heavily loaded service that exposes a huge number of metrics.

# First server
10.135.0.5:5000/metrics

# Second server
10.135.0.6:5000/metrics

# Third server
10.135.0.7:5000/metrics

# Fourth server
10.135.0.8:5000/metrics

You have a single Prometheus instance scraping them, and you noticed that it's struggling to keep up with the increasing load. You also noticed that Prometheus is consuming more and more resources, and you're worried that it might not be able to handle the load in the long run.

This is your initial configuration:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'my_heavily_loaded_service'
    static_configs:
      - targets:
          - '10.135.0.5:5000'
          - '10.135.0.6:5000'
          - '10.135.0.7:5000'
          - '10.135.0.8:5000'

We are going to see some common ways to handle scaling in the next sections. However, there are some important points to keep in mind:

Do Not Scale Prometheus Unnecessarily

Observability with Prometheus and Grafana

A Complete Hands-On Guide to Operational Clarity in Cloud-Native Systems

Enroll now to unlock all content and receive all future updates for free.

Unlock now $36.99 Learn More

Previous Next