How we reduced our Prometheus infrastructure footprint by a third

@faun ・ May 01,2023

https://medium.com/criteo-engineering/how-we-reduced-our-pro...

How we reduced our Prometheus infrastructure footprint by a third

This article discusses sharding in Prometheus, a technique used to distribute the load of collecting metrics across multiple instances. The article describes a problem where the number of metrics being scraped was growing non-linearly, causing increased memory and CPU costs.

The root cause was identified as the scalability limits in the Prometheus drop metric_relabel_configs. To address the issue, the team implemented a solution that filters the metrics during the scrape process rather than after, reducing the overall footprint of their Prometheus setup.

This change resulted in significant memory and CPU savings, as well as improved efficiency and performance of the system.