Observability with Prometheus and Grafana

What you'll learn

	Getting Started with Prometheus: Discover what Prometheus is, its origins, and why it has become the de facto standard for monitoring and metric-based observability.
	The Internal Design of Prometheus: Head, chunking, compaction, write-ahead log, blocks, and more concepts that make Prometheus a powerful monitoring tool are explained in detail. The way Prometheus stores and queries data is key to understanding how to use it at scale.
	Installing and Configuring Prometheus: Follow a hands-on, step-by-step guide to installing Prometheus, configuring it to scrape metrics, and setting up a robust monitoring environment.
	Exploring the Prometheus Web Interface: Learn how to effectively navigate the Prometheus UI, query collected data using PromQL, and understand the status of targets, TSDB, and alerting rules.
	Exploring and Querying Metrics with PromQL (Prometheus Query Language): Dive deep into PromQL with practical examples, from basic queries to advanced functions like rate calculations, aggregations, and mathematical transformations.
	Relabeling and Advanced Configuration: Master the art of relabeling, configuring service discovery, and advanced configuration options to make Prometheus as flexible as possible. Use these techniques to monitor complex environments and make your monitoring experience adaptive and efficient.
	Building Dynamic Dashboards with Grafana: Understand and implement dynamic dashboards in Grafana to create interactive visualizations and explore data across different dimensions.
	Visualizing Metrics with Grafana: Import intuitive and powerful dashboards using Grafana to visualize Prometheus metrics and gain actionable insights. Build on the open-source community dashboards and extend them to meet your specific needs.
	Monitoring *nix Systems (Linux, Unix, FreeBSD, etc.) with Node Exporter Collect system-level metrics such as CPU, memory, disk usage, and network statistics using the Node Exporter.
	Monitoring External Services with Blackbox Exporter: Probe endpoints over HTTP, TCP, DNS, and ICMP to monitor availability and response times, using the Blackbox Exporter.

	Getting Started with Prometheus: Discover what Prometheus is, its origins, and why it has become the de facto standard for monitoring and metric-based observability.
	The Internal Design of Prometheus: Head, chunking, compaction, write-ahead log, blocks, and more concepts that make Prometheus a powerful monitoring tool are explained in detail. The way Prometheus stores and queries data is key to understanding how to use it at scale.
	Installing and Configuring Prometheus: Follow a hands-on, step-by-step guide to installing Prometheus, configuring it to scrape metrics, and setting up a robust monitoring environment.
	Exploring the Prometheus Web Interface: Learn how to effectively navigate the Prometheus UI, query collected data using PromQL, and understand the status of targets, TSDB, and alerting rules.
	Exploring and Querying Metrics with PromQL (Prometheus Query Language): Dive deep into PromQL with practical examples, from basic queries to advanced functions like rate calculations, aggregations, and mathematical transformations.
	Relabeling and Advanced Configuration: Master the art of relabeling, configuring service discovery, and advanced configuration options to make Prometheus as flexible as possible. Use these techniques to monitor complex environments and make your monitoring experience adaptive and efficient.
	Building Dynamic Dashboards with Grafana: Understand and implement dynamic dashboards in Grafana to create interactive visualizations and explore data across different dimensions.
	Visualizing Metrics with Grafana: Import intuitive and powerful dashboards using Grafana to visualize Prometheus metrics and gain actionable insights. Build on the open-source community dashboards and extend them to meet your specific needs.
	Monitoring *nix Systems (Linux, Unix, FreeBSD, etc.) with Node Exporter Collect system-level metrics such as CPU, memory, disk usage, and network statistics using the Node Exporter.
	Monitoring External Services with Blackbox Exporter: Probe endpoints over HTTP, TCP, DNS, and ICMP to monitor availability and response times, using the Blackbox Exporter.
	Monitoring Kubernetes with Prometheus: Deploy Prometheus and kube-prometheus-stack using Helm, scrape Kubernetes endpoints, and collect cluster-wide metrics with kube-state-metrics and other integrations.
	Monitoring Docker and Containerized Workloads: Track container resource usage, running instances, and performance metrics using cAdvisor, Docker Engine metrics, and Prometheus-native monitoring tools.
	Custom Exporters for Non-Native Integrations: Learn how to create and deploy custom exporters for applications and services that don’t natively expose Prometheus metrics.
	Handling High Cardinality and Label Best Practices: Strategies for managing high-cardinality metrics and designing efficient labels for scalable monitoring.
	Prometheus Service Discovery: Learn how Prometheus automatically discovers targets using mechanisms like Kubernetes, Docker Swarm, and file-based discovery.
	Code Instrumentation and Custom Metrics: Learn how to shift-left monitoring, instrument your applications with Prometheus client libraries, and expose custom metrics to Prometheus.
	Understanding Prometheus Metric Types: A deep dive into counters, gauges, histograms, and summaries-how they work and when to use each.
	Setting Up Alerts with Alertmanager: Learn how to configure alerting rules, manage notifications, and integrate Alertmanager with tools like Slack, email, and others for real-time alerting.
	Pushgateway for Short-Lived Jobs: Understand how to monitor batch jobs and ephemeral workloads that are not directly exposed to Prometheus using the Pushgateway.
	Understand the Bottlenecks and Performance Tuning: Gain a practical understanding of the performance bottlenecks of Prometheus and how to easily identify and resolve them.
	Debugging and Troubleshooting Prometheus: Learn the important techniques for diagnosing slow queries, missing metrics, and performance issues in Prometheus.
	Retention Policies and Storage Management: Learn how to manage data retention, configure TSDB, and optimize disk usage for long-term efficiency.
	Scaling and Long-Term Storage: Understand Prometheus’ limitations and how solutions like Thanos and Cortex can help with scaling and long-term storage. Master the advanced techniques of sharding, federation, remote write, and more.
	Best Practices: Learn practical tips for fine-tuning Prometheus, optimizing its resource usage, avoiding high-cardinality pitfalls, implementing monitoring best practices, reducing alert fatigue, designing effective dashboards, and many other strategies to help you get the most out of this powerful tool.
	Real-World Use Cases: Learn how operations and observability teams use Prometheus in production, monitor containers, Kubernetes clusters, and VMs, and integrate it with other tools like Alertmanager and Grafana.

Read less

Description

If you are on a journey to improve your observability, Prometheus is the perfect tool to start with. The goal of this guide - Observability with Prometheus and Grafana - is to help you not only get started with Prometheus but also master its advanced features and set you on the path to becoming a Prometheus expert.

This guide is designed for anyone looking to master Prometheus and build a strong foundation in modern monitoring and observability. *Observability with Prometheus and Grafana* is designed for both beginners and experienced professionals. A basic understanding of monitoring concept…

k3s

Docker

Grafana

GNU/Linux

Prometheus

Kubernetes

Learning path

Follow the winding road from start to finish

Preface

5 sections · 31m read

Who This Guide is For 7m What You Will Learn 17m About the Author 3m Join the Community 2m Your Feedback Matters 2m

How to Use This Guide

2 sections · 14m read

General Recommendations 4m Technical Recommendations and Standards Used in This Guide 10m

What is Prometheus and What Makes it Unique?

2 sections · 24m read

What is Prometheus? 15m What Makes Prometheus Unique? 9m

Understanding Prometheus: Internals and Architecture

2 sections · 47m read

How Does Prometheus Work? 32m Prometheus Architecture 15m

Prometheus: Limitations, Trade-offs, and Solutions

9 sections · 43m read

Reliability Over Completeness 5m Long-Term and Clustered Storage 4m There is No Advanced User Management 1m Visualization is not Prometheus' Value Proposition 1m Prometheus is not a Logging System 2m The Pull Model is Perfect but not Always 3m Short-Lived Jobs and the Pull Model 3m High Cardinality 19m Do One Thing and Do It Well 5m

Exploring Prometheus: Installation and Configuration

3 sections · 37m read

Requirements 15m Installing Prometheus 10m Configuring Prometheus 12m

Exploring the Prometheus Web Interface

6 sections · 43m read

The Graph Page 12m Understanding Prometheus Targets 2m Exploring and Interpreting Prometheus Metrics & Queries 9m The TSDB Status 7m Using the TSDB Status Information 6m The Runtime and Build Information 7m

Integrating Prometheus with Grafana

6 sections · 40m read

A User-Unfriendly Web UI: A Blessing, not a Curse 3m Grafana: Visualizing Time Series Data 2m Deploying Grafana 5m Adding Prometheus as a Data Source 2m Adding Prometheus Dashboards 22m Monitoring Grafana Using Prometheus 6m

Alertmanager: Metric-Based Alerting in Prometheus

2 sections · 14m read

Installing Alertmanager 8m Integrating Prometheus with Alertmanager 6m

Alertmanager: Rules, Receivers, and Grafana Integration

5 sections · 72m read

Alerting Rules, Expressions, and Groups 33m Alertmanager Receivers 15m Silencing Alerts 2m Grafana Alerting vs. Prometheus Alerting 8m Choosing Between Prometheus and Grafana for Alerting 14m

Relabeling in Prometheus

2 sections · 20m read

Understanding Relabeling 8m Target, Metric, Write, and Alert Relabeling: What's the Difference? 12m

Relabeling: Rules and Actions

2 sections · 99m read

Relabeling Rules 5m Relabeling Actions 94m

Relabeling: Best Practices

8 sections · 22m read

Monitor the Performance Impact 1m Watch Out For Cardinality Explosion 4m Use Relabeling Sparingly and Purposefully 1m Prefer Metadata Labels for Dynamic Rules 1m Version Control and Documentation 1m Test Before Production Deployment 1m Understand Edge Cases and Special Labels 7m Use the Right Relabeling Option - A Cheat Sheet 6m

Monitoring *NIX Systems with Prometheus

4 sections · 21m read

Introduction to Exporters 2m Adding Scraping Targets 7m Installing the Node Exporter 4m Node Exporter: Metrics and Collectors 8m

Building Grafana Dashboards for Node Exporter Metrics

2 sections · 21m read

Building Custom Dashboards 14m The Node Exporter Full Dashboard 7m

Understanding Network Black Box Monitoring with Prometheus

3 sections · 18m read

What is Network Black Box Monitoring? 10m The Blackbox Exporter 2m How to Install the Blackbox Exporter 6m

Monitoring Endpoints with Prometheus and Blackbox Exporter

3 sections · 33m read

Your First Steps with the Blackbox Exporter 12m Debugging the Probe 2m Integrating the Blackbox Exporter with Prometheus 19m

Black Box Exporter: Advanced Configurations, Probes and Tools

4 sections · 26m read

The ICMP Probe 7m The DNS Probe 8m Advanced Configurations 4m Beyond the Blackbox Exporter 7m

Building User-Friendly Grafana Dashboard

23m read

Increase, Rate and Instant Rate

4 sections · 21m read

Understanding the Usage and Importance of Rate Functions 11m Rates and SRE practices 2m Rate vs. Instant Rate 6m Rate vs. Increase 2m

Instrumenting Applications with Prometheus

2 sections · 11m read

What is Instrumentation? 2m How Prometheus Instrumentation Works 9m

Instrumentation with Prometheus in Practice

6 sections · 78m read

Counters: Tracking Metrics that Only Go Up 12m Gauges: Tracking Metrics that Go Up and Down 12m Histograms: Tracking the Distribution of Values Over Time 33m Summaries: High-Accuracy Quantiles with Limitations 11m The Four Core Metric Types in Prometheus 5m Additional Custom Metric Types 5m

Prometheus Pushgateway: The Push Model for Short-Lived Jobs

5 sections · 24m read

When Is the Pushgateway Needed? 6m How it Works 3m Configuring the Pushgateway 4m The Push Process 7m Configuring Prometheus to Scrape Metrics 4m

Prometheus and Docker

2 sections · 21m read

Docker Metrics: Enabling and Configuring 10m cAdvisor: Richer Container Metrics 11m

Monitoring Docker Swarm with Prometheus

4 sections · 44m read

Installation and Initial Configurations 15m Configuring Prometheus 25m Security Aspects 3m Cleaning Up 1m

Monitoring Kubernetes with Prometheus

5 sections · 53m read

Kubernetes Service Discovery for Prometheus 12m Setting Up a Kubernetes Cluster 4m Exporters and Metrics in Kubernetes 15m Monitoring Kubernetes with Prometheus; a Practical Example 18m Key Takeaways 4m

Using Prometheus to Monitor Kubernetes: Grafana and Prometheus Operator

3 sections · 43m read

Do Not Reinvent the Wheel 3m An All-in-One Kubernetes Monitoring with Prometheus 38m Key Takeaways 2m

Prometheus High Availability vs. Scalability: Strategies and Considerations

2 sections · 15m read

Fundamentals of Scaling Prometheus 12m A Highly Available Setup 3m

Sharding and Federation: Scaling Strategies for Prometheus

2 sections · 38m read

Sharding 20m Federation 18m

Strategies to Scale Prometheus: Remote Write and Agent Mode

1 section · 25m read

Remote Write 25m

Prometheus on the Edge

4 sections · 15m read

Prometheus Agent Mode 3m When to Use the Agent Mode 4m What Are the Limitations? 2m Setting Up the Agent Mode 6m

Strategies to Scale Prometheus: Managed Prometheus Services

5 sections · 89m read

Google Cloud Managed Service for Prometheus 13m Amazon Managed Service for Prometheus (AMP) 10m Grafana Cloud (Managed Prometheus Metrics) 11m Logz.io Infrastructure Monitoring (Prometheus-as-a-Service) 18m Sysdig Monitor (Enterprise Prometheus Service) 37m

Afterword

5m read

The author

Aymen El Amri

@eon01

Aymen El Amri is a software and cloud-native engineer, trainer, author, and technopreneur with 15+ years of experience in building and scaling distributed systems, cloud architectures, and modern software delivery pipelines.

He founded FAUN.dev(), one of the web's most active developer communities focused on Kubernetes, cloud-native engineering, modern software delivery, and other related topics.

He has trained thousands of engineers on DevOps, SRE, Kubernetes, microservices, and cloud architectures, helping teams build reliable and scalable systems. His technical guides and courses are widely used by engineers and organizations looking to adopt cloud-native practices.

His work earned several honors, including a national open-source award. He also advises companies on shaping their cloud-native and platform engineering direction. TechBeacon listed him among the top 100 DevOps professionals to follow.

Find him on FAUN.dev(), LinkedIn or X.