Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements.
Typical use cases for Kubernetes monitoring include:
- Ensuring workload reliability
- Achieving high-level visibility into your workload
- Alerting and enabling incident management
Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.
Summary of key Kubernetes monitoring best practices concepts
The table below summarizes the Kubernetes monitoring best practices we will explore in this article.
- Concept: Monitoring vs. observability
- Description: Observability means gaining insights into your workloadâs performance using external indicators. Monitoring means checking such indicators over time.
- Concept: Determining requirements
- Description: Accurately determine your requirements and monitoring goals.
- Concept
Identify appropriate metrics - Description
Identify which metrics you need to achieve your monitoring goals.
- Concept: Select the right tools
- Description: Selecting the right tools given your requirements is a critical best practice. A main decision here is whether to build something in-house using open-source software or buy a more complete SaaS solution with better support.
- Concept: Monitor the monitoring system
- Description: In a production workload, it is important to monitor the monitoring system itself to ensure it is reliable and highly available.
- Concept: Consider data storage
- Description: Monitoring data must be stored and managed effectively.
- Concept: Monitor the control plane
- Description: Monitoring the Kubernetes control plane is easy to overlook, so teams should be intentional about control plane monitoring.
- Concept: Account for incident response
- Description: Monitoring outputs can enhance incident response coordination, which can reduce MTTR (mean time to resolve).
Monitoring vs. observability
Before we go into more detail, letâs unpack an often confusing topic, monitoring vs. observability. The term âmonitoringâ is more traditional and covers the collection of metrics and logs used to monitor the application infrastructure components. The idea is to âmonitorâ a workload by constantly evaluating the real-time performance of its underlying infrastructure.
Observability is a relatively new concept, and even though it overlaps with monitoring, its end goal is to isolate a performance bottleneck along a transaction path instead of monitoring the application infrastructure. Observability gained traction in application environments designed based on the paradigm of microservices, where an application comprises modularized services hosted in ephemeral containers and interacting with each other via application programming interfaces (API). In such an environment, monitoring the servers and containers in isolation isnât meaningful, so a new perspective was required, giving rise to the notion of observability.
In addition to metrics and logs, observability also includes distributed tracing to follow the path of a transaction through the application infrastructure. Distributed tracing enables operation engineers to understand the path a userâs request takes, including:
- When the workload received the request,
- The stages or microservices it went through, and
- When the response was sent to the user.
Observability allows the operations engineers to quickly understand the upstream and downstream impact of application services on each other. Typically, observability tools will combine metrics, logs, and tracing to give engineers a coherent view of the entire transaction path across the infrastructure. Read this article if you want to learn more about observability (also called âO11yâ).
7 essential Kubernetes monitoring best practices
The seven Kubernetes monitoring best practices below can help DevOps and SRE (site reliability engineering) teams achieve SLOs (service level objectives) and improve overall infrastructure observability.
Kubernetes monitoring best practice #1: Determine what you want to achieve
Determining business goals is the first (and arguably the most important) Kubernetes monitoring best practice. Examples of such goals are:
- Gain visibility into your clusterâs health
- Gain visibility into the end userâs experience
- Be alerted when certain events occur
- Anticipate potential problems
- Identify trends and patterns in the workload utilization, such as a steady increase in disk utilization which will lead to the disk being full within a certain period of time
- Identify trends and patterns going out of the ordinary or out of whatâs expected
- Scale pods in and out when certain conditions are met
- Evaluate the reliability of the application against expected criteria
While planning is important, itâs also essential not to overthink it. Teams just getting started with monitoring should avoid analysis paralysis and instead take an iterative approach to developing a plan. Additional requirements can be added later to address new information and requirements.
Integrated full stack reliability management platformTry For Free Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets Manage incidents on the go with native iOS and Android mobile apps Seamlessly integrated alert routing, on-call, and incident response
Kubernetes monitoring best practice #2: Identify metrics to monitor
Once youâve identified your business goals, you can identify which metrics you need to collect to achieve those goals. This step also includes defining related configuration parameters, such as the collection rate and how long you need to store the metrics data.
Some metrics are usually readily available, typically system metrics. These metrics include:
- CPU utilization
- Memory utilization
- Free space available on disks
- Disk input/output data
- Network usage
System metrics are usually necessary as part of any monitoring strategy and tend to show the overall stress the cluster is under. However, they are quite basic and usually wonât give enough actionable information beyond telling whether the cluster seems healthy.
Additionally, more complex metrics are often required. These metrics are often tied to the software you run. For example, they could measure:
- How responsive is the website or the app?
- How many users are currently logged in?
- What is the average number of concurrent users at 10am on weekdays?
- How fast does your support team respond to the initial requests?
- Whatâs the rate of 5xx errors reported by your web server?
- Whatâs the average number of jobs in an input queue per day?