Kubernetes Monitoring Best Practices

Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements.

Typical use cases for Kubernetes monitoring include:

Ensuring workload reliability
Achieving high-level visibility into your workload
Alerting and enabling incident management

Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.

Summary of key Kubernetes monitoring best practices concepts

The table below summarizes the Kubernetes monitoring best practices we will explore in this article.

Concept: Monitoring vs. observability
Description: Observability means gaining insights into your workload’s performance using external indicators. Monitoring means checking such indicators over time.

Concept: Determining requirements
Description: Accurately determine your requirements and monitoring goals.

Concept
Identify appropriate metrics
Description
Identify which metrics you need to achieve your monitoring goals.

Concept: Select the right tools
Description: Selecting the right tools given your requirements is a critical best practice. A main decision here is whether to build something in-house using open-source software or buy a more complete SaaS solution with better support.

Concept: Monitor the monitoring system
Description: In a production workload, it is important to monitor the monitoring system itself to ensure it is reliable and highly available.

Concept: Consider data storage
Description: Monitoring data must be stored and managed effectively.

Concept: Monitor the control plane
Description: Monitoring the Kubernetes control plane is easy to overlook, so teams should be intentional about control plane monitoring.

Concept: Account for incident response
Description: Monitoring outputs can enhance incident response coordination, which can reduce MTTR (mean time to resolve).

Monitoring vs. observability

Before we go into more detail, let’s unpack an often confusing topic, monitoring vs. observability. The term “monitoring” is more traditional and covers the collection of metrics and logs used to monitor the application infrastructure components. The idea is to “monitor” a workload by constantly evaluating the real-time performance of its underlying infrastructure.

Observability is a relatively new concept, and even though it overlaps with monitoring, its end goal is to isolate a performance bottleneck along a transaction path instead of monitoring the application infrastructure. Observability gained traction in application environments designed based on the paradigm of microservices, where an application comprises modularized services hosted in ephemeral containers and interacting with each other via application programming interfaces (API). In such an environment, monitoring the servers and containers in isolation isn’t meaningful, so a new perspective was required, giving rise to the notion of observability.

In addition to metrics and logs, observability also includes distributed tracing to follow the path of a transaction through the application infrastructure. Distributed tracing enables operation engineers to understand the path a user’s request takes, including:

When the workload received the request,
The stages or microservices it went through, and
When the response was sent to the user.

Observability allows the operations engineers to quickly understand the upstream and downstream impact of application services on each other. Typically, observability tools will combine metrics, logs, and tracing to give engineers a coherent view of the entire transaction path across the infrastructure. Read this article if you want to learn more about observability (also called “O11y”).

7 essential Kubernetes monitoring best practices

The seven Kubernetes monitoring best practices below can help DevOps and SRE (site reliability engineering) teams achieve SLOs (service level objectives) and improve overall infrastructure observability.

Kubernetes monitoring best practice #1: Determine what you want to achieve

Determining business goals is the first (and arguably the most important) Kubernetes monitoring best practice. Examples of such goals are:

Gain visibility into your cluster’s health
Gain visibility into the end user’s experience
Be alerted when certain events occur
Anticipate potential problems
Identify trends and patterns in the workload utilization, such as a steady increase in disk utilization which will lead to the disk being full within a certain period of time
Identify trends and patterns going out of the ordinary or out of what’s expected
Scale pods in and out when certain conditions are met
Evaluate the reliability of the application against expected criteria

While planning is important, it’s also essential not to overthink it. Teams just getting started with monitoring should avoid analysis paralysis and instead take an iterative approach to developing a plan. Additional requirements can be added later to address new information and requirements.

Integrated full stack reliability management platformTry For Free Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets Manage incidents on the go with native iOS and Android mobile apps Seamlessly integrated alert routing, on-call, and incident response

Kubernetes monitoring best practice #2: Identify metrics to monitor

Once you’ve identified your business goals, you can identify which metrics you need to collect to achieve those goals. This step also includes defining related configuration parameters, such as the collection rate and how long you need to store the metrics data.

Some metrics are usually readily available, typically system metrics. These metrics include:

CPU utilization
Memory utilization
Free space available on disks
Disk input/output data
Network usage

System metrics are usually necessary as part of any monitoring strategy and tend to show the overall stress the cluster is under. However, they are quite basic and usually won’t give enough actionable information beyond telling whether the cluster seems healthy.

Additionally, more complex metrics are often required. These metrics are often tied to the software you run. For example, they could measure:

How responsive is the website or the app?
How many users are currently logged in?
What is the average number of concurrent users at 10am on weekdays?
How fast does your support team respond to the initial requests?
What’s the rate of 5xx errors reported by your web server?
What’s the average number of jobs in an input queue per day?

(source)

Kubernetes monitoring best practice #3: Select the right tools

Our next Kubernetes monitoring best practice is selecting the right tools based on the required metrics and achieving your monitoring goals.

Free and Open Source Software (FOSS) vs. commercial third-party software is commonly used to categorize Kubernetes monitoring tools. Some examples of FOSS monitoring solutions include:

Tools to collect metrics (e.g., Prometheus, kube-metrics)
Tools to collect logs (e.g., Loki, Fluentd)
Tools to collect traces (e.g., Jaeger)
Tools for visualization and alerting (e.g., Grafana, Alertmanager)

While plenty of open-source options are available, you will need in-house expertise and a significant amount of DevOps engineers’ time to build and maintain a FOSS monitoring solution. If you don’t have in-house experts, you can hire consultants to build a solution, but this will likely be expensive. On the other hand, developing your own monitoring solution could save you a considerable amount of money in the long-run.

The alternative is to pay for third-party software, which usually offers turn-key, software-as-a-service (SaaS) solutions. Commercial options typically have more advanced products, such as machine learning to detect suspicious trends and patterns or perform offline data analysis. Additionally, most commercial solutions come with a level of support that FOSS projects lack.

When evaluating solutions, remember that using third-party tools (especially SaaS products) can create compliance issues, such as safeguarding personally identifiable information under HIPAA or GDPR. You might also need to open your cluster to allow routes from the internet for the third-party SaaS products, which increases attack surface and could create other security issues.

Kubernetes monitoring best practice #4: Monitor your monitoring system

Unless you run a non-production workload, you probably want every element of your monitoring solution to be highly available and scalable. Achieving high-availability monitoring requires monitoring of the monitoring system itself. At a minimum, you must be able to detect critical failures in your monitoring system and send notifications when they occur. Ideally, you should also configure automated remediation of such problems.

Generally speaking, this extra level of monitoring is required only for in-house solutions, as third-party SaaS vendors usually have monitoring systems for their platforms. Some FOSS products incorporate their own monitoring systems. For example, Loki comes with Loki Canary, which regularly sends dummy logs to Loki and reads them back to ensure it works fine.

Kubernetes monitoring best practice #5: Consider data storage

Your monitoring system will accumulate data over time, and this data should be managed like any other data. You will need to determine how long you need to hold onto it, maybe even put it in cold storage after a while. Be sure to consider any regulations or legal requirements applicable to your organization so that the data can be accessed and provided quickly if requested. Determining your data retention requirements for your monitoring data will be part of your overall requirement gathering exercise, and you will then need to implement it accordingly.

Kubernetes monitoring best practice #6: Monitor the control plane

(source)

Do not neglect monitoring your control plane as well! All the best practices we have listed also apply to the control plane, not just the data plane. Some Kubernetes managed solutions, such as Amazon’s EKS, will do that automatically for you. If not, you will need to add the monitoring of the control plane nodes and the various control plane components into your monitoring strategy.

Kubernetes monitoring best practice #7: Account for incident response

Once your monitoring system is up and running and able to send alerts to your team, you must consider how to respond to such alerts. Squadcast can help coordinate incident responses, ensuring a very high level of coordination within your team so they can be as efficient as possible while dealing with the problem.

Integrating monitoring data into a robust incident response strategy helps teams detect and recover from outages and other production-disrupting incidents faster. As a result, MTTR decreases, and uptime improves.

Conclusion

Monitoring your production workloads is necessary, but working towards true observability across your Kubernetes infrastructure is important. If you’re just starting, the most important of our Kubernetes monitoring best practices is gathering your requirements and defining your business goals.

Once you have requirements and goals, identify which metrics will help fulfill them before you move on to tooling. Selecting the right tools is an important step, especially choosing between FOSS (one consequence being that your team will have to spend time and effort to implement an in-house monitoring solution) and a paid-for third-party solution (which are usually more exhaustive and come with better support). Compliance and security are other considerations you might need to consider when choosing your tools, depending on your project’s requirements.

Finally, especially after building an in-house solution, ensure your monitoring system is reliable, which would require monitoring it. And don’t forget that Squadcast can help with the coordination of incident responses within your team.