EKS: Enhancing Observability and Debugging

EKS, a widely used managed Kubernetes service, faces challenges in troubleshooting and maintaining applications due to the disconnect between development and production environments. Lightrun, a dynamic observability platform, addresses these issues by providing real-time insights and dynamic instrumentation capabilities, enabling developers to debug and monitor applications seamlessly in EKS clusters.

Disclaimer: This blog post was written as part of a collaboration with Lightrun.

Challenges in EKS Troubleshooting

Since its launch in 2018, EKS has experienced exponential growth, becoming the preferred choice for both enterprises and startups. Its rise to prominence reflects the increasing adoption of cloud-native architectures and the demand for scalable and resilient container orchestration platforms.

Today, EKS supports a vast array of applications, ranging from small-scale microservices to large-scale distributed systems. It has quickly become the most widely used managed Kubernetes service, according to a survey from the CNCF.

However, the rapid growth and widespread adoption of EKS clusters have brought forth unique challenges in troubleshooting and maintaining applications. The disconnect between developers and production applications has become more pronounced. Local and remote development environments often fail to capture the delicacies and complexities of the actual production environment running in EKS. This contrast inhibits developers' ability to accurately reproduce and debug issues that surface specifically in the production EKS environment.

Additionally, the inherent complexities of managing a production-grade distributed cluster of pods, each may have its own network, storage, and security requirements, make troubleshooting a complex process. Network restrictions, for example, as well as access controls, and limited visibility into running containers further slow debugging and performance analysis.

The main objective of this article is to explore how Lightrun, a dynamic observability platform, can enhance observability and debugging capabilities in EKS clusters. We will examine how real-time insights and dynamic instrumentation capabilities help developers and operators gain a deeper understanding of their applications' behavior in EKS production clusters.

Introducing Lightrun: A Developer-Oriented Observability Platform

Lightrun, a powerful observability platform tailored for developers, is specifically designed to address these challenges of troubleshooting in AWS EKS and other Kubernetes platforms. By providing real-time insights and dynamic instrumentation, Lightrun enhances developers' ability to identify and resolve issues without disrupting the production environment.

Seamlessly integrated with EKS, Lightrun enables developers to debug, log, and monitor their applications on the fly. They can set breakpoints, inspect variables, and analyze logs and metrics in real-time, all without requiring code changes or redeployments. This seamless integration streamlines the debugging process, accelerates issue resolution, and ensures optimal performance in EKS clusters.

In the next section, we will explore the step-by-step process of integrating Lightrun into your AWS EKS cluster. We will dive into the features and capabilities that Lightrun brings to the table.

Dynamic Instrumentation and Live Troubleshooting with Lightrun

Before starting, we need to have an EKS cluster up and running and then set up the Lightrun agent on your cluster - there are multiple ways to do this:

You can integrate Lightrun directly into your code. If you are using Node.js, you will need to install the “lightrun” package using npm (npm install lightrun) and then configure it in your application while replacing the LIGHTRUN_SECRET and FULL_PATH_TO_METADATA_FILE with their values:

Or if you’re using Python, then start by installing the required dependency (python -m pip install lightrun) then add the following code at the beginning of your application, after updating the <COMPANY_SECRET>:

To get the secret, you can create a free account here. You can also view more configuration options for Java and .NET in the official documentation.

You can also install Lightrun by adding its agent to your Docker image. This is a quick example of how to do it if you were using Python:

Java developers have another way to install Lightrun by using Lightrun Kubernetes Operator which is a way of installing and configuring agents in your Kubernetes workloads without having to change your Docker or manifest files. This installation method can be performed using the available YAML files:

The next step is deploying the agent configuration. This is an example:

You can find more details about the above configuration here.

If you prefer using Helm, it is also possible to install and configure your agents using a “values.yaml” file

Dynamic Observability for EKS Microservices

In a Cloud Native world, an application is usually a collection of microservices that are independently deployed and loosely coupled. Each service has its own database and the communication between services is made using a REST API. Even if this architecture has shown its strong capabilities in building scalable and resilient systems, it has its drawbacks, mainly the complexity of debugging and understanding the positioning of a microservice within the set of all the microservices that together make up the application.

Adding observability to your APIs endpoints is a good way to start. Since a service's API is its front door, so all incoming traffic has to go through it. Luckily, without updating your application or its resources and without installing any additional tool, we can dynamically understand how a service’s API work and debug it. This is done independently of the number of replicas you run for your pod.

Let’s see a practical example now!

This is the Dockerfile we are going to use:

And this is our API code for a to-do app:

We have two goals:

Understanding what is being sent to our API (what’s being received).
Examining what is being sent by the API (what’s being sent).

To start with the first one, we can set up a snapshot from VScode using the Lightrun extension.

If you’re using other IDEs, check this list:

Go to the POST method, and find the first line where request is called:

Right-click, choose “Lightrun” then add a new snapshot.

Now add a new task using:

You will able to capture the call stack. Whenever you click on a step from the same stack (e.g: add_task), you’ll see different variables including the received data, the headers of the client request and so on.

We can also add a log line to see the content of the data sent to our API:

To examine what is being sent by the API, we can add a new snapshot on the following line where we return the response:

At this stage, in addition to adding a log line or a snapshot, you can add a condition: For example, we only want responses that have a return code greater than 399 to be logged automatically. This is how to do it:

Right-click and add a new snapshot.
Add the following condition: response_code > 399
Add response to the “Watch expression”.
Apply the changes to create the new snapshot.

Try sending data containing errors like an unknown key:

Lightrun filters all requests and only shows those with a response code greater than 399.

To debug our API, we used two powerful features of Lightrun (dynamic logs and snapshots), other options are available such as custom metrics and counters. As demonstrated, we were able to quickly diagnose and filter requests containing errors thanks to the powerful capabilities of Lightrun.

The debugging process was straightforward and didn’t require any new deployment. We only had to integrate Lightrun in our code and this is done only once and then without leaving our IDE, we were able to abstract the networking complexity of EKS clusters.

In many cases, developers find themselves using port forwarding, creating services to expose their application, or creating complex SSH tunnels to run some basic tests, but by using Lightrun we were able to simplify and streamline these tasks. Lightrun provided a seamless debugging experience that eliminated the need for such workarounds.

Improving Observability and Mean Time to Resolution (MTTR)

By leveraging the comprehensive logs and snapshots provided by Lightrun, you can gain profound insights into the behavior of your application within a production EKS cluster. All of this can be done without the need to go through the laborious process of deploying code, building containers, and updating Kubernetes deployments and service manifests.

Without the assistance of Lightrun, the alternative approach would involve meticulously inserting log lines throughout your application using your programming language logging libraries, followed by the time-consuming tasks of building and redeploying the application and probably creating new services to expose your pods. This is what usually makes the mean time for resolution (MTTR) long and unexpected.

Enhancing observability and achieving optimal Mean Time to Resolution (MTTR) are essential factors in effectively managing applications deployed on AWS EKS. Lightrun provides developers with a rich set of tools and features that significantly improve their productivity and accelerate the process of incident resolution.

The seamless integration of Lightrun with popular Integrated Development Environments (IDEs) provides developers with an effortless experience of setting breakpoints, capturing variables, and conducting real-time analysis of the application's behavior, all from the comfort of their familiar development environment.

As a matter of fact, with Lightrun's real-time troubleshooting capabilities, InsideTracker developers were able to diagnose and resolve application issues on the fly, saving valuable debugging time. Their mean time to resolution improved by an impressive 50%, resulting in dozens of hours saved each month. Before using Lightrun, InsideTracker developers were unable to troubleshoot or access remote Kubernetes environments from their local machines, which resulted in long and inefficient debugging sessions. These sessions used to require hotfixes and redeployments that took hours.

Besides, typical logging systems might provide a substantial issue in terms of growing logging expenses. As your application grows in size and creates a significant number of logs, the storage and processing costs associated with traditional logging mechanisms can soon add up.

What’s next?

Lightrun provides a comprehensive solution for troubleshooting and monitoring in AWS EKS setups. Its real-time insights and dynamic instrumentation help developers discover and address issues quickly, resulting in a significant reduction in Mean Time to Resolution (MTTR). Furthermore, Lightrun's efficient logging method improves resource use and lowers logging costs, resulting in a smooth and cost-effective debugging experience.

Take the next step in upgrading your troubleshooting and observability approaches with trust. Explore Lightrun today for free and witness the transformative impact it can have on your AWS EKS environments. Click here to get started today!