The Four Golden Signals, SlO, SLI, and Kubernetes

It's 2018 in Kubecon North America, a loud echo in the microphone, and then Ben Sigelman is on the stage.

There is conventional wisdom that observing microservice is hard. Google and Facebook solved this problem, right? They solved it in a way that allowed Observability to scale to multiple orders of magnitude to suit their use cases.

The prevailing assumption that we needed to sacrifice features in order to scale is wrong. In other words, the notion that people need to solve scalability problems as a tradeoff for having a powerful set of features is incorrect.

People assume that you need these three pillars of Observability: metrics, logging, and tracing, and all of a sudden, everything is solved. However, more often than not, this is not the case.

Today we are going to discuss Observability and why this is a critical day-2 operation in Kubernetes. Next, we will discuss the problems with Observability and leverage its three pillars to dive deep into some concepts like service level objectives, service level indicators, and finally, service level agreements.

Welcome to episode 6!

Moving from a world of monolithic to microservices world solved a lot of problems. This is true for the scalability of machines but also for the teams working on them. Kubernetes largely empowered us to migrate these monolithic applications to microservices. However, it made our applications distributed in nature.

The nature of Distributed Computing added more complexity in how microservices interact. Having multiple dependencies in each one produces a higher overhead in monitoring.

Observability became more critical in this context.

According to some, Observability is another soundbite without much meaning. However, not everyone thought this way. Charity Majors, a proponent of Observability, defines it as the power to answer any questions about what’s happening on the inside of the system just by observing the outside of the system, without having to ship new code to answer new questions. It’s truly what we need our tools to deliver now that system complexity is outpacing our ability to predict what’s going to break.

According to Charity, you need Observability because you can “completely own” your system. You have the ability to make changes based on data you have observed from the system. This makes Observability a powerful tool in highly complex systems like microservices and distributed architectures.

Imagine you are sleeping one night and suddenly your phone rings.

According to your boss, an application crashed in production - let's say a payment service. Now, you are being asked to investigate what happened and why?

According to your investigations, the service crashed around 1 AM. So you looked at what transactions the application was processing at that time. This is what we call logs.

Logs have the ability to describe discrete events and transactions within a system. It tells you a precise story about what's happening via messages generated by your application.

You saw that there was an increase in errors within those logs. Therefore you determined that there must be something wrong with the host.

The host where your application is running has multiple metrics, and one of the metrics you have observed for the payment service is consistently throwing gateway timeouts. Without any successful requests, you were able to determine that the service or one of its dependencies is down.

Talking about metrics, they are time-series data that describe a measurement of resource utilization or behavior. They are useful because they provide insights into the behavior and health of a system, especially when aggregated.

Finally, you determined that a downstream service causes high latency - not the payment service itself but a third-party payment processor. You were able to observe this by looking at requests IDs and tracking them back to their origins.

Traces, or what we also call "Distributed Tracing" tracks-down individual requests using unique IDs as they hop in from one service to another. It allows you to know how a request travels from one end to the other.

That was just an example, and if we summarize what we did, we'll say that looking at logs, metrics, and traces gave us a fast yet coherent way to identify the problem root, not just its symptoms.

Metrics, Logging, and Tracing are the so-called three pillars of observability.

You are in a game show, and there are three boxes in front of you. The first game's objective is to understand what's inside the first box without opening it.

You are allowed to ask questions about the box. Also, you can write on a sticky note and tag it on a box.

The more you ask the question and tag it on the box, the easier for you to understand what is inside it. However, the more you tag, the more your prize money goes down.

In the world of observability, the first box is metrics. And your prize value going down due to many tags is cardinality.

Tags allow you to describe your metrics in order to understand them better. However, as you keep on tagging them, you increase the cardinality of your metrics.

Cardinality is the number of unique combinations of metric names and their tags. Adding more tags increases cardinality. But having no tags makes your metric indecipherable like the box in our game show.

Too much cardinality is bad because it means you have more data to ingest and keep. This is a significant pain point for most of the people trying to run observability systems.

Onto the second game. You have another box, a hefty one.

The objective of the second game is to find a specific puzzle piece.

This box is already open; however, it is full of very similar puzzle pieces, and finding the right one to win the game is hard.

In the world of observability, this second box is logging.

Logging has a fatal flaw, which is the raw volume of data itself.

Historically, if you want to observe if a specific action happened, you would generate a transaction log.

The raw volume of the transaction log is multiplied by the number of interactions with other services. And in a microservices world, you might as well look for a needle in a haystack.

Combined with the cost of storage, the overhead of networking and weeks of retention causes the price of logging to be exorbitant.

You can mitigate this issue in both metrics and logs by sampling. For example: keeping only the first transaction out of 10000 transactions.

Finally, this is the last game and the last box. When you open it, you found a blueprint of the edges and the edge pieces.

The objective of this game is to recreate the entire puzzle with only the blueprint of the edges and the edge pieces.

The obvious problem is that there are not enough puzzle pieces to build the puzzle.

In the world of observability, this third box is called "traces" or distributed traces.

By its nature, distributed traces only account for a subset of data. It only traces the generic path of travel between one system and another.

Consequently, it means that it is hard to find a solution that scales gracefully, accounts for all data, and is immune to cardinality issues for only logging, only metrics, or only tracing without using a combination of two or three.

This is why most companies adopt a variation of two pillars or all three pillars combined.

However, having three pillars is not enough. They are a starting point for building observability.

This is why there is an emerging thought pattern of using SLI or service level indicators using three pillars as the foundational data.

In the Kubernetes world, there is a myriad of things to observe. The key is to understand the architecture to know where we need to set our Service Level Indicator.

There are two types of components in Kubernetes. The first one is the Control Plane.

The Control Plane makes global decisions about the cluster. A good example is when a pod needs to be scheduled, or when a cloud load balancer needs to be created.

The other type is the Node Components. These are localized components that provide a runtime environment. Node Components also talks to the Control Plane Component to orchestrate workloads, networking, storage, and event messaging within the cluster.

Here’s the million-dollar question: How do you know when your workload in the Kubernetes Cluster is healthy?

The answer, of course, is “it depends.” This is where Service Level Indicator comes in.

One component you can put in your SLI, for example, is uptime. How many nines is my current uptime? Looking at logs, how many times did my service crash? Looking at metrics, how many times was my service unreachable with 500 errors? Were there traces that couldn’t be completed at times?

Well, aside from uptime, you can also use request latency, aggregating all the metrics, logs, and metrics.

How long does a request in my service take? Do logs provide a timestamp where they begin and where they end? What do the traces show end to end?

Looking into the hosting of the workload, you also need an indicator if the Control Plane component is unavailable, which could cause a downstream impact on your services.

Assuming you got all the indicators you need, The next step then is to set objectives. The natural progression to Service Level Objectives becomes very important once the key Service Level Indicators are put into place.

SLOs are how you expect these indicators to behave, given a measurement period or a target measurement.

For example, a system given 100 queries per second will have an SLO of 99.99% uptime, 0 errors, and less than 1000 milliseconds of request latency.

Another one could be, response time will always be less than 200 milliseconds for all 80% of requests and less than 800 milliseconds for all 95% of requests.

Finally, here is an easier example: Availability could be at 99.95% for a month, given that you are running in three zones.

Observability is an important day-2 operation in Kubernetes. It will help you build confidence in your workloads and understand key performance indicators such as Service Level Indicators and Service Level Objectives.

Don't forget to follow FAUN on twitter.com/@joinFAUN and subscribe to our hand-curated weekly newsletters: Visit faun.dev/join, choose the topics you would like to subscribe to, confirm your email subscription, and start receiving the best tutorials and stories from the web about DevOps, Cloud Native, Kubernetes, Serverless, and other must-follow topics.

If you want to reach us, we will be glad to read your feedback and suggestion, just email us at community@faun.dev.