Monitoring Is Not Enough and Understanding Through Observability Is the Key

Current monitoring solutions focus on capturing the raw data and various signals from multiple layers. This lets us build dashboards and alarms to automatically monitor the performance of our applications. While this approach works, it puts too high a burden on the developers without providing them with tools and processes. Even though this sounds contradictory, current monitoring solutions do not show us what is going on and do not let us analyze the performance properly.

The world has changed significantly in the last decades. We work with multiple databases and microservices, we deploy changes many times a day, and we face issues that span across multiple systems. Changes are often independent and asynchronous. We very rarely deploy big monoliths nowadays. Instead, we split our systems into small parts, develop them independently, and deploy them with automated CI/CD pipelines. Later on, we capture metrics from the hardware and runtime platforms, and business metrics emitted by the applications depending on the actual workflows executed by users. Since we deploy new systems every day, we need to either build big dashboards that show more and more charts or aggregate the data into Key Performance Indicators that hide the details.

Apart from seeing things, we also need an understanding. We need to be able to reason about the signals and draw proper conclusions on how to improve our systems. We do that by setting thresholds for the metrics based on our hunch, knowledge, experience, and understanding of the business domain we work in. We later configure alarms and wire them into CI/CD pipelines to control the deployments or automatically reverse invalid changes.

The current state is not enough, though. To effectively debug and analyze the issues, we need to develop a new approach that scales well with the increasing number of moving parts. Just like we developed new tools and patterns when scaling our architecture and infrastructure, we need to build a new way of tracking the changes and performance of our applications. That is because with the current approach we can’t scale well. We configure alarms manually and spend time on a case-by-case basis to set thresholds on the metrics. We work with raw data instead of semantic signals. When something goes wrong, we don’t know what’s the reason because we deal with general metrics instead of an end-to-end explanation of the situation. When it comes to setting the thresholds, we can set them manually, but such an approach is too time-consuming and expensive. We need some new automated approaches that will grow with our businesses. We need observability and understanding. We need to share the responsibility within the team and make developers capable of taking ownership of all the components. Let’s see how we do that.

Components of Monitoring

Let’s see which components create the monitoring solutions and what they lack.

Telemetry

Telemetry is the process of gathering signals of various kinds. This includes logs, traces, metadata about operations, and details of running workflows. It is the first step that lets us see what is happening in our systems. Let’s see which steps are parts of telemetry.

The general plan is:

We need to specify the signals. This involves planning what we’re going to capture and how exactly. Developers need to implement the solutions to actually calculate and emit signals from the code. This can be done manually with in-house-built solutions, or we can use libraries like OpenTelemetry to automatically emit signals from our code. Nowadays, it’s very possible that your libraries, frameworks, drivers, and infrastructure emit signals already thanks to integration with Openelemetry and other solutions.
We need to transmit the signals to a centralized place from all the hosts, components, and systems inside our infrastructure. Signals are typically first stored on the hard drives, and then transmitted asynchronously to the centralized storage thanks to background daemons.
We need to process the signals once they’re delivered to a single store. Users can search them, read them, and build more systems using the data captured in one place. We often need to transform signals to a uniform schema to make signal processing much easier.

Telemetry is now a common part of the systems. Just like we log messages with logging libraries, the same way we emit metrics and traces from our applications using standardized solutions. However, that’s only the beginning of successful application monitoring.

Visibility

Once we have telemetry, we can start building visibility. The idea here is to integrate all the components of the system together. This consists of two parts: making sure that each component emits signals, and making sure that signals are properly correlated.

Making each component emit signals is nowadays easier with OpenTelemetry and other standardized solutions. We need to modify every code component that we own and plug in OpenTelemetry or another library. We also need to properly configure each component we took off the shelf to make sure that they emit signals accordingly. This may be harder with external components owned by infrastructure providers, legacy components, or infrastructure in between that we don’t own (like routers in the Internet backbone).

Next, we need to make sure we can correlate all the signals together. It’s not enough to just capture the web request and the SQL query sent to the database as part of processing the request. We need to make sure that we can later correlate the SQL query and the web request together. We can do that by reusing identifiers (called correlation ids). It’s important to not leave any gaps - if we don’t propagate identifiers through one layer, then we won’t be able to correlate things properly and we won’t build the full visibility.

Our goal is to be able to see everything. Once we achieve that, we can move on to Application Performance Management (APM).

Recommended reading: Mastering SQL Database Auditing: Best Practices & Tips

Application Performance Management

Application Performance Management is the ability to capture all the signals from all of the places, aggregate them, and build solutions showing the current health of the system. We typically achieve that by using dashboards, charts, and alarms. Let’s see them one by one.

Dashboards can be configured as per our needs. We can build dashboards showing the global state without details. We can build separate dashboards for particular parts of the system that would show more details. We can have different dashboards for different scenarios, roles, or people. Dashboards for database administrators may focus on database-specific details, while dashboards for product teams may focus on product metrics.

All dashboards include charts. Charts present metrics with different granularity and scale. For instance, we can capture infrastructure metrics from all of the nodes in the system, aggregate them to calculate averages or percentiles, and then show for the last two weeks. This way we can track the weekly patterns and see when something changes over time.

Finally, we can configure alarms on the metrics that we present with charts. We can set thresholds based on our experience or knowledge, and then get automated notifications when things break. APM can show us the health of the system and notify us accordingly. The ultimate idea here is to be able to see at a glance if everything works well and see where the fire is in case of issues. That all sounds very promising, however, it is not enough. Let’s see what we lack in the current world.

Why It Is Not Enough

There are many issues with the current state of monitoring and performance management. This includes too much raw data, a lack of anomaly detection, and a lack of understanding in general. Let’s go through these.

APM captures signals and presents them in dashboards with metrics and alarms. This all sounds great, however, we can’t just use these dashboards. We need to understand how they were constructed, what they show, how to set thresholds, and what to do when something is wrong. Dashboards present raw data, they swamp us with metrics and charts, and while we can see what’s there, we don’t know what to do with that. We don’t know how to set alarms, how to configure thresholds, whether particular anomalies are expected due to activity in the environment or if they are completely undesired and show underlying issues. Unfortunately, this takes understanding and knowledge of the system. This is not something that current tools do automatically for us. Exactly the opposite.

Another problem is with detecting anomalies. Charts and diagrams do not show us issues. They sometimes can detect anomalies based on previous trends, but they can’t give us the full picture of the items. We need to know which anomalies to look for, configure mechanisms for anomaly detection, and set their sensitivity properly to not be overloaded with false alarms. We need to spend much time on getting these mechanisms stable, it’s not just fire & forget. Also, machine learning solutions need some data. They can’t just work, they need some historical metrics to learn the patterns. These metrics should also include signals about what we do. A spike in CPU may be caused by an issue with data or just an ongoing deployment. If metrics don’t know that, then we’ll have alarms that mislead us.

Next, the issue with aggregating the data. To have dashboards that we can review quickly and easily, we need to aggregate signals from all the hosts across all dimensions. However, this way we lose the insights into particular subsets of our operations. If we average the web service latency across all the countries, then we won’t see the changes in small regions. It is possible that the average of all the requests remains stable, however, one particular region is severely affected by the new deployment. Global dashboards won’t show that. We need other dashboards that show metrics aggregated differently across the dimensions. However, this way we lose the ability to quickly see what’s going on. We can dive deep, however, this takes time and is not straightforward anymore.

Finally, current monitoring solutions don’t understand the big picture. The CPU may have spiked for so many reasons - slow machine, issues with memory, different traffic, a spike of user requests, lack of an index, index not being used anymore, deployment taking longer, background task, operating system updates, database reorganization, slow algorithm, a bug in the application.

Recommended reading: Troubleshooting PostgreSQL High CPU Usage

Monitoring solutions don’t know that, they don’t see what’s going on. We need something that will be able to connect all the dots and build a story like “You merged this change last week, today it reached production, the index is not used anymore, we finally see traffic using the change, and therefore we see the database is slow now”. This is something that we can’t achieve with current solutions. We need something better. Let’s see that.

How to Move Forward with Observability and Understanding to Build Database Guardrails

We need database guardrails. We need to move from static monitoring to observability and understanding. We need to have a solution that can prevent the bad code from reaching production, monitor all your databases and database-related pieces, and automatically troubleshoot issues as they appear.

First, we need to automate everything. We can’t continue reviewing the dashboards manually or dig into metrics to see which customer cohort is affected after the deployment. We need a solution that understands what changes, how it changes, and how it may affect the business. This should include things like:

History of CI/CD with deployment dates
Details of the changes - is it schema, code, configuration, or something else
Ongoing activities around databases - like statistics recalculation, defragmentation, table rewrite, purging, auditing
Details of requests coming to the platform - if they are “general” queries or if they “fall into some edge case” bucket
Current configuration of the database and how it differs between environments

There are many more aspects that we should incorporate in the solution. Anything that our business produces must be added to the database guardrails platform.

Next, we need to have a solution that understands databases. It’s not about monitoring the host only, it’s also about understanding how SQL databases work, how they organize the data, and what operations they execute daily. This should depend on the database type, whether it’s SQL or NoSQL, if it’s replicated, partitioned, or sharded, what ORMs we use to talk to the database, what libraries we have around, and which piece manages the migrations and the scheduled maintenance tasks.

Next, we need to push all the checks to the left as early as possible. We can’t wait for the load tests to happen. We can’t deploy wrong code to production and see that it’s too slow or doesn’t work. We need to let developers know as early as possible that the code they wrote is not going to perform well. See our article explaining how to test your databases to learn more.

Finally, all the monitoring we have must be built with developers in mind. They own the database, they understand their solutions. We must give them tools that will let them own the database, help them analyze the performance, and make it easy to reason about the changes. We can’t keep the database maintenance within a separate team. Just like developers now manage their clouds as part of the DevOps movement, we need to let them own their databases. Let’s see how Metis achieves this.

How Metis stays ahead of Database Guardrails movement

Metis is your ultimate database guardrails solution. It prevents the bad code from reaching production, monitors all your databases, and automatically troubleshoots the issues for you.

How It Works

Metis is a Software as a Service (SaaS) that integrates with your application and database, analyzes queries and schemas, and provides insights into how to improve your ecosystem.

To use Metis, we need to discuss its two building blocks: SDKs and Metadata Collector.

An SDK is just a library that wraps OpenTelemetry and adds another sink that delivers spans and traces to Metis. It plugs into the regular OpenTelemetry features you most likely already have in your application. Many frameworks and libraries, including web, ORM, SQL drivers, already integrate with OpenTelemery and emit appropriate signals. Metis reuses the same infrastructure to capture metadata about interactions, requests, and SQL queries. You can read more about that in the documentation.

SDK is by default disabled in production to not decrease the performance of your application and to avoid extracting confidential information. To install the SDK, one just needs to add a single dependency to their application with the ordinary package manager they use every day (like NPM or PIP).

Once signals are delivered to the Metis platform, they are processed and insights are generated. To do that, we’d like to use as much information about your production database as possible. We capture that using Metadata Collector (MC) - an open-source docker that you can run in your environment and connect it to all your databases to extract schemas, statistics, row numbers, running configuration, and much more. MC can be deployed easily in the cloud (for instance, using AWS CloudFormation template) or in an on-premise environment (just run the docker container your regular way). MC connects to your databases every couple of hours and gets information. It also uses extensions you have like pg_stat_statements or slow query log. The more metadata you have, the better.

Let’s now see what Metis can do for you.

Monitor Everything

Metadata Collector can connect to your databases and provide a summarized observability dashboard:

The dashboard shows insights about various parts of the database: indexes (number 1) and their usage, extensions (number 2), running configuration (number 3), query performance (number 4), and insights about tables (number 5).

As before, Metis tries to be as direct and developer-friendly as possible. It doesn’t show raw metrics unless you ask for that. Instead, it can give you actual information that you can immediately act upon. Let’s see the indexes:

We see the list of indexes, but we also see insights about some of them (number 1). When we click on the insights, we get the following:

We can see that the index hasn’t been used for the last 2 weeks, and we should consider dropping it.

Similarly, we get the analysis of queries running in the database:

When we choose a query, we can quickly see how the query performed recently:

Similarly, Metis can show statistics of tables:

Metis can analyze the schema:

We can immediately see insights for particular tables. Once we click on it, we get developer-friendly insights again:

Successful database guardrails shouldn’t just show raw data but should provide reasoning and knowledge.

Troubleshoot Instantly

Metis can connect all the dots. By taking signals from the application, statistics from the tables, deployment times from CI/CD, and running configuration from the database, Metis can correlate all the things and find the reason behind the issue. Not a metric or chart showing high CPU usage, but full reasoning that a given code change has been pushed recently and it caused that query to slow down because there is a missing index.

Metis can then close the loop of the database guardrail. By utilizing details of the production database, Metis can give feedback right when developers modify their application code. Developers don’t need to run the load tests nor manually verify their performance on big databases. Metis does all of that behind the scenes and monitors your environment continuously. Developers should adopt database guardrails as a must-have element of their environment.

Skyrocket Your Database

Developers should own their databases. To do that, they need to have the tools, but they also should be able to get working knowledge. The best way to get that is by experimenting with the actual systems. Metis can help with that as well. It provides a query analysis module:

Developers can experiment with queries, configuration, and their databases locally. Metis can take the knowledge of the database experts, enrich it with the actual numbers from your environment, and immediately tell you whether a given query is going to work well and why. Not only does it show the result, but it also explains the reasoning and provides a way to improve it.

Taking all of that into account, developers can finally own their databases, thanks to database guardrails.

Summary

Current monitoring solutions are very good at capturing the raw data and various signals. However, dashboards with manually tuned alarms are not enough. We need to decrease the burden put on developers. Current monitoring solutions do not show us what is going on and do not let us analyze the performance properly. Metis helps that by providing understanding via observability. This is what we need to push all the checks to the left and regain control of the performance of our systems.

FAUN.dev() is where engineers from GitHub, Netflix, and Shopify go to stay ahead — fast.