Current monitoring solutions focus on capturing the raw data and various signals from multiple layers. This lets us build dashboards and alarms to automatically monitor the performance of our applications. While this approach works, it puts too high a burden on the developers without providing them with tools and processes. Even though this sounds contradictory, current monitoring solutions do not show us what is going on and do not let us analyze the performance properly.
The world has changed significantly in the last decades. We work with multiple databases and microservices, we deploy changes many times a day, and we face issues that span across multiple systems. Changes are often independent and asynchronous. We very rarely deploy big monoliths nowadays. Instead, we split our systems into small parts, develop them independently, and deploy them with automated CI/CD pipelines. Later on, we capture metrics from the hardware and runtime platforms, and business metrics emitted by the applications depending on the actual workflows executed by users. Since we deploy new systems every day, we need to either build big dashboards that show more and more charts or aggregate the data into Key Performance Indicators that hide the details.
Apart from seeing things, we also need an understanding. We need to be able to reason about the signals and draw proper conclusions on how to improve our systems. We do that by setting thresholds for the metrics based on our hunch, knowledge, experience, and understanding of the business domain we work in. We later configure alarms and wire them into CI/CD pipelines to control the deployments or automatically reverse invalid changes.
The current state is not enough, though. To effectively debug and analyze the issues, we need to develop a new approach that scales well with the increasing number of moving parts. Just like we developed new tools and patterns when scaling our architecture and infrastructure, we need to build a new way of tracking the changes and performance of our applications. That is because with the current approach we canât scale well. We configure alarms manually and spend time on a case-by-case basis to set thresholds on the metrics. We work with raw data instead of semantic signals. When something goes wrong, we donât know whatâs the reason because we deal with general metrics instead of an end-to-end explanation of the situation. When it comes to setting the thresholds, we can set them manually, but such an approach is too time-consuming and expensive. We need some new automated approaches that will grow with our businesses. We need observability and understanding. We need to share the responsibility within the team and make developers capable of taking ownership of all the components. Letâs see how we do that.
Components of Monitoring
Letâs see which components create the monitoring solutions and what they lack.
Telemetry
Telemetry is the process of gathering signals of various kinds. This includes logs, traces, metadata about operations, and details of running workflows. It is the first step that lets us see what is happening in our systems. Letâs see which steps are parts of telemetry.
The general plan is:
- We need to specify the signals. This involves planning what weâre going to capture and how exactly. Developers need to implement the solutions to actually calculate and emit signals from the code. This can be done manually with in-house-built solutions, or we can use libraries like OpenTelemetry to automatically emit signals from our code. Nowadays, itâs very possible that your libraries, frameworks, drivers, and infrastructure emit signals already thanks to integration with Openelemetry and other solutions.
- We need to transmit the signals to a centralized place from all the hosts, components, and systems inside our infrastructure. Signals are typically first stored on the hard drives, and then transmitted asynchronously to the centralized storage thanks to background daemons.
- We need to process the signals once theyâre delivered to a single store. Users can search them, read them, and build more systems using the data captured in one place. We often need to transform signals to a uniform schema to make signal processing much easier.
Telemetry is now a common part of the systems. Just like we log messages with logging libraries, the same way we emit metrics and traces from our applications using standardized solutions. However, thatâs only the beginning of successful application monitoring.
Visibility
Once we have telemetry, we can start building visibility. The idea here is to integrate all the components of the system together. This consists of two parts: making sure that each component emits signals, and making sure that signals are properly correlated.
Making each component emit signals is nowadays easier with OpenTelemetry and other standardized solutions. We need to modify every code component that we own and plug in OpenTelemetry or another library. We also need to properly configure each component we took off the shelf to make sure that they emit signals accordingly. This may be harder with external components owned by infrastructure providers, legacy components, or infrastructure in between that we donât own (like routers in the Internet backbone).
Next, we need to make sure we can correlate all the signals together. Itâs not enough to just capture the web request and the SQL query sent to the database as part of processing the request. We need to make sure that we can later correlate the SQL query and the web request together. We can do that by reusing identifiers (called correlation ids). Itâs important to not leave any gaps - if we donât propagate identifiers through one layer, then we wonât be able to correlate things properly and we wonât build the full visibility.
Our goal is to be able to see everything. Once we achieve that, we can move on to Application Performance Management (APM).
Recommended reading: Mastering SQL Database Auditing: Best Practices & Tips
Application Performance Management
Application Performance Management is the ability to capture all the signals from all of the places, aggregate them, and build solutions showing the current health of the system. We typically achieve that by using dashboards, charts, and alarms. Letâs see them one by one.
Dashboards can be configured as per our needs. We can build dashboards showing the global state without details. We can build separate dashboards for particular parts of the system that would show more details. We can have different dashboards for different scenarios, roles, or people. Dashboards for database administrators may focus on database-specific details, while dashboards for product teams may focus on product metrics.
All dashboards include charts. Charts present metrics with different granularity and scale. For instance, we can capture infrastructure metrics from all of the nodes in the system, aggregate them to calculate averages or percentiles, and then show for the last two weeks. This way we can track the weekly patterns and see when something changes over time.
Finally, we can configure alarms on the metrics that we present with charts. We can set thresholds based on our experience or knowledge, and then get automated notifications when things break. APM can show us the health of the system and notify us accordingly. The ultimate idea here is to be able to see at a glance if everything works well and see where the fire is in case of issues. That all sounds very promising, however, it is not enough. Letâs see what we lack in the current world.
Why It Is Not Enough
There are many issues with the current state of monitoring and performance management. This includes too much raw data, a lack of anomaly detection, and a lack of understanding in general. Letâs go through these.
APM captures signals and presents them in dashboards with metrics and alarms. This all sounds great, however, we canât just use these dashboards. We need to understand how they were constructed, what they show, how to set thresholds, and what to do when something is wrong. Dashboards present raw data, they swamp us with metrics and charts, and while we can see whatâs there, we donât know what to do with that. We donât know how to set alarms, how to configure thresholds, whether particular anomalies are expected due to activity in the environment or if they are completely undesired and show underlying issues. Unfortunately, this takes understanding and knowledge of the system. This is not something that current tools do automatically for us. Exactly the opposite.
Another problem is with detecting anomalies. Charts and diagrams do not show us issues. They sometimes can detect anomalies based on previous trends, but they canât give us the full picture of the items. We need to know which anomalies to look for, configure mechanisms for anomaly detection, and set their sensitivity properly to not be overloaded with false alarms. We need to spend much time on getting these mechanisms stable, itâs not just fire & forget. Also, machine learning solutions need some data. They canât just work, they need some historical metrics to learn the patterns. These metrics should also include signals about what we do. A spike in CPU may be caused by an issue with data or just an ongoing deployment. If metrics donât know that, then weâll have alarms that mislead us.
Next, the issue with aggregating the data. To have dashboards that we can review quickly and easily, we need to aggregate signals from all the hosts across all dimensions. However, this way we lose the insights into particular subsets of our operations. If we average the web service latency across all the countries, then we wonât see the changes in small regions. It is possible that the average of all the requests remains stable, however, one particular region is severely affected by the new deployment. Global dashboards wonât show that. We need other dashboards that show metrics aggregated differently across the dimensions. However, this way we lose the ability to quickly see whatâs going on. We can dive deep, however, this takes time and is not straightforward anymore.
Finally, current monitoring solutions donât understand the big picture. The CPU may have spiked for so many reasons - slow machine, issues with memory, different traffic, a spike of user requests, lack of an index, index not being used anymore, deployment taking longer, background task, operating system updates, database reorganization, slow algorithm, a bug in the application.
Recommended reading: Troubleshooting PostgreSQL High CPU Usage
Monitoring solutions donât know that, they donât see whatâs going on. We need something that will be able to connect all the dots and build a story like âYou merged this change last week, today it reached production, the index is not used anymore, we finally see traffic using the change, and therefore we see the database is slow nowâ. This is something that we canât achieve with current solutions. We need something better. Letâs see that.
How to Move Forward with Observability and Understanding to Build Database Guardrails
We need database guardrails. We need to move from static monitoring to observability and understanding. We need to have a solution that can prevent the bad code from reaching production, monitor all your databases and database-related pieces, and automatically troubleshoot issues as they appear.
First, we need to automate everything. We canât continue reviewing the dashboards manually or dig into metrics to see which customer cohort is affected after the deployment. We need a solution that understands what changes, how it changes, and how it may affect the business. This should include things like:
- History of CI/CD with deployment dates
- Details of the changes - is it schema, code, configuration, or something else
- Ongoing activities around databases - like statistics recalculation, defragmentation, table rewrite, purging, auditing
- Details of requests coming to the platform - if they are âgeneralâ queries or if they âfall into some edge caseâ bucket
- Current configuration of the database and how it differs between environments
There are many more aspects that we should incorporate in the solution. Anything that our business produces must be added to the database guardrails platform.
Next, we need to have a solution that understands databases. Itâs not about monitoring the host only, itâs also about understanding how SQL databases work, how they organize the data, and what operations they execute daily. This should depend on the database type, whether itâs SQL or NoSQL, if itâs replicated, partitioned, or sharded, what ORMs we use to talk to the database, what libraries we have around, and which piece manages the migrations and the scheduled maintenance tasks.
Next, we need to push all the checks to the left as early as possible. We canât wait for the load tests to happen. We canât deploy wrong code to production and see that itâs too slow or doesnât work. We need to let developers know as early as possible that the code they wrote is not going to perform well. See our article explaining how to test your databases to learn more.
Finally, all the monitoring we have must be built with developers in mind. They own the database, they understand their solutions. We must give them tools that will let them own the database, help them analyze the performance, and make it easy to reason about the changes. We canât keep the database maintenance within a separate team. Just like developers now manage their clouds as part of the DevOps movement, we need to let them own their databases. Letâs see how Metis achieves this.
How Metis stays ahead of Database Guardrails movement
Metis is your ultimate database guardrails solution. It prevents the bad code from reaching production, monitors all your databases, and automatically troubleshoots the issues for you.
How It Works
Metis is a Software as a Service (SaaS) that integrates with your application and database, analyzes queries and schemas, and provides insights into how to improve your ecosystem.
To use Metis, we need to discuss its two building blocks: SDKs and Metadata Collector.
An SDK is just a library that wraps OpenTelemetry and adds another sink that delivers spans and traces to Metis. It plugs into the regular OpenTelemetry features you most likely already have in your application. Many frameworks and libraries, including web, ORM, SQL drivers, already integrate with OpenTelemery and emit appropriate signals. Metis reuses the same infrastructure to capture metadata about interactions, requests, and SQL queries. You can read more about that in the documentation.
SDK is by default disabled in production to not decrease the performance of your application and to avoid extracting confidential information. To install the SDK, one just needs to add a single dependency to their application with the ordinary package manager they use every day (like NPM or PIP).
Once signals are delivered to the Metis platform, they are processed and insights are generated. To do that, weâd like to use as much information about your production database as possible. We capture that using Metadata Collector (MC) - an open-source docker that you can run in your environment and connect it to all your databases to extract schemas, statistics, row numbers, running configuration, and much more. MC can be deployed easily in the cloud (for instance, using AWS CloudFormation template) or in an on-premise environment (just run the docker container your regular way). MC connects to your databases every couple of hours and gets information. It also uses extensions you have like pg_stat_statements or slow query log. The more metadata you have, the better.
Letâs now see what Metis can do for you.
Monitor Everything
Metadata Collector can connect to your databases and provide a summarized observability dashboard: