Join us
@adammetis ă» Dec 14,2023 ă» 10 min read ă» 779 views ă» Originally posted on www.metisdata.io
Database is the most crucial part of each application. While we can accept some issues with the frontend or with the backend, we simply cannot let our database fail or become slow. In this post we will see the crucial database performance metrics and how to improve database performance. We will focus on monitoring, configuration, and various database engines (postgres or mysql), and we will see why they are important.
Database is the most crucial part of each application. While we can accept some issues with the frontend or with the backend, we simply cannot let our database fail or become slow. In this post we will see the crucial database performance metrics and how to improve database performance. We will focus on monitoring, configuration, and various database engines (postgres or mysql), and we will see why they are important.
Database consists of aggregate data stored in files. No matter if itâs SQL or NoSQL database, the data is ultimately stored on the hard drive. To make processing faster, each engine implements various techniques for memory management, caching, parallelization, or speculative execution.
All these techniques improve the performance by increasing the throughput or reducing the execution time, however, they are based on assumptions about how data is stored or what operations are executed. There are different cases in distributed databases or heterogeneous environments. Each operation may be affected by transaction isolation level and face a performance degradation. Therefore, we need to track database performance metrics to identify issues and fix the performance as early as possible.
Database performance metric is a key indicator describing some aspect of the operation. There are multiple areas which we can track, like memory subsystem, CPU consumption, disk usage, availability or throughput. We need to remember that the overall performance of the system will be affected if any of these metrics are affected negatively. For instance, high memory usage may lead to high CPU consumption that will leave less CPU cycles for processing of other tasks (like clearing memory pages) which will affect other transactions. In short, issues spread quickly.
Letâs now dig deeper into actual performance metrics.
First metrics we need to track are the basics of the machine that our database is running on. These include available working memory (both paged-in and paged-out), CPU usage (for each core), available disk space, disk fragmentation, IO operation latency (and queue), network traffic, number of processes, number of threads, number of operating system handles, and many more.
These metrics should be provided by our hosting provider. They can be obtained from the operating system counters. No matter what is wrong with our database, these metrics should indicate there is an issue. However, very rarely do we need to fix these metrics per se. If the CPU spikes, then we need to understand what is running on the machine, not increasing the CPU performance directly.
We need to find the right balance between having room for unexpected spikes and good utilization of the system. We should aim for CPU utilization to be around 70% of the time, low disk fragmentation, 70% of the memory being used, etc.
Next group of metrics are metrics showing âhow much we can doâ. They include:
Those metrics show how fast the system is. Again, we want to target 70% for the available resources being used, and we want to minimize the number of cache misses or page misses. For any cache we have, we want to have a cache hit ratio as high as possible.
Regarding the time for completing the query - itâs important to understand that the query must be as fast as needed but not faster. There is no point in reducing the query execution time to nanoseconds in general. Always understand that your performance is a feature, but human perception is also limited.
Now we focus on metrics specifically for databases. We should look for:
These metrics should be grouped by the database and replica. They show ongoing processes in the database, and should be viewed periodically to make sure we donât have any performance degradation. When things slow down, we should be able to correlate them with other metrics (like throughput or infrastructure) to identify scenarios like âmany more users came to the platform that lead to longer query times and higher CPU usageâ. Building a plausible scenario explaining the metrics is the first step for improving the performance.
Other metrics include indicators around security (number of users, number of account changes, how often there is an authentication error, configuration changes), logging (how many errors or exceptions we have, how many errors per specific error code, etc.).
These metrics can help us pinpointing things that can easily go unnoticed (like bruteforce attack), but should be actioned separately.
Analyzing database performance metrics may not be trivial. We need to follow multiple leads and incorporate metrics from all the available layers to understand what is exactly going on.
First thing to understand is that metrics from various layers are connected. Itâs very rare that we will see spikes in CPU usage without spikes in other metrics. High CPU usage will lead to slower query performance, delayed transactions, more locking contention, or issues with scheduled tasks. To improve the database performance, we need to understand all layers of metrics and find the issues that affect the system overall.
Second, we need to cure the source of troubles, not the metric itself. If we see a CPU spike, then we need to work on what caused this spike instead of just upscaling the CPU. Obviously, sometimes we just need to upscale our database, but most of the time we can deal with things around. Thatâs why we need to build a narrative that explains the metrics anomaly across all layers. For instance, the CPU spiked because there were more tasks running. We had more tasks because each task takes more time to complete. Each task works longer because the amount of data we store in the database doubled over the last three months. With a narrative like this one we can now answer the important question: do we want to hold this much data? If yes, then maybe we need to adjust the schedule of maintenance tasks. If not, then maybe we need to archive some data and make the database smaller again.
Third, we need to remember that metrics are interconnected. High memory usage will often lead to higher CPU usage because the CPU will spend time on dealing with fragmentation or cleaning garbage. Similarly, too small a limit for the transaction log length may lead to more disk activity which will degrade the performance. Whenever we deal with performance indicators, we need to look for the source of the problem. Acting on any breached metric will lead to hiding the issue for a short time, not solving the problem entirely.
Last but not least, weâre still humans. We need to have dashboards and alarms configured for the metrics. However, if we set too many alarms, we will face too much noise and wonât be able to act accordingly. Therefore, itâs crucial to have metrics and charts that show the reason behind the problem, not the raw metrics that we canât reason about. See the Metis dashboard:
This dashboard doesnât try to swamp you with details of each metric or alarms for every anomaly it detects. Instead, it connects the dots for you and makes sure you have all important information available instantly. When something goes wrong, you can dig into details and find whatâs happening with one of the databases:
This way you can find the problem much easier, without being overloaded with metrics at start.
Finally, metrics must be grouped by logical dimensions. You need metrics per database, per replica, per version, per region, per country, grouped by date and time. This makes it easier to look for weekly patterns, country-specific behavior, or even tune alarms based on the characteristics of some particular region. There is no size that fits all. You need to adjust your solutions to your needs.
To implement the database performance monitoring the right way, we need to use proper tooling and systems that understand the broader context. We then need to tune alarms and settings, and finally get proper analysis and reporting.
Good tools will give you configurable dashboards that you can adjust to your needs. You can create dashboards per system or per region. You also need to integrate the tools with the rest of your ecosystem: paging system for alarms, metrics collector to get statistics from other data sources, your cloud and infrastructure provider, or even messaging platforms for showing details easily during status meetings.
Good tooling must be extensible. Each company is different, and you need to be able to configure the platform to match your needs. Especially in heterogeneous environments that utilize multiple platforms, databases, or operating systems.
Finally, good tooling must analyze data in real time. We canât let the metrics be delayed or out of date. They must be as accurate as possible and always reflect the current state of the system.
When there is an anomaly detected, the performance monitoring system must alert us. Each alert must be configurable in multiple ways: how early is it triggered or how many data points it needs, whether it alerts on missing metrics, who is paged, how is it escalated, what is the SLA for the fix, and many more settings that each team needs to be aware of.
However, the most important part of alerts are the thresholds. We need to be able to configure which anomalies are acceptable, and which anomalies should be actioned immediately. Finding the right thresholds isnât trivial and requires a lot of experience and hard work. Machine learning based solutions are also helpful because they can run anomaly detection for us based on the observed patterns.
Last but not least, we need reports about the performance. These reports should be reviewed manually every now and then. The team should take a look at the number of breaches, look for common sources of issues, and find ways to prevent them in the longer run. Some alerts can be ignored because they resolve themselves automatically, however, if they happen too often, then we should either relax the constraints or fix the problems.
Reports need to be generated automatically and include enough historical data to show the big picture. Performance tuning is never finished. We need to look for long-term trends and changes that we can improve in the future.
Good database performance monitoring system covers a wide range of metrics, including metrics around infrastructure, throughput, database performance, and operational metrics. They need to allow for configurable alerts that do not swamp administrators with raw numbers, but they show actual issues with the performance. They need proper tooling that we can configure to our needs, alerts based on automated anomaly detection, and reporting that we can review periodically to find recurring issues.
To improve the database performance we need to build a narrative that explains metrics across all layers. This is a never-ending story that we need to repeat every now and then. Our datasets grow over time, our business gets bigger and bigger. We canât solve these problems once and for all, but we need to constantly monitor and improve. Good visualization and tooling that deals with databases is a way to go for building approachable monitoring metrics. They also increase our understanding of the system and make sure we run smoothly over time.
Start building your performance monitoring today. Use Metis as a base for your observability dashboards. Explore our other resources about configuring your database to avoid common issues.
Infrastructure (memory, CPU, disk, network), throughput (connections, transactions, files, logs, caches, pages), database (deadlocks, locks, timeouts, data distribution, slow queries, errors, exceptions), operational (users, authentications, authorizations, connections).
Primary metrics include aggregated numbers per server or database engine. Secondary metrics focus on split based on region, country, date and time, version, language, etc.
CPU is rarely a bottleneck. More often it indicates issues with configuration, transactions, data size increase, or other aspects of the running configuration.
Number of writes, I/O queue length and latency, disk size, number of spills to disk, number of network connections, total network utilization.
Join other developers and claim your FAUN account now!
DevRel, Metis
@adammetisInfluence
Total Hits
Posts
Only registered users can post comments. Please, login or signup.