Join us
@milapsingh ・ Dec 01,2023 ・ 5 min read ・ 625 views ・ Originally posted on dyte.io
Learn how we streamlined our NRQL alert setup process by adopting Terraform and the problems we tackled to enhance the scope of alerts.
At Dyte, we have recognized the potential of Terraform in streamlining our alert setup process. By adopting Terraform, we have empowered our engineering teams to set alerts for their respective services without relying on the SRE/DevOps team.
Setting up alerts on New Relic can be tedious and repetitive, requiring manual effort. But with the advent of Terraform, New Relic has started supporting the creation of alerts by Terraform.
Adopting Terraform has also enabled us to better manage our New Relic alerts, with greater control and flexibility in setting up thresholds and conditions. This has led to a more efficient and streamlined process, allowing our teams to focus on other essential tasks. Overall, it has been a game-changer for us, helping us to automate and optimize our alert setup process.
What was our problem in setting New Relic alerts for services?
Our infrastructure consists of multiple clusters offering a range of services. However, we are encountering an issue when setting NRQL (New Relic query language) alerts with different thresholds while minimizing the use of repeated codes. Fortunately, we have discovered that Terraform provides New Relic providers (Terraform providers), which can help address this concern significantly. By leveraging Terraform's New Relic capabilities, we can easily configure our NRQL alerts with different thresholds across all of our clusters without having to write the same code over and over again.
How have we done it?
NRQL queries will allow us to filter and aggregate data from our applications, infrastructure, and services, enabling us to create targeted alerts specific to our needs. In provider.tf, we'll first set up the provider.
terraform {
required_version = ">= 0.12"
required_providers {
newrelic = {
source = "newrelic/newrelic"
version = "~> 3.11.0"
}
}
}
We used simple local variables with looping to simplify setting up New Relic alerts for our various services. Each variable object was given specific values based on the corresponding service. This ensures that the alerting system is tailored to each service's specific needs, improving our systems' overall performance and stability.
Overall, this approach has made it much easier for our engineering teams to set up their own alerts without relying on SRE/DevOps teams.
locals {
alerts = {
cluster-1 = {
cluster_name = "cluster-1"
policy_id = "policy_id"
threshold = 5,
threshold_duration = 300,
aggregation_window = 60,
description = "Description about alerts",
operator = "above"
},
cluster-2 = {
cluster_name = "production-us"
policy_id = data.newrelic_alert_policy.production_infra_alert_policy.id
threshold = 5,
threshold_duration = 300,
aggregation_window = 60,
description = "Log link: https://one.newrelic.com/logger?account=xxxxx&duration=1800000&state=xxxxxxxxxxxxx",
operator = "above"
}
}
We are creating a resource by looping this local variable.
resource "newrelic_nrql_alert_condition" "alert" {
for_each = local.alerts
account_id = var.newrelic_account_id
policy_id = each.value.policy_id
type = "static"
name = "${each.value.cluster_name}-error-percentage"
description = each.value.description
enabled = true
violation_time_limit_seconds = 1800
aggregation_window = each.value.aggregation_window
aggregation_method = "event_flow"
aggregation_delay = 1
open_violation_on_expiration = false
close_violations_on_expiration = false
nrql {
query = "SELECT percentage(count(*),where message like '%error%' AND message NOT LIKE '%ignore%') FROM Log WHERE cluster_name = '${each.value.cluster_name}'"
}
critical {
operator = each.value.operator
threshold = each.value.threshold
threshold_duration = each.value.threshold_duration
threshold_occurrences = "ALL"
}
}
How are we using heredoc(EOT) for the NRQL query?
When working with complex queries, reading and understanding the code can be difficult when it is all in one line. To help with this issue, we can use heredocs (the EOT is the delimiting indicator, for more detail, visit this URL) to define and write multi-line queries. This is especially useful when creating NRQL queries with multiple conditions, as it allows us to separate each condition and make the query easier to comprehend.
The Syntax for NRQL heredocs query
query = <<EOT
SELECT count(*)
FROM Log
WHERE cluster_name = '${local.cluster}'
AND `container_name` = 'Name'
AND level='ERROR'
AND NOT (
`message` LIKE '%condtions name 1 %'
OR `message` LIKE '%condtions name 1 %'
)
EOT
We used heredoc for our NRQL query to make it clearer and readable for human eyes. As a code reviewer, you should be able to identify the query change easily; if the query is in one line, it's hard to compare.
A problem we faced with the heredoc syntax <<EOT...EOT
is that with this, your query formed on the New Relic console is with leading spaces, which is, again, not so readable. Then we introduced <<-EOT...EOT
, which removes all leading spaces from the query.
How to enrich alerts with the information we are receiving?
To ensure that our live-support team can quickly and accurately identify the issue and affected service, we have conducted numerous iterations to define the alerts we receive. By doing so, we have focused on providing rich and detailed information within each alert, including contextual details, relevant error codes, and any other pertinent data that could assist the support team in resolving the issue.
We started to define New Relic alerts in the description part of the alert in the IaC (Infrastructure as Code) code itself with information. You can see these values with the alert tag.
Visit New Relic for more detail.
How is the CI/CD flow working?
The core problem with standard infra provisioning is that other teams are blocked on the DevOps team for their tasks or features. As DevOps, we like to automate things and resolve all kinds of blockers. And to solve this particular problem, we implemented CI/CD by leveraging GitHub actions.
With automating the whole CI/CD pipeline for infra, we have unblocked all other teams to code the resources as they need and simply raise the PR on our infra GitHub repo. After raising the PR, there are some steps to provision the infra:
Note: We are using Terraform Cloud to store our Terraform state files.
Setting alerts manually is really hectic and repetitive task for DevOps/SRE teams in any organization. In this blog, we explained how we can leverage Terraform to create NRQL alerts with the help of code in an automated manner. We have also covered heredocs to make NRQL queries more readable, and easy to understand. In the last part, we illustrated the CI/CD part of IaC, and how easily other dev teams can add alerts for their applications without depending upon SRE/DevOps team.
I hope you found this post resourceful. If you have any thoughts or feedback, please get in touch with me on LinkedIn and Twitter. Stay tuned for more related blog posts in the future!
If you haven't heard about Dyte yet, head over to dyte.io to learn how we are revolutionizing communication through our SDKs and libraries and how you can get started quickly on your 10,000 free minutes, which renew every month. You can reach us at support@dyte.io or ask our developer community if you have any questions.
Join other developers and claim your FAUN account now!
Influence
Total Hits
Posts
Only registered users can post comments. Please, login or signup.