Praemonitus praemunitus: use actionable alerts to fix problems quickly

Video size:

Abstract

How to react to something happening with the application as fast as possible? How to reach SLOs beyond three nines (99.9%)? The answer is to automate the reaction to an incident and to automate triggering this reaction. How to implement it? The answer is “Actionable Alerts”.

Summary

This talk looks into maximizing efficiency of the alerts by better targeting and reducing response times. Being actionable is one of the properties of the efficient alerts. At the end, I will demonstrate how to implement automating responses using alerting solution in Google Cloud.
Efficient means achieving maximum productivity with minimum wasted effort or expense. Processing large number of alerts leads to extended response times and churning. How to alert also plays an important role in enabling actionable alerts.
Alerting solution in Google Cloud is part of cloud monitoring service. It also supports multiple notification mechanisms called notification channels. Demo shows how actionable alerts help with automation.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello. This talk is about actionable alerts and how we can use them to automate responses to incidents. Alerting is usually perceived as a system to human notification solution. However, alerts are not about messaging only. This talk looks into maximizing efficiency of the alerts by better targeting and reducing response times. My name is Leonid. I'm a senior developer relation engineer at Google Cloud. In my work I help to improve workload observability by using cloud infrastructure, telemetry, alerting and other services and tools. You can reach me out via LinkedIn on my posts on medium or using a link to the feedback form on the last slide of this presentation. Being actionable is one of the properties of the efficient alerts. So first we will talk what efficient alerting is and how we can make our alerts efficient. Then we will look into automating and alerts when it is possible and how to use alerting solution for automating responses. At the end, I will demonstrate how to implement automating responses using alerting solution in Google Cloud. Let's begin. Efficient means achieving maximum productivity with minimum wasted effort or expense. We can ask aren't any alert like that? The truth is that many companies reported their alerts overload support systems. Processing large number of alerts leads to extended response times and churning. In DevOps and Sre teams. Companies end up changing priorities of their alerts which in turn result in opening tickets or bugs as an automatic response to the new alerts. Many such bugs end up abundant. Increasing the technical depth of engineering teams alerting should help us address urgent problems and fix found problems quickly. So what means to have an efficient alert? Let's look into three components of the alert. When focusing on alerting conditions, it is responsible for controlling frequency and relevancy of created alerts. What is responsible for information related to the alert? Right. Information can increase efficiency of the alerts outcome and improve time to resolve. Metric of alerts the last how is one that covers the messaging part of the alert. Notifying about the alert can take different forms and formats. Lets look into each of these components of the efficient alert cloth. One of the methods to make alert efficient is not to rise them frequently. Raised alerts are like the story about a shepherd boy who cried wolf wolf all the time. When rising alert depends on several factors, good practice is to start from the end. Think about relevance of this alert. What is the problem this alert discovers? Does this problem affect the users of our system? If this alert catches the problem or responses to some transient change in the system, state review the signals used to identify the problem. It identifies the best suited metrics to discover the problem. A good practice is to use metrics that reflect user experience about the system and not the systems resources. When unsure, consider using one of the monitoring golden signals, latency, traffic errors and saturation. Another factor influencing efficiency when to rise an alert is using threshold versus tendency. Sometime the speed of deterioration can be much more useful signal for alert than a simple threshold. We always want to condition rising alert by observing the same condition along a predefined period of time. Consider to trigger an alert when a request latency reach 10 seconds versus triggering an alert when the average latency is 10 seconds for the last two minutes. Okay, 10 seconds may be too much. The first condition will result in many accident alerts when some one time delays happens which aren't actually affect any users, while the second condition will provide a clear signal that the system responses are delayed. When implementing an alert, it is important to think about outcome. We want to respond to the alert to be efficient too. In other words, the alert should contain enough information to ensure successful resolution in a reasonably short time. For example, an alert saying that cpu usage is above 90% for last five minutes is really hard to handle. However, if the alert comes with a link to a dashboard capturing all processes consuming that cpu for the last five minutes, the mitigation of the problem will be much faster and easier. We don't always can know what type of information will be useful when responding to the alert. As a rule of thumb, all relevant information about the system that we can collect is useful. We often reference this information as context or metadata. Additional materials like playbooks, team chats, and other supporting information when it is part of the alert data can increase alert efficiency. When alert notification provides enough information for efficient response, we can call such alerts actionable. Remember that information that is easy to obtain during the alert is often very hard or even impossible to discover when it is time to respond to the notification. How to alert also plays an important role in enabling actionable alerts. How can enable response to be automated and control possible outcome of the alert? The selected method should be resilient in delivering notification about alert with an unexpected delivery times. The method also has to support escalation path in event of the notification failure or if a response doesn't happen within expected time interval. Needless to say that efficient notification should be able to deliver all alert information as well. Since many alert responses are subject to the post mortem analysis, how should include auditing feature as well. Actionable alerts can leverage multiple notification methods to increase efficiency, alerts can be sent via several channels while supporting distinct formatting of alert information for each of them. As I previously mentioned, when talking about alerts, we mainly mean system to human communication, where engineers manually identify problems behind the alert conditions and remediate them. What about automating the response to the alert notification? Think about it. Automation is already a big part of alerts. The process of raising an alert is automated. Instead of having a DevOps engineer looking into histograms and bar charts on the dashboard and tracking gorges, curves and numbers to start responding if something goes wrong, we have software that does it for us all. What is left to engineers is to respond, and even that part doesn't always happen. What about notifications that open a bug or a ticket? What it is, if not automated, response to the alert let's look at the classical example of the alert. When an on call engineer receives a pager or another type of notification that requires them to address the problem immediately. A simplified algorithm that they follow looks like first, identify a type of the problem or a cause. Second, find the playbook with instructions to this problem type and third, execute instructions. Note that with efficient alerts, the first two steps are optimized into single step because the playbook is already part of the information about the alert. So the difference between a human engineer and an automated response is what is written in the playbook or is there a playbook in the first place? One of the oldest playbooks the system engineers know has a simple restart. The problematic instance do we really need a human to do this today? Obviously not. Any alert response can be automated. There are distinctive signs that an alert response can be automated. Alert is actionable, meaning three components of the alert when, what, and how follow the efficient alert principles. The cause of the alert is deterministic and can be derived from the information captured with the alert. And the response does not require human decisions or interactions, and vice versa. Signs that an alert cannot be automated include the ambiguous cause, despite alert being actionable, maybe, or a response expects some human interaction or decision making process. It is all arguments such as alerts are not intended for automation or high priority alerts require some human attention or alerts with potentially high business impact should be handled by humans should not influence the decision about alert response automation. Of course, automated responses don't guarantee to be free of errors. However, majority of these errors are human errors. Let's quickly review the three alert components for actionable alerts. With automated response, alerts should be triggered on individual resources to narrow the problem area and simplify automation. Other metrics and time windows should serve to determine the exact cause of the alert. Further, all resource metadata and all alert conditions should be captured to provide detailed context, helping to discover the cause. Notification channel should be resilient, asynchronous and support formatting of all alert information in parsable format such as JSON. Now it is time to demonstrate how actionable alerts help with automation. Of the responses. To those of you who are not familiar with Google Cloud, don't worry. Most of cloud providers have alerting solutions that let you automate responses. Alerting solution in Google Cloud is part of cloud monitoring service. Cloud monitoring service implements alerting policy using the three component model when, what and how that we reviewed in the previous slides. It also supports multiple notification mechanisms called notification channels, including sending emails, texting to a phone number, asynchronous messaging, using pub sub service, posting to slack channel, Google Chat, and integration with pagerduty. An application used for this demo has a single HTTP endpoint and is deployed to cloud functions. Cloud functions is a serverless solution to run your code similar to lambda in AWS and serverless functions in Azure. You can see the code of the application in Go on this slide. Besides initialization code, it has a single function that returns ping string along with 200 status for requests coming to path ping, and it responds with a 204 status to any other path. The alerting policy condition can be implemented using proprietary cloud monitoring query language. As you can see, it requires some deciphering to make it easier. Let's look at the same condition written using Prometheus query language. Cloud monitoring supports querying metrics and defining alert conditions using ProMQL. If you are not familiar with both these languages, you will have to take my word for what this query does. The condition uses the predefined metric of the cloud functions that counts function executions. This metric has a label status which gets a value ok each time a function is executed with httpstatus in the range 200 to 299. Following the best practices that were previously explained. This condition tracks the rate, the ratio sorry between error responses and all valid responses of the function execution. When this ratio is above 20%, an alert is raised. The alert policy uses an email notification channel to send information about the alert in the human readable format. The email is sent once the condition is fulfilled. So what would be the best course of action to mitigate such alert when an email is received? There are several scenarios. In our case, we want to review how to automate the response, so let's simplify the demo and assume that we are enabling this alert each time we deploy a new version of cloud function. Such thing can be easily incorporated into deployment pipeline. In this case, the fastest mitigation would be rolling to the previous version of the function. Let's now check if we can automate the response to this alert following our decision tree. The alert is actionable. It monitors the resource because luckily for us, our application and the resource are essentially the same. It captures the resource metadata and alert context as out of the box feature of the cloud alerting solution. It has a notification channel supporting sending asynchronous messages reliably, which is a pub sub service that also can send all the data about alert in the JSON form. The conditions of the alert let us deterministically pin the cause of the alert to the fact that the current version of cloud function generates too many errors. It makes the cause of the alert 100% predictable. Mind that we dont need to find out the root cause of the problem. The cause of alert we are interested in should be sufficient to effectively mitigate the problem as fast as possible. And at last we can rollback to the previous version of cloud functions without human intervention using cloud API in this demo, the rollback is implementing using cloud build job cloud build job can be triggered by receiving a pub sub message sent via alert notification channel. The script of the job is not important for this demo. If you are interested in the implementation, you can find the script in my post on medium using the link shown in the wrap up slide. What you see right now is a Google Cloud console window. First thing we need to do is to modify alert policy and to add notification channels that sends notifications to the cloud. Pubsub I've already created and defined the pubsub topic that will be receiving the notifications. For our convenience, it is straightforward. We save the changes. Next we go to cloud build and create a new trigger that automates that implements automation. The trigger is working by receiving notifications from the pub sub topic and notifications that are handled use the same topic that was used for notification channel. Now we simulate the change in the source code of the cloud function. The current function responds with 200 statuses. We change it to respond on any request with the 400 status. It will take some time to deploy. Once it deploys, we will be able to see changes in the alert monitoring. When we look into the alert policy, we can see a dashboard that actually shows tracking the changes. Let's check that cloud function. Successfully deployed? Not yet. Okay, cloud function is deployed and it starts responding with errors. Now we can return to alert dashboard. Let's fast forward a little so we'll be able to see what is going on. The responses are not immediate because they are aggregated by the infrastructure and now you can see the incident was reported. If we go to the cloud job, we can see a triggered job script executing right now. What the job does, it just roll backs to the previous version of the cloud function. As you can see, the version is already in the process of deploying and in a moment we will fast forward to see that the original version that responds with 200 statuses is there. Returning back to the alerting window, we can see that after metric returned to normal below the threshold alerting solution will automatically mark the incident as closed. Let me summarize. Efficient alerts are the key to quick detection and response to discovered threats. Not every alert is efficient. Validate your alert configurations to make sure they respond to actual problems that affect your users and not your systems. Narrow your conditions to maximize chances for detecting alert calls. Capture service and environment context with the alert and reach alert information with additional materials that will be helpful in responding to the alert notifications. Use auditable notification methods that are able reliably deliver your alert information for timely response. You have to make your alert actionable in order to be able to implement automatic responses, but not every actionable alert can be automated. Look to automate when the cause of alert is deterministic and the response does not require human intervention. Be aware of your service provider capabilities to optimize alert implementation. If you have any questions or comments about the stock, other feedback, or you just want to get in touch, scan this QR code and write your feedback in the anonymous Google form. Thank you.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Observability 2024 - Online

June 13 2024

Praemonitus praemunitus: use actionable alerts to fix problems quickly

Video size:

Abstract

Summary

Transcript

Slides

Leonid Yankulin

Developer Relations Engineer @ Google

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2024 - Online

June 13 2024

Praemonitus praemunitus: use actionable alerts to fix problems quickly

Video size:

Abstract

Summary

Transcript

Slides

Leonid Yankulin

Developer Relations Engineer @ Google

Join the community!