Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. This talk is about actionable alerts
and how we can use them to automate responses
to incidents. Alerting is usually perceived as
a system to human notification solution.
However, alerts are not about messaging
only. This talk looks into maximizing
efficiency of the alerts by better targeting and reducing response
times. My name is Leonid. I'm a senior
developer relation engineer at Google Cloud.
In my work I help to improve workload observability
by using cloud infrastructure, telemetry, alerting and other
services and tools. You can
reach me out via LinkedIn on
my posts on medium or using a link to the
feedback form on the last slide of this presentation.
Being actionable is one of the properties of the efficient alerts.
So first we will talk what efficient alerting
is and how we can make our alerts efficient.
Then we will look into automating and alerts when
it is possible and how to use alerting solution
for automating responses. At the end,
I will demonstrate how to implement automating
responses using alerting solution in
Google Cloud. Let's begin.
Efficient means achieving maximum productivity
with minimum wasted effort or expense. We can
ask aren't any alert like that?
The truth is that many companies reported their
alerts overload support systems.
Processing large number of alerts leads to extended
response times and churning. In DevOps and Sre teams.
Companies end up changing priorities of their alerts
which in turn result in opening tickets or bugs as an
automatic response to the new alerts.
Many such bugs end up abundant. Increasing the technical
depth of engineering teams alerting should help us
address urgent problems and fix found problems quickly.
So what means to have an efficient alert?
Let's look into three components of the
alert. When focusing on alerting
conditions, it is responsible for controlling frequency
and relevancy of created alerts.
What is responsible for information related
to the alert? Right. Information can increase efficiency
of the alerts outcome and improve time to resolve.
Metric of alerts the last
how is one that covers the messaging part of
the alert. Notifying about the alert can take different forms
and formats. Lets look into
each of these components of the efficient alert cloth.
One of the methods to make alert efficient is not to
rise them frequently.
Raised alerts are like the story about
a shepherd boy who cried wolf wolf
all the time. When rising alert depends on
several factors, good practice is to start
from the end. Think about relevance of
this alert. What is the problem this alert discovers?
Does this problem affect the users of our system?
If this alert catches the problem or responses
to some transient change in the system,
state review the signals used
to identify the problem. It identifies
the best suited metrics to discover the problem. A good
practice is to use metrics that reflect user experience about the
system and not the systems resources.
When unsure, consider using one of
the monitoring golden signals, latency,
traffic errors and saturation.
Another factor influencing efficiency when to rise
an alert is using threshold versus tendency.
Sometime the speed of deterioration can be
much more useful signal for alert than a simple threshold.
We always want to condition rising alert by
observing the same condition along a predefined period
of time. Consider to trigger an alert when
a request latency reach 10 seconds
versus triggering an alert when the average latency is
10 seconds for the last two minutes.
Okay, 10 seconds may be too much.
The first condition will result in many accident alerts
when some one time delays happens which
aren't actually affect any users,
while the second condition will provide a clear
signal that the system responses are delayed.
When implementing an alert, it is important to think about
outcome. We want to respond to the alert to
be efficient too. In other words, the alert
should contain enough information to ensure successful resolution
in a reasonably short time. For example,
an alert saying that cpu usage is above
90% for last five minutes is
really hard to handle. However,
if the alert comes with a link to a dashboard capturing
all processes consuming that cpu for the
last five minutes, the mitigation of the problem will be much
faster and easier. We don't always
can know what type of information will be useful when responding
to the alert. As a rule of thumb,
all relevant information about the system that we can collect
is useful. We often reference this information
as context or metadata.
Additional materials like playbooks,
team chats, and other supporting information when
it is part of the alert data can increase alert efficiency.
When alert notification provides enough information for efficient response,
we can call such alerts actionable.
Remember that information that is easy to obtain during
the alert is often very hard or even impossible
to discover when it is time to respond to the notification.
How to alert also plays an important role
in enabling actionable alerts. How can enable
response to be automated and control possible outcome
of the alert? The selected method should be resilient
in delivering notification about alert with
an unexpected delivery times. The method also
has to support escalation path in event of the notification
failure or if a response doesn't happen within expected
time interval. Needless to say that
efficient notification should be able to deliver all
alert information as well. Since many alert responses
are subject to the post mortem analysis,
how should include auditing feature as well.
Actionable alerts can leverage multiple notification
methods to increase efficiency, alerts can be
sent via several channels while supporting distinct formatting
of alert information for each of them.
As I previously mentioned, when talking about alerts,
we mainly mean system to human communication, where engineers
manually identify problems behind the alert conditions
and remediate them. What about automating the
response to the alert notification? Think about
it. Automation is already a big part of alerts.
The process of raising an alert is automated.
Instead of having a DevOps engineer looking into
histograms and bar charts on the dashboard and
tracking gorges, curves and numbers to start responding
if something goes wrong, we have software that does
it for us all. What is
left to engineers is to respond,
and even that part doesn't always happen.
What about notifications that
open a bug or a ticket? What it
is, if not automated, response to the alert let's
look at the classical example of the alert. When an on call
engineer receives a pager or another type of notification that
requires them to address the problem immediately.
A simplified algorithm that they follow looks
like first, identify a
type of the problem or a cause.
Second, find the playbook with instructions
to this problem type and third,
execute instructions.
Note that with efficient alerts,
the first two steps are optimized into single step
because the playbook is already part of the information about the alert.
So the difference between a human engineer and an automated
response is what is written in the playbook or
is there a playbook in the first place?
One of the oldest playbooks the system engineers know
has a simple restart. The problematic instance
do we really need a human to do this today?
Obviously not. Any alert response can be automated.
There are distinctive signs that an alert response can be automated.
Alert is actionable, meaning three components of the
alert when, what, and how follow the efficient
alert principles. The cause of the alert
is deterministic and can be derived from the information captured with
the alert. And the response does not require human
decisions or interactions, and vice versa.
Signs that an alert cannot be automated include the ambiguous
cause, despite alert being actionable,
maybe, or a response expects
some human interaction or
decision making process. It is all arguments
such as alerts are not intended
for automation or high priority alerts
require some human attention or alerts
with potentially high business impact should be
handled by humans should not influence
the decision about alert response automation.
Of course, automated responses don't guarantee
to be free of errors. However, majority of
these errors are human errors.
Let's quickly review the three alert components for
actionable alerts. With automated response,
alerts should be triggered on individual resources to narrow
the problem area and simplify automation.
Other metrics and time windows should serve to determine the exact
cause of the alert. Further,
all resource metadata and all alert conditions
should be captured to provide detailed context, helping to
discover the cause. Notification channel
should be resilient, asynchronous and support
formatting of all alert information in
parsable format such as JSON.
Now it is time to demonstrate how
actionable alerts help with automation. Of the responses.
To those of you who are not familiar with Google Cloud,
don't worry. Most of cloud providers have alerting
solutions that let you automate responses.
Alerting solution in Google Cloud is part of cloud monitoring
service. Cloud monitoring service implements
alerting policy using the three component
model when, what and how that we reviewed in
the previous slides. It also supports multiple
notification mechanisms called notification channels,
including sending emails, texting to a phone number,
asynchronous messaging, using pub sub service,
posting to slack channel, Google Chat, and integration
with pagerduty. An application used for
this demo has a single HTTP endpoint and is deployed
to cloud functions. Cloud functions is a serverless solution
to run your code similar to lambda in AWS and
serverless functions in Azure. You can see the code
of the application in Go on this slide.
Besides initialization code, it has a single function that
returns ping string along with 200 status
for requests coming to path ping, and it
responds with a 204 status to any other
path. The alerting policy condition can
be implemented using proprietary cloud monitoring query
language. As you can see, it requires some
deciphering to make it easier.
Let's look at the same condition written using Prometheus query
language. Cloud monitoring supports querying
metrics and defining alert conditions using ProMQL.
If you are not familiar with both these languages, you will
have to take my word for what this query does.
The condition uses the predefined metric of the cloud
functions that counts function executions.
This metric has a label status which
gets a value ok each time a
function is executed with httpstatus
in the range 200 to 299.
Following the best practices that were previously explained.
This condition tracks the rate,
the ratio sorry between error responses and
all valid responses of the function execution.
When this ratio is above 20%,
an alert is raised. The alert policy
uses an email notification
channel to send information about the alert in the human readable format.
The email is sent once the condition is fulfilled.
So what would be the best course of action to mitigate such
alert when an email is received? There are several scenarios.
In our case, we want to review how
to automate the response, so let's simplify the demo
and assume that we are enabling this
alert each time we deploy a new version
of cloud function. Such thing can
be easily incorporated into deployment pipeline.
In this case, the fastest mitigation would be rolling
to the previous version of the function. Let's now check if
we can automate the response to this alert following our
decision tree. The alert is actionable.
It monitors the resource because luckily for
us, our application and the resource are essentially the
same. It captures the resource metadata and alert context
as out of the box feature of the cloud alerting solution.
It has a notification channel supporting
sending asynchronous messages reliably, which is
a pub sub service that also can send all
the data about alert in the JSON form. The conditions
of the alert let us deterministically pin the cause of the
alert to the fact that the current version of cloud function generates
too many errors. It makes the cause of the alert
100% predictable.
Mind that we dont need to find out the root cause of
the problem. The cause of alert we are interested
in should be sufficient to effectively mitigate the problem
as fast as possible. And at last we can rollback
to the previous version of cloud functions without human intervention
using cloud API in this demo,
the rollback is implementing using cloud build job
cloud build job can be triggered by receiving a pub sub message
sent via alert notification channel. The script
of the job is not important for this demo.
If you are interested in the implementation, you can find the
script in my post on medium using the link shown
in the wrap up slide.
What you see right now is a Google Cloud console
window. First thing we need to do
is to modify alert policy and to
add notification channels that sends notifications
to the cloud. Pubsub I've
already created and defined the pubsub
topic that will be receiving the notifications.
For our convenience, it is straightforward.
We save the changes.
Next we go to cloud build and create a new
trigger that
automates that implements
automation. The trigger is working
by receiving notifications from the pub sub topic
and notifications that
are handled use the same topic that was used for
notification channel. Now we
simulate the change in the source
code of the cloud function. The current function
responds with 200 statuses.
We change it to respond on any request with
the 400
status. It will take some time to
deploy. Once it deploys, we will
be able to see changes in the alert
monitoring.
When we look into the alert
policy, we can see a dashboard
that actually shows tracking the changes.
Let's check that cloud function. Successfully deployed?
Not yet.
Okay, cloud function is deployed and
it starts responding with errors.
Now we can return to alert dashboard.
Let's fast forward a
little so we'll be able to see what is going on.
The responses are not immediate because they
are aggregated by the infrastructure and
now you can see the incident was reported.
If we go to the cloud job, we can see
a triggered job script executing right now.
What the job does, it just roll backs to the previous
version of the cloud function.
As you can see, the version is already in the
process of deploying and in
a moment we will fast forward to
see that the original
version that responds with 200 statuses is
there.
Returning back to the alerting window,
we can see that after
metric returned to normal below
the threshold alerting solution
will automatically mark the incident
as closed. Let me summarize.
Efficient alerts are the key to quick detection
and response to discovered threats. Not every alert
is efficient. Validate your alert configurations
to make sure they respond to actual problems that
affect your users and not your systems. Narrow your
conditions to maximize chances for detecting alert calls.
Capture service and environment context with the alert and
reach alert information with additional materials that
will be helpful in responding to the alert notifications.
Use auditable notification methods
that are able reliably deliver your alert information for
timely response. You have to make your alert
actionable in order to be able to implement automatic responses,
but not every actionable alert can be automated.
Look to automate when the cause of alert is deterministic and
the response does not require human intervention.
Be aware of your service provider capabilities to optimize alert
implementation. If you have any
questions or comments about the stock, other feedback,
or you just want to get in touch, scan this QR code
and write your feedback in the anonymous Google form.
Thank you.