Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer?
A quality engineer who wants to tackle the challenge of
improving reliability in your DevOps? You can enable your
DevOps for reliability with chaos native.
Create your free account at Chaos native Litmus Cloud
hello and welcome to the SRE conference. In my session I'll be talking
about improving observability when you're running your workloads on
AWS. As part of this session, we will talk about the best
practices that you can follow whenever your workloads are running on AWS
by leveraging the native AWS services, and also touch upon
some of the SRE practices like slis, slos and slas,
and what kind of best practices you can follow when you are setting up
these KPIs and slas for your workloads.
With that being said, let's have a look at the agenda that we will be
following today. We'll start with the introduction
where we talk about observability. What are the key
signals that you'll be looking out for whenever you're implementing observability or improving
observability for your workloads? Then we'll see the
different features which are available in AWS,
AWS part of Amazon Cloudwatch and AWS Xray
to understand the state and behavior of your application.
Finally, we will touch upon the importance of slis,
slos and slas, and then we'll close off the session with
summarizing whatever we have learned for the next 30 minutes.
So let's start with the first part. What is observability?
As we are trying to develop more and more applications in a distributed ecosystem,
observability has suddenly become a very important aspect of all these applications,
especially when it comes to defining observability. There are
many definitions that you would find in different
websites and resources which are available. The way
you can put observability in the most simplest form is it helps
you understand what is happening in your system.
As a counterpoint to that, you may ask, oh well, I have logs or
I have metrics or I have tracing. I have implemented
one of these three different aspects which are already there in my environment. Doesn't that
give me observability? Well, look at observability
as a correlation of all these different signals that
you would implement if you are running applications in a distributed environment.
Some of the questions that you would want to ask yourself is
is my system up or is it down? Is it fast or
slow? Based on the experience of the end users,
any kind of KPIs and slas which
I am establishing how would I know that I am meeting them?
These are some of the questions for which observability can
help you answer, especially when you
are running your applications at scale and in the cloud.
You cannot afford to be blind to all the different
aspects of running applications in a distributed environment.
You need to be able to answer a wide range of operational and
business related questions. You should also be able to spot
problems which ideally before they would disrupt your
operations. And it should also be able to
respond quickly to any kind of issues that you would see arising
from a customer. To achieve all this
insight, you need your systems to be extremely observable.
Now, obviously there are many different ways of improving your
system's observability. We will be focusing to begin with
on the three primary signals which are logs,
metrics and traces. Most of you would
be implementing at least one of these in your system, or maybe
all three of them if you have a very mature observability practice
in your organization or in your project.
There is also a white paper from CNCF which describes observability
and it's still a work in progress, but you can surely have a look at
it on their GitHub repo to know more about how
the cloud native foundation is thinking about observability
and what kind of recommendations they are giving.
Now one question that you can ask is is observability
new for these software systems? I would say no,
because the three signals that I spoke about in the previous slide,
which is log metric and traces logs,
have always been used for most of your debugging and for
identifying the root cause of your issues. Take an example of
anything that would go wrong in your application even
before you were running them. In cloud, there will always be a log
which is sitting somewhere in your virtualbox which
can be used for identifying what is the root cause or something that went
wrong. Then you have metrics.
The metrics have always been used whenever your applications
have been running for a very long
time. Or maybe you are trying to understand the different
infrastructure related issues which may happen. Let's say something went wrong,
or the cpu is too high, the memory is too
high, or even during performance testing you would notice that certain
aspects of the application are not behaving the way it was intended
to be. You would be relying on metrics and finally traces.
Traces help you understand how a request is traversing
from point a to point b and what all
different services it is impacting even downstream as that request
is traversing. So all in all,
these signals together or individually have always been used
by different software development teams during their
development, testing, your operations, or even maintenance
of the software systems. So observability is not
something which is leveraging concepts which
you are not aware of. Rather it is correlating
all these different signals which you have always been seeing about the
application and giving you a consolidated picture of what
is the state of your application.
So let's go a little bit in depth into each of these.
What are logs? The way I look at it, the logs
can be split into four different categories.
Application logs, system logs, audit logs,
and infrastructure logs. You would see the same segregation
from the CNCF paper as well. The application logs
are the normal logs that you would see as part of your
Java application or your Python application.
Any kind of log appending that you're doing that becomes your
application logs. What about system logs?
These would be from your oss. Let's say you are having some
kind of AMI which are running and that AMI
is having errors, and those would be the logs which you'll be capturing here.
Audit logs is something like an example of what action has been
done by what user, what was the entire trace
of a specific business outcome which has been done by a user
or even cloud trail. If you have implemented cloud trail in AWS,
that's again audit logging. And finally
infrastructure logs. Whenever you are building out your
infrastructure, maybe by using the cloud native services
or even by using let's say chef
or terraform or cloudformation, any kind
of logs that it would generate from there. Again, those are infrastructure
logs. Once you have segregated
these logs into four different categories, you can ideally
look at the logging levels that you are having, and this applies
predominantly on the application side where you have the trace,
debug, born info and error.
These are different log levels. Depending on the environment that you are
using, you should keep changing these log levels.
Try to have only the error logs at the production environment
and use the rest of the logging levels in the lower environments.
Sometimes too much of logging can also result in performance issues because of
all the file I O actions which are happening behind the
scenes. One more point to be
careful about is avoid printing sensitive information in the logs.
Each business would define their own explanation
of what is a sensitive information, but you
should have mechanisms in place in your project or in your
organization which looks at any kind of data which
has been printed into the logs and to determine if they are sensitive or not.
Next point would be defining a consistent logging pattern. Quite often
it's very difficult to have a schema which you will follow for a
long duration, especially when you are writing logs. So you
can use a log format, maybe by using log four
j or logback or any of the logging frameworks that you have,
and you can define a consistent logging format.
The advantage is it helps you extract the information from
these logs during analytics. So you may be having a centralized
logging location where all these logs are being read by
some team trying to find data out of it,
or do some analytics on the logs. A consistent logging format
will help you ease that entire process of extracting information
from the logs. And finally, when you are running container based workloads,
it is preferred to write your logs into a standard output
or AWS, an SD out. And when you're writing it into an STD out,
all these logs can be aggregated. Let's say
you're running your application in a Kubernetes
cluster. You can have some daemon set operation which
is running behind the scenes to aggregate all these logs by possibly using
fluentbit and send it across to the destination of your choice,
which can be a Cloudwatch logs in the case of AWS,
or you can even run those to send the logs to an elasticsearch
instance, which may be running on cloud or on prem, depending on
whatever is your choice of architecture with
that information on logs. Let's have a look at metrics.
Metrics often define as the performance of
your system. So when we say metrics,
it can be the cpu information that you're having. Let's say
one of my cpu is going beyond the 70% threshold.
That's a metric which says my vm is having too
much of load happening on it, or maybe there is some out of memory
that is going to happen. So there is a threshold of say 70 80%
that whatever threshold you have set up that is exceeding.
That's the metric for you. You can aggregate different types
of data as part of this metric. It can be a numeric representation,
it can be a point in time observation of a system.
It can have different cardinality associated with it.
The metrics are essentially divided into two different types.
One is real time monitoring and alerting, and second is trend
analysis and long term planning as
it's self explanatory. Real time monitoring and alerting will immediately
tell you if something is wrong with your application. So that's
where if you have the right set of alarms, you have the right set
of notification and other bits in place, you would immediately
know that a specific application is not showing
the right metrics, or maybe having too many 404 errors,
or too many 500 errors. Those are again metrics.
It gives you account of what is happening with your application,
the trend analysis over time it helps
you do the right sizing. For example, if you are running a new application
and you know that the application is just getting
started. So possibly you would use a lower configuration
of a vm, let's say two virtual cpus and
four gb of ram. Over time, as your application
scales out, you would want to increase the cpu or maybe
scale out. Those kind of long term planning can
again be done based on the trend of what these metrics are coming in and
what is the observation from the historical data. So metrics
help you guide the future of
what needs to be done for your application and also give a snapshot
of what is currently happening with your application.
The third signal that we have spoken about in the previous slide is tracing.
So distributed tracing has become increasingly important
when you are running your system in a distributed environment.
For example, a request which is initiated by your end user
or your customer from your point a to
point b, by having the right tracing
in place, it shows the effect across all
the different downstream services which are there.
Quite often. When you're running a microservices architecture, it's not always
one microservice, it's a collection. So you would have three or four
microservices interacting with each other behind the scenes. So if you're
going from microservice a to d, the tracing
will help you show the information of what is happening
when the request is going from a to b, b to c, and c
to D. So it gives you information of all the
lag or the latency that you are having at different
levels of your application. And that way it gives you a
holistic picture of a request or a journey of a request.
Okay, moving on.
Now that we have established what observability means
and what are the different signals that we will be talking about in
observability, we will discuss about the
ways by which you can understand the state and behavior of the application.
For this purpose, I will be focusing on two main
services. In AWS. There are different
observability services which are available both in the
open source as well as in AWS. For this
particular presentation, I'll be focusing on Amazon Cloudwatch and
AWS X ray. Amazon Cloudwatch will take care
of your metrics and your logs, and AWS Xray provides
you the tracing capability.
So let's consider a use case. Let's say that
I want to monitor a microservices architecture and this microservices
architecture consists of different distributed parts
which needs to be monitored. Obviously for the logging
aspects, you can make use of Amazon Cloud watch and it can
also be used for collecting the metrics, setting up the alarms
and reacting to certain changes which are happening in your AWS environment,
and also with many different microservices working together.
You want to know what is the chain of
invocation for these microservices?
And that's where the idea of xray comes in, which uses
the correlation ids which are unique identifiers attached
to all the requests and messages related to a specific
event chain. So let's take an
example the same way.
Service ABC if I'm trying to do a get operation
on service a behind the scenes, it may be fetching
the data from service B and C and consolidating it and give
it back to me. That's where Xray will help
relate all these different invocations by using the correlation
id and you can get a consolidated view of how your application is
behaving with the different inter service
communication and also the overall response when it is being
provided back to the user.
So as I mentioned earlier, you have two services
which we'll be focusing here for this presentation,
Amazon Cloudwatch and x ray. So Cloudwatch is
a monitoring and management service that provides the data
and insights into AWS. With the help of
Cloudwatch, you will be able to get all your logs consolidated
into Cloudwatch logs. You can have the container
insights in order to get information around the metrics
on containers which may be running on your ECS or eks
orchestrators. You can make use of service lens
to see how the different services are related to each other. You can make use
of metrics and alarms. The metrics will help you get
data on number of 400 errors
or the 500 errors which you are seeing for an application. Load balancer
and alarms will help you set up threshold. For example,
if I have a cpu which is going beyond
70%, have an alarm
which is going off and telling me that okay, this is an error
and there is something happening consistently where the cpu threshold
is beyond 70%, you can also have anomaly
detection and that's where the earlier part of metrics
being able to help you with the long term trend
analytics comes into picture. Because at any point of time, if there
is an anomaly in your overall operation,
the metrics will help you catch that.
Talking about AWS x ray, that's basically for tracing and analytics.
This will also give you a service map and it helps the
developers to analyze and debug their production workloads,
especially in a distributed environment. It can understand how your
application and the underlying services are performing. And you can
use Xray to both analyze the applications in
production as well as in development. And there are different
environments based on each of the customer. You can surely have x ray do
all that in each of those environments.
Now let's look at the different components that you have with
Amazon Cloudwatch. I did give a brief overview of
Amazon Cloudwatch in the previous slide. Let's take an example
of one of the resource that we have here. Let's say
it's an ALb or maybe you have some kind of custom data
which you need to send. So right in the left you will be sending this
custom data into Cloudwatch, and Cloudwatch will be
responsible for getting all these metrics put together.
So you can see an example here of the cpu percentage hours per
week, and again cpu percentage, et cetera, et cetera.
So all these metrics are being put together by Cloudwatch
at any point of time. If a metric is exceeding
or going beyond a specific threshold that you would have set,
you can have a Cloudwatch alarm go off and that alarm
can integrate with an SNS email notification which
will inform your team that there is something wrong in your application.
You need to possibly go and have a look at that. If it's
not a notification, you can also have auto scaling being
invoked. So let's take an example that I have one virtual
cpu and the threshold
is saying that the cpu percentage has gone beyond 70%.
I would like to have auto scaling kick in at this point of time.
So that's where I can use Amazon Cloud watch alarms
and I can trigger an auto scaling so that my cpu count,
the EC two count for that matter, can go from a single
instance to a fleet consisting of three or four different instances,
and then it can scale in as well once the threshold
is coming back to normal. And as a consumer, you can
have a look at all these statistics. You can plot the graphs which
we will have a look at it in the next slide. Aws well,
now let's look at log insights. I did mention that the
logs are getting pushed to Cloudwatch logs,
and once the logs are available in Cloudwatch logs, you can execute
this query. And this is a query which is
being executed on Cloudwatch logs itself,
which helps you parse certain parts of that
particular log format and filter all the logs
which are having a logging type of error. And this is really
helpful when you have all these logs getting consolidated
as part of your log groups within
Amazon Cloudwatch. And after that you can see in
this particular screenshot how you can show the distribution
of the log events over time. Even the custom log events can
be seen here. So the log insights help
you look at what has happened, say from
01:00 a.m. All the way to 02:00 a.m.
And it will be able to consolidate all that information,
show you the distribution of the events which have happened, and you
can also find specific events which you would have otherwise
missed when the logs are being manually looked at.
So that's the advantage of having the log insights.
The sample query here, it fetches the timestamp and
it fetches the message fields, and it orders the timestamp
in the descending order.
So with the logs in place, let's talk about the metrics.
We did see in the earlier example that the metrics are being exposed
for different AWS services. Now these metrics are being grouped by
namespace and then by the various dimensions which are associated
with that. For example, here in this screenshot,
you can see that all the custom namespaces like the container insights,
the Prometheus related stuff, or the ecs container
insights, all these are custom metrics which have been added
into the custom namespaces. And then you have the AWS related namespace
like API, gateway, application, load balancers,
dynamodb and ebs. These are the default namespaces
from different AWS services which can be used in your account.
So this way the metrics section will
help you find all the related metrics which are
for a specific service or for the custom events
which are being consolidated by your team.
Next we have the graphed metrics. It's very related
to the metrics that you saw in the earlier slide. It helps you
basically run statistics on your metrics which can be average,
minimum, maximum sum, et cetera.
And if you look here at the right side, what you see as
the red circle, it also helps you add the
anomaly detection band where you're saying that this is
what my application is normally behaviours, but at
a certain aspects it helps you identify,
okay, this is an anomaly. So possibly something went wrong, or maybe
you got a very high spike of incoming traffic at
that point in time. So it helps you identify such anomalies in
your regular flow of traffic and obviously in your
metrics.
We have been talking about alarms for quite a while. So there are basically
two types of alarms. One would be the basic metric alarms and others are composite
alarms. So we saw that there
are metrics. So if I am using just one metric, and I
am using that for creating an alarm,
then that would be a metric alarm, a composite
alarm. It would basically include a rule expression,
and it needs to take into account the different states which are there in
the alarm. So the threshold of the alarm, for example,
whenever you are setting an EC two auto scaling event,
you would set up say that it needs to scale out whenever
it is 90% and above cpu utilization. And then it would
scale in whenever it's 50% or something cpu
utilization. So that's a threshold that you would set up for an alarm.
So you can see here that the blue line
that is visible here, that's the threshold that you have set up.
The value is the red one. So the statistic
that you are measuring, that would determine whether your
alarm is enabled or it is okay,
or it is an in alarm. So after three periods
over the threshold, where you can say that, I'm not going to say
that the particular metric is an alarm unless it happens
three times in a specific period. So that's what you're seeing here,
where the value has become three
times more than the threshold. So that's where the alarm will go up.
Or you can put a specific statistic
which says that the moment the value goes up,
consider the alarm to be in action. So depending
on your business case, depending on the use case, you can pick
and choose whether you want to have the value to be above the
threshold for consecutive period or just one period.
And that's how the alarms can be put to work together
with the metrics that you are collecting to give you
this experience of either sending out SNS notification or maybe
even doing automated remediation. Or finally,
you can also have the auto scaling, which is kind of like an automated remediation.
So in this example I wanted to show about automated
remediation where you are leveraging the AWS lambda. So you
can see here that, and we have a blog
about this, about incident management and remediation in the cloud.
In this example, you're monitoring a microservice
API that is sitting behind application load
balancer. The traffic can't reach the microservice, so it
times out. So you could have an alarm that is triggered to
send SNS notification to the topic when a lambda function
is being used. So the moment the alarm kicks
off, this alarm is getting triggered, the SNS topic will
send the notification to lambda. And once the notification
is received by the lambda, it knows that there is something wrong with
the security group. So what it will do is it'll
go back and it will fix this security group,
possibly edit the inbound or the outbound rules which are there.
And that way you will allow the traffic to go through and through.
So that's one way of doing automated remediation. You can also
make use of chat Ops is basically you integrate it with some
kind of a slack integration and instead you post a message to your
slack channel and someone would be made aware.
Okay, here is something which has gone wrong.
So there is multiple ways in which you can leverage
the lambda and you can leverage the Cloudwatch alarm to
together have this automated remediation in place by using the Cloudwatch.
If you have a look at the EKS workshop which is published
by AWS, you will see that there is
a specific chapter which talks about service mesh integration
and how you can use the container insights.
This image is from the container insights which
you can set up. So as you can see here on the left panel there
is a section called Container Insights. And here you can select whichever
is your EKS cluster which is native Kubernetes
AWS managed Kubernetes cluster. And here you can have
different views on cpu utilization, memory utilization,
network, number of nodes, disk and cluster failures.
This is the advantage of having container insights which gives you a
one shot view of different clusters which are there. It also
gives you a view of different services and resources if you're using ecs.
So that way you have all the metrics which are needed for your container workloads
together in Cloudwatch. Now Cloudwatch
can't just be used for container workloads, you can also use it for EC twos.
That would be just your normal virtual Vm.
You'll have to export all the logs which are getting generated
in there by using a Cloudwatch agent. But you can have all those logs
come into Cloudwatch logs and you can do the exact same operations that
we have been talking about in the earlier slides, even by
using EC two.
So next is have a look at the anomaly detection.
So when you enable anomaly detection for a particular metrics, it applies
the statistical and the machine learning models which are the
algorithms which are already in place. So you can see from the graph that
this grayed out area is what is the
anomaly detection band around the different metrics which
have been set up. And in this particular anomaly detection
band, we are saying that anomaly detection between m one
and m two. So this indicates that the anomaly detection
has been enabled for the metric with a one
dimensional m one and with a standard deviation of two
as a default. So by default it will be one, and the moment
it goes to a standard deviation of two, you would have an anomaly being detected
in here. So when you are viewing a graph of a metric data,
overlay the expected values on top of the graph. So I know that
this is where my normal execution of
the graph is, and the moment you see these red areas is
where some anomaly has been detected. So that's generally a pattern
which can be used for your long term trend analytics,
or it can even be used for your long term planning, saying what is
the usage that's going to happen for each of your services,
and also what percentage of those will be used. So depending on whichever metric
you're using, in this case it is cpu utilization.
You can determine the band around which it will
be operating and it helps you detect any anomalies which will come up in
your operation. The advantage
of having this sort of a setup is the moment
there is something wrong from the normal baselines, your ops team will be made aware
of it and you are not being caught off guard
whenever such systems are running at scale in a distributed environment.
This is an example of one thing that I mentioned
in the earliers, where you can consolidate
all your logs by potentially running a Cloudwatch agent,
and then you can have the metrics and everything being shipped to
the Cloudwatch logs by using a fluent bit sidecar
pattern. This pattern is very much common in
the container space, and the container insight also collects
the performance logs, focusing something called as the embedded metric format.
These performance log events are structured JSON schema
basically, and it enables you to send high cardinality
data which can be ingested and it can be stored at scale using
Amazon Cloudwatch.
So just to summarize here, we have seen that Amazon Cloudwatch
can help you take care of your logging as well as your metric needs
when it comes to observability.
And now we have a look at AWS X ray with
AWS X ray, going back to the first point I mentioned,
it's basically about correlation ids. So this is an example
where you can see that this service of scorekeep
is behind the scene, getting the data from DynamoDB.
It is keeping the session, it is updating the item,
and it is ultimately putting the data into there.
So this breakdown, what you're seeing, the step by step breakdown, and also
the time which it takes for each of the stage, that's what
x ray gives you, the trace
id which is the correlation id. It's added to the HTTP request
for specific headers and it has the name called X Amazon
Trace iD. And ultimately you can have this integrated
with Amazon API gateway or AWS
application load balancer and likewise.
So in this section we have a service map,
and x ray uses all the data that you have in your application
to generate this service map. What you're seeing here is
anything which is green is what is up and running, and here
what is red. So there is something wrong with this particular service. So this sort
of a consolidated view for a client, let's say
a user is invoking something. It helps you determine how
your service is operating and also gives you a one shot view
of how different services are integrated, especially in
a large distributed system.
So that covers all the topics around Amazon
Cloudwatch and AWS x ray. So we
touched upon different aspects of metrics, how you can have the
graphs put together, how you can use the logging
aspects of Amazon Cloudwatch for consolidating
all your logging, and ultimately use the x ray for the tracing bit as well.
Now I would like to move a little bit outside of the AWS services
which we have been talking about for the last 1520 minutes,
and just talk about the importance of service
level indicators, objective and agreement.
Quite often whenever you are defining your SRE
practice, it's quite important to have these three
definitions well sorted out. So what is
a service level indicator? The service level indicator
is basically a number. It's a quantitative measure of
some aspect of the service which is being provided
to you. So if I have a service, say a rest API
which I am giving, and I'm saying that it's going to be up
99% or something like that, then I know
that this is the measure that
I'm going to meet and that's the SLA part of it. That's the measure
which I have to meet whenever the service is running
for a long duration, and that's the reliability part
of it. And every time you are trying to set up slas,
SLO or SLI, make sure these are incorporated
in your sare practice. Because ultimately not meeting
a particular SLA or meeting an SLA determines the
success or failure of a service. And most often
the slas are easily recognized with a
financial penalty or some kind of rebate
which is associated with them.
Some of the guidance which you can follow, especially when you
are setting up the SRE practice and for the product development,
these are things which you would obviously find in different resources online.
Do have a look at Google's SRE book
which talks about what are the best practices and how you can have
the SRE practice implemented in your organization.
Also, have a look at AWS provided resiliency
and well architected framework which talks about all the
five pillars that you would need to accomplish if you want to ensure
that your architecture is well architected for the high
availability, resiliency, operational excellence,
security and likewise.
So the first thing is you must not use each and
every metric in any tracking or the monitoring system that you are having.
It's always a concern when sometimes
you want to capture everything that your system is coming in that
is not really good in the long term because you would not be differentiating
as to what it is that you want to know about the system and what
is it that you don't want to know about the system.
The second is have as few slos as
possible and get an agreement from all the stakeholders.
For clarity. These should be measured and the conditions under
which these slos are valid. Even that has
to be clarified with the stakeholders. And finally,
make sure that you have an error budget which provides
the objective metric that determines how unreliable
a service is allowed to be. So let's take an example. If you are saying
that your service is going to be available 99.9%
availability, that's approximately four and
a half hours of downtime in a year. Now, are you
ready to have that kind of a setup, and are you
ready to have that kind of an SLA for your application? That is
something which you have to discuss with your stakeholders because the
more available that you want to have your application,
sometimes the cost associated with it is also high. So you
need to take care of what is the Dr. For your application, how the application
is going to behave if it is not going to be 99.9
or 99.95% availability?
How is the run team going to make sure that it has the
right resources, like the run book and other details about the applications?
How is the handover going to be from the
application team which has been building the service to the run team?
What kind of DevOps methodologies we are following?
Does the team have the autonomy for building
and deploying certain patches or fixes for such an application?
So all those things have to be taken into account when you're defining these
KPIs for your application.
Okay, so with that being said, let's summarize.
So what did we learn here? We spoke about logs,
metrics and traces and how all of them correlate. In order to
give you a better experience in the overall observability of your system.
Make sure you implement the SLIs, SLOs and SLAs for
your application in consultation with the stakeholders,
evaluate how these behaviors are under different circumstances.
If you're running your services on AWS, try and leverage the
native tooling that is already available for logging alarms
and dashboards like Amazon Cloudwatch, and for tracing using
AWS x ray, and for a more deeper
dive into what is being offered. AWS part of observability
from AWS hop over to observability
workshop AWS and you can get a hands on experience of
all the features which are there. It's approximately three to 4 hours
long. Lab. It's a self learn kind of a lab. You can
run this in AWS environment with all the different
templates which are already available. So with that being
said, thank you so much for your time and hope you have
a good day.