Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, welcome to Conforted to Cloud Native 2024.
I'm very happy to be part of this year's Cloud native
conference and I'll be walking you through about how you
can leverage AWS to build a comprehensive observability maturity
model which will take your observability from reactive to
autonomous. You had to ask this question,
are you working for a machine or machines
are working for you? It's middle of the night, you get a call
out and you have to open your laptop and start working. I am afraid
that little means you are working for machine. So what you
have to understand is how you can take it to more of autonomous
way so that machines will start working for you.
During my presentation I'll walk you through about why
observability is important, especially in cloud native area,
and why you need to focus on observability
maturity. The topic of my presentation is mainly around the
maturity model which I have came up with and the pillars
around that. And I'll go more into details about the
maturity model where it's a four state maturity, how you can take it
from reactive to autonomous, and then we will talk about some
of the implementation guidelines where you are able to
leverage these when you are implementing and you
start your own AWS observability journey.
It's very important when you are implementing a maturity model to
ensure that you are measuring business outcomes.
Every step of the way you try to see what is that value you
are generating to your business. Unless you do that, it will be just
another approach and your business partners
will not see the expected benefits. So it's very important.
We kind of have an ability to measure everything and then see how it's
impacting the overall, the business goals
and other things. And then we'll wrap it up with going through
some of the best practices pitfalls you have to avoid.
And also, I'll briefly talk about my predictions for cloud
native observability in future.
So moving on. As you might already aware, cloud native
is not the buzzword now. So everyone is almost in the cloud
or like partially already move
into the cloud as well. So cloud native is,
even though it's simplifying lot of stuff. When you are moving
from your on premise data centers to cloud, it's cutting over a
lot of overheads, but it has its own complexities as
well. So one of the key complexity
thing which is bringing is the distributed nature due to
the applications are nowadays heavily
dependent on microservices architectures and there
are a lot of upstreams and downstreams this naturally results.
Our systems are pretty much distributed and very complex
and hard to track. And also most of our Trixa
systems are dynamic in nature. So that means there will be auto
scaling happening, there will be other, the dynamic elasticity
is there. So that is required a new way of doing
observability, because the traditional way of doing
the monitoring and managing and operations
will not work. And of course we have the containers,
we have the continuous integration and deployments. And this
is doing is it has increased the production velocity.
This has resulted in, we have lot of complexities in
our cloud native solutions. So this is
a recipe for, again another disaster
unless you plan properly. So what we are suggesting
is the traditional monitoring will fall about when
you are in cloud native. So you have to look at observability and
you have to look at what are the ways how you can
get the benefit of cloud as well. So in nutshell,
observability is a key part of your cloud native journey.
Without observability you will definitely fall apart
and you will not achieve your end objectives.
So moving on, why you think we need a maturity model?
So there are lot of reasons. One is that
not a technical reason, but one of the main thing
is you need to have a north star. So when you start your
observability journey, you'll probably start in someplace,
but you want to know where you are heading and you want to
have kind of a decent objectives
in a particular timeline so that you are able to work with your
resources and then able to go in that direction.
So one of the main thing is if you don't have a maturity model,
you don't know how to measure the quality and you
don't know where you stand when it comes to the rest of the industry.
So it's very important that you kind of have an observability maturity
model. Doesn't mean that you will have to stick to
what I am presenting today, but you can take it, do a little bit
of customizations to suit your needs and then probably make it to a
blueprint where you can look into that and from there you can start
your observability journey. So as I said
earlier, the maturity model is very important because that
is we are not stuck.
So when we are building an AWS based observability,
what are the key pillars? So obviously there are a
lot of pillars, and I'm going to touch upon few. So one
of the main thing is logs. As you might know, logs are
the most ancient type of
observability element. It would have been there
when the distributed systems or the computer systems started.
Syslog is kind of like the most oldest logs
we might know, and logs have been used as a
way of auditing or even troubleshooting purpose.
And then comes the metrics. Metric is usually a number
which is providing indication of how something
is working. So metrics generally are
being used to trigger alerts,
because we have the metrics and it's easy for us to set a
threshold or kind of like a profile based alerting,
where we can get this benefit of using metrics
to alert us. So metrics are very important aspects of
observability, because that allow us to understand some
of the internal state of our systems.
Then tracing. Tracing is probably
the newest kid in the block, where it's about trying
to understand how your code is doing. So we are good at
looking at the logs and going through what things are happening.
But you have to understand, logs are a little limited and
sometimes it might not provide you the exact details you are looking for,
but when it comes to tracing, traces will provide you that
exact unit of work. What is your code is doing,
you can able to trace back into the method level and
even to the database queries level. So the traces are very powerful
thing which is allow you to especially help
you in troubleshooting issues. And then I don't
have to spend time on the alarms, you have to have the right
alarms in place so that you get automated call outs
and you are in.
But what we have to understand is automated call out is not
the way to go, but you have to understand, is there a
way I can automate it? I can get systems
to resolve itself, or the self healing capabilities,
the autonomous work, which I'm going to talk about. So again,
alarms are kind of like very early,
or like primarily things, but you have to have some alarms
in case your autonomous things are not working. And then of course
you'll have to have dashboards. Canneries are nothing but about
doing some synthetic testing in your application. So it's good that
we are looking at our end users behaviors, what they are doing,
all the service calls, and we have kind of a full stack observable
to our application. But what if that customers have a network issue
or some other issues where those are outside our control?
So having a synthetic monitor which is trying to mimic
actual end user behavior will help us here,
because with this we are able to mimic end user the
actions, behaviors and then get an alert when
our synthetic monitor is hitting issues. So this is
a good way of keeping track of things and being on top
of our systems and real user monitoring is very
important. It's about front end monitoring, it's about trying to understand the
exact customer experience your end users are getting.
And of course you'll have to do your infrastructure monitoring, network monitoring,
security and finally be mindful of cost as well.
Because especially in AWS, you have to make a
decision when to enable the detailed logging, even though it might be
costly. You might have to make a decision when to enable my anomaly
detections and other great features AWS is
providing. And these will obviously have some cost associated
with that. So you will have to start balancing your
needs with the cost as well. So in high level, these are
the key pillars of observability. So observability, if I
just trying to remind you, is about approach based on
the application emitted telemetry data
trying to understand system's internal state. The more we
understand the system's internal estate, we are in control of
ensuring that the systems are working fine.
And in case if we identify that internal states are kind
of like deteriorating, we are able to take actions in
advance so that end user customer experience
will not get impacted. And what is our mission here?
Our mission is to ensure we reduce or
completely eliminate anything which can impact our end users
customer experience. Because what we are trying to do is ensure that
our systems are reliable and available as well.
So the key pillars of observability is generally locks,
metrics, traces, but you have to add other things as
well that will complement your observability journey.
So moving on, what are the key,
the levels we are referring when we are saying observability model
here, these are the four levels I am referring.
So first model is reactive, you can call it more for
keeping your lights on, where it's about just
doing the basics to ensure that you are kind of like getting
alerts when the systems goes down and the next level is being
proactive. So it's about doing things little more
and being little advanced in detecting and fixing
the issues which can cause impact on end
user customer experience. And then predictive. The third level
is the way to go where that will allow you to predict
something because being proactive is good,
but if that is still having an impact on the inducer experience,
that is not so great. So what we want is to predict
early, so that we identify symptoms
early so that we can fix it even before those get
materialized and have impact on the end user experience.
And the final cloud, the Nirvana state is the autonomous
level where our systems are able to look at all the telemetry data.
And with that it's able to make a judgment
of its internal state. And if it sees that it's not
going in the right direction, it's able to do some self healing
remediation by its own so that it's able to keep that
internal state. The key thing to note is that
observability is about understanding the internal state and
more data we have. We are able to let our
systems to understand its internal state and then take
the precautionary measure, even with us involved.
So that is what we call the autonomous. So if I
come back again reactive, we will have some logs and
probably we'll have some metrics as well. And metrics
are probably the limited and we might not have traces at all,
but we will able to know if our application comes down, we might get some
process alert, infra heavy alerts, and that will able to
do some work. But it's just keeping the insights on. But it's
not necessarily a great customer experience.
Being proactive is you have the access of the logs,
metrics and traces, and with that you are
able to proactively identifying issues. It means
probably you identify some issue and that might still have impact
on customers, but you can still speedify the resolutions. You can
probably identify even your end users. Getting known
to that issues, sometimes it feel little great.
You don't have to wait until your customers reporting and telling something
is down or someplace there's an issue, but you kind of
like get to know it first. And probably you can send some comms and you
can be on top of the entire incident window.
That is again not a great place to be, but it's still, it's better than
keeping insights on or being reactive.
Predictive is using these metrics, logs and traces,
and being on top of the game, looking at the anomalies,
forecasting things, looking at what is happening
outside our BAU operations with that,
come up with some intelligent predictions and then based on that,
take the actions so that we can actually eliminate those
issues which can have a bad impact on end user customer
experience. And then finally autonomous. I have kind of like
touch upon when I'm going through these four levels. Initially it's about
looking at things and trying to understand when and
where system internal estates is getting changed, and then
what actions the system itself can take to ensure that it
can bring its internal estate back to the normal
state where it can able to self heal,
remediate and do all those things so that it will be even far
above the predictive level. And one of the questions you might ask
is, does all your systems or the
clients which you are working need to be at the autonomous level?
The answer is no. It depend on the complexity,
mission criticalness, it depend on lot of factors.
So we are not advising everyone to be at autonomous level,
but obviously everyone should not be at reactive level as
well. For most of the systems between proactive and predictive
will do the job. It will be able to balance
your operations with the cost as well, and also provide some
greater benefits to the end users and the business.
But if you are in a mission critical system, any outage or
bad customer experience will cost you money.
And it's important for you to keep track of what the end users
are feeling. Then obviously you will have to go to the predictive
and autonomous level where you can leverage
these levels to ensure that you are on top
of your operations. So now let's go through the
key pillars, which we discussed earlier
with the four levels of maturity and trying to understand what
the maturity levels mean for each pillar. So when it
comes to logs reactive approach, you are simply using
logs for troubleshooting. Customers are reporting some issue
and then you are acknowledging that and you simply
refer to the logs and then start troubleshooting and trying to find answers.
But being proactive is that the
exceptions are getting visible in the logs. And probably
you are trying to get some alerts out of that so that you
will get alerts and you can be little proactive in trying
to identify issues. Obviously this will improve your meantime
to detection. And with this you are able to
kind of like be a little bit on top. And predictive is you
are looking at all the logs and you are probably doing advanced
anomaly detections. And with that you see the anomalies in advance.
The moment you see something with going outside your
BAU or the home internal
state, you identify and then you can kind of like
do a lot of predictions on top of that. Autonomous is that looking
at those and then trying to do call relation and then doing kind
of ability to trigger the other
workflows which can actually do autonomous operations or
the self healing metrics. Again, the reactive
level, you will have the basic metrics and proactive, you will have
some threshold based alerting, but when it comes to predictive,
and you will use lot of anomaly detection capabilities
and these capabilities will help you to predict issues
in advance and also then build your autonomous capabilities.
Tracing usually in the keeping the lights
on the reactive level you will not see tracing proactive level. You will
have some basic tracings, but when it comes to predictive,
you probably have tracing, which is a time driven,
and then you will have kind of a topology base as well,
so that it's more of a distributed tracing where you are
able to propagate your traces, the context,
and that you can make a correlation between different systems,
and then you are able to identify a lot of issues.
So this is more of a full stack observability level in nutshell.
And this will give you great benefits when come to being
predictive. And you can definitely use this
because with the traces you are able to identify actual root causes,
and then you can able to trigger your autonomous workflows.
The canaries are the synthetic monitoring, and while
going through you will start from having no canneries to
have all your key journeys being monitored by synthetic
monitors. And real user monitoring is a very important
one, where you will start it from proactive level
and then you will improve it. When it comes to predictive level and
when it comes to autonomous, you will obviously use AI and ML
to improve the capabilities and then also to drive this
autonomous capabilities as well.
So infrastructure monitoring, again, when it comes to predictive
autonomous, it's about bringing in AI,
ML and being on top of your operations, and it's
the same for the network and security as well.
And those will actually allow
you to keep walking in this
journey. Achieving the observability
objectives and cost is a very important factor.
And from reactive to autonomous,
even though there is a cost improvement increased,
but with the autonomous nature or the predictive, you are able to
bring down lot of human involvement, lot of human effort,
and this will resulted in more gains for you as well.
So you might start with little bit of expensive when you are starting your
observability journey, but you can reduce the cost definitely when
you are ending your journey.
So now let's look at how we can implement a comprehensive
observability setup using
AWS. So in this example, this is
an application hosted in AWS where you have
a database and you have microservices and you have front
end code as well, and you have upstreams, downstreams and
end users. So in high level, when we are trying
to implement a comprehensive observability setup, what we need
is we have to implement RAM, which is real user monitoring.
That is usually where I start, because I would like to know how
my end users are feeling. RAM is all about understanding front
end performance, and front end performance is the most important thing
because that's what our end users will see.
So next we'll try to implement APM or the application performance
with distributed tracing, so that we know from end to
end how things are happening. And we also have a full control of
our code. What's important is to understand how
is our code is behaving and with that we can understand the bottlenecks
and the other issues so that we can rectify. So enabling
application performance monitoring and distributed tracing or
the full stack observability is very important. With the distributed tracing,
when a user request comes in, we are able to correlate that
with what's happening in the front end, what's happening in the microservice
level, and what's happening in the database as well. Traditionally if
you see a database query which is taking time, what happens is
that your DBS or the SMEs will identify, they will
reach out to the DevOps and the SRE teams, but they will
sometimes struggle unless they have access to code or
even they have access to code. They will search, but they might have
trouble identifying which part of
module or sometimes which journey this is getting triggered.
Even in case if they are good with the code, they identify
the module and everything. It will be next to impossible for
them to isolate and identify which
user, which part of user profiles has invoked this
because there's no connect. But with distributed tracing
we are able to propagate the trace context which will take from
front end to the microservice to our database layers as well,
where with the trace propagation we are able to understand
the queries which is getting run by or
queries which got triggered due to the end user request.
So this is very powerful. This will enable us to understand and
go through and identify the bottlenecks, issues in our
code and other errors and everything related
to customer experience which can directly correlate it with our code.
And obviously we'll have to look at the logs and events and we will look
at the metrics. And also you'll have to ensure that as
part of site reliability engineering you define your SLIs,
SLos and error budget, which is again complement
by all your observability goals. And then finally, obviously you will have
to do your infrastructure monitoring as well. And that will actually ensure that
you are on top of your state.
With this. Let's look at some of these key implementation
areas so that you are able to get some idea about
the implementation. So if you are using AWS, you can go into
the real user monitoring where you can configure your application.
It might give you the ability to get a code snippet
which you have to embedded in all your front end code. With that you know
you can enable the real user monitoring. It will allow you to
see the page response times, the page errors and the
epidemics as well, and all the things which is related
to the front end performance. And then
obviously you'll have to configure the cloud watch agent and then
you can configure the relevant prop XMLs
to ensure that you added all your log files and you can ensure that your
logs are getting feeded into the cloudwatch.
And once that happened, ensure that you enable
log anomaly detection because it's very important.
As I said earlier, what usually happen is in the
reactive level when you get ripped, get your end users are
complaining something is not working, you will trail your logs,
you will identify those exceptions, then you will able to understand the
issue and then you will try to come up with fixes. But what is
great is once you identify the issues, probably someone will ask,
can you go back and check when this issue started? And you have seen
this started a couple of hours early or even sometimes a couple of
days early. How great it is that if we are able to
identify this the moment those are appearing. But one
of the challenges how we do, because it can be something unknown
which is not even aware by the development team or which has
been difficult to capture. So with part of the log
anomaly detection, the AI is able to understand what is
the baseline, what are the existing errors, what are the things which is currently happening
and braceline your state. And after
that, on top of there are new errors are happening, new issues,
new behavior changes happening. It's able to alert you.
So log anomaly detection is a very powerful concept which
you should definitely enable, which will provide you
value when you are going through your observability.
And once you enable your traces, you will start seeing
the service map. So you can do this with open telemetry.
And once you do that, the one great feature is that this
map will allow you to see how the requests are getting served. And even
if there are any bottle knocks or anything, it will show.
As I said, traces are the great way because it allows you
to track your request from the browser
level, from the API gateways and to the SQL server.
So like if you see this is about the time the front end is taken
and this is probably about the time where the microservices are
taking. And you can see some of the SQL running as well.
So enabling full stack observability is very important because that
provides you the full control of your state. So you
have the ability to see all your system,
internally state and especially the code. What is
your code doing? But usually what happened is that in
the reactive level you are more infra heavy. You'll see all
the infrastructure and other things. But one thing to
note is that it's the code. It's the code which is serving your
customer request, the code which is doing the processing and other things.
So you had to enable traces and start ensuring
that you enable the full doors to your system where you
can have the full visibility. And once
you're doing that, you will have your metrics enabled as well. So it can be
your intra level metric, application metric, performance metric,
and you can have your custom metric as well. If you are using lambda,
you can have the serverless based metrics as well, the database metrics.
So metrics are everything like metrics are
generally the numbers. Of course, it's always the number where
you are able to, based on that, make lot of decisions and you can see
the performance, and especially you can configure lot of alerting
as well. So metrics will give you those trigger.
And obviously with cloudwatch you are able to enable metric
anomaly detection. So with that you have the power of
not only having the metrics or
not only going with some threshold based alerting, which is a very legacy
or an old way of doing, and then enable
anomalies. So what cloud watches do is it
will start profiling the metric and how the metrics are going
on and how it is changing. And with that it will try
to create upper bound and a lower bound for a
guidelines. And based on that, it will start sending you
alerts if it sees that there are anomalies are happening.
And also in AWS we have code guru. I'm recommending
you to enable code guru for application profiling,
because that will allow you to understand the code performance and
you can correlate it multiple other factors and also
enable the AWS DevOps guru, which will
do a lot of AI ML in your entire account.
So it's a very powerful tool, so that will have full holistic
view of your entire state and ability to identify anomalies
across the board.
So that's kind of like what
is required to implement a comprehensive observability solution in
AWS. Now, let's discuss. One of the key
thing is, so why we are doing this. We are doing this because
we want to ensure our customers are getting the world class customer
experience which our application is designed for.
So while you are building and coming up with the observability
journey, it's very important you clearly define your goals,
how to measure your customer experience and with that trying
to understand is your observability methodology.
The framework allow you to achieve your
customer targets and how it's able to
correlate and identify when things are going wrong.
So you can quickly identify and then fix it.
So I am not going to go into much detail here, but one of
the thing I'm iterating and important thing
is to ensure that you understand your business objectives and
you have a way of measuring it while you are traveling around
level one to level four observability. So each level you
will see the benefits and it's good even before you start
you are able to identify what are the benefits and then see
whether you keep or add some targets to
your journey. Some of the KPIs, if you want to have few
KPIs, is work on your meantime to detection,
meantime to resolution, and meantime between failures
and trying to see your service level objectives
achievement. Because the more you are going from reactive to autonomous,
you should be able to achieve your service level objectives.
And that's a must. Unless we do that, the purpose of
doing observability is not useful
at all. And of course,
as anything, enabling observability in AWS is
pretty much easy. That's what the cloud is providing you. But you
have to ensure you follow some of the best practices. So observability,
as I said, it's about looking at the internal state
of your system. So you must enable your logs and traces
and metrics and you should ensure that in AWS,
wherever needed, you enable your detailed monitoring because that
will really help you as well. And don't forget your traces,
because traces provide what's really happening at your code
level. So that is what's most important thing. Because when
you come to cloud native, most probably it's a
safe guess to say your infrastructure is pretty stable. And then
ensure that you send almost everything into a cloud watch. And then you
have proper dashboards as well so that you are able to have
a look and you are able to have a big picture, holistic picture
of your entire state and what you should
avoid. And definitely ensure that
when you are shipping your logs to Cloudwatch and you're
mindful of retention as well, you don't want to lose your logs.
And when in a week's time you want to troubleshoot an issue. If you
don't see your logs then that's a problem. And ensure that
you have more granular level metrics and others as
traces as well. So that's something which is very important because don't
try to be very high level, because sometimes what you
need is the ground level details.
And when we are working in a very vast complex systems,
it's very easy to forget about some of the critical systems
virtusa of they are not feel like critical.
So ensure that you have a proper way of understanding your critical systems
and ensuring that those are being monitored, observed and everything
you have done. And finally, it's very easy
to have a technology or data siloed observability.
Ensure that all your data observability
telemetry data are centralized and it gives you a
big picture.
And finally, so where the cloud native observability is
heading. So that's a very good question you should ask when you
are coming up with your own observability maturity framework.
So as far as I see immediately what I seeing is lot
of clients are getting adopt into opentelementry because
they are aware of the need of the traces,
the distributed traces enabling full stack observability.
So that will have more way of
the first kind of the requirement for our customers. And then
when it comes to the midterm, I'm thinking lot of people really
start moving into AI ML because the
observability tools are inbuilt, providing them the
ability of doing anomaly detection and ability of doing
the forecasting, ability of doing prediction. And those things
are inbuilt and available and people really start using
capabilities very quickly. And long term vision is it's
about where I started. Do you want to work for a machine
or you want machine to work for you? So ultimate objective of
observability is it's about ensuring that you identify
your system's internal state. And whenever it's getting
slightly changed without human involved, you try to
fix it, the systems try to self heal it. And that is the autonomous
nature which I was discussing.
And finally, my prediction for this year. So if you
see the gardener's magic quadrant, you have seen we
have the leaders as well. Dynatrace, Datadoc, new relic are the
top three. And we have Amazon Web services, also a
leading contender and in the challengers category. I feel
like with all the new advancements
and the announcements and the capabilities unleashed, part of last year's
AWS reinvent, which is about application signals,
which is about the log anomaly detections, which is
about other more improvements and advancements to
the anomalies and other AI based changes
related to Cloudwatch, I'm pretty sure that Amazon Web
service will be in the leaders category when the next time Gardner is going
to release this magic quadrant, so keep your fingers
crossed and I'm pretty sure this will happen
probably this year, and if not next year for sure.
So with that, I hope you enjoy my presentation.
I wanted to ensure that you have an understanding of observability
and you know how to use observability for your AWS.
And observability is a journey. It's about starting from
probably keeping lights on, then going into the proactive level
and then trying to make it to predictive nature and
then finally ending up with autonomous operations.
So thank you very much for listening and in case if
you have any questions you can find me in LinkedIn and
you can also put comments into this video as well,
which we might appreciate. There's a great line of speakers
who are going to speak part of Cloud Native 2024.
Please join and be really very much happy
and appreciate the time you have spent. Take care.
Bye.