Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, a survey done by state of digital operations this year found that
there's a 13 percent year over year increase in customer facing incidents and
that's little puzzling because when you compare to all the advancements happening
in tech industry you would think this is something which should be coming down.
And also, a survey done by options.
Bob Stramp AI Ops study found that 58 percent enterprise think anomaly detection
is one of the key AI Ops use case.
I think this tells a good story.
The idea is that a lot of organizations being dependent on traditional monitoring
or enable observability but based on traditional way of looking at things
have understood that we have to move.
The thresholds are never static, it's getting varied and you have
to understand the baselines of your applications and it's all about baselining
something and getting alert or getting informed if that baseline is breached.
So that means that a lot of enterprises think
in order to
enable.
If you want to have high customer experience or reduce the service impacting
issues, you have to embrace AIOps.
Hi, everyone.
Welcome to Incident Management 2024 organized by CON42.
My name is Indiki Vimalsurya.
During probably next 20 to 30 minutes, I'll walk you through about AIOps in AWS.
I'll walk you through how AIOps works.
You can implement a comprehensive or robust AIOps framework in AWS to support
the applications you have hosted in AWS.
My presentation will look at having a quick introduction of AIOps and
concept into predictive maintenance.
And then we'll directly jump in to three implementation approach.
Number one.
observability based approach, where we will look at how we
can leverage CloudWatch and implement the AIOps framework.
Then we will look at data lake based approach, where we will use quite a lot of
AWS services to build a data lake and on top of that, develop AIOps capabilities.
And finally, we will look at one of the AIOps offering provided by CloudWatch.
AWS, and how you can leverage that to enable AIOps.
And we'll wrap it up with some of the effective strategies and best practices.
Quick intro about myself.
I'm based out of Colombo, Sri Lanka.
I am experiencing SRE, observability, AIOps, and Gen AI.
I'm a passionate technical trainer and a technical blogger.
I'm very proud.
How to be an AWS community builder under cloud operations category, and
also ambassador at DevOps Institute.
With that over, let's dive into our topic.
And we'll look at
why AIOps is needed and why predictive maintenance is important
for the enterprise applications.
As you have seen, what's happening in the industry, industries keep on
evolving and things are getting better.
And we used to have monolithic systems, now moving to microservices.
And we used to be on prem, now we have moved into cloud, where
provisioning of servers are a matter of seconds, or probably minutes.
And we have already moved from server based to serverless as well.
this is it.
This is all good, and this has improved things significantly.
We also have seen, because this advancement has resulted in we
developing more and more complex systems leveraging these capabilities.
And that has resulted in expansion of data sources, surge in data volumes
coming out, and more importantly, the number of failure scenarios.
Because if you add So, if you have 9 microservices, you might
have, I would say, 15 to 20 or 40 microservices, where it used to be
a one gigantic monolith application.
Now this, the microservice architecture has made things simple, so you can
focus, your developers can focus on a small functionality and then work
on that, but that has created a lot.
they are using a lot of dependencies and this managing these dependencies
are one of the biggest pain points.
And these can result in a lot of failure scenarios in your state as well.
again, OpsRamp AI FCP study, they have asked this question from
the top, enterprise customers.
What are their
importance?
And
what are their important aspects when it comes to AIOps?
And what this study shows is 60 percent of the enterprise customers,
they want to, reduce the noise.
They want more accuracy when it comes to data.
And around 51 percent of enterprise customers, they used to have
better root cause analysis.
And then around same percentage wants to understand dependencies better.
And half, 50 percent of the customers again mentioned that
they want to reduce their MTTTR.
And these are some of the very genuine problems anyone is facing
if you are managing a relatively complex system in production.
Because data is very important.
And as I mentioned, now we have so much of data, but more data with less
care will not provide you a lot of information or meaningful insights.
So that's why the accuracy of data is very important.
And because now we have so much of data coming in, so in case of actual
issue, identifying the root causes or isolating things And ability
to narrow down your root cause investigation is really challenging.
We have, everyone needs some help.
And as I touch upon with microservices, we have this problem of dependencies,
understanding dependencies.
So those are some of the challenges
the enterprise customers are having these days and challenges where AIOps
can definitely provide you a solution.
So if you look at a typical AIOps implementation, so what you want
is, you have your customers, and they will probably start accessing
your system, and they will try to get the benefit of your system.
And on top of your distributed system, what you have done is, you probably
have some self healing bots, so that whenever issues are happening,
you can self heal your system.
You might have you will have intelligent compliance, so that
your systems are compliant, and you will build a lot of intelligent
capacity methodologies as well, auto scaling being one of the example.
And then you will look at predictive maintenance and
intelligent threat detection.
So you have a lot of things out of your observability layer.
So on top of these things you have enabled, what you want to do is, you
want to go out of traditional way of.
Looking at things, which is threshold based, which is a statistic, approach.
Or, it's a very static based thresholds.
And we want to come out of that, and we want to do things
in a more intelligent way.
And how we do that?
We want to do anomaly detection, which is about understanding a
baseline of and getting alert when this baseline is breached.
And of course you want to do some noise reduction as well.
And this can be done.
And while we do this, we can enable some intelligent business impacts so that
whenever live issues are happening, we have understanding what the impact and
we have intelligent knowledge bases and we have the ability of extracting
things or the root causes intelligently.
So when these customers are trying to engage with the operational teams, we
can provide chatbots using AI so that.
AdBots can be integrated with your observability and entire your platform so
it can detect things, learn, and it can provide some intelligent answers to your
customers and also help them when it comes to their day to day operation support.
And we can obviously bring in intelligence to our pipelines and
especially the guardrails and this will help us developing some of
the really good AI observables.
Use cases to provide value to enterprise customers.
So in summary, what we are trying to do is we are trying to bringing
in intelligent alerting with support of anomaly detection or forecasting.
And we also want to cut down the noise and we want to automate
some of these resolutions as well.
And also we have using some of these other AIOps use cases, we can definitely bring a
lot of value to your enterprise customers.
Moving on, let's also be very clear what we are trying to achieve.
So what we want AI ops or artificial intelligence for IT
operation is number one, right?
So example, if you look at this, so that if you look at this as an
application or enterprise system, which is green and suddenly you see
something going wrong and then it's getting fixed and back to normal.
So if you look at this.
So this is the time where you identify there is a service impacting issue.
And what we want is AIOps to understand these things in advance so that we
can actually eliminate these issues.
We call it predictive maintenance when it eliminates potential issues which
could have impacted the end users.
There's nothing great about picking something after it has happened.
So what we want AIOps to do is to be intelligent, understand this
small, relatively non impactive behavior changes in our systems.
And that which can help us to identify these issues in advance.
Because otherwise what will happen is, these small issues might
cause into ice ball and then it might have a bigger impact.
And now that we have a predictive maintenance, which is about eliminating
incidents and what is needed to do that.
I think next part is, we also have to be realistic.
There are chances that sometimes we might not be eliminate things,
even though we like to do that.
So in that case, what we want is detect things little faster and
then fixing things much faster.
So those are the key, three important aspects.
When it comes to AIOps, we want to first eliminate these incidents and
whatever the task we are doing part of that, we call it predictive maintenance
and then we are being realistic.
So we understand that there's a percentage of issues we might struggle
to detect early to eliminate, but we want to still nevertheless detect
them as quickly as possible so that we can fix it as quickly as possible.
So in this case, without a long outage.
H probably we could have fixed it much short outage window, which is a win-win.
So how we do that, as I mentioned, there are three approach I'm
going to discuss with you.
Approach.
Number one is observability base, and it's predominantly enabling
observability and then developing AIOps capabilities on top of that.
AWS is offering a very wide range of observable related capabilities.
CloudWatch sits on top of everything.
what we generally do is we'll have to instrument the system, probably
with CloudWatch agent, or to enable tracers with OpenTelemetry.
And this will enable our foundations, which are the metrics, logs and tracers.
And then On top of this, we can build dashboards, we can explore
metrics, or we can define our service level objectives as well.
There are a lot of insights AWS is providing, or the CloudWatch is
providing, like Container Insight, Lambda Insight, Log Insight, Application
Signals, EC2 Health, and Live Trail.
And on top of that, we can build digital experience monitoring
using Trail User Monitoring, or we call it FRUM, or Developing
Synthetic Tests, which is about it.
we have got mimicking, actual inducer behaviors, and finally
using application signals.
And on top of all of this, what capabilities Cloud work providing
is, we are able to do metric anomaly detection, which is by far
one of the most important things.
As I mentioned, that is what the enterprise customers need as well.
most of our alerting, we are based on, most of the time, majority
of it is based on metrics.
And of course, we look at logs as well.
And they are also, what you want is anomaly detection.
So if you can implement metric anomaly detection and log anomaly detection,
that is very strong to AIFs use cases.
And that will enable you to be on top of your operations.
And CloudWatch also provide AI driven language query generation,
intelligent insights, which is related to your containers, the like.
Lambda and other things.
So these are key capabilities which CloudWatch is providing you if you
build the observability correctly, which you can easily leverage to
enable a solid AIOps platform.
So in summary, the key capabilities the CloudWatch or AWS is offering
part of the observability capabilities which support AIOps anomaly detection.
So we did.
It takes unusual patterns and inform you.
So it will just check whether are you inside your baseline, or are
you outside, which, if that is the case, then you will get alert.
And the application signals will use ML to understand, the different
application issues and behavior changes.
Likewise, contain insights, which will analyze the containers and understand
whenever there are changes happening
related to the baseline.
And continue.
AI analysis, which is causing unstability in the platform.
And one of the other most important thing is log anomaly detection.
So it's about analyzing logs to understand unusual events, the new errors, spikes,
so that AI can do it intelligently.
So next option is developing a data lake based approach.
So here, what we'll focus is, we'll try to get all the data.
All our telemetry data into a particular storage where we can go with S3.
And then once we have the data, which is your, the logs, metrics, traces,
events, and if you want, you can connect your ITSM tools and other things.
And this will actually help you to consolidate all your
telemetry data in one place.
Then what you can do is we can use Kinesis Data Firehouse if you want
to stream some of this data, or if you want to go with some ETL based
approach, you can use AWS Glue as well.
And then we will create our data lake, or which is about enabling
the data governance, and we can use Amazon Lake for mention.
And then when it comes to the data processing and analyzing,
we can use EMR or Amazon Glue.
And then in order to Get access to machine learning models to enable
our anomaly detection, forecasting, and other key capabilities.
We can use, Amazon search maker.
And so we dump all this data, the process data to search maker.
And on top of that, we can build our models.
And if required, you can use some of the AWS offerings like AWS
lookout for metrics and the Amazon, the metric forecasting as well.
So that will also provide you more.
And then this you have can send it to the QuickSight and we can
connect with CloudWatch again.
So here, what we are doing is we are able to develop a data lake and on
top of that, run machine learning models developed using SearchMaker to
Enable this AIOps capabilities.
So this is little hard and complex when you look at the observability
based approach, the observability based approach, it's more of like you enable
observability and leverage cloud, the cloud watch, and then pretty much use
the native capabilities coming from the cloud watch to enable AIOps, but
here you have more control, so you can, ship as much as data you want.
So that's one thing and you can build a lot of good data governance and you can
be on top of the data processing as well.
And you have luxury of bringing in a lot of models in search maker and
that can be sometimes customized to your application or the business or
the domain which you are working.
So this will provide you more, flexibilities and more
opportunities for you when you are developing an AIOps framework.
But the other side is, it's, this is take time, you need SMEs and,
it will take a bit of time as well.
So if you look at the, other approaches, so AI, Amazon or AWS provide this
tool called DevOps Guru, which is a machine learning powered, cloud
operation capability to provide more visibility into your application.
So what you can do is you can select the coverage, what sort of coverage the
DevOps Guru provides Guru should have for in your system and you can add your data
sources and it can be your CloudWatch, the CloudConfig, X Ray and other things.
And then what DevOps Guru will do is it will continuously analyze and stream this
data and monitor relevant metrics and it will try to establish normal application
pattern and behavior and whenever there's a change, it will start informing you.
So when it comes to DevOps Guru queue, things.
It's able to work on account level, or once you select the entire coverage,
it can look at your metrics, it can look at the relevant events, and it can
come up with recommendations as well.
So this is a very powerful tool because at this time what you have to do is
just enable it and then DevOps Guru will take care of most of the heavy lifting.
And if you look at some of the capabilities DevOps Guru is providing,
it's able to do anomaly detection, so it can automatically detect unusual patterns.
It's in metrics, logs, and alert you, and it can do intelligent root cause analysis,
identify the root cause of operational issues, trying to correlate with other
problem so that it's able to reduce noise.
And it's also able to provide a lot of proactive insights when it comes to, your
systems and other tech stack as well.
So it's able to optimize resources by providing suggestions.
It is able to monitor your databases and provide more intelligent.
information and able to support you when it comes to the capacity management,
especially capacity forecasting.
It also provide cross service correlation, so which is about
analyzing relationships in between AWS services for holistic insights.
And it's also integrated serverless, so you can get the Lambda insights as well.
It can support you when it comes to security and compliance, and
we can obviously build remediation capabilities on top of here as well.
So we have looked at three approaches.
One based on observability where we leverage CloudWatch and then we
enable some of these high profile, highly requested AIOps capabilities.
Our second option is we build a data lake solution and then we use Amazon search
maker to build models and we pretty much build machine learning models on top of
the data lake and have a AIOps framework which is very aligned to our needs.
If not, you can use DevOps Guru, which is again, a very powerful thing.
You can select, you can, select the coverage and based on that AWS will start
baselining things and alert you whenever there are deviations are happening.
So finally, before we wrapped up, so some of the strategies you have to be
mindful is, so when it comes to AIOps, having a clear goal is very important.
It's sometimes one of the, key thing which is missing and which is So that
will make a difference in implementing a solid AIOps platform for your customers.
And next, we'll look at data.
So data is very important.
So you can integrate all of your observability data, ITSM related
data, and other, your wiki, confluence pages, knowledge artifacts,
so that, there's solid data.
And you have to also be mindful of team collaborations, because there
will be various teams like operations teams, the data center teams.
So you had to get everyone into the table and then come up with your approach.
And key thing is you'll have to enable real time monitoring and
detecting things much faster.
Automation is key.
Even though we have not discussed much in this presentation, whenever you have
this intelligent alerting based on anomaly detection on the forecasting, you have
to have that action automated as well.
And we can do a lot of tool integration.
I'm going to show you how you can use AIOps implementation to support this.
And also, we had to look at some of the aspects like training, because
AIOps engagements has to gel with your team, and that is one of the easiest
way to get the best, the benefit.
And one of the other most important thing is making sure that you understand
what is your success criteria when it comes to AIOps implementation.
Understand customer experience is key, then you look at NPS, or you look at
the system availability, reliability.
And the improvements related to mean time for detection, mean time for
resolution, mean time between failure.
And if possible, look at percentage of incident self heal and look at overall
what this should lead is ability for us to deploy more frequent changes with high
velocity, but it's still high reliability.
And this should also allow us to ensure that our development teams
and the operations team has not.
Spending unnecessary times on firefighting, but they are able
to spend some quality times on what is actually needed.
And then probably spend time on technical eliminations and those things.
So that finally the teams are enabled to do things much faster.
So you will see positive impact when it comes to lead time for change.
And also with all of this, we should be able to reduce the
change failure rate as well.
So in very high level, this is about understanding your
service level objectives.
Understanding what you assign up with your customers and using
AIOps to deliver that promise.
With that, I hope you enjoy this session and this is about how we can
develop AIOps capabilities with AWS.
Thank you for taking time.
There are a lot of other interesting presenters doing presentations
part of Incident Management 2024.
I request you to go around and watch these sessions and get your ideas.
Transcripts provided by Transcription Outsourcing, LLC.