Conf42 Incident Management 2024 - Online

- premiere 5PM GMT

AIOps in AWS: Choosing the Best Approach for Predictive Maintenance

Abstract

Transform your IT operations with AIOps in AWS! Discover how leveraging observability tools, data lakes, and AI-driven automation can revolutionize predictive maintenance. Learn to optimize performance, reduce downtime, and align IT with business goals for seamless, scalable operations.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, a survey done by state of digital operations this year found that there's a 13 percent year over year increase in customer facing incidents and that's little puzzling because when you compare to all the advancements happening in tech industry you would think this is something which should be coming down. And also, a survey done by options. Bob Stramp AI Ops study found that 58 percent enterprise think anomaly detection is one of the key AI Ops use case. I think this tells a good story. The idea is that a lot of organizations being dependent on traditional monitoring or enable observability but based on traditional way of looking at things have understood that we have to move. The thresholds are never static, it's getting varied and you have to understand the baselines of your applications and it's all about baselining something and getting alert or getting informed if that baseline is breached. So that means that a lot of enterprises think in order to enable. If you want to have high customer experience or reduce the service impacting issues, you have to embrace AIOps. Hi, everyone. Welcome to Incident Management 2024 organized by CON42. My name is Indiki Vimalsurya. During probably next 20 to 30 minutes, I'll walk you through about AIOps in AWS. I'll walk you through how AIOps works. You can implement a comprehensive or robust AIOps framework in AWS to support the applications you have hosted in AWS. My presentation will look at having a quick introduction of AIOps and concept into predictive maintenance. And then we'll directly jump in to three implementation approach. Number one. observability based approach, where we will look at how we can leverage CloudWatch and implement the AIOps framework. Then we will look at data lake based approach, where we will use quite a lot of AWS services to build a data lake and on top of that, develop AIOps capabilities. And finally, we will look at one of the AIOps offering provided by CloudWatch. AWS, and how you can leverage that to enable AIOps. And we'll wrap it up with some of the effective strategies and best practices. Quick intro about myself. I'm based out of Colombo, Sri Lanka. I am experiencing SRE, observability, AIOps, and Gen AI. I'm a passionate technical trainer and a technical blogger. I'm very proud. How to be an AWS community builder under cloud operations category, and also ambassador at DevOps Institute. With that over, let's dive into our topic. And we'll look at why AIOps is needed and why predictive maintenance is important for the enterprise applications. As you have seen, what's happening in the industry, industries keep on evolving and things are getting better. And we used to have monolithic systems, now moving to microservices. And we used to be on prem, now we have moved into cloud, where provisioning of servers are a matter of seconds, or probably minutes. And we have already moved from server based to serverless as well. this is it. This is all good, and this has improved things significantly. We also have seen, because this advancement has resulted in we developing more and more complex systems leveraging these capabilities. And that has resulted in expansion of data sources, surge in data volumes coming out, and more importantly, the number of failure scenarios. Because if you add So, if you have 9 microservices, you might have, I would say, 15 to 20 or 40 microservices, where it used to be a one gigantic monolith application. Now this, the microservice architecture has made things simple, so you can focus, your developers can focus on a small functionality and then work on that, but that has created a lot. they are using a lot of dependencies and this managing these dependencies are one of the biggest pain points. And these can result in a lot of failure scenarios in your state as well. again, OpsRamp AI FCP study, they have asked this question from the top, enterprise customers. What are their importance? And what are their important aspects when it comes to AIOps? And what this study shows is 60 percent of the enterprise customers, they want to, reduce the noise. They want more accuracy when it comes to data. And around 51 percent of enterprise customers, they used to have better root cause analysis. And then around same percentage wants to understand dependencies better. And half, 50 percent of the customers again mentioned that they want to reduce their MTTTR. And these are some of the very genuine problems anyone is facing if you are managing a relatively complex system in production. Because data is very important. And as I mentioned, now we have so much of data, but more data with less care will not provide you a lot of information or meaningful insights. So that's why the accuracy of data is very important. And because now we have so much of data coming in, so in case of actual issue, identifying the root causes or isolating things And ability to narrow down your root cause investigation is really challenging. We have, everyone needs some help. And as I touch upon with microservices, we have this problem of dependencies, understanding dependencies. So those are some of the challenges the enterprise customers are having these days and challenges where AIOps can definitely provide you a solution. So if you look at a typical AIOps implementation, so what you want is, you have your customers, and they will probably start accessing your system, and they will try to get the benefit of your system. And on top of your distributed system, what you have done is, you probably have some self healing bots, so that whenever issues are happening, you can self heal your system. You might have you will have intelligent compliance, so that your systems are compliant, and you will build a lot of intelligent capacity methodologies as well, auto scaling being one of the example. And then you will look at predictive maintenance and intelligent threat detection. So you have a lot of things out of your observability layer. So on top of these things you have enabled, what you want to do is, you want to go out of traditional way of. Looking at things, which is threshold based, which is a statistic, approach. Or, it's a very static based thresholds. And we want to come out of that, and we want to do things in a more intelligent way. And how we do that? We want to do anomaly detection, which is about understanding a baseline of and getting alert when this baseline is breached. And of course you want to do some noise reduction as well. And this can be done. And while we do this, we can enable some intelligent business impacts so that whenever live issues are happening, we have understanding what the impact and we have intelligent knowledge bases and we have the ability of extracting things or the root causes intelligently. So when these customers are trying to engage with the operational teams, we can provide chatbots using AI so that. AdBots can be integrated with your observability and entire your platform so it can detect things, learn, and it can provide some intelligent answers to your customers and also help them when it comes to their day to day operation support. And we can obviously bring in intelligence to our pipelines and especially the guardrails and this will help us developing some of the really good AI observables. Use cases to provide value to enterprise customers. So in summary, what we are trying to do is we are trying to bringing in intelligent alerting with support of anomaly detection or forecasting. And we also want to cut down the noise and we want to automate some of these resolutions as well. And also we have using some of these other AIOps use cases, we can definitely bring a lot of value to your enterprise customers. Moving on, let's also be very clear what we are trying to achieve. So what we want AI ops or artificial intelligence for IT operation is number one, right? So example, if you look at this, so that if you look at this as an application or enterprise system, which is green and suddenly you see something going wrong and then it's getting fixed and back to normal. So if you look at this. So this is the time where you identify there is a service impacting issue. And what we want is AIOps to understand these things in advance so that we can actually eliminate these issues. We call it predictive maintenance when it eliminates potential issues which could have impacted the end users. There's nothing great about picking something after it has happened. So what we want AIOps to do is to be intelligent, understand this small, relatively non impactive behavior changes in our systems. And that which can help us to identify these issues in advance. Because otherwise what will happen is, these small issues might cause into ice ball and then it might have a bigger impact. And now that we have a predictive maintenance, which is about eliminating incidents and what is needed to do that. I think next part is, we also have to be realistic. There are chances that sometimes we might not be eliminate things, even though we like to do that. So in that case, what we want is detect things little faster and then fixing things much faster. So those are the key, three important aspects. When it comes to AIOps, we want to first eliminate these incidents and whatever the task we are doing part of that, we call it predictive maintenance and then we are being realistic. So we understand that there's a percentage of issues we might struggle to detect early to eliminate, but we want to still nevertheless detect them as quickly as possible so that we can fix it as quickly as possible. So in this case, without a long outage. H probably we could have fixed it much short outage window, which is a win-win. So how we do that, as I mentioned, there are three approach I'm going to discuss with you. Approach. Number one is observability base, and it's predominantly enabling observability and then developing AIOps capabilities on top of that. AWS is offering a very wide range of observable related capabilities. CloudWatch sits on top of everything. what we generally do is we'll have to instrument the system, probably with CloudWatch agent, or to enable tracers with OpenTelemetry. And this will enable our foundations, which are the metrics, logs and tracers. And then On top of this, we can build dashboards, we can explore metrics, or we can define our service level objectives as well. There are a lot of insights AWS is providing, or the CloudWatch is providing, like Container Insight, Lambda Insight, Log Insight, Application Signals, EC2 Health, and Live Trail. And on top of that, we can build digital experience monitoring using Trail User Monitoring, or we call it FRUM, or Developing Synthetic Tests, which is about it. we have got mimicking, actual inducer behaviors, and finally using application signals. And on top of all of this, what capabilities Cloud work providing is, we are able to do metric anomaly detection, which is by far one of the most important things. As I mentioned, that is what the enterprise customers need as well. most of our alerting, we are based on, most of the time, majority of it is based on metrics. And of course, we look at logs as well. And they are also, what you want is anomaly detection. So if you can implement metric anomaly detection and log anomaly detection, that is very strong to AIFs use cases. And that will enable you to be on top of your operations. And CloudWatch also provide AI driven language query generation, intelligent insights, which is related to your containers, the like. Lambda and other things. So these are key capabilities which CloudWatch is providing you if you build the observability correctly, which you can easily leverage to enable a solid AIOps platform. So in summary, the key capabilities the CloudWatch or AWS is offering part of the observability capabilities which support AIOps anomaly detection. So we did. It takes unusual patterns and inform you. So it will just check whether are you inside your baseline, or are you outside, which, if that is the case, then you will get alert. And the application signals will use ML to understand, the different application issues and behavior changes. Likewise, contain insights, which will analyze the containers and understand whenever there are changes happening related to the baseline. And continue. AI analysis, which is causing unstability in the platform. And one of the other most important thing is log anomaly detection. So it's about analyzing logs to understand unusual events, the new errors, spikes, so that AI can do it intelligently. So next option is developing a data lake based approach. So here, what we'll focus is, we'll try to get all the data. All our telemetry data into a particular storage where we can go with S3. And then once we have the data, which is your, the logs, metrics, traces, events, and if you want, you can connect your ITSM tools and other things. And this will actually help you to consolidate all your telemetry data in one place. Then what you can do is we can use Kinesis Data Firehouse if you want to stream some of this data, or if you want to go with some ETL based approach, you can use AWS Glue as well. And then we will create our data lake, or which is about enabling the data governance, and we can use Amazon Lake for mention. And then when it comes to the data processing and analyzing, we can use EMR or Amazon Glue. And then in order to Get access to machine learning models to enable our anomaly detection, forecasting, and other key capabilities. We can use, Amazon search maker. And so we dump all this data, the process data to search maker. And on top of that, we can build our models. And if required, you can use some of the AWS offerings like AWS lookout for metrics and the Amazon, the metric forecasting as well. So that will also provide you more. And then this you have can send it to the QuickSight and we can connect with CloudWatch again. So here, what we are doing is we are able to develop a data lake and on top of that, run machine learning models developed using SearchMaker to Enable this AIOps capabilities. So this is little hard and complex when you look at the observability based approach, the observability based approach, it's more of like you enable observability and leverage cloud, the cloud watch, and then pretty much use the native capabilities coming from the cloud watch to enable AIOps, but here you have more control, so you can, ship as much as data you want. So that's one thing and you can build a lot of good data governance and you can be on top of the data processing as well. And you have luxury of bringing in a lot of models in search maker and that can be sometimes customized to your application or the business or the domain which you are working. So this will provide you more, flexibilities and more opportunities for you when you are developing an AIOps framework. But the other side is, it's, this is take time, you need SMEs and, it will take a bit of time as well. So if you look at the, other approaches, so AI, Amazon or AWS provide this tool called DevOps Guru, which is a machine learning powered, cloud operation capability to provide more visibility into your application. So what you can do is you can select the coverage, what sort of coverage the DevOps Guru provides Guru should have for in your system and you can add your data sources and it can be your CloudWatch, the CloudConfig, X Ray and other things. And then what DevOps Guru will do is it will continuously analyze and stream this data and monitor relevant metrics and it will try to establish normal application pattern and behavior and whenever there's a change, it will start informing you. So when it comes to DevOps Guru queue, things. It's able to work on account level, or once you select the entire coverage, it can look at your metrics, it can look at the relevant events, and it can come up with recommendations as well. So this is a very powerful tool because at this time what you have to do is just enable it and then DevOps Guru will take care of most of the heavy lifting. And if you look at some of the capabilities DevOps Guru is providing, it's able to do anomaly detection, so it can automatically detect unusual patterns. It's in metrics, logs, and alert you, and it can do intelligent root cause analysis, identify the root cause of operational issues, trying to correlate with other problem so that it's able to reduce noise. And it's also able to provide a lot of proactive insights when it comes to, your systems and other tech stack as well. So it's able to optimize resources by providing suggestions. It is able to monitor your databases and provide more intelligent. information and able to support you when it comes to the capacity management, especially capacity forecasting. It also provide cross service correlation, so which is about analyzing relationships in between AWS services for holistic insights. And it's also integrated serverless, so you can get the Lambda insights as well. It can support you when it comes to security and compliance, and we can obviously build remediation capabilities on top of here as well. So we have looked at three approaches. One based on observability where we leverage CloudWatch and then we enable some of these high profile, highly requested AIOps capabilities. Our second option is we build a data lake solution and then we use Amazon search maker to build models and we pretty much build machine learning models on top of the data lake and have a AIOps framework which is very aligned to our needs. If not, you can use DevOps Guru, which is again, a very powerful thing. You can select, you can, select the coverage and based on that AWS will start baselining things and alert you whenever there are deviations are happening. So finally, before we wrapped up, so some of the strategies you have to be mindful is, so when it comes to AIOps, having a clear goal is very important. It's sometimes one of the, key thing which is missing and which is So that will make a difference in implementing a solid AIOps platform for your customers. And next, we'll look at data. So data is very important. So you can integrate all of your observability data, ITSM related data, and other, your wiki, confluence pages, knowledge artifacts, so that, there's solid data. And you have to also be mindful of team collaborations, because there will be various teams like operations teams, the data center teams. So you had to get everyone into the table and then come up with your approach. And key thing is you'll have to enable real time monitoring and detecting things much faster. Automation is key. Even though we have not discussed much in this presentation, whenever you have this intelligent alerting based on anomaly detection on the forecasting, you have to have that action automated as well. And we can do a lot of tool integration. I'm going to show you how you can use AIOps implementation to support this. And also, we had to look at some of the aspects like training, because AIOps engagements has to gel with your team, and that is one of the easiest way to get the best, the benefit. And one of the other most important thing is making sure that you understand what is your success criteria when it comes to AIOps implementation. Understand customer experience is key, then you look at NPS, or you look at the system availability, reliability. And the improvements related to mean time for detection, mean time for resolution, mean time between failure. And if possible, look at percentage of incident self heal and look at overall what this should lead is ability for us to deploy more frequent changes with high velocity, but it's still high reliability. And this should also allow us to ensure that our development teams and the operations team has not. Spending unnecessary times on firefighting, but they are able to spend some quality times on what is actually needed. And then probably spend time on technical eliminations and those things. So that finally the teams are enabled to do things much faster. So you will see positive impact when it comes to lead time for change. And also with all of this, we should be able to reduce the change failure rate as well. So in very high level, this is about understanding your service level objectives. Understanding what you assign up with your customers and using AIOps to deliver that promise. With that, I hope you enjoy this session and this is about how we can develop AIOps capabilities with AWS. Thank you for taking time. There are a lot of other interesting presenters doing presentations part of Incident Management 2024. I request you to go around and watch these sessions and get your ideas. Transcripts provided by Transcription Outsourcing, LLC.
...

Indika Wimalasuriya

Associate Director / Senior Systems Engineering Manager @ Virtusa

Indika Wimalasuriya's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways