Securing the Future: How AIOps Drives Operational Resilience on AWS

Video size:

Abstract

Unlock the power of AIOps to enhance security in complex systems. Explore how AI and Generative AI automate real-time responses to security threats, transforming operational resilience. Discover strategies and real-world implementations that drive measurable impact across your organization.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

There are very interesting findings, we identified part of this cost of data breach report. So they identified close to around 4. 4 million is the average cost of a data breach, which is really expensive. And they also identify close to more than 46 60 percent of data breaches involve personal data. That's very, I would say a dangerous precedent because it's like 50 percent of the time it's the customer data. So these are some of the challenges being exposed part of numbers. Hi everyone. My name is Indika Vimalasuriya. Welcome to DevSecOps 2024. Today we'll discuss how you can build a secure future. in complex landscape and how we can leverage AIOps to drive operational resilience on AWS. I'll walk you through some of the challenges in distributed systems and why it's hard and then we'll get into some of the detailed aspects of AIOps and the security offering in AWS and we will go into very detailed AWS implementation use cases. Very focused on real world and security. And we'll wrap it up with some of the strategies and some of the metrics we should be focusing on when we are going through this journey. So a little bit of myself. I'm based out of Columbus, Sri Lanka. I'm a reliability engineering advocate and a solution architect. I specialize in site reliability engineering, observability, AIOps and generative AI. I'm a passionate trainer. An energetic technical blogger I'm also very Proud to hold AWS community builder title from AWS under cloud operations category and also very proud to be ambassador at DevOps institute So let's move into the detail So what are the challenges we face in? The today's distributed systems. We see a lot of challenges When we are working on cloud, these challenges touch upon multiple life cycle phases. And we'll go through and see what are these challenges and why, we have to be vigilant of this. So identifying something in our ecosystem is always a challenge because, we have this, limited visibility across resources. So there are challenges related to in case system telemetry data coming from, like multi cloud or poly cloud environments and identifying assets and dependencies and, the dynamic nature of things. So those are very challenging things to manage. So when it comes to protect, like how are we going to protect our state? And we have a challenge of preventive controls. because, enforcing consistency across distributed systems is very hard. And also configurations in cloud, and especially when it comes to native clouds, and, some of the vulnerabilities it can arise are also very challenging. So when it comes to detect, how quickly you can detect a threat, that, that's the challenge. And I had to be agree that we have made significant progress in this regard. There's a lot of. things in place, but that has also resulted in, there's a lot of noise in our systems. So false positives, noise is one of the big problem when you are trying to identify threads and difficulty in finding out, the insider threads. Within the distributed access model is also very challenging aspects. And then respond. How are we going to respond? The fragmented incident response, it's a one key challenge area, lack of integration tools across the cross cloud responses and a delay in some of these automating playbooks in the regions and different parts of our infrastructure. is a key challenge. And then recovery, generally we take, considerable of time to, recover ourselves. So insufficient disaster recovery planning in cloud native environments and Restoration to the pre incident states due to the complex nature of our systems are always a big challenge. And also we have a challenge at our governance as well because of the compliance overhead and keeping pace with the, the continuously changing regulatory requirements across globe is a very challenging thing. And ensuring our state is always audit ready. The fact that our state is built on dynamics, nature, dynamics, scaling. And on top of that, we need to ensure things are audit ready is one of the big challenge. So if you can see different aspects of our life cycle of operations, like the security operations, like identifying things. In protecting things, detecting, responding, recover, and overall governance, each and every area, we have seen some of the challenges. And these challenges are across, the cloud systems, probably on premise as well. And this has been increased or amplified because we have been moving from on prem to cloud and with cloud, it's the dynamic nature, elasticity that has created some of the challenges and our monolith applications. We have moved into microservices with dependencies and, it's again creating a lot of, challenge as well and finally we have our telemetry data Like we have the logs metrics traces and these things and this is generating Again, a high num high volume of data and because we have high volume of data Again, it's a data paradox situation. We are more data with less care is not provide much information So it's again a challenge So moving on, let's, get into, some of the offerings, AWS is offering and how AWS is focusing on, meeting this challenge. So before we go any further, I think we all have to be familiar with, or let's at least cover the AWS Shares Responsibility Model when it comes to the security pillar. So AWS covers. AWS is responsible for security of the cloud and it's about the global infrastructure, availability zones, regions, edge locations, data centers. So it's AWS responsibility. And AWS also will ensure the foundation services, security is being managed as well. It's about compute, storage, database and networking. So AWS is. In summary, responsible for what's inside the cloud or security of the cloud. And when it comes to us, the customers, we are responsible for the security and compliance in the cloud because we are bringing in that customer content. We are developing our platforms, applications, and we are responsible for tracking and maintaining and managing identity and access management. And we are responsible for. The maintenance of operating systems network firewalls and other configurations we are building client side data encryption server side data Encryption and the network traffic protection. So there's a very clear Distinguish here. So aws will take care of the entire the security of the cloud and we The customers we want to ensure We secure what we're going to put in the cloud You So this is one of the key aspects we have to be aware when we are going through this journey. AWS is also offering a wide range of tech stack, the services, when it comes to identifying some of these security problems and protecting our ecosystems or the infrastructure or the application domain and detecting things and also how we can investigate things much faster, automate the response. respond and recover. So these are some of the services AWS is providing, which allows us to streamline this entire life cycle, identify, protect, detect, investigate, automate and response and recover. So it's very important. We, whoever who's going to leverage AWS or cloud And any cloud provider has a understanding of these services because you don't have to really reinvent the wheel, but you just have to ensure you use these services so that we can develop a good practice when it comes to deploying your data in the cloud and ensuring that what's in the cloud is Transcribed So there are a lot of services, AWS is offering. So when it comes to identify, you have AWS security hub, AWS organizations, AWS control tower, and trusted advisor, which actually provide a lot of good capabilities. And we also have AWS service catalog, AWS config and AWS well architected tools and systems manager which is also providing a lot of capabilities here as well. So under protecting things, we have AWS transit gateway, the, all the foundational services like Amazon VPC and your AWS Shield, IEM policies, Amazon Cognito, WAF. So there are firewall manager, a lot of stuff, right? And detecting, we have Amazon GuardDuty, the Amazon, the inspector and a lot of other services. So when it comes to the automation, we can use a CloudWatch to build a lot of these integrations or the trigger CloudWatches. like the observability tool aws have so with that we can observe What's happening inside this security the bubble and whenever there are triggers? We are able to trigger aws step functions or systems manager lambda And with that we can actually build a very comprehensive Workflow and we can recover things much faster. So there are a lot of other services as well, but highly early Yes, aws is offering a lot of services which allows us to identify things protect things proactively and detect things especially using ai things like The aws mesh or the inspector guard duty and devops guru, which we're going to discuss A little in detail, in subsequent, presentation. The slides are key. Those collectively help us. To build a comprehensive security driven, incident workflow also, as I said, All our applications are growing. It's very complex. So observability is also key part So we have aws cloud watch which is providing Ability to instrument our code and collect this the data and we we can leverage aws cloud watch and probably open telemetry You And it will allow us like this metrics, logs and traces, the key pillars of observability on top of that we can build visualizations and we can build a lot of insights and analytics and we have the digital experience monitoring as well. So the key pillars, the metrics, logs and traces on top of that, the key AI features, the cloud watch. Is providing we can look at the metric anomaly detection. We can have the log anomaly detection We can have ai driven natural language query generations and log analytics And there's intelligent insights as well. So insights are coming from both containers lambda The application insights ec2. So all these are very powerful capabilities so all these leverage ai and the and These things provide a lot of advanced option for us to be on top of the system issues. So especially when it comes to any of the security issues Not we are not going in a traditional way, but we can use some of these AI ops capabilities to identify things much faster understand the root cause much faster and prevent it and Not only that but actually go and eliminate these things permanently. So generally With AIOps, what we are looking at security, right? what generally happen is we see a incident, a security related incidents, and we want to identify it and fix it. So with AIOps, we want to make it, really fast, right? So if you look at it, the first task of AIOps is even before service impacting issue happens, can we detect it? Can we detect that security issue and can I eliminate it? We be more predictive maintenance So that we can eliminate if not, can we detect the is the security incident much faster? And so that we can reduce our mean time to detection and then we fix it much faster so that we can actually reduce the time, the cycle time. The final is can we self feel things like with, the, we discuss about the automation workflows Typically, AWS is offering and leveraging these automation workflows, the self feeling capabilities. We detect things and then we try to resolve things much faster. So that is. The whole advantage the ai ops offerings are bringing into the table when it comes to managing high impact security incidents So one of the other like very cool service aws Has developed and been evolving we have seen is devops guru. So it's a it's about analyzing Entire your aws services like your entire thing like your account the tags and All the services so it's able to look at the metric anomalies related events and it can able to provide recommendation So it will be continuously analyzing. the purpose of, Amazon DevOps Guru, it's a continuously analyze, analyze streams of, the data coming from multiple, channels. And it's, I have the ability of monitoring things, understand relevant metrics to identify. And baseline the normal behavior and on top of that understand these, the deviations, the outliers and the anomaly patterns. And this is happening across the entire, our AWS account. So that way that Entire thing is being covered. So some of the key capabilities of DevOps guru are it Can do anomaly detection a very powerful anomaly detection and across like the logs metrics and other things And it's have the capability of root cause analysis as well with the leveraging of ai and correlating some of these data sources It's able to do some smart root cause analysis and guiding you in the right direction It can provide a lot of proactive insights And recommendations to prevent, the security incidents and it's able to also optimize some of these Resources and probably identify some of the weaknesses in advance. It is have some lot of good Capabilities when it comes to database monitoring as well. So Some of these sql injections and other database for related vulnerabilities. We can mitigate And it's able to do capacity planning assessment and more intelligent in that area as well So it's able to also provide support when it comes to cross service correlation And it's pretty much integrated with most of the aws services And it will keep track of you know The security and other compliance things and also the most important thing is it allows us to Do automated remediations and build self healing capabilities. So it's one of the my one of my favorite like very Useful services WSAS is offering with that we can actually keep Entire our state on check all the time. So let's go and check what are the the real world applications, of AI ops in the, world of security. So you have already seen what are the, what are the key services AWS is offering and, how we can leverage these offerings and also the offerings coming part of, observability as well. And with this, let's go and deep dive into some of the use cases and try to understand how, We can prevent some of these security incidents by leveraging this AIOps capabilities being offered. Let's go through them. The first use case I would like to walk you through is anomaly detection in logs and metrics. So anomaly detection is a very powerful capability. What we are trying to do is we're trying to baseline the performance when it comes to logs and some of the security related metrics and especially, logs, because the access logs and other. Things so this way we can identify some of the unusual behaviors in our systems So we can look at and find out some of the unauthorized access attempts and identify Very suspicious configurations or spot some of these lateral movement patterns So some of these examples, we are able to monitor spikes in api calls volume tied to a potential ddos attack So the next time, they're like we see a spike in api calls like we can go and correlate it What's happening and whether it's something unusual or whether it's something to do with the? Cybersecurity problem and detecting anonymous API calls from unknown ips and especially whether these ips are conflict behaving in, the connected manner and identify the fail logins and burst of atoms as well. So it's pretty much like we, we define a baseline for logs and the metrics and we try to understand the outliers and some of these, anomalies. Which will help us greatly preventing some of these security problems And second one is as I said, there's so much of data here Like we have logs metrics traces. We have the observability. We have all the data coming from the security services So correlating these things are very important. So ability to aggregate correlated data from various sources It's a very powerful capability Especially when you are investigating things so that we can cut the noise and go to the root causes. So some of these, allows us to map relationship between suspicious attacks, because there are nowadays, a lot of attacks are coordinated. So when that happens, the event correlations and the service data help us to understand these coordinations and it also help us to build unified timelines and unified response approaches as well. So some of the examples. correlating fail logins with outbound traffic or linking a history bucket access to a unusual IAM role or flagging simultaneous logging from distant regions. So this is again a very powerful, capability. use case because noise is sometimes one of the greatest challenges we face these days. We have, we are doing a lot of good work when it comes to, detecting things, but with the noise, it's, it take us a lot of time to cut across these things to go to the actual problem. So event correlation is a powerful capability which we can leverage to, reduce the MTTR of, dealing with these major security incidents. And that also goes hand with the noise reduction which are the with the event correlation So this is about filtering these irrelevant alerts and ensuring that there's no alert fatigue And we the teams are always looking at the high priority incidents. So the reducing fault positives In guard the duty or suppressing the duplicate alerts or prioritizing high risk vulnerabilities So if you exposed to, like thousands of vulnerabilities, you had to reduce the noise, understand the key, essential, the high priority things. So we can use AI to really help us on this regard and, trying to narrow down our focus so that we can focus and get things done instead of getting drowned with. Forecasting and is a very powerful capability. So we can predict some of the potential threats based on historical data. So we can look at, things in a holistic way and identify risk even before they happen, right? That's one of the key objectives of AI ops. And we can allocate resources, proactively in advance to prevent some of these vulnerabilities as well. So some of the key examples, predicting potential DDoS attacks from traffic patterns. So you don't really have to wait until it happens, but the way it comes, it's going, it's like mimicking a classic, Traffic pattern of a DDoS attack so that you can be ready or we take your we look at we apply some other use case to Understand these anomaly IPs now and the coordinated fashion and we do correlation And then we are able to response much faster and we can also look at anticipating IEM role misuses based on the behavior and predicting future And forecasting also pat needs so forecasting is one of the key high profile, capability or use case AIOps is offering and finally the automated incident response. we had to do that. We had to identify things much faster, but providing resolution is the key. This might cut across multiple of these use cases, the anomaly detection, the forecasting correlating things. So once you do that, Then probably it's easier for us to automate the response And we can of course limit the blast radius of these threats So we can do some auto isolation of compromise instance or block malicious ips automatically Rework some of the compromise credentials or the host or the services So there are a lot of automation workflows. We can develop using services offered by AWS and threat intelligence. It's a key part. We can enhance some of these models to look at the way things are coming in and then obviously take more actions. So this can be more tailors to understanding, evolving threats. And some of these examples are automatically backlisting traffic from some of the suspicious IPs or looking at phishing attacks and enhancing logs with thread intelligence. So those are some of the key things which we can do part of enhanced intelligent thread detection. And we can definitely leverage AI to do a lot of good behavior analytics as well. So behavior analytics is to understand what the user behaviors and try to see what users are actually doing. So this is about monitoring typical behaviors to understand or detect insider threats and ensure that, we are compliant as well. Some of the examples are Spotting access attempts outside work hours or detecting unusual data transfers by specific users or even identifying unusual configurations. So the behavior analytics is a very, Powerful, use case, like we can look at the access logs. You can look at the, some of the audit, the audit trails, or you can look at application logs or some of the metrics, the key business centric metrics to do this. So now that we have understanding of, what are some of the high profile use cases. Let's look at, how, what are the strategies in building, effective AIOps, framework to, mitigate some of the cyber security risk. First and foremost, it's very important to set the goal, right? We just have to have a good understanding what is our purpose, what are some of the threats and what are some of the blast radius and how we can reduce this blast radius. some of the improvements we are targeting when it comes to eliminating incidents or, MTTD, MTTR. So those are very important. We should also focus on where our AIOps is going to integrate, what services it's going to monitor. it's very important you look at things in a holistic way, look at the logs, metrics and traces. Look at the observability data coming from this cloud trails or the guard duty and use some of these capabilities like devops guru to have Ai taking care of the entire ecosystem. We had to also be very collaborative. We had to enable cross teams synergy between all the teams in the ecosystem so that this goal of us being fulfilled by everyone. So we obviously had to use the CloudWatch. CloudWatch is the more real time observability tool offered by AWS and it is providing this Amazing capabilities like anomaly detection the forecasting and other things So we had to definitely plan to use these things and build those sophisticated Advanced capabilities around that and obviously we'll have to go with the Automation is provided by system manager or the lambda You can get the trigger from cloud watch and then do the automation as well AWS is offering loads of tools, doesn't mean that you have to use everything but you have to look at your read and based on that you can pick some of these useful tools for you and with that you can build a very, good AIOps framework to support your cyber security needs. And you can look at the security and compliance capabilities as well. There are a lot of security compliance tools. checks and best practices AWS is providing definitely leverage them and then finally it's about KPIs I always try to believe regardless whether it's a security incident or anything. It's about the customer experience So we have to keep on ensure we maintain those customer experience and then ensure our systems are Available and reliable as well The, especially the data. So the data is available and data is consistent and the data integrity is not being compromised. So those are some of the key things we have to track. And then we have for all the security incidents, we had look at mean time to detect. Meantime to recover and meantime between failures and also because there's a lot of Eliminations we are bringing in how many incidents how many? the security incidents, we have been self healed or eliminated and can all this lead to like we improving volatility in our systems as well So whole idea is that if I take a summary aws is offering Lots of services and they have a very good model when it comes to shared responsibility model where aws is responsible for what's been developed or what's been the infrastructure side of the cloud and then customers we are responsible for what are we going to put in the cloud, right? so AWS is offering lots of services to ensure that we can build a very comprehensive identification protecting and ability to automate things and recover things much faster. And we also have the observability based on CloudWatch, which is providing your logs, metric traces, where you can apply into all the places where you get the security, the telemetry data as well. And then on top of that, we can do log anomaly detection, log or the metric anomaly detection, forecasting, correlation, and noise reduction, and all those things. And all of this help us in achieving these targets, but it's very important you baseline these things so that you know where you are going and you are continuously improving this matrix. So thank you everyone. Thank you for taking time. And I hope this was an interesting session. If you have any questions, you can just feel free to comment and I'll pick it up and reach out to Otherwise you can check me, or search me in LinkedIn. And probably you can connect with me link in so we can have a discussion as well So there are lots of other good tech the talks happening with deus ex ops 2024 organized by con42 So I request you to join those sessions and learn about deus ex ops and new trends and what's happening Thank you everyone. Thank you

Slides

Download slides (PDF)

See all 42 talks at this event!

Conf42 DevSecOps 2024 - Online

December 05 2024 - premiere 5PM GMT

Securing the Future: How AIOps Drives Operational Resilience on AWS

Video size:

Abstract

Summary

Transcript

Slides

Indika Wimalasuriya

Associate Director / Senior Systems Engineering Manager @ Virtusa

Join the community!

Featured event

2025

2024

Info

Conf42 DevSecOps 2024 - Online

December 05 2024 - premiere 5PM GMT

Securing the Future: How AIOps Drives Operational Resilience on AWS

Video size:

Abstract

Summary

Transcript

Slides

Indika Wimalasuriya

Associate Director / Senior Systems Engineering Manager @ Virtusa

Join the community!