Transcript
This transcript was autogenerated. To make changes, submit a PR.
There are very interesting findings, we identified part of
this cost of data breach report.
So they identified close to around 4.
4 million is the average cost of a data breach, which is really expensive.
And they also identify close to more than 46 60 percent of data
breaches involve personal data.
That's very, I would say a dangerous precedent because it's like 50 percent
of the time it's the customer data.
So these are some of the challenges being exposed part of numbers.
Hi everyone.
My name is Indika Vimalasuriya.
Welcome to DevSecOps 2024.
Today we'll discuss how you can build a secure future.
in complex landscape and how we can leverage AIOps to drive
operational resilience on AWS.
I'll walk you through some of the challenges in distributed systems
and why it's hard and then we'll get into some of the detailed aspects
of AIOps and the security offering in AWS and we will go into very
detailed AWS implementation use cases.
Very focused on real world and security.
And we'll wrap it up with some of the strategies and some of the metrics
we should be focusing on when we are going through this journey.
So a little bit of myself.
I'm based out of Columbus, Sri Lanka.
I'm a reliability engineering advocate and a solution architect.
I specialize in site reliability engineering, observability,
AIOps and generative AI.
I'm a passionate trainer.
An energetic technical blogger I'm also very Proud to hold AWS community
builder title from AWS under cloud operations category and also very proud
to be ambassador at DevOps institute So let's move into the detail So
what are the challenges we face in?
The today's distributed systems.
We see a lot of challenges When we are working on cloud,
these challenges touch upon multiple life cycle phases.
And we'll go through and see what are these challenges and why,
we have to be vigilant of this.
So identifying something in our ecosystem is always a challenge because, we have
this, limited visibility across resources.
So there are challenges related to in case system telemetry data coming from, like
multi cloud or poly cloud environments and identifying assets and dependencies
and, the dynamic nature of things.
So those are very challenging things to manage.
So when it comes to protect, like how are we going to protect our state?
And we have a challenge of preventive controls.
because, enforcing consistency across distributed systems is very hard.
And also configurations in cloud, and especially when it comes to native
clouds, and, some of the vulnerabilities it can arise are also very challenging.
So when it comes to detect, how quickly you can detect a threat,
that, that's the challenge.
And I had to be agree that we have made significant progress in this regard.
There's a lot of.
things in place, but that has also resulted in, there's a
lot of noise in our systems.
So false positives, noise is one of the big problem when you are trying
to identify threads and difficulty in finding out, the insider threads.
Within the distributed access model is also very challenging aspects.
And then respond.
How are we going to respond?
The fragmented incident response, it's a one key challenge area, lack of
integration tools across the cross cloud responses and a delay in some of these
automating playbooks in the regions and different parts of our infrastructure.
is a key challenge.
And then recovery, generally we take, considerable of
time to, recover ourselves.
So insufficient disaster recovery planning in cloud native environments
and Restoration to the pre incident states due to the complex nature of
our systems are always a big challenge.
And also we have a challenge at our governance as well because of the
compliance overhead and keeping pace with the, the continuously changing
regulatory requirements across globe is a very challenging thing.
And ensuring our state is always audit ready.
The fact that our state is built on dynamics, nature, dynamics, scaling.
And on top of that, we need to ensure things are audit ready
is one of the big challenge.
So if you can see different aspects of our life cycle of
operations, like the security operations, like identifying things.
In protecting things, detecting, responding, recover, and overall
governance, each and every area, we have seen some of the challenges.
And these challenges are across, the cloud systems, probably on premise as well.
And this has been increased or amplified because we have been moving from on
prem to cloud and with cloud, it's the dynamic nature, elasticity that
has created some of the challenges and our monolith applications.
We have moved into microservices with dependencies and, it's again creating
a lot of, challenge as well and finally we have our telemetry data Like we
have the logs metrics traces and these things and this is generating Again,
a high num high volume of data and because we have high volume of data
Again, it's a data paradox situation.
We are more data with less care is not provide much information
So it's again a challenge So
moving on, let's, get into, some of the offerings, AWS is offering and how AWS
is focusing on, meeting this challenge.
So before we go any further, I think we all have to be familiar with,
or let's at least cover the AWS Shares Responsibility Model when
it comes to the security pillar.
So AWS covers.
AWS is responsible for security of the cloud and it's about the global
infrastructure, availability zones, regions, edge locations, data centers.
So it's AWS responsibility.
And AWS also will ensure the foundation services, security
is being managed as well.
It's about compute, storage, database and networking.
So AWS is.
In summary, responsible for what's
inside the cloud or security of the cloud.
And when it comes to us, the customers, we are responsible for the security
and compliance in the cloud because we are bringing in that customer content.
We are developing our platforms, applications, and we are responsible
for tracking and maintaining and managing identity and access management.
And we are responsible for.
The maintenance of operating systems network firewalls and other configurations
we are building client side data encryption server side data Encryption
and the network traffic protection.
So there's a very clear Distinguish here.
So aws will take care of the entire the security of the cloud and we The customers
we want to ensure We secure what we're going to put in the cloud You So this is
one of the key aspects we have to be aware when we are going through this journey.
AWS is also offering a wide range of tech stack, the services, when it
comes to identifying some of these security problems and protecting our
ecosystems or the infrastructure or the application domain and detecting things
and also how we can investigate things much faster, automate the response.
respond and recover.
So these are some of the services AWS is providing, which allows us to
streamline this entire life cycle, identify, protect, detect, investigate,
automate and response and recover.
So it's very important.
We, whoever who's going to leverage AWS or cloud And any cloud provider has a
understanding of these services because you don't have to really reinvent the
wheel, but you just have to ensure you use these services so that we can develop a
good practice when it comes to deploying your data in the cloud and ensuring
that what's in the cloud is Transcribed
So there are a lot of services, AWS is offering.
So when it comes to identify, you have AWS security hub, AWS
organizations, AWS control tower, and trusted advisor, which actually
provide a lot of good capabilities.
And we also have AWS service catalog, AWS config and AWS well architected tools and
systems manager which is also providing a lot of capabilities here as well.
So under protecting things, we have AWS transit gateway, the, all the foundational
services like Amazon VPC and your AWS Shield, IEM policies, Amazon Cognito, WAF.
So there are firewall manager, a lot of stuff, right?
And detecting, we have Amazon GuardDuty, the Amazon, the inspector
and a lot of other services.
So when it comes to the automation, we can use a CloudWatch to build a lot of these
integrations or the trigger CloudWatches.
like the observability tool aws have so with that we can observe What's
happening inside this security the bubble and whenever there are triggers?
We are able to trigger aws step functions or systems manager lambda
And with that we can actually build a very comprehensive Workflow and
we can recover things much faster.
So there are a lot of other services as well, but highly early Yes, aws is
offering a lot of services which allows us to identify things protect things
proactively and detect things especially using ai things like The aws mesh or the
inspector guard duty and devops guru, which we're going to discuss A little
in detail, in subsequent, presentation.
The slides are key.
Those collectively help us.
To build a comprehensive security driven, incident workflow also, as I
said, All our applications are growing.
It's very complex.
So observability is also key part So we have aws cloud watch which is providing
Ability to instrument our code and collect this the data and we we can leverage aws
cloud watch and probably open telemetry You And it will allow us like this
metrics, logs and traces, the key pillars of observability on top of that we can
build visualizations and we can build a lot of insights and analytics and we have
the digital experience monitoring as well.
So the key pillars, the metrics, logs and traces on top of that, the
key AI features, the cloud watch.
Is providing we can look at the metric anomaly detection.
We can have the log anomaly detection We can have ai driven natural language
query generations and log analytics And there's intelligent insights as well.
So insights are coming from both containers lambda The
application insights ec2.
So all these are very powerful capabilities so all these leverage
ai and the and These things provide a lot of advanced option for us
to be on top of the system issues.
So especially when it comes to any of the security issues Not we are not going
in a traditional way, but we can use some of these AI ops capabilities to
identify things much faster understand the root cause much faster and prevent
it and Not only that but actually go and eliminate these things permanently.
So generally With AIOps, what we are looking at security, right?
what generally happen is we see a incident, a security related incidents,
and we want to identify it and fix it.
So with AIOps, we want to make it, really fast, right?
So if you look at it, the first task of AIOps is even before service impacting
issue happens, can we detect it?
Can we detect that security issue and can I eliminate it?
We be more predictive maintenance So that we can eliminate if not, can we detect
the is the security incident much faster?
And so that we can reduce our mean time to detection and then we fix it
much faster so that we can actually reduce the time, the cycle time.
The final is can we self feel things like with, the, we discuss about the
automation workflows Typically, AWS is offering and leveraging these automation
workflows, the self feeling capabilities.
We detect things and then we try to resolve things much faster.
So that is.
The whole advantage the ai ops offerings are bringing into the table when it
comes to managing high impact security incidents So one of the other like very
cool service aws Has developed and been evolving we have seen is devops guru.
So it's a it's about analyzing Entire your aws services like your entire thing
like your account the tags and All the services so it's able to look at the
metric anomalies related events and it can able to provide recommendation
So it will be continuously analyzing.
the purpose of, Amazon DevOps Guru, it's a continuously analyze, analyze streams of,
the data coming from multiple, channels.
And it's, I have the ability of monitoring things, understand
relevant metrics to identify.
And baseline the normal behavior and on top of that understand
these, the deviations, the outliers and the anomaly patterns.
And this is happening across the entire, our AWS account.
So that way that Entire thing is being covered.
So some of the key capabilities of DevOps guru are it Can do anomaly detection a
very powerful anomaly detection and across like the logs metrics and other things
And it's have the capability of root cause analysis as well with the leveraging of
ai and correlating some of these data sources It's able to do some smart root
cause analysis and guiding you in the right direction It can provide a lot of
proactive insights And recommendations to prevent, the security incidents
and it's able to also optimize some of these Resources and probably identify
some of the weaknesses in advance.
It is have some lot of good Capabilities when it comes to
database monitoring as well.
So Some of these sql injections and other database for related vulnerabilities.
We can mitigate And it's able to do capacity planning assessment and more
intelligent in that area as well So it's able to also provide support when
it comes to cross service correlation And it's pretty much integrated with
most of the aws services And it will keep track of you know The security
and other compliance things and also the most important thing is it allows
us to Do automated remediations and build self healing capabilities.
So it's one of the my one of my favorite like very Useful services WSAS is
offering with that we can actually keep Entire our state on check all the time.
So let's go and check what are the the real world applications, of
AI ops in the, world of security.
So you have already seen what are the, what are the key services AWS
is offering and, how we can leverage these offerings and also the offerings
coming part of, observability as well.
And with this, let's go and deep dive into some of the use cases and try to
understand how, We can prevent some of these security incidents by leveraging
this AIOps capabilities being offered.
Let's go through them.
The first use case I would like to walk you through is anomaly
detection in logs and metrics.
So anomaly detection is a very powerful capability.
What we are trying to do is we're trying to baseline the performance when it
comes to logs and some of the security related metrics and especially, logs,
because the access logs and other.
Things so this way we can identify some of the unusual behaviors in our systems
So we can look at and find out some of the unauthorized access attempts and
identify Very suspicious configurations or spot some of these lateral movement
patterns So some of these examples, we are able to monitor spikes in api
calls volume tied to a potential ddos attack So the next time, they're like
we see a spike in api calls like we can go and correlate it What's happening
and whether it's something unusual or whether it's something to do with the?
Cybersecurity problem and detecting anonymous API calls from unknown
ips and especially whether these ips are conflict behaving in, the
connected manner and identify the fail logins and burst of atoms as well.
So it's pretty much like we, we define a baseline for logs and the
metrics and we try to understand the outliers and some of these, anomalies.
Which will help us greatly preventing some of these security problems And second
one is as I said, there's so much of data here Like we have logs metrics traces.
We have the observability.
We have all the data coming from the security services So correlating
these things are very important.
So ability to aggregate correlated data from various sources It's a very
powerful capability Especially when you are investigating things so that we can
cut the noise and go to the root causes.
So some of these, allows us to map relationship between suspicious
attacks, because there are nowadays, a lot of attacks are coordinated.
So when that happens, the event correlations and the service data help us
to understand these coordinations and it also help us to build unified timelines
and unified response approaches as well.
So some of the examples.
correlating fail logins with outbound traffic or linking a history bucket
access to a unusual IAM role or flagging simultaneous logging from distant regions.
So this is again a very powerful, capability.
use case because noise is sometimes one of the greatest
challenges we face these days.
We have, we are doing a lot of good work when it comes to, detecting things,
but with the noise, it's, it take us a lot of time to cut across these
things to go to the actual problem.
So event correlation is a powerful capability which we can leverage
to, reduce the MTTR of, dealing with these major security incidents.
And that also goes hand with the noise reduction which are the with
the event correlation So this is about filtering these irrelevant alerts
and ensuring that there's no alert fatigue And we the teams are always
looking at the high priority incidents.
So the reducing fault positives In guard the duty or suppressing the duplicate
alerts or prioritizing high risk vulnerabilities So if you exposed to,
like thousands of vulnerabilities, you had to reduce the noise, understand the
key, essential, the high priority things.
So we can use AI to really help us on this regard and, trying to narrow down our
focus so that we can focus and get things done instead of getting drowned with.
Forecasting and is a very powerful capability.
So we can predict some of the potential threats based on historical data.
So we can look at, things in a holistic way and identify risk
even before they happen, right?
That's one of the key objectives of AI ops.
And we can allocate resources, proactively in advance to prevent
some of these vulnerabilities as well.
So some of the key examples, predicting potential DDoS
attacks from traffic patterns.
So you don't really have to wait until it happens, but the way it comes, it's going,
it's like mimicking a classic, Traffic pattern of a DDoS attack so that you can
be ready or we take your we look at we apply some other use case to Understand
these anomaly IPs now and the coordinated fashion and we do correlation And then
we are able to response much faster and we can also look at anticipating
IEM role misuses based on the behavior and predicting future And forecasting
also pat needs so forecasting is one of the key high profile, capability or
use case AIOps is offering and finally the automated incident response.
we had to do that.
We had to identify things much faster, but providing resolution is the key.
This might cut across multiple of these use cases, the anomaly detection,
the forecasting correlating things.
So once you do that, Then probably it's easier for us to automate the response
And we can of course limit the blast radius of these threats So we can do some
auto isolation of compromise instance or block malicious ips automatically
Rework some of the compromise credentials or the host or the services So there
are a lot of automation workflows.
We can develop using services offered by AWS and threat intelligence.
It's a key part.
We can enhance some of these models to look at the way things are coming in
and then obviously take more actions.
So this can be more tailors to understanding, evolving threats.
And some of these examples are automatically backlisting traffic from
some of the suspicious IPs or looking at phishing attacks and enhancing
logs with thread intelligence.
So those are some of the key things which we can do part of enhanced
intelligent thread detection.
And we can definitely leverage AI to do a lot of good behavior analytics as well.
So behavior analytics is to understand what the user behaviors and try to
see what users are actually doing.
So this is about monitoring typical behaviors to understand or
detect insider threats and ensure that, we are compliant as well.
Some of the examples are Spotting access attempts outside work
hours or detecting unusual data transfers by specific users or even
identifying unusual configurations.
So the behavior analytics is a very, Powerful, use case, like
we can look at the access logs.
You can look at the, some of the audit, the audit trails, or
you can look at application logs or some of the metrics, the key
business centric metrics to do this.
So now that we have understanding of, what are some of the high profile use cases.
Let's look at, how, what are the strategies in building, effective
AIOps, framework to, mitigate some of the cyber security risk.
First and foremost, it's very important to set the goal, right?
We just have to have a good understanding what is our purpose,
what are some of the threats and what are some of the blast radius and
how we can reduce this blast radius.
some of the improvements we are targeting when it comes to
eliminating incidents or, MTTD, MTTR.
So those are very important.
We should also focus on where our AIOps is going to integrate, what
services it's going to monitor.
it's very important you look at things in a holistic way, look
at the logs, metrics and traces.
Look at the observability data coming from this cloud trails or
the guard duty and use some of these capabilities like devops guru to have
Ai taking care of the entire ecosystem.
We had to also be very collaborative.
We had to enable cross teams synergy between all the teams in
the ecosystem so that this goal of us being fulfilled by everyone.
So we obviously had to use the CloudWatch.
CloudWatch is the more real time observability tool offered by AWS and it
is providing this Amazing capabilities like anomaly detection the forecasting
and other things So we had to definitely plan to use these things and build those
sophisticated Advanced capabilities around that and obviously we'll have
to go with the Automation is provided by system manager or the lambda You
can get the trigger from cloud watch and then do the automation as well AWS
is offering loads of tools, doesn't mean that you have to use everything
but you have to look at your read and based on that you can pick some of these
useful tools for you and with that you can build a very, good AIOps framework
to support your cyber security needs.
And you can look at the security and compliance capabilities as well.
There are a lot of security compliance tools.
checks and best practices AWS is providing definitely leverage them and
then finally it's about KPIs I always try to believe regardless whether
it's a security incident or anything.
It's about the customer experience So we have to keep on ensure we maintain
those customer experience and then ensure our systems are Available and reliable
as well The, especially the data.
So the data is available and data is consistent and the data
integrity is not being compromised.
So those are some of the key things we have to track.
And then we have for all the security incidents, we had
look at mean time to detect.
Meantime to recover and meantime between failures and also because
there's a lot of Eliminations we are bringing in how many incidents how many?
the security incidents, we have been self healed or eliminated and can all this lead
to like we improving volatility in our systems as well So whole idea is that if
I take a summary aws is offering Lots of services and they have a very good model
when it comes to shared responsibility model where aws is responsible for
what's been developed or what's been the infrastructure side of the cloud and then
customers we are responsible for what are we going to put in the cloud, right?
so AWS is offering lots of services to ensure that we can build a
very comprehensive identification protecting and ability to automate
things and recover things much faster.
And we also have the observability based on CloudWatch, which is providing your
logs, metric traces, where you can apply into all the places where you get the
security, the telemetry data as well.
And then on top of that, we can do log anomaly detection, log
or the metric anomaly detection, forecasting, correlation, and noise
reduction, and all those things.
And all of this help us in achieving these targets, but it's very important
you baseline these things so that you know where you are going and you are
continuously improving this matrix.
So thank you everyone.
Thank you for taking time.
And I hope this was an interesting session.
If you have any questions, you can just feel free to comment and I'll pick it
up and reach out to Otherwise you can check me, or search me in LinkedIn.
And probably you can connect with me link in so we can have a discussion as
well So there are lots of other good tech the talks happening with deus ex
ops 2024 organized by con42 So I request you to join those sessions and learn
about deus ex ops and new trends and what's happening Thank you everyone.
Thank you