Transcript
This transcript was autogenerated. To make changes, submit a PR.
You hello and become to
Conf 40 director director of cyber Chaos engineering.
Aaron Reinhardt, CTO and co founder of Verica. And I'm,
I'm, my co presenter is David Lavezzo,
director director of Cyber Chaos Engineers, Capital one.
Today we're going to talk about cyber chaos engineering and
security chaos engineering. Some of the things we'll be talking about are what is
chaos engineers? Very brief introduction to it. We'll talk a bit about
how it applies to cybersecurity,
how you can get started. David's also going to share his journey.
Director of Cyber Chaos Engineering, Capital one a little background about
me. Like I said before, I'm the CTO and co founder of Verica.
I'm also the former chief security architect
of UnitedHealth Group. That's where I wrote the first open source tool that applied
Netflix's chaos engineering to cybersecurity. I'm also the O'Reilly
author on the Chaos engineering book,
all for Security Chaos Engineering. And I'm also the author of
the O'Reilly Security Chaos Engineering report.
I am David, Capital one. I'm the director of security chaos engineering,
formerly from Zappos, Amazon, and also a contributing
O'Reilly author for security Chaos engineers. I've released multiple
metal albums and surf albums, and the picture that you see right
on there is actually my best photograph. And most notably,
I did indeed start an indoor snowball fight in Las Vegas.
You're much more interesting than I am, David. So let's
get started. So what is the base of the problem that we're trying to solve?
Well, in cybersecurity, the problem, I mean,
these headlines are just examples of the
problem is becoming more and more frequent in
terms of breaches and outages, as well as they're getting more magnified.
But why do they seem to be happening more often? That's part of the theme
we're going to talk about is why we believe that's happening. So to start,
part of the problem is our systems have fundamentally evolved beyond our
human ability to mentally model their behavior. Every dot you see,
so what you see in front of you is something called a Death Star diagram.
Each dot represents a microservice. This used to be
kind of a reflection of how fang, Facebook, Apple Alphabet,
Netflix and Google, how they built their systems.
And so imagine if l every dot is a microservice. That's very
complex. That's a lot of things for you to keep track of in your head.
Well, now it's not just Google, Facebook and Netflix
building like this. It's everyone, everyone's not building this.
The problem is the complexities and the scale of these systems is very
difficult to keep track of in our heads. So here's an example from Netflix
is actually edge streaming service.
The point of this slide is to demonstrate that.
So each large circle is actually a microservice, and each dot represents a
request. It's very difficult at any given point in time to keep
track of what the heck is even going on here.
And this is part of the problem that we're trying to
articulate. So where does this complex come from?
So some of the areas that introduce complexity into how
we build software today are things like continuous delivery,
cloud computing, auto canaries, service meshes,
microservices, distributed systems in general, mutable infrastructure.
But these are the mechanisms that are enabling us CTO deliver
value to customer faster than we've ever done before. The problem is
they also add a layer of complexity that makes it difficult for us humans to
keep track of what's going on. Where is the state of security? Well,
in security, our solutions are still mostly monolithic.
They're still expert in nature, meaning they require domain knowledge.
Operative. It's hard to pick up Palo
Alto firewall. A lot of our security is
very tools rich, but it's very dependent upon commercial tool sets.
And you can't just really go to GitHub and download a tool and
get working with it. It's not exactly the same as it
is in the world of software engineering. We're getting better at it by
leaps and bounds. We are getting better at it, but still our solutions are mostly
staple in nature. So what is the answer to the complexity problem?
Well, the most logical result would be to simplify. Well,
it's important to recognize that software has now officially taken
over. What you see on the right hand side here is you see the new
OSI model. It's now software, software, software, software, software.
Software was taken over the entire stack. But also
coming back to fundamentals, it's important to recognize because of the nature of the
ease and flexibility of change, software only
ever increases in complexity. It never decreases. Right.
There's an old saying in software that there's no problem in software that another
layer of abstraction can't solve. But when does those
layers of abstraction start to become a problem? Well, we're starting to
see that at speed and scale in terms of complex. So what does all this
have to do with my systems? What does that have to do with the systems
we build every day? We're getting to it. So how? Well, the question
remains. How well, given all these things that we have to
keep track of in our head and the changes, how well do you really
understand how your system works? I mean, how well do you really understand
it? Well, one thing, and this has been really a
transformational thing for me. It sounds really simple, but we love to forget
that systems engineering is very messy exercise.
So in the beginning, we love CTO. Look at our systems like this, right?
We have a nice plant, we have time, we have the resources, we've got our
pipeline figured out, the code, the base image for our docker
container, the configs. We know what services we need to deliver.
We've got a nice, beautiful 3d diagram of the AWS
instance that we're going to build in. It's clear what the
system is, right? In reality, our system never
looks like this. It's never this clear. It never really
achieves anything that looks like this.
What happens is, after a few weeks, a few months,
our system slowly drifts into a state where we don't no
longer recognize it and what sort of manifests that is.
So after a week, what happens is there's outage on the payments API. You have
to hard code a token or there's a DNS error that you have to
get on the war room and fix. Marketing comes down and
says you have to refactor the pricing module or service because they
got it wrong, or Google hires your
best or your elite software engineer. What happens is we solely learn about
what we didn't really know about our systems through a series of surprises,
right, unforeseen events. But these are also the
mechanisms by which we learn about how our system really functions.
We learn through these surprises. But as a result of these surprises
comes pain. And the problem just magnifies. Over time, our system just slowly
drifts into a state that we really don't recognize it anymore. Through these
unforeseen events. These unforeseen events don't have to be
unforeseen. We can surface these problems proactively using
chaos engineering and chaos engineering, because it's proactive
and managed. It's a managed way of doing that. We're not constantly
reacting to an outages or incidents as a result of
not understanding how our system functioned. And we're going to get more into
how that works in the end.
The summation of that is that our systems become more complex
and messier than we initially remember them being.
So what does all this stuff have to do with security? Well, we're getting to
it. We're building towards it. So it's important to recognize
that the normal function of
your systems is to fail. It's to fail.
It's humans to keep them from failure. So humans,
we need failure. It's a fundamental component of
growth. We needed to learn and to grow and to learn new things.
We learned to learn how not to do it, to learn how to do it.
Right. But the things we build are
really no different. They require the same inputs.
So how do we typically discover when our security measures fail?
Well, usually it's not till some sort of incident is launched. There's some human reports,
some tool throws an alert, tells us we're missing information,
but by the time a security incident is launched,
it's too late. The problem is already probably out
of hand, and we need to be more proactive
in identifying when things aren't working the way we intend them to
work, so we don't incur the pain of incidents and
breaches or outages, for that matter.
So no system, like I said before, no system is inherently secure.
By default, it is humans that
make them that way. If so, security,
reliability, quality,
resilience, these things are human constructs.
It requires a human to manifest it, to create it.
So point a finger at a human for causing a security
problem is kind of not productive, right? It requires
us initially to even create this thing we call security.
So, cognitively, we need to
also remember that people operate fundamentally
differently when they expect things to fail.
What do I mean? So what I mean by this is when there's an incident
or an outage, especially one related to security, people frig out,
right? Because they're expecting, hey, this is the one they're worrying
about being named, blamed and
shamed, right? I shouldn't have made that change late at night,
or I
should have been more careful. Should have, could have, woulda.
This environment. So what happens when there's an incident or outage is
that within 15 minutes, some executive
gets on the phone or gets involved and says, hey, I don't care what you
have to do. Get that thing back up and running. The company is losing
money, right? So it becomes more about, get that thing back up and
running, not what really caused it, the problem to occur.
And so this environment is
not a good environment to learn. This is not how we should be learning about
how our systems actually function. So chaos engineering we do here,
we do it when there is no active war room. Nobody's freaking out.
Nobody's worrying about being blamed, aimed, or shamed for causing an incident.
We're able to proactively surface these inherent
failures that are already in the system, the inherent chaos. We're trying to
make order of the inherent chaos that's already in the system, and we're
able to not be worried about the company
having an outage, losing money, customers calling in. We're able to do this eyes wide
open, understand the failure, proactively address it so we
don't incur customer pain. So, chaos engineering,
this is where we enter into chaos. So, chaos engineers.
The definition that Netflix wrote is the discipline of experimentation
on distributed systems in order to build confidence
in the system's ability to withstand turbulent conditions. So it's
about establishing, building confidence in how the systems actually
function. Another way of understanding it is it's a technique
of proactively introducing turbulent conditions into a system
or service. Try CTO, determine the conditions by which the
system or service will fail before it actually fails.
That's the key aspect. So, cyber chaos engineers, about establishing
order from chaos. It's not about just breaking things. If you're
just breaking things, you're not doing chaos engineering. Things are already
breaking. What we're trying to do is proactively understand why that
is, fix it before it manifests and breaks and causes
company problems. So, no story about chaos engineering would
be complex without explaining the Chaos monkey story.
So, in late 2008, Netflix was changing over from shipping
dvds to building
their streaming services and Amazon Web services.
This is important to recognize. A lot of people say, oh, we can't do cyber
chaos engineering. Can barely do the DevOps. We're barely in the cloud.
Right? Well, Netflix had the need for
chaos engineering during their cloud transformation. It was a
key component of Netflix transforming to the cloud.
So if you're just transforming the cloud, what it ends up being? It ends up
being a feedback mechanism to help you understand.
Are the things that I'm building in this new world actually
functional? Actually, are they doing the things that I build them to do
over time? Right. You can almost think of it as a post deployment
regression test for the things you're build. And so
that's important to recognize. So this isn't some crazy thing.
This is that you have to be super advanced to do that is a
false interpretation of chaos engineering. Also,
what chaos engineering really represented. So what Chaos monkey did
was because at the time, they were building out their services in AWS,
what was happening was Amis. Amazon machine images were,
poof, just disappearing. At the time, it was a feature of AWS.
So service owners had a difficult time when that happened. So what
they needed was they needed a way to
test that if I built my service to be resilient to amis
disappearing, that it would be resilient to that problem.
So it put a well defined problem in front of engineers.
And so basically, chaos monkey would, during business hours, it would pseudo
randomly bring down an AMI on a random service. What it
did was it put the problem of an AMI outage or
disappearing in front of the engineers so they clearly knew what they had to
build resilience into. And that's
really what chaos engineering is about. It's about building with better context and better understanding
who's doing chaos engineers. I've lost track at this point.
There's roughly about 2000 companies or so in
some. I mean, the CNCF has declared 2021 cyber
chaos engineers being one of the top five crafts
that are being adopted practices that's being adopted. But it's
a little over. Just about every major company is now looking into it or adopting
it in some level of maturity or fashion. This is no longer just a fat.
So there's also three books out there, as David said. David and I were
involved in writing the security chaos engineering report. Actually,
if you stay tuned, toward the end of the presentation, we'll give a link to
downloading that for free, as well as a link to
download the Chaos engineering O'Reilly book written by my co founder,
casey Rosenthal, the creator of chaos engineering at Netflix.
And there's also the original report that Netflix wrote.
So, instrumenting chaos, so loosely
defined, where does chaos engineering fit in terms of instrumentation?
So Casey always likes to frame it as
testing versus experimentation. So testing is a verification or validation of something
you know to be true or false. You know the information you're looking
for before you go looking for it. Whereas in our world, it's a CVE
and attack,
that's we kind of know what we're looking for,
whereas experimentation, we're trying to derive new understanding and new information
that we did not have before. And we do that through the scientific method.
We do that because the hypotheses are in the form
of, I believe when x condition occurs on my system,
y should be the response, right? And we actually introduce
x in the system and observe whether or not y was a response,
and we learn about what really happens. And it's quite interesting, because I've
never seen a chaos engineering experiment succeed the first time
we run it. What does that mean? What that means is our understanding of
our systems is almost always wrong. And it's because
speed and the scale and the complexity of software
is hard for humans to keep track of.
In the world of applied sciences, we refer to this as the complex
adaptive systems. So Casey also likes to say, casey and Nora Jones
like to talk about that. It's not about breaking things
on purpose or breaking things in production. It's about fixing them
proactively on purpose. Right. And Casey also likes to point out, if you're just
like, I'm pretty sure if I went around breaking things, I wouldn't have a job
very long. Right. And that's. That's important to recognize.
So, security, chaos engineering. All right,
what is the application of this to security? Well,
what we're really trying to address is that hope is not an effective strategy.
So, engineers, as an engineer, I've been a builder most of my career,
and engineers, we don't believe in two things.
We don't believe in hope. We don't believe in luck. Right.
We believe in instrumentation and empirical data that tells us, is it working
or not? Right. So we can have context to build from and improve what we're
building. And so hope and luck, that worked
in Star wars as an effective strategy, but it doesn't work in engineering.
So what we're trying to do is understand our systems and its
security gaps before an adversary does. So here's the problem that we're dealing
with today, is that our systems need to be
open and available to be able for a builder to have the flexibility
to change, to change what they need to do to make the system function
or to make the product operational for a customer. There's this need to change and
change quickly, even if you're changing a small thing. But there's also
the need for a security mechanism. CTO reflect that change.
Right. What's happening is the speed of our
ability to change and the flexibility. CTO change is coming at the cost of our
ability. CTO understand and rapidly meet that change.
From a security perspective, you magnify that problem with speed and scale.
What's happening is things are slipping out
of our ability. CTO understand what happened,
right. So what's happening is adversaries are able to take advantage of
our lack of visibility and understanding how effective our controls are or not.
We have too much assumption and too much hope in how we build. We need
more instrumentation post appointment to tell us, hey,
when the misconfigured port happens, we have,
like, ten controls. CTO catch that, or when we accidentally
turn off encryption or downgrade encryption on an s three bucket,
we should be able to detect that we have all these things in place,
but what we're actually doing is we're introducing test conditions
to ensure that that is constantly happening. So it's about
continuously verifying that security works the way we think it does. We introduced
the condition to ensure that when misconfiguration
happens, that control catches it or stops it or blocks it,
or reports log data or the alerts fire. I think David's going to talk
a lot about how capital one does a lot of that, but that's really
what we're trying to achieve here. So it's about reducing uncertainty
through building confidence. So really, you're seeing a theme
here, the application of chaos engineers. Security is no different.
It's just we're looking at it more from the security problem set,
but we're trying to correct how
we believe the system works and how it actually works in reality.
We're bringing that to meet.
We build confidence as we
build that that is actually true, that our understanding
is in line with reality. So, use cases, I think David's
going to expand on some of these in a minute, but some rough use cases
you can apply chaos engineering to are instant response,
security control validation, security observability,
and also every chaos experiment has compliance value.
So basically approving whether the technology worked the way you thought it did
or not. Each one of these use cases you can find in the O'Reilly report
expanded from people who have applied them in these particular
areas. I'm going to talk a little bit about incident response in security
cyber chaos engineers. The problem with instant response is
response is actually the problem.
So security incidents, by nature, are somewhat subjective. No matter
how much money we have, how many people we hire, how many fancy controls
we put in place, we still don't know. A lot. It's very subjective. We don't
know where it's going to happen, why it's happening, who's trying to get in,
how they're going to get in. And so all the things we put in place,
we don't know if they're effective until the situation occurs.
Right? With security chaos engineering,
we're actually introducing that x condition, the condition that we prepped
all that time, effort and money to
occur or detect and block. And what we do is because
we're no longer waiting for something to happen, to find out whether it
was effective or not, we're introducing it purposefully, a signal into
the system, trying to understand, do we have enough people on call?
Did they have the right skills? Did the security
controls throw good log data? Were we able to
understand that log data? Did it throw an alert? Did the Soc
process the security operations center,
did they process the alert?
All these things are difficult when you're to
assess confidence when it's happening. We're doing this proactively,
and we can now understand and measure and manage things we can no longer manage
and measure before. So it's about flipping the model instead. The post
mortem exercise being after the fact, in a way, you can
think about it being the postmortem now being a preparation exercise,
determining how effective we are at what we've been preparing
for. So inner chaos slingers. So this is a tool I talked about
earlier when I was at UnitedHealth group.
We started experimenting with Netflix's chaos engineering. We started
applying it to some security use cases around instant response, security control validation,
and we got some amazing results. And that tool we wrote
was called chaos Slinger. There's an r at the end of that slide.
It's not showing up, but it's an open source tool. It's out there on GitHub.
You can check it out. But really, chaos Slinger represents
a framework on how to write experiments. There's three major
functions. There's Slinger generator and tracker generator
finds out the target to introduce the failure condition into
Slinger actually makes the change, and tracker tracks the change and reports it into
slack. So here's an example. So chaos Slinger had
a primary example. We open sourced it. We needed something
to share with the wider community so they would know kind of what
we're trying to achieve. So we picked this example called Portslinger,
because everyone kind of, we've been solving for misconfigured ports and firewalls
for, like, 30 years, right? Whether you're a software engineer, a network engineer,
a system engineer, executive, everybody kind of knows what a firewall does.
Okay, so what port Slinger did, was it proactive?
Would introduce a misconfigured or unauthorized port change into
an AWS EC two security group. And so
we started doing this throughout Unitedhealth group at the time. And so
Unitedhealth group had a non commercial and a commercial software division.
So what we started doing is started introducing it into these security gifts and
started finding out that the firewall.
So our expectation was that our firewalls would immediately detect
and block it. It'll be a non issue. This is something we've been solving for,
remember, for 30 years. We should be able to detect it. It'd be a non
issue. That only happened about 60% of the time. The problem was,
there's a drift issue between our commercial and our non commercial environments.
Well, that was the first thing we learned, remember, there is no active incident.
There was no active incident. There was no war room. We did this proactively to
learn about was the system working the way it was supposed
to. The second thing we learned was the cloud native configuration management tool.
Caught it and blocked it almost every time. That was amazing. Something we're barely
paying for is working better than we expected. A third
thing we expected to happen was that both
tools sent good log data to our security logging
tool. We didn't really use a sim at the time. We used our own homegrown
solution. So I wasn't sure if alert would be thrown because we're new to the
cloud, we're very new to AWS at the time, so we're learning all these things
and what happened is actually through alert. So that was awesome.
That was the third thing I kind of didn't expect to happen happened. So the
fourth thing is that the alert went to the Sock. But the SOC didn't know
what to do with the alert because they couldn't tell what account it came from,
what instance it came from. Now, as an engineers, you can say, well,
Aaron, you can map back the IP address. Well, truth, that could take 515
30 minutes. CTO do that if snat is in play, it could take 2 hours
because snat hides the real IP address.
But remember, if this were an incident or an outage,
2 hours is very expensive when you're the largest healthcare company in
the world. But there was no incident, there was no outage.
Proactively figured out, oh crap,
this would have been bad had this
actually happened. So we're able to add metadata, pointers to
the alert and then be able to fix it, right? There wasn't no problem.
But that could have been very expensive had we waited for a surprise or
CTO manifest on its own. But these are the kind of things you can
uncover with security cyber chaos engineers.
That's kind of a prime example. You can check out more examples. Like I said
earlier in the book that David and I wrote,
security enhanced engineering and why you should do so.
As for why, I don't really see a choice. Looking back
over CTO Aaron's slides, the relationship maps kind of look like coronavirus. The timely
reference security is hard and it's getting harder.
With all these vendors promising the world will be fixed by using their products,
it's even more difficult to make good decisions
on what the correct course of action is. Are the tools you have in working?
Do you need new tooling? Is the problem that you need to invest time and
effort. If it is, how do you know? People talk
about needing to address the basics before they talk about doing the advanced things like
security, chaos, engineers, they're kind of missing the point around SCE
in general. If you build this foundation into your engineering
teams and the engineering functions, it becomes a basic capability.
It becomes business as usual, and it removes the lift needed
to get it implemented later on. Many basic issues are
the ones that you need to worry about before tackling an advanced adversary. But I'm
not saying it's not important, because depending on your threat model, it can be,
but you need to know what your capabilities are before it's taken advantage
of by someone else. So I know it's going to be generalizing,
but most things that we're seeing in security are kind
of basic. Like we see password reuse credentials in GitHub,
cross site scripting firewalls, the stuff that's been around for 30 years.
They're all basic things everyone knows, and these issues can just pop up
at any time. But it's not easy. Even though they're basic, they're definitely
not easy. You can look back to some of the recent hacks like
Solarwinds, and think, wow, how can they mess up something so easy?
But if you're in a large environment, you kind of know the answer to that
one. It's the complexity of things. Nothing is more permanent
than doing something, a quick fix, something temporary.
Nobody knows how all the systems interconnect, how one firewall
rule can really affect another one, or even how service accounts can have undocumented
services be affected. Not until you do it. And that's what needs
to be observed and measured. You just do it like
in a night's tale. That movie from like 100 years ago with Heath Ledger.
It was a good movie, I think. So with
baselining is where you want to get all this to.
Trying to see what actually happens is how you establish that baseline.
Do we really know that we can stop the malware or the vendors
doing what it is they promise to do? Does a proxy really stop a
specific category? Well, you don't really know unless you try.
And when you try it, you document it and treat anything that you find
as that baseline. So you know where you've been, so you know where you're going
to and when you've got something to compare against. And why
you want to do that is to understand when and how things change and to
drive usable metrics that aren't just uptime, but it tells
you how your tooling performs. So when you want to prove
the way that things work, you want to prove that your tools
function. You want to know that it's more than just a blinking light that you're
going to be using to comply with a policy somewhere. You don't want to wait
until the ghost is already in your ballroom before you check out your security capabilities
function. How you think they do when you're asked, are we vulnerable
to stuff? You'll have an answer. So instead of like loose waffling around
with, well, we've got rules in place that will
alert when it happens. You can prove that you have
an answer by having the log, having the data that work that shows the
results or triggered alerts. You can remove the guesswork and replace it
with analysis and data. And as you tune your environment,
you have a baseline to reference when detection or alerting failed.
You'll know if you fine tune the rule too much because maybe a simple variable
change has now bypassed your tooling. But in practice,
it's not going to be that simple. With products,
they're difficult. They're difficult to implement, they're difficult to configure
and tune. And some of the things may not even do what they're supposed to
be doing outside of a few test cases that the vendor prepared you
so they knew that their tool would work. You can get
poorly designed ux, which mandates vendor specific training and vendor time because
you've got a deep investment in them, because that's the only way to know how
their tools work, is to get deep training. When people move, they take that
training knowledge with them and you've got to reinvest into new people.
As your defenses move through this testing, you continue to
update your expectations on what a system provides. You've got a
way to actually show improvement that isn't just, we saw this many attacks last
month, we blocked this many this month. You can show incremental improvement
as you go from one spot to another to another.
And as for why, maybe you're bored and you
think, you know what, I want to jump in the deepest rabbit
hole I can find. Let's see what I can find out. Because once you succeed
at this, you're going to be really popular. You're going to be so popular,
fetch might actually happen. I don't know.
It hasn't happened for me yet. But what they say
about when you do the same thing over and over again expecting a different result,
but for reals, people get really excited. And when they get excited,
they get ideas. And sometimes they're dangerous ideas. Dangerous meaning increased
asks, expectations and wanting to get even more of the data that
you're generating. And it's a good thing, but you don't want to be caught unprepared.
So how do you get started with this? Like strong defenses,
alerts going off. How are people and teams being rewarded?
Uptime and visibility. When you're doing this, you want to
give them support. And when you're supporting people, they don't want to feel
like they're being attacked. When people feel attacked, they kind of buckle down and they
kind of resist any of the efforts that you're doing. So you
want to take the relationships you have, take what you know, and just
sort of nurture them. So start with
something that will show immediate impact, something quick, something easy.
Your ultimate goal at this stage is to show the value of what you're doing
and do it in a way that shows support. CTO, the team is tasked with
owning and operating the platforms that you're experimenting on and with.
Like, let's say you have a sock. Take some of the critical alerts they've got
and test them. They haven't gone off in a while. It's even better because you
don't really know what alerts work until they go off. So you do
it and measure it if they work, awesome. We've got a success. Let's celebrate
the success. Tell everyone in a way that get these teams rewarded,
because too often people don't get rewarded for
doing the things that they're doing well because they don't know.
You will get that data driven metric to show that your tool is working,
that your teams are working. So what I did is starting out, I used a
combination of endpoint alerting and endpoint protection. I took an attack trophy.
I documented an expected path that the attacker would take,
what defenses are expected, what alerts would go off and just let loose
within 15 minutes, roughly. I had my initial results
and data to bring to my new best friends and I brought the data with
questions like, is this what you expected to happen? Are the blogs what
you expected? Did the alert go off when you expected it? If it
did, awesome. So take that and repeat it every day and help generate
metrics with that data. So I mentioned I
like the to do. I really wasn't settling on
this gift, but it worked for me.
Getting started back then, it was still endpoints and it was something that
I can test myself, something that I didn't need to get other people involved.
I could say, all right, let me just do this. Look, because anytime
you have to leverage other people, it's taking away from their day job, from the
work that they needed to keep doing. So I picked something that we cared about,
critical alerts. And then picked the ones that we haven't seen fire in a while.
And for why that why it shows alerts. It's alerts
go off when you're attacked. If you don't see alerts, is it because you're not
being attacked? Maybe your alerts don't work. This thing logs.
Maybe the tools don't actually stop it. They're supposed to stop.
And for how starting it. I took something nondestructive.
So lull bins, living off the land binaries powershell
things that wouldn't actually affect the status of the machine. And I checked
it didn't want to do anything destructive because if you leave the machine in
a state worse than when you found it, that could be something
very bad. You could put like a trademark on that, very bad in
the future. So when I did that, did I see
what I expected? Did it respond how I thought it would? If so, cool.
Now figure out how to automate it. Run it every day. It's a small thing,
but you're starting to establish that baseline that you can use to know
your basic level of protections and have a spot to compare. It was your environment
gets tuned, gets updated, gets fixed, and then
when you're getting started, you need to find what your
target cares about, what their goal is, changes the approach that
should be taken, like identifying malware detection, isn't going to necessarily
be the same approach that you would take for validating alerts, trigger or things
like lateral movement, network based attacks.
And again, report on what you're doing. You want to be
like Malcolm up there. You want to just observe, report, show the
changes, and show teamwork to make things better. So getting
back into the use cases that Aaron was talking about.
So do critical alerts work?
It's always going to be a rhetorical question, do they work?
We all know alerts work until they don't. And sometimes you find out at
the worst possible time that something stops working. But there is a way to prevent
this. You test them. Always be running the attacks.
It's not as cool as ABC. Always be closing, but that was
close enough. Like, always run attacks, you schedule
them, just run them at consistent time so you can identify outliers.
If you're running at 08:00 a.m. Every single day, if you suddenly see it at
815 or 830, like, okay, well, what's going on that day? Was there
an issue with logging that we didn't know about? When you
do that, you're going to have historical trends of alert uptime, ensuring the alerts
trigger as expected. Even though tuning processes in the future may
inadvertently change how systems behave. You'll know what's working
and what is not working. With the typical
alerts become the negatives. Like, if you're like any
company ever, you hate false positives. Unlimited resources
aren't a thing, and teams have to prioritize what they spend their time on.
What do you do with false positives? In many cases, you just
spend the time and you tune out the excessive noise. But like anything, it can
cause its own issues. Like has false positive production made your alerts so
brittle that you're now missing attack that you used to catch?
And our alerts hyper focus on a single technique that
introduces false negatives. Have alerts been so finely tuned that
a slight variable tweaking now no longer triggers
any logging or alerts? If so, well, at least you know now,
because then you can feed that into your other use cases, such as like engineering,
hunting, security architecture. So when the alerts fire,
how long does it take for the alert to fire? Does it fire when you
think it's going to happen? Because sometimes, in reference to the tuning, when you tune
something, you can still affect time to alert, which is going to also affect the
time to respond. And do the results show what's expected?
If you're expecting the result within ten minutes, do you get them within ten minutes?
If maybe not. If so, cool. If not, that's when most
of the work starts, because now you've got to start tracing back.
Where is everything going? What does the
ingestion path look like? What do we need to change?
And finally, with there, it's helping to train
the new people they're going through. Standard training exercises
can be boring. So what we do is we try to spice it up.
Since it's already in production, there's not really a better way to get people to
practice investigations and doing real work. Like, how are they actually responding
to security events? Do they adhere to playbooks? Are the playbooks right?
Are there critical steps that people are keeping in their heads that never actually
made it? CTo, the electronic paper that they're writing on.
So testing removes guessing, and when you remove that guessing,
you're removing mistakes. Like, everyone will make mistakes. It's part of everything that we
do. We make mistakes all the time. It's just how we recover
from it. How we respond to it. And when you're doing it, just make
sure you have permission and you've got approval to do it, because otherwise you
can also have a pretty bad day. So as we get to another
use case of gap hunting, gap hunting, it relates
to all of this because was you're finding out that your tools are running
and you're missing things like, well, you are in a really
good position to identify what you don't see. You're able to know what you
don't detect. And there's a lot of overlap with this in something
like detection engineering, but still there's a difference. We're not
really worried about what we're detecting yet, because the primary goal is to find
out how systems are responding, how they work, and establishing a baseline of
behavior that we know can be detected and was well as understand what our
security stack sees and what it doesn't see.
And the behavior detection leads to
discovering gaps in understanding like what was missed. Did we expect
to actually see it? Is there a configuration issue?
Has something changed that we never thought about when we did the
initial deployment? Does deviating from an expected attack
result, attack technique results
in reduced visibility? Does changing a delivery technique from
one method to another, let's say Powershell versus
python, does it produce false negatives? Is it
something that is detected only? Is it something that we can change in
the future as we improve the environment? Can we go from detection
only to prevention, too? If so, that's another huge win
that can also lead into the next use case of engineering.
I apologize. I ran out of gifts. I think the
child was maybe four years old when he drew a cat, and it
was really one of the most frightening things I've ever seen.
Imagine if that came to life.
I don't know. I don't know what to think about this. But all
the other cases can lead back into something for engineering, like the return on an
investment. Tooling and staffing isn't cheap. None of
this is cheap. So it's great to know. Are the things that we're investing
on in the tools, the people?
Is it paying off? Is everything working the way that
we want to? Are the tools optimized? Do we have duplicative tuning?
Like maybe we've got five, six endpoint agents that
all do the exact same thing? Do you need that many? Do they have
any extra capabilities the other one doesn't have? If not,
do you need them all? Can you reduce complexity in the environment without reducing
your security capabilities? How is your defense in depth?
Like how many layers can fail before you're in trouble. Let's say you've got ten
different layers of security that you're expecting and you
know that you can handle up to three of them failing. That's good to
know. So you can determine, do we want to get 100% coverage
in every single tool? Can we live without some? Can we be 80%
effective like the bugs in starship troopers
when Dookie Hauser was showing combat effectiveness? Can you do that?
And that also leads into measuring your environment.
Without measurement, you can't really show progress. It's like the training montage
from footloose. The guy on the left, he goes from not
being able to snap to getting just stand up applause from Kevin Bacon.
I don't remember his character's name, but I guess just Kevin Bacon is
going to be enough for me. Measurement is going to help. When you stand up
and say, here's where we were last quarter, here's where we are now, here's where
we've improved, here's what I think should be targeted, and this is where we
hope to get so you can start planning out your roadmap, where you were,
where you are, where you're going. Audits become a lot easier when
metrics are constantly being generated. So instead of saying,
we need to show this one point in time when something works, you could say,
okay, well, here's how you get the data.
We've been running this for this long. We have consistent
reporting all the way through every week. So instead
of being a snapshot in time, showing compliance that
one time, way back when, you are always showing compliance because you
are proving that you are secure and all the data can be used to support
the people you work with for improving the entire security program. And it helps
use data driven decisions on what to focus on what needs help and providing
evidence that what used to work still works.
There are some risks. Sometimes people
don't care. It could be the way it's presented. But sometimes
there are other problems. If you don't do it the right way,
telling people about problems can make them feel bad, then they get the pushback.
So what I use is more of a framework for the issues. Like here's what
I see. I'm saying maybe it's incomplete metrics,
maybe there's alerts without context.
Here's my interpretation. And then ask, what am I missing?
Here's what I think that means. What do you think? So it could be something
that people know about it. It's something that's a work in progress.
It's important not to go in there head
on fire, pants on your head. Go in there
with a collaborative effort, a collaborative spirit,
and then deliver one message that you want them to walk away
with. Why should they believe this message? What confidence do you have? Just provide all
of that and it makes it a lot easier to have a good working relationship
and move the needle,
is what they say.
And my final messages are, don't worry, because if
I can do it, you can do it.
Just like that. So here's
the link to the security chaos engineering book. You can download it
for free. Free copy, compliments of Verica.
And if you're also interested in chaos
engineering, here's a link to both books. Great.
Thank you very much.