Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and
observing changes exceptions. Errors in
real time allows you to not only experiment with confidence,
but respond instantly to get things working again.
Close can
welcome to my talk. I would like to thank the organizers
for giving me an opportunity to speak at this conference.
I will be speaking about defining a steady, steady,
steady states developing hypotheses, security chaos engineering.
So the primary focus will be to discuss challenges
in moving chaos experiments beyond regular
testing of downtime and resiliency.
Finally, I'll also be exploring how these challenges can
be addressed by utilizing existing security benchmarks and
frameworks. And what I
hope by end of this talk is to explore a concept that
would help adoption of security chaos engineering to broader cybersecurity
practices. So, a little introduction about myself I
am Sakshyam Shah and I currently work as a developer relations engineer
at Teleport. For those of you who you don't know,
teleport is teleport provides passwordless access
to infrastructure. It's one of the most easiest yet
secure access to SSH servers,
Windows servers, kubernetes, clusters, databases, and web application across
all environments. Before teleport, I have eight years of experience
exclusively in cybersecurity. Been doing both offensive
and defensive stuff. And besides cybersecurity,
I love talking about new technologies and startups in
general. So if you have any question or you want to
connect with me, feel free to ping me in either Twitter
or LinkedIn. Happy to chat with you.
So let's begin by explaining
what is security chaos Engineering? Please bear with me on this.
This is chaos Engineering conference, and I can bet that there
are speakers who can explain what is
security chaos engineering more appropriately than me. But then again,
just let me set a stage for the topic. So security
chaos engineering is applied chaos engineering to validate
security implementation and test for resiliency.
To understand secure chaos engineering,
you first have to understand what led to the whole chaos
engineering itself. In a typical software application developing
and delivery process, developer writes unit test,
they write integration test, they write end to end test,
and the applications are deployed in production.
But despite all those testes, despite 100%
test coverage, applications are bound to fail crash
or they are bound to
face a downtime. And the reason for downtime can
be related directly to the application itself or can be affected
by many other dependency that goes into production. So that's
where chaos engineering comes in and tells that,
okay, despite the fact that you tested all those stuff,
the application services are still crashing in production.
So let's just try to find out those
unknowns way before they
become an incident in production. So chaos engineering comes in and
tries with many experimentation to validate
assumptions that the test written are actually working as expected.
So security chaos engineering, it works in the same way.
So despite organizations practicing security
for a very long time, so organizations
are buying security products, the next gen firewalls
and whatnot, they have been implementing both security reactive
and proactive practices. They have been doing vulnerability assessment,
pen testing, red teaming. They are awareing their
users, end users, internal employees
and even developers for secure software development practices for security
awareness. Yet, despite all these efforts,
data breach is nowhere going to end anytime soon.
Security chaos engineering comes in and tells that,
okay, despite all the fact and effort that we put in
security,
the rate of compromisation of any organization is just growing
and it's not stopping anytime soon. So let's just try to find
those unknowns that can
escalate to a data breach. Let's try to find those things
earlier as much as possible so that they are prevented
in the future. So security chaos engineering involves experimentation
to validate assumption and find the unknowns. Basically test
the effectiveness of all the controls that have already been
implemented. So the way I see it, in a short, is security
chaos engineering are test to validate all the teams
how a chaos experiment is conducted.
A typical chaos experiment starts
by defining a steady state by measuring a normal behavior,
so how the application behaves in a normal case.
Then a hypothesis is developed
with a what if scenario to create can experimental process
and some chaotic variables are introduced which will affect the CD,
which tries to affect the CD state of application.
And at the end we measure the
results and validate that.
Are the assumptions still correct or not? For example,
let's say that we have an application service responsible for
handling 1 million requests per second. That's a steady state of that
application. So a chaos engineer comes in and tells
that, okay, what if the caching proxy in front of this
application service is down for ten minutes? In that case,
would our application service still be able
to handle 1 million requests per second? So the
chaos engineering comes in and deliberately takes down the
existing proxy and takes
some metrics and tries to again come with a result that
tests that. In that case, would the application service really
can handle 1 million requests per second?
And if it can't handle that, then we have fined and
the chaos engineer has find an unknown that was thought to be
already addressed by developers or reliability
engineers. But then again, despite all those efforts, we can verify
that in fact the application service will
not be able to perform as it is supposed
to perform. So that's a typical example of chaos experiment.
So how this can be related to security.
So first thing is of course an obvious testing
for resiliency. Testing for resiliency
is route. That's not related to only security
chaos engineering, but the regular chaos engineering in general as well.
So in testing for resiliency, so we have lots of expectations
for resiliency. And as a
developer, as a site reliability engineer, always think that
the application or service will work as expected
in the server or in the production.
If you take this topic
resiliency and try to relate it with modern
infrastructure or application development and deployment process,
that is cloud native and that is cloud
native along with many microservices. So a typical
resiliency dependency of an application has grown lot
bigger. For example, I have a picture here in this
slide which shows the resiliency dependency of an infrastructure
access solution, right? Access control, infrastructure access
solution. We take these things for granted and
we think that okay, what could there be? Just might be a host
bastion server or VPN server, or modern access
proxy that allows the access. But then you
can see that a typical access control solution in modern infrastructure has
many, many resiliency dependency. For example, it depends
on identity providers, access providers, certificate authorities,
multifactor authentication providers, hardware security models,
approval systems, routers and files and switches. And these are just the main
dependency that I've listed here. So there are many
ways an application can face disruption
due to failure of the dependency
that it depends to operate its whole feature
set. So assumptions can be like, okay,
so the services can withstand downtime of this dependency
and a developer or DevOps
engineer comes in and tells that, okay, a whole new infrastructure service can
be running in under five minutes, and we have already tested for it.
Security engineers can come and tell you that in case
we find a vulnerability, the security patches can be applied as
soon as patch is available. So these are the general assumptions
that existing in operation infrastructure
operations of every organization. So security with secured
chaos engineering, we try to validate
these assumptions. For example, so for testing
for resiliency, for example,
is surviving a downtime due to outage of service provider,
okay, so when a DevOps engineer
comes and tells you that the service can withstand downtime of
a certain dependency chaos, engineers can come and
okay, let's try to check if that assumption
is correct. And they can introduce chaotic variables which include
shutting down containers, virtual machines, servers,
disconnecting network interfaces. They can also mock a dependency
downtime to validate the assumptions.
Let's say when there is an assumption that security patches can
be applied as soon as the patch is available. So chaos engineering
can come in and try to check that.
Okay, is that assumption true in any regard?
So how much time does it actually takes for upgrades
and patches? Do we even have an infrastructure ready for
rolling out certain kinds of patches to
our fleet of servers in our infrastructure? What happens when new
pipelines and workflows being introduced? Okay, we talk
about security, but are we in a position that
would help us smoothly roll out credential rotation
operations like rotation of passwords, API keys and certificate
authorities without a downtime?
So obviously these things are already
thought out when they are designed. But then again,
given operations inside any organization,
the changes are bound to happen. There will be drift in
configurations. Some new introduction of new workflows will
have a side effect on existing procedure, and they can
affect the resiliency of what the assumptions we
already have thought out or test out. So security chaos engineering can
come in and test for those assumptions. Okay, now we are venturing more
into core cybersecurity topics.
So testing for effectiveness of security
controls example of security administrators
comes and tells that web application
firewall policy will detect cross site scripting attacks.
A SoC engineer will come and tell that, okay, our SIEM policies are
configured to detect credential compromise and lateral moment
adversarial tactics. And security engineer of
a web application comes and tells that, okay, in case our x
application is compromised, it still will only be affect the
y database and not be able to pivot beyond that
database. So these are the generic types of
generic assumptions we have inside of our security
operations. The problem with testing these type of
assumptions when it comes to security chaos engineering
is how can you define and measure a steady
state appropriately? For example, when an administrator
tells that, okay, WaF policy will detect cross site scripting attack
in this case. So it's really hard to take
metrics and measurements in terms of security.
For example, in regular chaos testing, you can go and take a measurement of
the example I gave earlier, that handling of 1 million
requests per second. You can collect bandwidth
metrics, you can collect cpu and memory metrics,
right? So you can take
metrics that's related to request handle per second,
send those data to monitoring
dashboard, prometheus and Grafana dashboards, and you
can collect metrics. But it's lot hard to apply
the same concept to security because you can
quantify security in terms of, let's say,
okay, the security that we have applied is 100% secure or
90% secure. And now we want to test
a hypothesis and check if it will reduce the security to
80% or 90%. So that's not justifiable in
terms when we speak about security. So when can administrator
tells that policy will detect cross site scripting attacks?
There are many, many ways to bypass that cross site scripting attacks.
If you go and check the existing cross site
scripting bypass cheat sheets, there are many.
And given the condition and the pace where the
front end developing ecosystem is moving forward with,
the chances are that more of bypass techniques will be
found in the future. And this is just not ever going to stop.
So it's hard to quantify that. The steady
state of this application we are testing against is already
the best and possible way about it. So in a similar states,
what should you test? Okay, you can test for a hypothesis,
and again, it's hard to quantify that. This hypothesis is
bound to test that if the CD state will decrease to certain numbers
that we can measure in terms of security chaos experiments
and exactly what variables should we introduce?
Will this going to be enough? So these are the generic problems
you will be facing when trying to test effectiveness
of security controls and with continuing
on that. So when the chaos experiments
now tries to directly tackle with
the existing security processes that's already in
place in your organization, now the biggest question
your team will have is how can you align these
experiments with existing security processes?
So where does security chaos testing fit? Can we adapt
security chaos testing to speed up compliance process?
These are the questions when do you even start security chaos testing?
So these are the questions that
the security chaos engineer aspiring or the practitioner
will be facing when they try to venture out more
inside testing of existing security processes.
So this is where the core
topic of my talk is how security
chaos engineering based on security benchmarks and frameworks and
best practices allows us to
align security chaos engineering with existing
security processes in your organization.
So it's just a way that would help us for a
broader adoption of chaos engineering, because chaos engineering
is a good concept and it must be practiced, but unless it is
aligned with existing security process in your organization,
it wouldn't go much further.
Okay, so what I mean by security
benchmarks and best practices? So there are security baselines
such as CIS benchmarks or vendor
specific best practices like AWS,
Azure, Google cloud best
practices for identity and access management, for example.
And you have compliance specific controls such
as PCI, DSS, HIPAA, or SoC whatever
compliance your organization is following. Then you have
security frameworks such as Mitra, attack,
cyber kill, can, et cetera. So these are the examples of
security baselines, benchmarks, frameworks that
one way or another your organization is already practicing
or have implemented few parts of it. So how
can you introduce chaos testing in those parts? Right? So for example,
here is testing effective implementation of cis benchmark.
So as an example, I have taken benchmark
that's related to access control of cis v eight.
So in that benchmark,
so you have list of these things that the benchmark tells you to do.
And if you
pass security audit, chances are that you will have a green light among
all this implementation. But security proper, the purpose
of security chaos testing is to come and check, okay,
although we have implemented these benchmarks,
have we implemented them correctly? A security auditor will not go that
far to try to validate all these assumptions.
The compliance security auditor will come and most probably will
check that you have implemented or not. So this
is where security chaos engineering can come in and tell
that, okay, we are going to validate this assumptions based
on experiments. So for example, you might have already implemented
access granting process, access revoking process,
require multifactor authentication for externally exposed
application remote network access, or for administrator access,
and have defined maintenance role based access control, right?
But security chaos engineering comes in and tells,
okay, let's try to validate our assumption that
we have implemented requirement for multi effect authentication for administrative
access and the test cases can be okay, can other
special policy override this requirement? Or let's say if
we deploy a new application that doesn't require multifactor authentication
for administrative access, will this be detected? So these are
the types of test cases that can be introduced.
And these
types of changes will happen one
day or another, later or sooner,
because the static state of any organization
is never true. And changes
happen as the team grows, as the requirements grows.
There are many teams responsible for handling managing
infrastructures. There will be many new teams
responsible for developing new applications,
whether it's customer facing application or internal application.
So it will affect the current state of infrastructure.
And again, there will be changes in drift in
configurations. So these assumptions will come in
and try to validate the same assumptions and might be
that at current point of time,
the security policies will detect a new application that
has been deployed without requiring multifactor authentication.
But later stage of your organization,
as the team grows, as the requirement grows,
some policies or new workflows might have already disturbed that
security policy. So chaos engineering comes in and can
help you validate those assumptions.
Second thing is we can also use chaos
testing to test for adversarial tactics
that will help to test for breach readiness withstanding
adversarial tactics, test for threat contentment or
validating blast radius. And so these tactics
are well mentioned, well cataloged
in frameworks such as mitre attack,
cyber kill chain, Gartner's cyber attack
model, NIST cybersecurity framework.
And these are the primary ones that are popular in security
industry. So how can we introduce chaos experiments
within these frameworks? Right? So for example, I have taken can
sample of adversarial tactics that is related
to enterprise metrics of mitre attack
and for example of this
talk. So I've taken four primary tactics.
Example are initial access,
credential access, privilege, escalation and lateral moment.
These are the common tactics that are presented in enterprise metrics.
So we can take subtactics.
For example, here we have in credential access we have modify
authentication process so we can test that although
these, in a typical security mature organization, these things would
have been already practiced. So as a security chaos engineer,
our propose is to validate the assumption that even
that our system is ready to detect this stuff.
We will come with assumptions that, okay, what if in certain cases we
might miss this detection?
For example, I have taken here a sample of tactics
that is modified authentication process. So we can take that topic
as the security chaos experiment and build
our hypothesis that, okay, SoC team should
be alerted. That's the assumption of the
security team. That SoC team will be alerted if there is a modification in authentication
process. Now, as a security chaos engineer,
we go and deliberately change the authentication process from
SAML to oath and disable the lock server
for 20 minutes and resume it back.
The hypothesis here is that, okay,
the policy is defined to detect a change in authentication process,
but what if when lock server is down for 20 minutes or
cannot handle request for 20 minutes, right?
So we do that as a security chaos engineer.
And in the fourth step,
we might discuss or chat with Sock team that if it's
been detected or alerted or not. If not, it means that process
responsible for shipping logs is existing,
a retry mechanism. Maybe that was implemented way before,
maybe not, maybe it never was implemented, but that's
unknown. We found that in certain cases that a log server
is unable to handle requests for 20 minutes, we will
be missing the alerts that were related
to the change process, and we might miss this whole
alert of can adversarial tactics that
has been already going in your network.
So these are the examples of how chaos experiments can
be performed with respect to existing
security frameworks. But again, why should we
do that? To summarize, is that
it will help you catch misconfigurations or logical flaws that are
introduced over time. As I said, drifts are bound to
happen and changes in workflow will introduce new side
effects in your existing security policies. So it will help to catch
misconfiguration or logical flaws that are introduced over time.
Then it also helps to automate and close gap
between vulnerability assessment and penetration existing and incident response
drills. The third one is that it will help to validate
the effectiveness of existing security policies and security controls.
Even if you don't find
any flaws in the assumption, it's just a
way of scientific way to say that, okay, rather than
just ticking a box that we have implemented or not, we have actually
just carried out some exercise that
allows us to scientifically or mathematically justify
that this is indeed being implemented,
right? Then again, security chaos test with engaging
security chaos experiment with existing security benchmarks and frameworks
will also align security chaos test with existing security
initiatives in your organization. So that means it will be helpful for
executive buy ins or buy ins from the security team
that okay, we should do chaos existing and that will help
us increase our effectiveness rather than be
just another security concept,
okay, I mentioned earlier that lining chaos experiments
with existing security benchmarks and frameworks would also
allow us to close gap and align it with existing security
practices. So any typical organization will have already
been practicing proactive security practices such as vulnerability
assessment, pen testing, incident response drills. So how
will chaos experiment compare with existing proactive
security practices? So for example, in this scenario,
I have taken a case of just in
time access request system.
So in a vulnerability scanning,
you are looking for known vulnerabilities in the GIT
access granting system itself. In vulnerability research,
you'll be looking and identifying previously unknown
vulnerabilities that might affect the GIt system.
In a penetration test,
you try to find a way to bypass the Git access
granting process by exploiting a known vulnerability or by developing
a novel logical flaw explores or via
social engineering and security engineering and security
chaos testing, you will try to test
that. What if the GiT system chaos crashed? Will a downgraded
and can insecure access request system be activated and misused to
bypass the Git policies? So these are the typical differences
between each of the proactive security practices.
So security chaos engineering comes with a unique
point of view to validate the assumption. What if,
despite the fact that policies has been implemented and deployed,
will the change or downtime or
any effect in the dependency will affect or
lead to downgrade of security? That would just let the
security control to be bypassed. So when to
introduce security chaos testing?
In this case, I've taken a sample of security and
privacy capability maturity model,
which is also known as SPCMM and short.
And it
just shows you how you can measure a
maturity model of your security and privacy controls implemented
in your organization. And it's just one of the framework given
the industry your organization is on,
might be following the other security maturity model.
And again, this is just for example. So when should you introduce
security chaos testing in organization? Personally,
I believe that security chaos experiments works best when
it is introduced later in your security process. For example,
if you don't have any implemented cybersecurity
baselines benchmarks if there is not any team that's
practicing security, there's no point in conducting
experiments and trying to find the unknowns, right? So first,
the basic is that you have to go and implement basic
controls. Ensure that the basic hardening
stuffs have been implemented. The benchmarks,
basic benchmarks or frameworks have been followed and implemented to
tighten security and enhance security, right? So security chaos experiment
is a way to validate the assumptions that you
have once you have implemented all those security practices.
Without that, I think it can be helpful
if you use security chaos experiments just to identify a gap
on where should you focus on implementing
or prioritizing security. But again,
the most effectiveness of security chaos experiments will come
when you are later in the maturity model of security process.
So that's about my presentation
for today, concluding all
the things that I've said is so secure chaos experiments
are a noble way to find unknowns in security.
But then again, the experiments should be aligned with existing security
process to gain adoption. Otherwise, it's a good concept and
it will have challenges to grow beyond just
concepts and just a few steady, steady,
steady states. Developing hypotheses, security baselines, benchmarks and
frameworks would help security chaos engineering with existing
security process. So the challenge is
to align it with existing security process. And to address that
challenge, we can integrate or bring
the security chaos experiments to validate all the security controls
that we have already placed in our organization.
And security chaos testing should close the
gaps not addressed with penetration test and incident response drills.
So any mature security organization will have been
practicing this proactive security practices, including penetration
testing and incident response drills, vulnerability scanning.
So security chaos experiment is not about replacing them,
but it's about closing a gap that are left by these
tests, right? So if you think that way, there's a
spot for security chaos engineering. If your team
wants to replace the existing process, then it will have a challenge
to change what is already in place and followed
by many many security teams over the world.
So finally, and security chaos testing is more effective as organizations
move closer to the highest level of security maturity model.
So it's security chaos experiments. It works best
to find the unknowns when the knowns are implemented
correctly or have been implemented without the known knowns.
You'll just be firing experiments all over the place without
any good result and validate any security prior
implemented security assumptions. Okay, that's it for my talk today.
I hope it was helpful for those of you who started
planning to start venturing out in security chaos experiments.
I would like to thank again the organizers for giving me an
opportunity to speak at this conference.
If you have any questions feel free to ping me. I have
provided with my social media handles in
the earlier sites of this presentation. Okay thank you so much. Have a
great day.