Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hello. Good day,
good morning, good afternoon, good evening, wherever you
are today. Thank you so much for staying up or for
coming to my talk today. I'm really excited to share
some knowledge today about security rates engineering.
So today I'm going to be talking on these subject risk driven fault
injection security cares engineering for the fast and furious.
My name is Kennedy Torkura and I am a cloud security engineer
at Mattermost. And actually I'm also
a PhD student. I'm completing my PhD at these Hasoplatna Institute,
and my PhD is on cloud security. And therefore
I'm going to be sharing knowledge about part of the things that I researched on
and some of the theorems that I proposed and also evaluated
as part of my doctoral thesis. So,
let us start off with definition of security chaos engineering.
And we're going to be borrowing the definition that has been proposed by
Aaron Reinhardt, who is also the creator of security chaos engineering.
He defines it as the identification of
security control failures through proactive experimentation
to build confidence in the system's ability to
defend against malicious conditions in production.
So some similarities actually between definition
of chaos engineering and that of security chaos engineering,
but the key differences here are that we are trying
to look at how we can identify security control failures
and how we can defend against malicious conditions.
So some differences. Basically,
chaos engineering tries to address availability problems,
and that is done by employing resiliency patterns.
So, resiliency patterns,
strategies like timeouts, buckheads, circuit breaker,
are being used to inject failures and to identify
problems that might affect the availability of
services on the other side. On the other hand, security case engineering
addresses availability also, and a slightly different kind
of availability this time around.
Availability problems that might be caused by malicious actions,
for example, denial of service attacks. Security based
engineering also tries to look at integrity and confidentiality,
whatever might impact on
integrity or confidentiality or availability,
which are the three principles of security,
usually we call the CIA triad. And how that is done is
by employing the existing controls that
we have been using in cybersecurity. For example, the preventive
controls. We're talking about mechanisms like firewalls,
detected controls like intrusion detection systems,
corrective controls, for example, incident response systems.
So, security case engineering tries to verify that
these controls are working the way they are supposed to
work in an environment. And if they're not
working that way, that's going to be notified, that's going to be
identified. And the key big picture here is
to be able to detect security blind spots,
spots that these systems are
not able to see, they're not able to identify.
So what's the importance of applying security care engineering
in the current dispensation? Today we live in a cloud native world,
and actually our systems are getting more and more complex.
And according to Bruce Schneer, in an essay,
he wrote a plea for simplicity.
The core message in that article was that complexity
is the worst enemy of security. And essentially,
that is what we are seeing today. Our systems are becoming
more and more complex, and this makes it
harder for security to be even effective,
because security professionals can
only defend systems that they are able two understand.
The more they understand the system, the better they have a
chance of defending it, or even identifying
when there are malicious or insecure events
in that system. Also, another problem we see is there is an
increases of attacks against cloud infrastructure.
A few years back, there were already attacks, actually when
we saw hackers penetrating the Amazon web
services, actually the account of cecilar,
to be able to spawn virtual machines to mine bitcoins.
And doing this, they were able to hide their tracks in a way that their
infiltration was not even noticed. And later on, we also
saw other kinds of attacks. For example, the exploitation
of s three buckets, where attackers were able to successfully
have access, unauthorized access to s three buckets and
exfiltrate very sensitive information.
However, things are actually getting worse. Attackers are becoming
much more organized. And in a recent report, the Cloud
Native threats report that was released by the Aqua security
team, they actually showed that these attacks are getting more
and more complicated. They are getting more and more sophisticated
against cloud native infrastructure. They were able to deploy a
set of honeypots in these wild, and based on this, they are able
to gather attacks as they happen, and they are able to analyze it.
So there are more and more attacks coming up against
cloud native infrastructure. Another problem we see
is the problem of new kinds of attacks, or let's
say new security problems.
One of the most common one is that of misconfiguration.
So according to Gartner, from now up
to four years ahead, there's going to be a lot of problems that
are caused by misconfiguration. And as you know, we've heard a
lot about, I just mentioned about s three buckets and these,
when the major reason why these buckets are
being attacked is because of misconfiguration. And 99%
of attacks against cloud infrastructure are going to be caused
by users faults, majorly, inability to configure,
or inability two deployed to deploy cloud assets in a way
that they are secure. And these things are being
caused by, essentially, I put it into two
main reasons. There is a knowledge gap
as regards what is expected
from people who are using the cloud, and there's insufficient tooling
support to help them to be able to deploy or to
be able to deployed these infrastructure properly. And we can easily
see this. So we have on the screen here,
actually two policies which are access control policies
for Amazon Web services on XM. Right here is a policy
that actually is quite large and most
of the time people are expected to pick up
these policies, manage them, to edit them by themselves, by hand, actually doing
it manually. And this is a very, very tangible example of
working about insufficient tooling support
for security. So also as we observe,
there's a lot of new trends coming up that
are actually aligned with the digital transformation agenda.
We got DevOps, we have the CI CD pipeline,
we have a lot
of people are shifting their workloads to the left. They want
to be able to be fast, they want to be able to agile, to be
agile. They want two be able, two basically make
use of the new trends, new technologies to
take. It has an advantage for themselves.
And unfortunately this is not easy for security two handle
because the traditional model of security is like we see,
to be able to take care of infrastructure that is more or
less static, doesn't really change. And this is how security
has been in the last two decades or even three
decades. Most of our security systems are designed
to protect those kind of systems that are static.
And the new trends have become, they're a problem for
security. Security is more like quite confused these days.
You see the traditional security mechanisms
are basically struggling to catch up with these
recent trends. And so there is a new kid in the block,
a new concept coming up called the cloud native security.
And essentially cloud native security is about
securing cloud native infrastructure,
which kind of summarizes the new trends we're seeing these
days. Like has I mentioned
in the last slide, in order to sort of define what cloud
native security is, which essentially boils down to defense in
depth, the Kubernetes security team issued an
article which the link is down below there.
But essentially security in a cloud native world has
to be set up at every
layer of the cloud native infrastructure. So starting from the internal
one, the inner layer, we got the code layer,
which is these most common one to us because we've
been writing software for many years, many decades.
So security has two be embedded in our code using things
like static code analysis or dynamic code
analysis. The next layer is the container.
So we got to be able to scan our containers to detect dependencies
that have malicious components and
things like that. Then the next outer layer is the
cluster. Now we're talking about orchestrators like kubernetes,
and there's a whole new kind
of problems that are emanating from kubernetes. And we got to be able
to take care of things using things like the network
policies, or to be able to analyze when there are processes
within the containers that are actually malicious
or are suspicious. Now, the final layer is these cloud infrastructure,
which is the very platform upon which
the entire auditory layers delay upon.
And we're going to be able to take care of this cloud infrastructure.
We should be able to look at things like these shared security
model, and we have good understanding of how it works,
have a good understanding of our responsibilities,
and understand the kind of security efforts
that are expected from us. So this is a summary of
what cloud native security is, but how does the
attacker look at it? We've just talked about four layers of
infrastructure. Unfortunately, attackers still look at
this as one single target. So inasmuch
as they might need to have the skills that are necessary
for them to conducted attacks, probably they need just one toolkit
to be able to successfully attack a cloud native infrastructure.
And the attack surface
is so wide that these possibilities are endless. As you see here,
the attacker can virtually start from any part, either from
the code or from the docker layer, or from the
Kubernetes layer of the cluster, or even from the cloud layer,
can virtually start from any layer and literally
move across the other layer. And we have seen attacks,
these kinds of kinds of attacks.
What we still see, and the way our cloud native security platforms
are being designed, is to take care of these layers, one after
the other. So we got tooling support today for these core layer,
we got tooling support for the container layer. A lot of systems
are, security systems are designed. Two do that these days.
We got cluster security platforms and we got of
course, the cloud security platforms.
And the challenge here is that most
of these tooling support, these platforms, these security
systems, do not talk to themselves,
they work independently, and there is really no cross coordination
or no cross understanding. So eventually, human operators
are expected still to come into this loop. And these
try to make sense of the output,
the results, the analysis that these individual components
are making. So essentially what is missing is for us to have a unifying
layer, a unifying strategies that stitches these
various components together and makes sense of it. And that
is where security chaos engineering comes in. And I will in
the next slides, try to explain how that works.
So basically, security chaos engineering,
as far as I see, is going to be a new
way for us to be able to put together these various
cloud security, cloud native security platforms.
And commonly I've put in this diagram the major categories
of cloud native security. First, we got the cloud security
posture management, which tries to look at the control
plane of cloud infrastructure to detected malicious
actions, to be able to detect misconfigurations and stuff like that.
We got the cloud workload protection platforms,
essentially looking at workloads from kubernetes,
doing vulnerability scanning and things like that. And we also have
the cloud access security brokers, which are also another kind of
security system that looks at these, tries to understand
the interactions between on premises infrastructure owned
by organizations, and how these
on premises systems are interacting with a cloud platform,
and tries to make sure that sensitive
data is not handled in ways that exposes
them and quite a number of things. So essentially, security rates engineering,
as far as I see, is going to be that unifying mechanism that brings
together these various security components to make sense
out of them. So let us talk a little bit about riskdriven fault
injection. And essentially, riskdriven fault injection is
about employing security case engineering from a risk
perspective. And why that is important is
because firstly, we know that 100%
security is a dream. There is no security
system that is 100% secure.
Problems emanate from various directions, either from within
our employees that may make mistakes to attackers
that evolve new ways. Two things like zero
do vulnerabilities that might be exploited by
some attackers. And I've also spoken with a
couple of people who are trying to convey chaos
engineering to these security teams. And what
I sense is it's a bit difficult, because security
is a hard language to explain, it is hard to measure.
It is larger, abstract. And so we can use risk
as a method to communicate or to drive
chaos engineering to our security engineers to
end our culture. And we have various kinds of methods
for looking at risk. Quantitative risk assessments
are really more attractive. We're going to look at
data driven strategies. Risk helps us to measure
security, that we can communicate this whatever
strategy we are trying to propose in more
clearer and more sensible ways to management, as well as two other
teams in a company. I'm going to walk you through what we refer to
as the security case engineering feedback
loop, which is a method that we think is
going to drive these implementation of security case engineering
much better and much constructively in an organization. It consists
of five parts, and essentially this is a feedback loop,
and which is
an adaptation of the MApik feedback loop that has
been used in autonomous computing systems. And the idea
here is how we can take security based
engineering and push towards it, become an automated
system that works behind the scenes and works together with
other security systems in a
way that hardens security and makes security much,
much better. So the first part of this group is execute.
And here we have to talk about what
is the aim of the experiment. So if you want to conduct a security based
engineering experiment, you want to be able to clearly define these aim.
What do you want to achieve? And based on that, you're going to craft a
suitable hypothesis which you're going to be proving,
and then you're going to look at and define the scope,
the intensity of the experiments
you want to carry. It's really important to carry out some
sort of sanity checks. You're going to be coordinating with responsible
teams. You want to understand, you want to convey to them
clearly what you aim to achieve.
These are administrative aspects, these are, of course, social aspects.
Very important. There is a human side that is largely
overlooked. You want to be able to communicate with people and
let them understand your mindset aim, and you want to by
them in very important is recoverability.
And what I mean by recoverability is if things go wrong,
you want to be able to roll back to the state. That is good
enough. And there are various ways of putting this
in place. There are infrastructure as code strategies
where an infrastructure is already in git using
things like terraform or AWS, cloud formation,
and there's also state management. So if things go wrong
or if you break things, you can sort of recover and
kind of not bring too much problems to the system.
So I talked about you trying to have kind
of a scope of what you want to do. We created a
tool called cloud strike, and in cloud strike
we has different modes of operation. So if you're going to
launch inject security fault injection,
or if you're going to inject security faults, they can have
different magnitude of intensity,
30%, 60%, 90%.
You have to figure out how the impact is going to be in different degrees
and decide on what degree you will use based on the maturity
of the team or of the infrastructure. And you
could have an attack scenario. These actually we had, and I'm going to give
an example in the next slide where we had, we could chain various
attacks to sort of form a scenario. So it's two simulate,
to simulate how attackers move in real life,
because attackers do not, they launch series of attacks
to be able to achieve their objective. So here we have
a table which has different attacks
which we use in cloud strike.
So we got the cloud resource we want to
attack, the action that we want to take. And just a brief description.
These first line we got the user. So we create a new random user.
This is an action. And you could kind of, like I said,
have a scenario where you link three or
four or more of these various individual actions,
and that is going to form an attack scenario. And here is an
example of the experiments we carried out. We start with
creating a user called bob get buckets.
We select a random bucket from that which we got from Amazon
Web services, and we create a malicious policy and we
assign Bob access to the bucket using the policy.
And in this case, we want to be able to see whether our
security system, whatever security system we are using in the cloud,
maybe it's cloud security posture management or something
as simple as cloud trail. You want to see whether it's able
to detect these activities. When you created a new
user, was it flagged? Did the cloud security mechanisms
detect it? How long did it take for my notification to
reach you, for example? Also similar,
when you create a malicious policy, are you able to detect that
a malicious policy was created. So the
second point on the stage of the feedback loop is monitor. So once you
start injecting failures, you want to be able to monitor the progress.
This is pretty important that you have sort
of either logging system where you are able to see the logs in
real time, or you have an observability systems and there are a lot
of them coming up these days, or you have even tracing.
So whatever you have that gives you clear visibility into
the progress of the attack, because essentially you
want to be able to stop these experiment. If things
begin to go too bad and you want to be able to recover, as I
said, you want to be able to have recoverability,
which makes it possible for you to roll back to
these good state. Of course, this is the third part.
The third stage is about analyzing. So assuming everything,
even if you had a failure, if you had to stop these
experiment, it's critically important that you have to understand
why it failed. You get some lessons from
what happened, what went wrong, why did the experiment fail?
So you can have another trial. And if you succeed,
you want to be able to derive questions,
derive answers to the questions you posed at the beginning of the
planning stage. And essentially what we are talking about from a security
perspective is looking at this is an example of the
OWAsp risk rating methodology. And so since
we are proposes a risk based methodology, it is
pretty important for a good analysis to understand
exactly the results you got from the experiment.
You want to understand these kind of threat agents that
might exploit this attack.
You want to look at the attack vectors, the vehicles they're going to
use to conduct such an attack. You want to
understand exactly the problem, the vulnerability that was detected,
because eventually you're going to have to fix that. You want to
understand the security controls that were compromised and other
important things. What is the technical impact of
that attack and of course the business impact. We think if you
are able to sort of have this clear understanding or
this clear analysis of experimental results,
it even makes it much easier to convey two,
get a buy in from management, for example.
So the fourth stage of the security case engineering feedback
loop is about planning. So you want to plan for the next
iteration of your experiments because the
idea is to have a continuous
system. So in this case you're
going to have to create things like backlogs for vulnerability management,
for whatever teams that are responsible to fix
the things, the security problems that were detected.
This might mean you're reaching out to the security operations
teams, development teams and also
threat modeling. I think the knowledge that is going to be gained
from security case engineering is a knowledge that
can be used for threat modeling or things
like security awareness training for teams, because you
must understand that what you have at the end of a security based engineering
experiment,
you've been able to understand these problems in these system, meaning that you
have knowledge about what might happen in the future.
And it's different from what you get from traditional systems,
which they try to explain about what has happened
here you are trying to explain what might happen in the future.
So it's really, really proactive. So you want to be able to
fix, as I said, the issues you saw and
also construct hypothesis for the next iteration of
experiments. So this is like the last part of it, which is very
critical part, and talking about automation,
we want to have a knowledge base. So every result
that you got from the security rates engineering experiments,
imagining that you were able to construct supports. So for us
in our tool cloud strike, every security based engineering
experiment had a report and that report has put into a
sort of knowledgebase which might be just some database where you
put in your supports. And these gives you access to greater
possibilities. For example, you can create cloud watch rules
to trigger alarms for specific events. You could create
rules for your cloud security posture management system.
You could create rates for identity and access management
analyzer. You could also do
a lot of things. So what we see here is nowadays
actually the concept of SIM is actually getting
obsolete because SIM systems are beginning to
struggle to even manage data or to be able to analyze security
information properly. And we see here
that it's possible that security chaos engineering is put into the so
called security data lake, which is more becoming,
more and more becoming a much preferred way for putting together
security information so that you can get some sort of
intelligence from it. So security based engineering,
the reports you get can be put into a security data
leak. And where you have other resources of information like the threats intelligence
source, you're getting a lot of information from
threat intelligence source. The feeds that tell you about things like malicious
ip addresses and things like that. You have the ETL
things, all the log analytics systems,
they push their knowledge, they push the output of the analysis
to this place, to this central data
leak. We see that security case
and generated results can also eventually live in such
a security data lake and give users much,
much better and much more contextual information to use
to harden their security system. We also
like to point you to some of the papers we wrote. So these first sets
of papers, two patterns were written where we cyber
security engineer security case engineering methods,
firstly to evaluate a cloud security posture
management system to see if it functions as expected.
And the other paper was based for
incident response, where we were also trying to see how
an incident response system works. If it works as
fast, has it should work, if it's slow. And we think these are
also very good use cases. There are also two papers we wrote
that focus squarely on security rates engineering.
We took a deep dive into this subject and tried to understand
from an academic perspective as well as from a practical
perspective, what are the connections with existing literature
that are related to this field of fault injection.
And we saw there is quite an existing work about
security fault injection, more under the canopy
of dependability. And we
think it's kind of exciting to explore these
related works, to have a better understanding of security based engineering.
And last is, I want to point out the security based
engineering book that was released actually
last year. And we had a very good opportunity to contribute to these
book. And if you are really interested in understanding security
chaos engineering, I will really recommend this book to you.
And also you can also have a look at our
publications and you will have a much better understanding
of this field. So this brings me to the end of my
talk. Thank you so much for staying along
and feel free to shoot a mail to me or to reach
out to me in case you want to learn more about what I'm doing.
Thank you very much.