Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody. Thank you very much for being here. The title
of my talk is security chaos engineering with some considerations
for game days when the experiments are cyberattacks now.
Nice to meet you. I am Judy Nino. I work as site reliability
engineer in Adl Digital Labs, a company in
Colombia that provides technology and innovation services for a banking group.
Also, I am chaos engineering advocate between a spanish community.
Lastly, I am from Garagoa. Garagoa is a beautiful town
located in Colombia.
Although I've never been in Japan, when I seen in
resilience, Japanese immediately come to my mind regularly.
Palmat by Nashville and may made by disasters.
Japanese always recovered from adversity. In the
last 100 years, Japan has faced tragedies such as
the great Kanto equipment in 1923, two nuclear
bombs over Iroshima, Nagasaki in 1945
and a tsunami in the Toku in 2011.
These adversities have proved a culture based on resilience.
People say that is a consequence of discipline.
Positive, polite and resilient culture. Culture inspired
by example by samurai. Samurai cultivated
the bujido cause of martial bertuts, indifference to
pain and loyalty, engaging in many battles during
twelveth century. Here are the most famous cyber
rides in the history. My apologies for my pronunciation
here. Minamoto Miyamoto Tojo
Tomi Honda Antakera if
I have to choose one, I would say my favorite is Miyamoto
Musaji. Has sisting allowed to him improvising
with absolute efficiency in any situation of apparel.
Not without first reflecting on all variables he
faced. For Musaji, there is no better weapon
than any other. It is important for the warrior to evaluate
which is the most according to the circumstances.
Towards the end of his life, he's dedicated
his time to study and teaching. He wrote the book of five
Rings, an essential resource for martial arts.
Probably you are asking why I am speaking about Japanese and
samurai in an aesthetic conference. The reason is that
if we want to face security cyber attacks,
we should develop skills such as discipline,
precision, attention to details and the resilience. Like Japanese
inspired in the Miyamoto's story, I have designed this
agenda for address security chaos engineering. We are talking
about some of the famous cyber attacks, how to use incident
management theories. To mitigate this, I am going to introduce
a novel discipline known chaos engineering and security chaos engineering.
With this context, I am going to talk about security Chaos game days.
That is the main topic. I am going to present a framework that we develop
to design and practice security chaos game days.
And finally, I am going to share some
learnings and challenge on this chart.
Recently, I read on Twitter. Cyber war is
everywhere, in the media, in the military, among politicians
and in academia. Although I think that is a
personal it is not a cyber war.
Cyber attacks could be useful for a war that is a fact.
But I am going to show how you can use chaos engineering
to mitigate the risk and training engineering teams
to be prepared for a disaster like this.
Cyberattacks can be defined as unanticipated and
catastrophic incidents. They happen on production and can take
the system down. Cyberattacks are hard to or
most likely impossible to predict, have a severe
impact on the availability of a sofa system,
and may generate multiple hours of downtime.
Some of the famous cyberattacks include nope,
cyberattacks. In 2018, this cyberattacks affected
all business units and Mexline. Mex is a famous
shipping company, the Equifax data breach.
This happened between May and July 2017
and American Credit Bureau Equifax.
In this cyberattack, the credit data of
millions of Americans, Canadians and british citizens
were compromised and recently the
cyberattacks to Twitter. At this moment, Twitter believed that
attackers used social engineering
scheme to manipulate a number of employees and
use their credentials to access Twitter internal systems.
As you notice, cyberattacks are a reality and
they are hard to predict them. However,
we can respond to these issues before
they impact our systems and this sense severity
and for this case, security incident management could be
useful here.
An incident in the context of information technology is
an event that is not part of normal
operations. Incident management is the practice
of recording, triaging, tracking and assigning
business value to problems that impact our critical systems.
SIV is an acronym used to refer
to an incident and it is derived of severity.
Some examples for this kind of incidents
include availability drops, product issues,
features, broken data losses,
and security risks. Here are some resources from
Gremlin Gremlin is a company specializing SRE and
chaos engineering and here
are more resources about SIP Gremlin has
provided a complete guide to analyze the
impact of an incident in our sres and
slos. Here, it's very important to consider the level
associated to the incident and finally, they explained
a formula to calculate this impact.
Since security teams have focus on
confidentiality and reliability,
I asked why in this literature
the word security is missed. According to
Google, both SRE and security have
strong dependencies on classic software engineering
teams. So it is really strange why the
word security is missed here.
So, for example, in this service
level agreement document,
it is from a company for which I work in
the past. I remember that we include a disclaimer
when the incident has related to security, no matter
if we are talking about incident levels or environments,
we couldn't commit with times or
solutions. When we are speaking about
cyber attacks, the reason has described by Laura
Nolan in his Eucenics presentations. Cyber attacks
are black swans, so they
are hard to or most likely impossible to
predict. They have a severe impact
on the availability of a software system.
As I mentioned,
according to the third book about SRE and
security from Google, security and reliability,
SRE missed in the customer's operations. The reason
is that if the system is working well, the customers
don't notice them. However, I believe that
security and reliability should be the top priorities
for any organization because there are a lot
of common things such as invisibility,
assessment, simplicity, evolution,
resilience, investigations and recovery. They are
explained in the chapter one of this book.
So in this sense, I asked to Colton Andrews
about this in a public questions and answer
session and my two questions were related
to this. So my questions were
are there is a list of common attacks when you are
considering experimenting with security on a system?
And my second question was should we have
special considerations when the attacks SRE involved
with security instead of infrastructure. For example,
he answered, reliability is a
core pillar of security testing. An offensive
security tester. Although penetration testing and chaos
experiments share some parallels,
they have different goals. However, chaos engineering
is focused on making systems more reliable
and secure in any situation. If you notice reliability
and security and our
next topic chaos engineering SRE related
topics so for me
it is about culture. So we should build
a culture based on security and reliability. And in
this sense we should train our security teams based
on this culture.
According to the last book from Google according
to the last book related to Chaos Engineering published on
April, a culture based on reliability
and for me on security. Our incident response
teams should have these roles,
designer, designer or facilitator,
who is the person leading the discussion and commander
who is the person executing the comments
and scribe and unescribe who
takes notes in a communication tool such as slack
on what is occurring in the room, an observer,
this person looks at and sres relevant
graphs with the rest of the group and finally
the correspondent who keeps an eye on slack
for example.
Related to security we have some exercise can
be useful for training our security teams. Although consider
the answer from Colton and Andrews, they are different from
chaos experiments. Red team exercise were originated
with the USA armed force and in this
case the adversarial approach that imitates
the behaviors and techniques of attackers in the most realistic
way possible. Two common examples include ethical hacking,
penetration and penetration testing. In the other side,
blue teams are the defensive counterparts to the red
teams in this exercise.
Purple team exercises are an evolution
of red team exercises since
they deliver a more cohesive experience between
the offensive and defensive teams. The goal
of this exercise is the collaboration of
offensive and defensive tactics to
improve the effectiveness of both
groups in the event of an attempted compromise.
An orientation is to increase transparency
and allow to learning.
So however, our point here is that
penetration testing are not enough and red
teams exercise and blue teams exercise
are not enough for mitigating attacks such as Nopedia
or resin attacks to Twitter. We need a new
approach, one that keep pace
with the evolving world of attacks and of course
software engineering. So it
is the moment to introduce a novel technique can be useful for
solving this challenges. Chaos engineering
chaos engineering is the discipline of experiments
failures in production or on production in order to reveal
their weakness and to build confidence in their resilience capability.
This definition was taken from the site of
principle of chaos which contains a manifesto for
chaos engineering. I have highlined
experiments, production, rebel, build,
confidence and resilience because these words are really important
in this definition. So here there
is a list of the common attacks practiced by engineering
teams using chaos engineering. We have a
list for technical issues and some examples
include dependency failures, region and some failures
provided failures for example cloud provided failures,
network upgrades and failures
in Iraq for example. And the
most important is that cultural issues
related to this or allow
to these technical issues happen. Some examples
includes, for example, lack of chaos engineering or lack
of knowledge sharing, for example, lack of on
call training. That is a practice described
in the blue books in Google for example.
So if you notice in the previous days,
the word security is missed again.
Why? Because this friendly reminder,
if security teams have focused on confidentiality
and reliability when the issue is a cyber attack,
we don't commit with this. We don't commit with
times or solutions or
sres, slos or slM.
So what is security chaos engineering? Security chaos engineering
is the identification of security control failures draw proactive
experimentation to build confidence in the system's ability
to defend against malicious conditions on production.
I highlighted six words
that are variable in this definition provided by Aaron in
the last book about chaos engineering published in April
in this year. Security failures,
experimentation, confidence, defense and
production because they sre super important because
this discipline is based on the scientific method and
confidence and defense
are words related with resilience and reliability.
And of course the experiments should be ruined
on production.
You security chaos engineering
addresses a number of gaps in contemporary
security methodologies such as red and purple
team exercise. It is really
important to have clear it is not the intention to
overlook the value of red and purple team exercise or
other security testing methods. These techniques remain valuable
but differ in terms of goals and techniques.
As Colton Andrews mentioned in his answer,
combined with security chaos engineering, they provide
more objective and proactive feedback
mechanisms to prepare a system for adverse
event that when implemented alone
for example and let
me before to pass to Chaos game days.
Remember to this phrase from Warner
Bowels, everything fails all the time. So that
is the moment for introduce the Chaos game days.
So chaos game days are
based on game days. A definition from AWS.
For game days they sre an
interactive team based learning exercise designed
to give players a chance to put their skills to
the test in the real world gamified, risk free
environment. So as
a form of game days, we have chaos game days.
In this case, it is a practice event that can take
a whole day. It usually require only few hours.
The goal of a game day or a chaos game day
is to practice how your team or your supporting
system team deal with the real
world turbulence conditions. The difference between them
is that the technology and the objective. Finally,
because in the case for AWS,
we sre experimenting with AWS
technologies. So in this case for Chaos game days,
we sre experimenting with our system, with our technologies
and trying to mitigate real
turbulent conditions as Rosemail
mentioned in this book. So in
this framework provided by Ruth Meyers in
the previous book, Sorry Learning Chaos Engineering,
the framework has three phases, before,
during and after. During the before during the before
stage we should focus on pick a hypothesis,
pick a style, decide who, decide where,
decide when, document and get approval. That is really important.
And this phase is very expensive because
we have to plan a lot
of things. So the second has is during.
In this phase, the idea is to run the exercises,
validate or refute the hypothesis
using observability tools and our skills for solving issues.
In this phase, the engineers for example detection the
situation, communicate, visit that board,
analyze data, propose solutions.
In this case, we are trying to apply this
framework with these activities. So finally,
in the last part, the last part is for writing the
postmortem with these parts,
what happens, impact, duration,
resolution, time resolution,
timeline and action items. And although
there are several templates for
this in the literature,
before to pass to our framework
our focus for security. I would like to share
this phrase from Vincom.
Human factors in cybersecurity are perhaps the
biggest challenge when building an effective threat prevention
strategy.
So considering the previous framework provided by Rusmiles
and the context. If you remember the security word is
missed in the literature in the list of attacks
in the exercise. I would like to share a framework that we
are using to practice security chaos engineering in our
study group in ADL and study group related
to chaos engineering. So this
framework. In our framework we consider the
three phases such as Rusma is described
in his book, but we are including a new
stage for evolve or for evolution.
So in this case we are dedicated time after each
game day to improve vulnerability database,
refine the process, adjust metrics, validate the
chaos maturity position after the exercise and
adapt the new game day because it
is really important to use the feedback
generated in a previous Chaos game day.
In this case security chaos game day in order to
improve the sorry.
Also we have to consider some things in the phases provided
by Rosemary, so also to
include a new has or a new stage.
We are providing some considerations for the
three classical stages. So in this case,
our recommendation better. Our consideration
better is when you are picking the hypothesis. It's really important
to understand the adversary because the
motivations, profiles and methods are super important
here. Remember, the human factors are a
main considerations when you are trying to provide
experiments related to security. And in
a second one decide or pick
the style. In this case,
if you are going to choose a style, choose one.
With adversaries. It is preferred over the classical mode
in which we had a master of disaster who attacked the system.
The idea in this case is to take inspiration from the red and
blue teams exercise, but using different
teams attacking between them. So finally
reconsider the roles. If you remember, we reviewed
the roles described in
the Chaos engineering in the last Chaos engineering book published
in April. Sorry, but consider
for example if you need a consultant or an expert
with knowledge of the last attacks
provided by attackers in the market. So they
are our consideration for this phase. So if
you want to practice an exercise,
this book, building secure and reliable system that
we use in a previous slide provides some examples
for attacks in order to
practice with security in a chaos experiment. For example, in this case they
are describing an attack in which
attacker use a search engine to find the email address
of employees at an attacker organization.
Attacker send phishing emails to employees so
attacker remotely go logs or remote access
to the system using these credentials.
So in this case during
the exercise, some examples that we are trying to
practice in our study group introduce RNC
on security controls for example. That is a classic
example for these experiments group
folder like a script in production software
secret clear text disclosure permission collision in
a share AI policy.
So disable service event login. That is really
critical. AP gateway shutdown and XSI
related with s three buckets. For example, if you are
using a technology such as AWS
or disable multifactor authenticator.
So it is an example of a mental experiment
using security case engineering. In this case, our hypothesis
was after the owner of root accounting AWS
left the company, we could use our cloud in a normal
way. So we were thinking about this in
an experiment and the result was hypothesis disproved
because in this experiment the access to AWS was connected
with the active directory. So when an employee
left the company, his account is dropped and we lost the access
to AWS. So that is really critical.
And in this case we are trying to
practice a mental exercise. But if you notice
we have the opportunity to identify vulnerabilities
and controls we should review
in order to guarantee
that our system is secure on production. So aside,
effect of this experiments thinking in this scenario
allows to consider another applications connected
to active directory.
Finally,
in the third phase, it is really important to consider
that security postmortem should
be different of classic postmortem
because they should cover technology issues
that the attacker exploits and also recognize opportunities
to improve incident handling. That is the reason for including
a four stage, for example, document the time frames and effort
associated to these action items and decide which
actions items we consider.
So finally, in this last phase,
introducing our framework, I would like to focus on
continuous verification because remember, it is a continuous
process in which we should use the previous feedback
in order to improve the next game day. So according to
Erica, Erica is a company specialized on
security case engineering continuous verification encourage both
of these requirements in a way that proactively educates
engineers about the systems they operate is emerging.
Has crucial practice for navigating complex software systems,
one definition more provided by them.
Continuous verification is a game changer for complex
software system management and
complex attacks such as security attacks. In the future,
it will fundamentally change the scale and types of systems
that we even consider building.
So lastly, I would like to share some learnings and challenge
after trying to practice security kos game days
because remember, it is a fact that the future only can
be improved if something is learned from the past.
So this definition was provided by David woods
in the resilience engineering book.
So our learnings include the adoption of security
chaos engineering faces challenge in
this adoption human factors and consider the view
of an attacker is very very important in order to
provide the proper experiments and the proper hypothesis
in an experiments. So reducing potential
damage and blast radius is super important in security.
It is important in a common
chaos experiment, but the insecurity is important too.
Communication or better skills related to communication and
observability can guarantee the success of an experiment.
So it is really important to
have the proper skills for communication and observability
and to use tools such as data doc
or neuralic for this. And requirements can
make collision with experimentation in security. So it is
really important that consider that our requirements
are aligned with
the experimentation in this case. So finally, you don't need
to be an expert and security expert in
order to start with security chaos engineering. My favorite
phrase here is that just start with the
experiments that probably in the past you will
have the opportunity to learn the side things that you need
for practicing this discipline.
So they
are the least challenge that we identify it
after trying to apply to
this framework in our group of
stories. So the adoption of security
chaos engineering principles across organizations reminds
us an open challenge. It is our mission to
provide more investigation, more techniques and other
frameworks that can be useful
for trying to mitigate cyberattacks.
So as I mentioned, security is missed in several resources
for chaos engineering, in this case for
chaos maturity model. It is not an exception. So security
may be included in this chaos maturity model, since combining
chaos maturity model and security chaos game days help
new custom newcomers to start their chaos engineering
efforts and allow to build resilience
and security. And finally, it is a challenge
for us. It is an exciting time to be working
on this space. So we
are in a moment to try with the experiments.
So now I would like to share a
phrase from Arun is an expert
in security case engineering and he says that humans
operate differently when they expect
things to fail. So it's really important to consider
human factor when you are trying to experiment with
security in your systems.
So now finally, here are
some books can be useful if you want to start with security chaos engineering.
They provide the fundamentals to chaos engineering.
Chaos experiments, scientific method. For example,
this book by Ruth Meyers. It is a
really good source if you want to study the scientific
method. And the last book provided by Nora
Jones and Cassie Rosental. In this case, this book provides
a chapter dedicated to security chaos engineering.
So the idea is to
consider that security should
be present in our definitions, in our experiments
related to chaos engineering.
So now, and to close two phrases that
I consider proper here. Don't fear failure in
gate attempts. It is glorious even to fail.
And one single vulnerability is all an attacker
needs. So thank you. Thank you very much
for attending this talk.
I would like to share my contact data and
my username is Yury Nino in LinkedIn medium,
Twitter and in another
show. Thank you. Thank you very much for attending.