Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, super happy to be were with
you today to discuss how we can build
things, build software, services,
applications faster, while keeping the security
and safety on top of our mind. First of
all, thanks to Conf 42 cloud native conference
for allowing me to be here with you today. And quick
intro. I am Mohammed, I'm a backend developer
in Spotify and I'm also a Google developer expert
in cloud technologies. And let's
the fun begin. So we all know that
the demand for speed and innovation is basically everywhere.
Users nowadays no longer wants to wait for weeks,
days or even days to get their feature
shipped, to get their buck fixed, to get their new
improvements released. However,
that speed or that innovation or that enthusiasm to
release faster can create conflict sometimes when it's butts up
against security, especially that security sometimes tends
to value status quo. And in some
cases the dutched hand to increase velocity can
introduce sorry new challenges and risks such as skipped
processes, insufficient testing, insufficient security
testing and so on and so forth.
And when you skip testing,
you soon going to hit a roadblock. And we all know
about a company that was bragging how fast they are releasing,
how fast they are innovating. They even made it one of their motives
right? However, soon enough they hit a roadblock.
And that's to say that security is also important.
Definitely speed and velocity is
required and desired. DevOps enable
us to release faster, ship faster,
making sure that both developers and ops are aligned
to once the software is ready, ship it to our users. However,
security is also important because once
you break that confidence that you have,
or that moral relationship with
your customers or clients or
users, once that's broken, it's hard to get back. So that's
why it's also important to keep the data
of your users safe and secure. And that's what devsecops tries
to bring to the table, basically. So devsecops try
to make sure that the developers are doing what they do best,
release software, introduce new features, fix bugs,
improve reliability,
resiliency, and improve the
user experience at a whole. They enable operators
to, once this improvements
or a feature is ready to be deployed, deploy it to
deploy it as fast as possible to the users.
And this is what make dipsycops really important.
Really particular and important is it try to save
a seat in the table to security folks to ensure
that the things that we ship and the software that
we releases are also secure to ensure the safety of our
users or the data of our users. So you'll probably figure
it out what the remainder of the talk
going to be. Obviously, we're going to talk about the Formula One
and quick context.
During the pandemic, was sitting back home, nothing much
to do, was trying to kill some
hours. Usual pandemic thing.
And then I started watching drive to survive. So it's basically,
it's between a documentary and the reality tv show. It's not really
a documentary and not really a reality tv show, but it tells the behind the
scenes of a Formula one weekend of a Formula one race in an
entertaining manner. Netflix way.
And in one of the episodes, this actually happened.
So, Roma, during the Bahrain Grand Prix, Roma is a
has driver. Has is one of the Formula
one teams. Was driving super fast,
160 mph, that's almost 300
km/hour was ripping super fast.
Bad maneuver, probably. He hit another driver and
then hit the barrier and the car turned
instantly into a fireball.
Now, Roma remained there
in the extreme heat, in that fireball for a couple of minutes.
And during that time,
it was interesting to hear the thoughts of the Formula one people.
Because by any standard, in a human standard, this is actually a death
sentence and a severe accident.
And the Formula One folks were also worried.
And a lot of them thought that this is
the end of Roma racing career, or at
least it's going to have a severe
injuries. However, the miracle actually
happened. Roma managed to get out of the car
from the fireball, from the extreme heat,
on his own feet. Sane and safe things is
a miracle. But it got me thinking how
the FIA, which is the entity that oversee the
Formula One industry, or races,
or sport, and the teams
in the Formula one,
value the safety of the drivers. And it got
me thinking, what would happen if we design our systems,
our architecture, our applications in the same
way, with the same level of devotion as the Formula
one folks, design the cars and make regulations and
architecture to make sure that the safety of the driver
is insured at any moment, even in the extreme
accident. Now, throughout the documentary,
they were talking about safest and little to no
mentioning security. And I am not
a native english speaker, so I use them interchangeably. For me,
they are more or less the same. So I opened the
dictionary out of curiosity to find that security actually
means the state of being free from danger or threat.
And that, from a software engineering point of view, is not
what were trying to achieve. Right. We know that once
we put a system facing the Internet, there are handle of things
that are considered as danger and try
to break our application and bring it to its knees.
However, safest means the condition of being protected
from danger, risk or injury. And this is what we are trying to
achieve, right? We know that the production system,
the Internet facing application are not safe. However, we try
to design our system in a safe manner,
in a secure manner, build the whole design architecture,
buy other pieces of software like firewalls
and everything, to make sure that. To make sure things
risky and unsafe environment is actually safe,
and ensure the safety of the data, of our users,
our services and our servers.
And that got me thinking that actually good
architecture should allow people to move
as fast as they can in a safe
manner. So I started a journey to
understand actually the safety regulations, the safety
rules that the Formula one forks follow, to ensure the safety
of the driver, the drivers, and check if there are any similarities, any lessons
that we can take and apply them
in software engineering. And this is actually
what I ended up with, were are more safest
measures that the Formula one people apply. But for
the sake of the time, I will limit myself to
ten. I hope that I can make it. It's going to be challenging, but nevertheless,
there are four, five pre crash measures and five
post crash measures. And the data never
lies, as the saying says, right?
You can see here that as the FIA introduced
more and more regulations to ensure the safety of the drivers throughout the years,
you can see the number of deaths has drastically increased
as well. And if we compare the
number of days between fat and incidents, you can see that as we introduced
more and more safety measures, the racing
in the Formula one is becoming more and more safer,
even if it's still a dangerous sport. But the
regulations actually worked and helped to make the Formula one a
safe sport, even if they are driving at a fast,
super fast speed. Anyway,
let the fun begin for the second time.
So here, the first part, we're going to discuss five
precrash measures, which are basically the things that they
do and insured before the crash, before actually
even the driver get into the car. And to ensure
that the driver safety is ensured,
the first one is the seatbelts that the Formula one drivers
put containing of a
six point harness, which can be released by the driver with
a single hand movement. This is actually what the seatbelt for
the Formula one car look like. It contains
six harness and the driver cannot put it
by himself like he needs help. However,
it can be released by a single push, a single click
and things is, from a software engineering perspective, similar to push to
deploy analogy which
make us think of automation, automation is really important
in ensuring speed and safety as well.
Automation can enable SieM to
deliver greater value by eliminating manual efforts.
Tasks like checking software vulnerabilities
can be handled automatically by
a bot, by a software, but by a job, by a pipeline.
And that would enable to free up teams
to pursue more productive and innovative
pursuit and innovative tasks. Obviously,
automation also can decrease manual human errors,
and that's occur with manual approaches.
We, as we all know, humans do
not excels with manual work. Bots and machines excels
in that. So that can help us to reduce bugs and
reduce errors. So automation
is actually a fundamental part in the defsecops movement and the deficops
mindset. And it enables us to release the software faster,
but also make sure that those softwares are actually secure.
The second thing I want to talk about is stranger, dynamic static
and load test to ensure the safest of the drivers.
And that means that they are testing the car in different conditions.
When the weather is hot, when the
weather is raining, when there is actually
wind, when the track is hot, when the track is a
little bit cold. They are testing different
things, the engine, the aerodynamics, a lot of
things,
to make sure that they have data about the car and how
it actually reacts in different conditions.
To make sure that during the race day, the car, they know
a lot of things about the car that enable them to react
if a crash happened or before even the crash happened.
And for software engineering,
that means that it's
actually the equivalent of having a trusted,
repeatable, and most importantly, adversarial ICD
pipeline enable us to effectively and repeatedly
test any chance to our application at any moment
in time. Making sure that our application go through a
stringent process to understand how it will react
facing different conditions. And those conditions needs to
be the same conditions that our application would need
to be in production, because there is no
place as production. It's the perfect place in the world.
And there is where the fun happened.
And that will not only engage security
through the development and operation processes,
but more specifically will ensure the involvement as
we design things process. And this is actually difficult
at scale and difficults at speed.
Another thing we can do to ensure that we are testing our application
in the same conditions that is going to be in production, is can deployments and
canal deployments is a pattern for rolling out releases to a subset
of users. So the idea here
is basically to deploy the change to a small subset of servers,
test it, have thresholds, both errors,
latency, business errors, if you want to.
And once we hit those thresholds and pass over them,
we basically roll back, because this is not a safe release,
otherwise we release it to all the
users and the Canaro deployment serves
basically as an early warning indicator to
us with less impact or downtime,
which enables, as I mentioned, to test our application
in actually production traffic where the fun is actually
happening. The third thing they do or they have
is the cockpit is surrounded by deformable crash protection structures and
this is how the cockpit looks like. And the thing that
strikes me is that they designed the car around things structure
about this piece of structure,
that its sole purpose in life is
actually or in existence is actually to ensure
the safety of the driver at all time.
And from a software engineer perspective, this is similar
to designing for failure.
And we as developers or it folks
need to think about failure
tolerance when designing our application in basically all
layers. No longer can our application developers
confine themselves to thinking about functionality only.
We must also consider
how to provide functionality in
face of databases, outages,
slow networks or even outages as a whole.
Something that we can do to ensure that is basically enable
mtls. For example, MTLS is mutual TLS,
which basically ensure that both the client and the server are
mutually connected and the traffic between them is encrypted.
And this is important in a microservices architecture
where both services are actually interchangeably
becoming service and client, making sure that
the traffic circulating inside your cluster is encrypted.
So even if an attacker manages to gain access to your cluster has
no clue what is happening, and he would need to do additional work to
gain access. Another thing we can do is
design our application in a micro segmented way. And that means
that we are putting the data
as far away from the Internet as possible, and were
ensuring that we are never exposing
sensitive system data. So even if an attacker
compromises the Internet faces system things, structure deforms
itself and contain itself, and the attacker
will not have the final data and they will have
only access to the Internet service that are
not actually sensitive. And we will
make the work hard for them to gain access to the sensitive
data. Moving on. Before they race,
driver must demonstrate they can get out of the can within 5
seconds. Every drivers cannot get out of the
car before the race under 5 seconds.
Goodbye. They cannot race for the day.
So if you want to race, if you want to go to race, show me
that you can get out of car under 5 seconds. Similar to what would happen
if a crash happened. I want you to get out of that car very
fast. And from a software engineering perspective,
this is similar to, this is not only designing for failures,
this is actually testing for failures, right. It's actually
testing for failures. And a similar discipline in software
engineering is the chaos engineering and chaos engineering.
Very briefly, very briefly, sorry. Is an approach to identify
failures before they become outages.
And we do that by proactively testing how
our system respond and their stress and their errors and
their failures. And while doing that,
we can identify failures before they end up in the news,
basically. And nothing
stop you. I mean, you can start small, right?
You can design the smallest experiments possible. You can
just add in some slowness in your
server, add in some seconds to check how your system react,
and then take notes about what
you did, know what you didn't know, and then make
fixes. And then you can grow gradually,
starting adding errors, such in a database, a whole
instance, a whole region. And then you gain confidence over the
releases of your application. And then
the last part in the pre crush measures is they
have constant monitoring and replacement of the tires.
Look at that things is its
readable team. They held
the world record of the fast replacement of the tires.
Sorry. And look how proud they are,
how happy they are. Not by fixing
the tires, not by trying to patch the tires,
but actually by changing the tires. Why in
the hell we as software engineers brag how
long our servers and containers and instances have
been running. What would happen if we start to challenge that mindset and
start adopting the same attitude as
the pit grows, at the pit grows and changing the
chairs as fast as possible? We all know the pit versus
cattle analogy, right? Which basically means that we should not
treat our service as pits, but more as
cattle. Resources should be sorted as
well and removed as well, and killed and switched it off as well,
whenever necessary, of course. But what will happen if we push this
zoologic analogy to a little bit further?
And introducing chicken analogy.
So the amount of food
and amount of resources a cattle need to raise
to reach adulthood needed is six
weeks, or is more compared to the
chicken, which is fewer resources, obviously.
And the period that it needs for a cattle to reach adulthood
is two years, almost 24 months,
compared to the chicken, which is a couple of weeks. So we are
optimizing if we adopt the chickens analogy.
And the chicken analogy is similar to the container analogy, to be honest
or to be transparent. But there are other metrics
that we can adopt basically to ensure
the security of our system. And for that,
I want to introduce an example. So imagine that you have configured and
secured a production cluster. And on that cluster,
basically, you are running a few mission critic application of
your company. Now, a hacker managed to get access to
your system and
has gated root access of one of the nodes.
Now, that alone is already bad. But the attacker
started to use your node as a base to
attack other nodes on your system. So he managed to hide himself,
and without your knowledge, he started to attack other nodes
and other servers on your cluster.
So imagine now what would happen if this
node is replaced. The attacker will
basically need to do the same process over and over again
to keep his base.
Now imagine that we are repeatedly killing this node
after a certain period of time, let's say one day. So the attacker
will need to repeat the same process each and every
day. So once
the node is actually removed, the hacker
has lost their base and they need to do the same process over again.
That's to say that we can backdoor a system
that is constantly being revaged.
So from a security point of view, this is actually
good. So we have an uptime, we have
a certain amount of period that we define, and then
that after that we basically destroy
that node and replace it with a freshly provisioned node.
And that's what we call the reverse uptime. Now combine that with the base freshness
of your image, which is the image
that we use, which is like imagine the OS or your
container image that we use to deploy all your other images.
From a security point of view, even a zero day vulnerability, you know that the
maximum amount of time to deploy or to
fix all your vulnerabilities is basically the reverse uptime.
So the attacker, that he managed to gain access after
the reverse uptime period, no longer have that access
since we fixed the base image and the vulnerability that
he was using. So those are other metrics that we can keep in mind to
ensure the safety and the security of our system.
I'm running a bit behind time, so I'll be a bit faster.
For the five post crash measures, the things that we do and ensure after
the crash has happened. And the first one is the driver can be
extricated from the car by lifting out the entire seat. So if
the crash happens, we lift, actually the entire seat.
If we cannot get the driver, obviously, and he cannot get out by
himself, we lift up the entire seat and provide him with
any measures necessary to save
his life. And take a look here and
notice that designing the car in a modular way enabled him, in case
of crisis, to lift out the car, lift out the seat, the entire
seat from the car. And this is similar to designing the
car in a moderate way and a less coupled way,
enable you to react in case of crisis and in case of
incidents. And there are a lot of benefits
of having a loosely coupled system, since fewer dependencies that you
need to manage and ensure diversion and make sure that they are up to date
and containing no vulnerability.
There are failure, isolation. If one
node or one service failed, you have that resiliency to
not fail your whole system basically, and keep the other
parts of system up and running and evolvable architecture is a great win
as well. You can evolve your system and architecture independently,
reusing other services with less and few
coupling of your system. The second
thing they have in the postcraft measures is a hands system.
And hands stand for hand and neck system that absorbs
and redistribute forces that would otherwise hit the driver's skull and
neck muscles. And this is how a hands look like. It's basically when there is
an incident. The force generated by these accidents are
gigantic since they are moving super fast. So the hands
make sure to redistribute those forces throughout the body
of the driver that otherwise would basically break his neck.
And this is similar in software engineer to have or design our
system in an elastic way. Making sure that we have a set of patterns
to ensure to build our architecture in an elastic manner,
such as loud balancers when the traffic
increase, if traffic increases and our nodes are not
keeping up, we can have auto scaling to
introduce more nodes. We can have request time threshold if some
of the services are starting to get slow. In some
cases having a degraded performance is better
than having an outage or nothing at all.
And also adopting some anti overload patterns such as circuit
breaking and exponential backup help to alleviate
some of load issues.
Basically another thing they have is
driver suits, our fire resistance, and they
can keep the driver's body under 41
degrees for at least 11 seconds. And this is how our
dear friend Roma managed to stay sane and safe,
because his suit can keep his body under
41 degrees, even if it's in
a fireball in the extreme heat. And from a
security point of view, this is similar to containing the damage and to
containing your attacker.
And there are a couple of security principles that we can be followed to
achieve that. Making sure, like adopting least privileged principle,
making sure that our application services are running with the
minimum set of access rights and resources enable them to perform its function,
defense in depth and having layers, basically. So even
if an attacker manages to breach one layer, we have an additional layer and a
second layer, many layers afterward, that can make
his life harder and slow him down until we figure out what's happening
and fix the vulnerability. Having a zero trust policy
and zero trust mean eliminating implicit trust and continuously
validating every stage of digital
interaction and communication between your services. And it's really important
to have no implicit trust,
but every time requesting authorization
authentication from the calling services or
servers or whatever.
Another thing we can do is basically having
hardware security modules. And an
example would help here to understand what I try to
say. So imagine an attacker managed to get access to
your database or to your server and extract
the data from your database that contains,
let's say passwords of your user now
without your knowledge. So he can take that data,
brute force, it manages to break
it and he would gain access of
plain text of your user's password.
And you will not know until you find those data
in the market, in the dark Internet. And you were basically using
a function to hash your script, you hash user's
password and he managed to break it somehow. Imagine now having
a dedicated entity or a dedicated server that give you
a key with which you hash the password
and you cannot or encrypt
the west and you can decrypt it until you have that key to
ensure like compare it until you have that key.
Now even if an attacker gets this data out of your cluster,
he would be severely limited on the impact he can make,
since he would need to stay within your cluster to get that key in order
to be able to break the data and gain the
password. So we are keeping him in and containing him until
you figure out what's happening. And hopefully that can help you gain some precious
minutes until you figure out what's happening.
Another thing they have is basically a fire suppression system that can
be activated by the driver externally or by
the race marshal. So in case of a fire, the driver can
trigger the system by itself, by himself.
Otherwise the race marshal can trigger it.
And this is the equivalent of having access
policies to your system. There are no right or bad way
define the access policies that you want, both from
communication policies between your service. If a service doesn't need access
to a server, don't give him that access, block him.
Basically each server needs to be allow listed before calling
another service for your users.
Having airbag policies and SaL policies, don't be overly restricted,
but don't be very optimistic as well and
very without any
protection in place. And finally, they have data record
data recorder that keeps data about everything.
I've read that in each race
Formula one people, each car can generate more than
1 data, which is a lot of
data for each race. And that goes to say
that no system should go to production without having monitoring and
logging tools in place, especially security
ones that help you to detect unusual behavior and trigger alarms
and allow you to react in case of issues.
I want to end this presentation by highlighting how comfortable
driving a Formula one car is. Look how the driver
sits in a Formula one car. And it's actually a trade
off between security, between speed and.
Between speed and safety and extraction.
The maximum of the car. And the same thing applies for software
engineering and the way we design our system.
So it's basically a trade off of our
system that we need to make between security,
speed and velocity.
And I want to end this presentation with a scary number. From cost
of data breach by IBM security folks,
they run analysis
and found that in average,
the cost of cyber attack is more than 4 million.
However, it took more than 280 days to detect the breach,
which is a long period of time. So I hope that throughout this
presentation, you discovered some patterns
that would help you to ensure the safety of your system, application,
servers and services while increasing
your velocity and moving fast. Thank you
very much and thank
you to conf 42 cloud. I'm a little bit overtime,
so sorry for that. Happy to take your question. If you have anything,
and I'm pretty available on slack,
those are the resources that helped me to build this presentation.
And yeah, that's pretty much it. Thank you very much
and I hope you're enjoying the talk and wish you a
great conference so far.