Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and welcome to this talk about infrastructure
drift. Infrastructure as code, devsecops and, well,
drift control. All right, the agenda.
So in this talk I will briefly define what is
infrastructure drift, what we learned the hard way, and then it
will be fun. Live demos and stories from the trenches.
This will be a very hands on talk. Thank you for attending.
So basically last year we were working on
GitHub's platform to support infrastructure as code and
all the tooling, the best practices, et cetera. And at some point
we looked for feedback to build the next features and
we interviewed our users and other DevOps,
Sres, ctos, et cetera. And the most important pain those people
reported was around infrastructure drift,
visibility of what's living outside of your terraform
configuration. So if you have only one thing to remember from
this talk, it's that almost everyone experienced infrastructure
drift recently, and that drifts can have major consequences,
including security. In the end, that's the reason
why we built this open source tool called drift control.
Drift cutl drift TTL. It's up to you causes your poison
so I'm Stephane Jourdan as Jordan pretty much everywhere
on the Internet, I've been around infrastructures for
the past 20 years. I cofounded a bunch of companies in
Canada and Europe, and well, many years ago I
cohort this book infrastructure as Code cookbook,
and more recently I'm the happy cofounder of the open source tool Drift
control with a wonderful team. So after talking to hundreds of
people, we had to come up with a common definition of infrastructure
drift. And basically it's what happens when the reality
and the expectations don't match. It's also known as oh my
God when things go wrong, while they really should not.
So according to this definition, your ediber's account has
a ton of actual resources that's on the right, and maybe
only part of it is correctly managed by terraform that's on
the left. And the diff is the drift.
So what are the causes of the drift? There are many,
but the major ones are the manual changes directly
on the AWS console, the authenticated lambdas or microservices.
You know, the scripts that you run every night, I don't know,
to snapshot those disks on AWS, or can for certain
tags, etc. It can be also this other team
working with this other tool or that AWS service
not yet available in the current terraform provider version
that you are using. Maybe it's just that new project starting on
this new cloud provider. It's different from the current one and no time
to automate it properly. All those activities
and others are authenticated and are sources
causes of drift. So it could be no big deal and
maybe we could all live with those drifts.
Problem is, misconfiguration can lead to security issues you
don't even know about. And we see here on this picture that
the NSA recently leveled poor access
control and misconfiguration stop issues.
And definitely you want to know about the unknown stuff running in your
cloud provider accounts. As we just said, we definitely need
to shed some light on the unknown resources of your accounts,
those that are not under control with terraform.
And only then you'll know about misconfiguration.
Only then you will be able to address or fix the issues.
Now it's time for the stories, for the fun part, the live part.
I will share a few stories. They are simplified for the
talk. They are from our closed environment.
And I'll show as well how our open
source answer drift control helped in this
case to find vulnerabilities in a configuration.
So the first story is about this authenticated
microservice for the lambda actually. And basically
during the weekend there was a problem, it was just
deployed, things like that. And basically, let me switch.
Here we are. And basically what the on call guys
did that weekend was going to the microservice
im user and basically they said, oh,
maybe it's a problem with the credentials, so let's
deactivate the first one and create a new access key.
So at last we will try that. Unfortunately it
did not help. So they said okay, maybe it's this
policy read only is not enough. Let's try again and attach
a larger one policy. So it wasn't administrative
access actually, but it's quicker.
Imagine what can go wrong. And basically we
don't know what happened, but a few minutes later it
started to work again and they said okay, we're good
with that. We will report it back on Monday to the
team and we'll be good. So what happened
is that they forgot to report it back on Monday, as you
can imagine. And it stood that way for a long time.
But it was cool. They had everything on terraform. So as
you can see here on
the code, it's working. So this is basic
terraform code. If you're not used to it, it basically means that
you create free resources. One is a user, one is an access key,
and the other one is an attachment for policy, a managed policy
or read only one. So basically it will create a
user, a microservice user, an IM user with a dedicated access
key for this specific user, and a policy attachment,
which is the read only access you wanted. And basically,
this configuration was perfectly well working
on terraform. And if you applied terraform
here, terraform apply.
Well, it didn't see the change because terraform is cant.
CTo see only changes for its
scope. And basically, a new attachment in
this case is seen as a new resource, as a new object.
It cannot see that there is a new attachment. It's not meant
for that. So had our
team used drift control here, it would have
done the following result. It's scanning
the AWS account and
comparing the results with the terraform state.
So in this case, we have. For the free stories I'm going to tell
you, we have free states for IAC users,
VPC security groups and S three buckets. We can see here
that we scanned 59 resources on
the account and that drift control found resources
not covered by terraform, like this access key that
we just created. And it belongs to
the IM user, microservice, LHz,
something. We also display some information,
like the number of resources, percentage coverage of your
terraform code compared to what's on your AWS account,
et cetera. So that's the kind of things terraform can
miss by design on CI.
And that's the kind of things drift control can show you live
if you cause it. So this is the first story for
this example. The second one is about security groups.
At the beginning of the pandemic, people started working from home.
And in this team, a couple of people couldn't connect
to the company network and they didn't want CTO wait for it
to solve the issue. They asked some managers to do something.
And one of them had enough credentials to open up everything
to everyone. And basically it solved the issue
and you'll see. So how does it look like VPC?
No, here we are,
Main. So this is a security group. It's called super secure
because it only allows port 22 ssh for the
local network on AWS. So basically it's completely locked down.
So if
you go there,
definitely we are good.
So what did the team do in this case?
The guy went to the security group in the console,
he edited inbound rules.
And basically what he did was adding a rule,
all traffic from anywhere
on IPV six, I-P-V. Four, and called it
temp. So that it was sure it's going to remember it.
All right, so now you have all your rules here.
And the terraform was on
CI. It was running every hour or so, and at the next terraform
apply maybe 2 hours later. What happened?
Do you think it would be reported as a
new rule in the managed security group?
Unfortunately not, because same story,
it's not the same thing as seen on AWS.
A rule seems like to be two objects,
CTO, two elements of a single table array, but it's
not the case at all on terraform. The way it
was written here on terraform, it means that basically
here, if we look again this
here, it doesn't mean that adding an element to the array
would be seen by terraform. It's a different object once again.
So basically we needed a tool like Driftctl
to show this team the
drifts they had on their account. So here,
now this team would have a security
group rule discovered as unmanaged, not covered
by terraform, and they would have had the whole information like the
IPV, six addresses, source addresses, I-P-V four,
the name, the id of the security group, the type.
And they would have seen as well that their coverage went down. It went
from how many it was,
how much it was, it was 85% and now it's 66%.
So it went down heavily. So basically that's about
it for the security group story. If you add rules
manually on the web console for AWS,
and you wrote your rules that
way on terraform, you will never
see added rules using terraform. You need to use
a specific tool for that. So unfortunately for them,
nobody reported the change on the weeks after that.
And the security group stood open for a
very long time. So the third and final story
for this talk is the AWS s
free terrible story that a team in
the UK in London, reported to me a few months ago.
They were building a plugin, there were like 100 developers,
and every hour or so they were uploading to s free
a build of this temporary build of this plugin.
And it was then processed by QA test,
et cetera, and then dropped because it was useless
at the time, just holy testing of everything. So it was
hundreds of gigabytes every day, uploaded and processed and dropped.
At some point there was another team in another country.
They wanted to improve security and compliance in
a company. And they went to see them and tell them,
well, we will enable redundancy in versioning,
sorry, in s three, so we will do it on
your bucket as well. So the team I was talking to
told them, well, it's perfectly useless in our case. It's cool
for the others, but it's useless in our case because we'll cost a lot,
because we definitely don't care about what we upload on s three.
If it burns we don't care. We will have lost 1 hour of
work, it's not can issue. So the other team said okay, cool,
you mentioned that. So we will not enable s three versioning
on your bucket. They went home, everybody was happy because
the first team thought well anyway, if they make
a mistake we have everything on terraform. And if they change
a setting we will know about it, cause as you can see
here, the resource is pretty simple and the bucket is declared, so we
will know if they change something. So no problem with that. And the
other team was happy as well because they scripted a wonderful script with
a lot of exceptions for all the buckets, et cetera. And if
it failed they will know about it as well. They executed the
script, it enabled versioning everywhere and
absolutely everywhere, because their exception didn't work, but they didn't know
about it and they went on with their lives. And guess
what happens at the first week of every month?
It's the Edibaria's bill. So what happened in their case?
So this is the bucket. And basically the
script did something like enable here
versioning. So it was not on the web console, but it's the
same id. So the enabled versioning
and now drift
control would see
here a change resource. So the AWS
S three bucket conf 42 demo, whatever,
has versioning enabled from false
to true. If we go to s three folder,
can see here and we terraform apply,
it refreshes from AWS, it will
not report the change. And even worse,
some way it's not worse, it's meant for
that. But if we look for the
s free bucket here, we will see
that versioning is enabled
directly on the state. So that's something that was written
on the state while not,
while not expressed directly on the resource.
Basically this kind of mistake can cost a lot
as well. It's not a security issue in this case, but it can cost a
lot to the company, cause it was definitely not cant to
be enabled for weeks. This is a
few examples from real life, but real life also includes
different ways of managing that kind of drift.
And if you want to integrate the data drift control
reports with another tool, maybe you
want to tour the data in a database for dashboards,
for tracking, for integration in reports,
et cetera. So you can use the JSON output
from drift control. Let me show that
to you. You can just add the output here.
JSON output JSon, sorry.
And basically, oh yeah,
I'm not on the right folder and
here we are.
Okay, same output. And now if we take a look at
output JSON, we have
the same output that we had previously,
but now as JSON. What's the use
for that? Maybe you want to only keep the
coverage output so you can just graph percentage
directly on your directly, I don't know, on the dashboard or something,
maybe on a pull request kind of things. So you can
use JQ to parse the output and well,
do whatever you want with it after that. That's it for the
JSON output. There is a lot of filters. If you want to report
only a certain type resources, only a certain tag,
anything specific to your case, you can do it with drift
control filters. It's using the gemspath query
language. It's very useful to analyze and report on
specific cases. This example here is about
security group rules. It's not bad.
It's a good example in our case. So it's
filter type equal AWS
security group rule and
let's go. So the tool will report
only on AWS security group rules and not
the other resources. You can see here that the coverage
dropped dramatically because we have so many drifts
in security group rules that the coverage is now very very low.
So yeah, filters are really, really useful on
drift control to control to report on specific areas,
even for enumeration or regular audits as
well. So there is also this
drift ignore file because you will probably start with
a lot of unmanaged resources in your AWS account.
I have a lot of mess, a lot of test files, and you
probably do as well. Even by just testing things
on AWS, it creates a lot of sub resources that
are not clean afterwards. So drift control has
support for this file, drift ignore.
It's pretty much like the git ignore file that you already know.
It takes a resource type. So like in this case,
AWS im user and a name or an id. So it's
terraform in this case because it's the first user that I created manually
on the AWS console. So it's AWS imuser terraform,
but it can be an id as well for a security group, et cetera.
So you can ignore specific resources while
you work on it, while you, I don't know,
you import it on your terraform code,
while you maybe will remove it from
your account. At least you can start fresh or you can
work on it at your pace. So when it's included
in your drift ignore file, it will be ignored. That's the name
of the file and you can also add this file to
your own docker image. So if you include if you work with
your own docker image, we have a cloud drift
CTO image available, official image and you can start
from there to add your own drift ignore so you are perfectly
ready. CTO launch drift control with the ignore file.
I don't know, as a scheduled job on
Kubernetes for example. Finally, once you
have your initial reports cleaned, you can deploy drift
control in your CI pipeline, at least as
a recurring task, cron job, et cetera,
like well, every hour or so if your infrastructure
is not too big so you can get alerts
when something goes wrong in the hover.
I've seen different use cases using drift control.
So one of them is that after terraform
apply is executed on the pipeline,
a lot of other scripts are executed, stuff are done
after terraform apply in the pipeline. And to ensure those scripts
and stuff don't create a mess, people execute drift
control after terraform apply and
the rest of the pipeline. In the end, that's one use cause another
use case is the exact opposite, the other way around. People want
to ensure that the situation is perfectly under control before doing
anything with terraform involving applied.
Like using drift control, like a safety check before terraform
apply can change anything on the state. Remember the
TF state a few minutes ago that was modified
because of the terraform refresh. It's the job of terraform
refresh to do this. But if you do not want this to happen
in your case, well put drift control first
in your pipeline so you can put it at the end if
you want to ensure things are okay.
Or at the beginning if you want to ensure things
are under control. Before executing terraform apply, you can use it
in parallel with terraform plan as well because they do different
things, one from the terraform scope and the other from
outside the scope of terraform.
In any case, I strongly advise you to
deploy drift control like can how they check so you can
get reports if anything changes.
So yeah, basically that's about it.
Drift control is written in Go. It's a brand new
tool. It was released last December, early January. It's written
in Go, the license Apache two, it has support for AWS
and GitHub. More is to come. You can go on GitHub discussion
upvote for the next cloud provider because we are really community driven.
So if you cant Azure or digitalocean or GCP
or I don't know what cloud provider go vote for it because
that's where we're going CTO decide what's the next cloud
provider to support. We support today the terraform
state locally s free HTTP cloud terraform
cloud is coming very soon. Basically same way,
same situation. If you want to vote for other features like
reading the terraform states, go vote on GitHub GitHub
discussion there's a lot of discussions about that filtering ignore. I showed them
in the slides. We are also very present,
very active on Discord. You can go to drift control driftl.com
d. It's a shortcut that redirects you
to our discord community discord. We are there. We can answer all your
questions live. We can help you live there as well.
So if you joined during the talk,
what CTO remember is that almost everyone is experiencing
infrastructure drift and that we basically built
drift control as an open source tool to
increase the knowledge and security of
automated cloud infrastructure drift. Yeah, thanks for watching
and see you soon on Discord and on GitHub. Thank you.
Bye.