Conf42 Cloud Native 2021 - Online

Why you should take care of infrastructure drift

Video size:

Abstract

As infrastructure as code (IaC) becomes widely adopted by users with heterogenous skillsets, and as IaC codebases become larger and larger, it becomes harder to track drift. Drift is a deviation between the actual infrastructure state and the IaC codebase. It causes issues for security posture management, collaborative work, and maintenance. It’s hard to improve what you can’t measure! Can we define good metrics for drift?

Developers track unit tests coverage to track how well unit tests match application code over time. Can we use an analogy and define infrastructure code coverage to track how well IaC matches the actual infrastructure state?

In this talk, we will show how minor infrastructure drift can cause issues. We will then introduce various ways to track IaC coverage, and how we can use them to bring visibility into the state of infrastructure and anticipate common drift issues. We will also show how measuring IaC codebase also benefits IaC adoption.

Summary

  • GitHub's platform to support infrastructure as code. Almost everyone experienced infrastructure drift recently. drifts can have major consequences, including security. That's the reason why we built this open source tool called drift control. Live demos and stories from the trenches.
  • Stephane Jourdan: What happens when the reality and the expectations don't match. He says misconfiguration can lead to security issues you don't even know about. The major causes of drift are manual changes on the AWS console and authenticated lambdas. Only then can you fix the issues.
  • I will share a few stories. They are simplified for the talk. And I'll show as well how our open source answer drift control helped in this case to find vulnerabilities in a configuration. Now it's time for the stories, for the fun part, the live part.
  • Third and final story for this talk is the AWS s free terrible story. A team in the UK in London, reported to me a few months ago. This kind of mistake can cost a lot as well. Real life also includes different ways of managing that kind of drift.
  • drift control has support for this file, drift ignore. It's pretty much like the git ignore file that you already know. Once you have your initial reports cleaned, you can deploy drift control in your CI pipeline. More is to come.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and welcome to this talk about infrastructure drift. Infrastructure as code, devsecops and, well, drift control. All right, the agenda. So in this talk I will briefly define what is infrastructure drift, what we learned the hard way, and then it will be fun. Live demos and stories from the trenches. This will be a very hands on talk. Thank you for attending. So basically last year we were working on GitHub's platform to support infrastructure as code and all the tooling, the best practices, et cetera. And at some point we looked for feedback to build the next features and we interviewed our users and other DevOps, Sres, ctos, et cetera. And the most important pain those people reported was around infrastructure drift, visibility of what's living outside of your terraform configuration. So if you have only one thing to remember from this talk, it's that almost everyone experienced infrastructure drift recently, and that drifts can have major consequences, including security. In the end, that's the reason why we built this open source tool called drift control. Drift cutl drift TTL. It's up to you causes your poison so I'm Stephane Jourdan as Jordan pretty much everywhere on the Internet, I've been around infrastructures for the past 20 years. I cofounded a bunch of companies in Canada and Europe, and well, many years ago I cohort this book infrastructure as Code cookbook, and more recently I'm the happy cofounder of the open source tool Drift control with a wonderful team. So after talking to hundreds of people, we had to come up with a common definition of infrastructure drift. And basically it's what happens when the reality and the expectations don't match. It's also known as oh my God when things go wrong, while they really should not. So according to this definition, your ediber's account has a ton of actual resources that's on the right, and maybe only part of it is correctly managed by terraform that's on the left. And the diff is the drift. So what are the causes of the drift? There are many, but the major ones are the manual changes directly on the AWS console, the authenticated lambdas or microservices. You know, the scripts that you run every night, I don't know, to snapshot those disks on AWS, or can for certain tags, etc. It can be also this other team working with this other tool or that AWS service not yet available in the current terraform provider version that you are using. Maybe it's just that new project starting on this new cloud provider. It's different from the current one and no time to automate it properly. All those activities and others are authenticated and are sources causes of drift. So it could be no big deal and maybe we could all live with those drifts. Problem is, misconfiguration can lead to security issues you don't even know about. And we see here on this picture that the NSA recently leveled poor access control and misconfiguration stop issues. And definitely you want to know about the unknown stuff running in your cloud provider accounts. As we just said, we definitely need to shed some light on the unknown resources of your accounts, those that are not under control with terraform. And only then you'll know about misconfiguration. Only then you will be able to address or fix the issues. Now it's time for the stories, for the fun part, the live part. I will share a few stories. They are simplified for the talk. They are from our closed environment. And I'll show as well how our open source answer drift control helped in this case to find vulnerabilities in a configuration. So the first story is about this authenticated microservice for the lambda actually. And basically during the weekend there was a problem, it was just deployed, things like that. And basically, let me switch. Here we are. And basically what the on call guys did that weekend was going to the microservice im user and basically they said, oh, maybe it's a problem with the credentials, so let's deactivate the first one and create a new access key. So at last we will try that. Unfortunately it did not help. So they said okay, maybe it's this policy read only is not enough. Let's try again and attach a larger one policy. So it wasn't administrative access actually, but it's quicker. Imagine what can go wrong. And basically we don't know what happened, but a few minutes later it started to work again and they said okay, we're good with that. We will report it back on Monday to the team and we'll be good. So what happened is that they forgot to report it back on Monday, as you can imagine. And it stood that way for a long time. But it was cool. They had everything on terraform. So as you can see here on the code, it's working. So this is basic terraform code. If you're not used to it, it basically means that you create free resources. One is a user, one is an access key, and the other one is an attachment for policy, a managed policy or read only one. So basically it will create a user, a microservice user, an IM user with a dedicated access key for this specific user, and a policy attachment, which is the read only access you wanted. And basically, this configuration was perfectly well working on terraform. And if you applied terraform here, terraform apply. Well, it didn't see the change because terraform is cant. CTo see only changes for its scope. And basically, a new attachment in this case is seen as a new resource, as a new object. It cannot see that there is a new attachment. It's not meant for that. So had our team used drift control here, it would have done the following result. It's scanning the AWS account and comparing the results with the terraform state. So in this case, we have. For the free stories I'm going to tell you, we have free states for IAC users, VPC security groups and S three buckets. We can see here that we scanned 59 resources on the account and that drift control found resources not covered by terraform, like this access key that we just created. And it belongs to the IM user, microservice, LHz, something. We also display some information, like the number of resources, percentage coverage of your terraform code compared to what's on your AWS account, et cetera. So that's the kind of things terraform can miss by design on CI. And that's the kind of things drift control can show you live if you cause it. So this is the first story for this example. The second one is about security groups. At the beginning of the pandemic, people started working from home. And in this team, a couple of people couldn't connect to the company network and they didn't want CTO wait for it to solve the issue. They asked some managers to do something. And one of them had enough credentials to open up everything to everyone. And basically it solved the issue and you'll see. So how does it look like VPC? No, here we are, Main. So this is a security group. It's called super secure because it only allows port 22 ssh for the local network on AWS. So basically it's completely locked down. So if you go there, definitely we are good. So what did the team do in this case? The guy went to the security group in the console, he edited inbound rules. And basically what he did was adding a rule, all traffic from anywhere on IPV six, I-P-V. Four, and called it temp. So that it was sure it's going to remember it. All right, so now you have all your rules here. And the terraform was on CI. It was running every hour or so, and at the next terraform apply maybe 2 hours later. What happened? Do you think it would be reported as a new rule in the managed security group? Unfortunately not, because same story, it's not the same thing as seen on AWS. A rule seems like to be two objects, CTO, two elements of a single table array, but it's not the case at all on terraform. The way it was written here on terraform, it means that basically here, if we look again this here, it doesn't mean that adding an element to the array would be seen by terraform. It's a different object once again. So basically we needed a tool like Driftctl to show this team the drifts they had on their account. So here, now this team would have a security group rule discovered as unmanaged, not covered by terraform, and they would have had the whole information like the IPV, six addresses, source addresses, I-P-V four, the name, the id of the security group, the type. And they would have seen as well that their coverage went down. It went from how many it was, how much it was, it was 85% and now it's 66%. So it went down heavily. So basically that's about it for the security group story. If you add rules manually on the web console for AWS, and you wrote your rules that way on terraform, you will never see added rules using terraform. You need to use a specific tool for that. So unfortunately for them, nobody reported the change on the weeks after that. And the security group stood open for a very long time. So the third and final story for this talk is the AWS s free terrible story that a team in the UK in London, reported to me a few months ago. They were building a plugin, there were like 100 developers, and every hour or so they were uploading to s free a build of this temporary build of this plugin. And it was then processed by QA test, et cetera, and then dropped because it was useless at the time, just holy testing of everything. So it was hundreds of gigabytes every day, uploaded and processed and dropped. At some point there was another team in another country. They wanted to improve security and compliance in a company. And they went to see them and tell them, well, we will enable redundancy in versioning, sorry, in s three, so we will do it on your bucket as well. So the team I was talking to told them, well, it's perfectly useless in our case. It's cool for the others, but it's useless in our case because we'll cost a lot, because we definitely don't care about what we upload on s three. If it burns we don't care. We will have lost 1 hour of work, it's not can issue. So the other team said okay, cool, you mentioned that. So we will not enable s three versioning on your bucket. They went home, everybody was happy because the first team thought well anyway, if they make a mistake we have everything on terraform. And if they change a setting we will know about it, cause as you can see here, the resource is pretty simple and the bucket is declared, so we will know if they change something. So no problem with that. And the other team was happy as well because they scripted a wonderful script with a lot of exceptions for all the buckets, et cetera. And if it failed they will know about it as well. They executed the script, it enabled versioning everywhere and absolutely everywhere, because their exception didn't work, but they didn't know about it and they went on with their lives. And guess what happens at the first week of every month? It's the Edibaria's bill. So what happened in their case? So this is the bucket. And basically the script did something like enable here versioning. So it was not on the web console, but it's the same id. So the enabled versioning and now drift control would see here a change resource. So the AWS S three bucket conf 42 demo, whatever, has versioning enabled from false to true. If we go to s three folder, can see here and we terraform apply, it refreshes from AWS, it will not report the change. And even worse, some way it's not worse, it's meant for that. But if we look for the s free bucket here, we will see that versioning is enabled directly on the state. So that's something that was written on the state while not, while not expressed directly on the resource. Basically this kind of mistake can cost a lot as well. It's not a security issue in this case, but it can cost a lot to the company, cause it was definitely not cant to be enabled for weeks. This is a few examples from real life, but real life also includes different ways of managing that kind of drift. And if you want to integrate the data drift control reports with another tool, maybe you want to tour the data in a database for dashboards, for tracking, for integration in reports, et cetera. So you can use the JSON output from drift control. Let me show that to you. You can just add the output here. JSON output JSon, sorry. And basically, oh yeah, I'm not on the right folder and here we are. Okay, same output. And now if we take a look at output JSON, we have the same output that we had previously, but now as JSON. What's the use for that? Maybe you want to only keep the coverage output so you can just graph percentage directly on your directly, I don't know, on the dashboard or something, maybe on a pull request kind of things. So you can use JQ to parse the output and well, do whatever you want with it after that. That's it for the JSON output. There is a lot of filters. If you want to report only a certain type resources, only a certain tag, anything specific to your case, you can do it with drift control filters. It's using the gemspath query language. It's very useful to analyze and report on specific cases. This example here is about security group rules. It's not bad. It's a good example in our case. So it's filter type equal AWS security group rule and let's go. So the tool will report only on AWS security group rules and not the other resources. You can see here that the coverage dropped dramatically because we have so many drifts in security group rules that the coverage is now very very low. So yeah, filters are really, really useful on drift control to control to report on specific areas, even for enumeration or regular audits as well. So there is also this drift ignore file because you will probably start with a lot of unmanaged resources in your AWS account. I have a lot of mess, a lot of test files, and you probably do as well. Even by just testing things on AWS, it creates a lot of sub resources that are not clean afterwards. So drift control has support for this file, drift ignore. It's pretty much like the git ignore file that you already know. It takes a resource type. So like in this case, AWS im user and a name or an id. So it's terraform in this case because it's the first user that I created manually on the AWS console. So it's AWS imuser terraform, but it can be an id as well for a security group, et cetera. So you can ignore specific resources while you work on it, while you, I don't know, you import it on your terraform code, while you maybe will remove it from your account. At least you can start fresh or you can work on it at your pace. So when it's included in your drift ignore file, it will be ignored. That's the name of the file and you can also add this file to your own docker image. So if you include if you work with your own docker image, we have a cloud drift CTO image available, official image and you can start from there to add your own drift ignore so you are perfectly ready. CTO launch drift control with the ignore file. I don't know, as a scheduled job on Kubernetes for example. Finally, once you have your initial reports cleaned, you can deploy drift control in your CI pipeline, at least as a recurring task, cron job, et cetera, like well, every hour or so if your infrastructure is not too big so you can get alerts when something goes wrong in the hover. I've seen different use cases using drift control. So one of them is that after terraform apply is executed on the pipeline, a lot of other scripts are executed, stuff are done after terraform apply in the pipeline. And to ensure those scripts and stuff don't create a mess, people execute drift control after terraform apply and the rest of the pipeline. In the end, that's one use cause another use case is the exact opposite, the other way around. People want to ensure that the situation is perfectly under control before doing anything with terraform involving applied. Like using drift control, like a safety check before terraform apply can change anything on the state. Remember the TF state a few minutes ago that was modified because of the terraform refresh. It's the job of terraform refresh to do this. But if you do not want this to happen in your case, well put drift control first in your pipeline so you can put it at the end if you want to ensure things are okay. Or at the beginning if you want to ensure things are under control. Before executing terraform apply, you can use it in parallel with terraform plan as well because they do different things, one from the terraform scope and the other from outside the scope of terraform. In any case, I strongly advise you to deploy drift control like can how they check so you can get reports if anything changes. So yeah, basically that's about it. Drift control is written in Go. It's a brand new tool. It was released last December, early January. It's written in Go, the license Apache two, it has support for AWS and GitHub. More is to come. You can go on GitHub discussion upvote for the next cloud provider because we are really community driven. So if you cant Azure or digitalocean or GCP or I don't know what cloud provider go vote for it because that's where we're going CTO decide what's the next cloud provider to support. We support today the terraform state locally s free HTTP cloud terraform cloud is coming very soon. Basically same way, same situation. If you want to vote for other features like reading the terraform states, go vote on GitHub GitHub discussion there's a lot of discussions about that filtering ignore. I showed them in the slides. We are also very present, very active on Discord. You can go to drift control driftl.com d. It's a shortcut that redirects you to our discord community discord. We are there. We can answer all your questions live. We can help you live there as well. So if you joined during the talk, what CTO remember is that almost everyone is experiencing infrastructure drift and that we basically built drift control as an open source tool to increase the knowledge and security of automated cloud infrastructure drift. Yeah, thanks for watching and see you soon on Discord and on GitHub. Thank you. Bye.
...

Stephane Jourdan

CTO @ CloudSkiff

Stephane Jourdan's LinkedIn account Stephane Jourdan's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways