Conf42 DevSecOps 2021 - Online

Security testing for Terraform templates

Video size:

Abstract

“I hate DevOps, you are not thinking about security!” This is a real quote. Let’s try to prove it is not the truth!

Infrastructure as Code is one of the pillars of “best practices” in DevOps world now. It is more and more popular to use it through CI/CD pipelines, but… what about security? Do we really care about it?

In this talk we will explore a few Terraform scanners and we will try to answer to one question: are these tools good enough to be security scanners?

Summary

  • Paolo Pivots is a lead system engineer at IPAM, Poland. Also, I am DevOps Institute ambassador and AWS community builder. We will talk a little bit about infrastructure as code, but not Infrastructure as code itself. If it is secure, and if not, what we can do to make it a little more secure.
  • terraform is an infrastructure as code tool. It is a fast, dynamic, programmable way to deploy hidden misconfigurations everywhere. One fifth of all breaches because of the cloud misconfiguration. Average time to identify and contain the breach is 280 days.
  • In 2019, Imperva had a breach where they lost customer records like API keys, TLS certificates. Another example of misconfigurations, capital one. And these second, very important and very tricky really, is configuration drift.
  • drift is unmonitored, undocumented change in the configuration done manually in most of the times. 90% of organizations allow users to make changes without proper process. Even go so high that you disallow or prohibit access to business accounts for users.
  • 36% of professionals suffered a serious breach because of cloud misconfiguration. Almost 50% of teams had more than 50 misconfigurations per day. The main goal should be to have as less permissions as possible.
  • Many teams spend 100% of time working with infrastructure as code security issues. The importance of this environment is exactly the same from the security perspective, like production. What we should do is control everything before we deploy. Use dedicated tools to prevent deployments with misconfigured resources.
  • I have very fresh installation of arm. We will go to EC two. Let's use AWS Linux for it. I need to use default setting. Now what I want to do is to install terra scan. Something wrong happened here with this ubuntu. Please work for me.
  • TfSec is the last tool to be tested. How to execute checkoff? It's very simple. I will clone a curix test templates and we will see how they work. Results said that I should have three of them in my pipeline to be like 90% sure what I'm doing right.
  • All free tools allow us to use junit to have proper reporting. All free tools can fail your pipeline. TFSeC is only one tool which is much more happier if terraform is installed. It can work without but it's definitely more happier.
  • Drifts have to be controlled continuously and there is always someone who has elevated privileges. Drift Ctl acquired by SNC recently Kubediv for Kubernetes SaaS offerings like bridge crew akurix. We should go into new level just even before, beyond just scanners, we should have everything as code.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
It. All right, another session. This time we will talk a little bit about infrastructure as code, but not infrastructure as code itself. But if it is secure, and if not, what we can do to make it a little bit more secure, what is quite important, I believe. Okay, so let's get started. My name is Paolo Pivots and I am a lead system engineer at IPAM, Poland. I work here for last two and a half years almost free. Also, I am DevOps Institute ambassador and AWS community builder. So we will have a little presentation today. Part of it will be in AWS because, well, you see why. All right, so first of all, we will talk about security in terraform templates. So what is terraform? I strongly believe that all of you know already, but just to be sure, it is an infrastructure as code tool, right? So very basic information. Right? So what is infrastructure as code then? Right. So infrastructure as code is a fast, dynamic, programmable way to deploy hidden misconfigurations everywhere. Yep. And when we start to think about infrastructure as code in this way, the new world opens, because very often we use infrastructure as code just like this. Right? We are writing something. Deploy, rewriting, deploy, rewriting, deploy, rewriting, deploy. And after two days, finally our template is ready. I've been there. I know that it's quite, let's say not usual to see something like that, but we have possibilities to work with it a little better. So in terms of security, what we deal with. So for sure, you remember like almost two years ago, solar winds, right, to go a little, but through it, to give you a little bit more insight, what happened there. So the bridge was done for the company, which has like 300,000 customers, right? And 30,000 used Orion with the application, which, let's say were infected. This infected version was downloaded or were downloaded 80,000 times. 80,000. More than ten government institution in us was infected or affected by these, by this version. And to say even more, Microsoft, Nvidia, Palo Alto, what is interesting, because they are also working with the security different level, but yes, and VMware, just to list a few of them. And Orion was the first. And like a half a year or one year later, they had another issue. Right? So now you can say, all right, but it was not about infrastructure as code. Right? Okay, I agree. But this is what we deal with and this is one of the element of the puzzle. So let's look on few numbers first. And those numbers are collected from IBM and Ponemon Institute report from. Well, almost two years right now. So it's for 2020. So average cost of each bridge is almost 4 million. Quite a lot, right? All right. IBM and Ponemon works with big customers or big organizations. But anyway, this will be interesting because average time to identify and contain the breach, and the breach is 280 days, from which 207 days in average is to identify the problem and then more than 70 to contain the problem. So to fix the problem, IBM and Polymon identify the percent of breaches by different areas. And in cloud misconfigurations, we care almost in infrastructure as code. So breaches caused by cloud misconfiguration for them, for their report it was 19%. So one fifth of all breaches because of the cloud misconfiguration, it start to be quite scary, right? So what kind of cloud misconfigurations we had and breaches because of this. So in 2019, Imperva had a breach where they lost customer or lost they lost customer records like API keys, TLS certificates, et cetera, et cetera. Because of network misconfiguration, hard coded API keys and not encrypted records in database. Imperva identified these issue after ten months. Another example of misconfigurations, capital one. Also, these same year, more than 100 million records exposed, bank account, social numbers of their customers, et cetera. Why it happened because there were misconfiguration in IAM policies and unencrypted storages. Those reports are obviously available in the Internet, so you can search for them. There is a lot more of them to give you the impression what we have. So what kind of cloud misconfiguration we can have really two types. One is cloud misconfiguration itself, when we are creating some resources or we are updating resources, et cetera. And these second, very important and very tricky really, is configuration drift. So I will talk a little bit about drifts in some time, but right now we need to know that drifts are definitely more danger because they are not monitored. All right. Okay. So what is drift, in fact? So drift is, as I said, is unmonitored, undocumented change in the configuration done manually in most of the times, right? Most of the cases. Because if we care, talking about change done by infrastructure as code templates or CDK or wherever it is monitored, documented somehow. Right. But this change is done manually by someone, no one knows when, where, why. Right. And a very sad fact, 90% of organizations allow users to make changes without proper process. And as a proper process, I mean, even go that high that you disallow or prohibit. Let's use more strong word. They prohibit access to business accounts for users, okay. I mean access to console, to play with resources on the account. And this is according to the state of devsecops by accuracy. Formerly because they care acquired by tenable. It's also for 2020. So this fact is very sad and very unbelievable, right? So we are talking here also about financial companies, about healthcare companies where these information should be very secure. So if you care. Scared enough. We didn't finish yet. So now we are going to the new report from this year and it is interesting because sonotype did a report where they talk with 300 professionals and they have very interesting information. These. So 36% of professionals suffered a serious breach because of cloud misconfiguration. So of course here we care talking about different group of people. That's obvious. So that's why we have different numbers. But I hoped it will be lower. But it's not, right. Very interesting is that almost 50% of teams had more than 50 misconfigurations per day. 50 misconfiguration. More than 50 misconfiguration per day. I say wow, right? It's hard to imagine really. So what kind of misconfigurations they find. So the most important, the most popular ones is misconfiguration of IAM. In all clouds, IAM are the core of everything. It's not quite very clear, very well understandable for people how to work with them. But the main goal which we should look forward and try to achieve is to have as less permissions as possible, as required even and go with policies like a one to one. So one resource has access to one other resource and if this resource needs access to similar resource but another one, it means another policy. Right? So we are decoupling the policies here. Yes. In these end of the day we will have a lot of policies. There will be a total mess, total chaos. But first of all, we are working with them through infrastructure as code, right? And we have proper naming and proper tagging. This is also important. And the second obvious biggest percent is security groups. It's very tempting, very easy and very common that. All right, I need to test something. I will just add my ip to ssh. Well, I don't know what is my IP. I will just put done. Right. So these elements are quite important. So now how those teams, those people are catching the issues. And this is also interesting because 33% of these do manual checks before deployments. And by manual I mean they are looking on the templates they have. For example, I don't know, code review or whatever, they are reading the templates and they can miss stuff. Obviously, 27% of them do a post deployment checks what is really risky because this is already deployed. All right, so we know what kind of issues people have, we know how they catch and try to fix the issues, so how much time they spend on it. And this is also quite interesting and I'd like to focus your attention here. So fair percent of these 300 people said that these spend more than 500 hours per week fixing the issues. Finding them and fixing it means that twelve people from these 300 say that. So their teams spend 100% of time working with infrastructure as code security issues. And it's for twelve people. Work for twelve people only for this. It sounds scary, but we are lucky here. And we have helping hands. And with helping hands, I mean the processes, the approaches, like for example, shift left. What it means, this is like a paradigm not only for software, which is obvious, right? Everyone is saying shift left in creation software, et cetera, et cetera. But it works also for infrastructure. Thanks to that, we have very fast feedbacks that something is wrong. Tools like SE stake also can be relevant for infrastructure. So these static analysis tools. And again, like with the proper development, everyone is responsible for security, right? We can implement open policy agents or any software which are using this approach. And this is project under a CNCF. So quite mature and quite stable. I mean stable from the perspective of big organization. It's not something what is like ifmrl, it is here right now and tomorrow will be gone. Now this is like a policy engine that automates and unifies implementation of policies across the environments and allows us to enforce, monitor and remediate policies across all environments and resources, right? And this is the one link which I want you to remember, all right, what we should do, we should control everything before we deploy. And here I'm saying that if you have infrastructure as code and you deploy it first on dev, then QA, then preprod, then RC, then prove this dev environment. These first in the chain is in these. The importance of this environment is exactly the same from the security perspective, like production. I know this like how it can sound. But let me ask you something. Would you like to create a bank account in the bank which has the exposure of dev and qa environments recently? Because I'm not. So if template is deployed, it is already too late. Let's put it in our heads like daily scrum, we also have daily scrum, always have daily scrum at 10:00 a.m. In our team the same way, if we deploy templates, it is already too late. All right, we should use dedicated tools to prevent deployments with misconfigured resources and use CI CD pipelines for it. Why? This is important because CI CD pipeline can have first of all implemented those tools and then stop fail, break the pipeline if something is wrong. So we have a lot of tools around, to name a few. Chekhov, right? The only one which is written in Python here, Terrascan TFSec to written in Golang, another one, CFN Nag, I think this is in python as well, but I don't remember at this moment. So this CFN Nag is especially for cloud formation, right? And snick for example. So these, all of them except CFN NAG work with terraform, with Kubernetes, with cloudformation except terra scan and TFSEC, right? Obviously because the dame is even part of the terraform. All right, so also we have tools from Acurix which are quite more advanced right now. They are acquired by Tenable. So I'd like to demo few of them, three of them really. It will be terra scan by acurix, Chekhov by Bridgekiro and TFSeC by Aquam at this moment. Right? So this is the GitHub for terra scan, for Chekhov and for TFSeC and you can implement it in your own pipelines. So let's see how they look like. I have very fresh installation of arm, sorry, very fresh installation of vagrant machine because I don't need anything more here. So let's do first, let's install those tools with Chekhov. It is quite easy, right? So what we need to do is pip install checkoff probably I didn't have pip here. Yes, exactly. First I need to get update. So we spend a few seconds here on this or even more, let's see. So as I said, checkoff is only one tool which we will test here, which is interesting. All right, this is quite interesting at this moment, but not. Don't worry, I have a backup in mind. Okay, let's try, maybe it'll stop it and. Oh, all right. Yep, it, it. So I have an issue here. So no worries, let's do with it differently. Right here I have my AWS console. We will go to EC two. I will create an instance very quickly. Let's use AWS Linux for it. I need to use default setting. I don't need anything else. All right. Okay, this is all right. At this moment for the security group, I will say this, I don't care at this moment. All right, please do not do that. I try to do this. I'm doing this right now only for the presentation purposes. Okay, create new pair. Okay, let's call it presentation download. All right, it is downloaded. Okay, so my instance is now launching. Something wrong happened here with this ubuntu. So we don't care about that. Let's try to connect. Okay, it was presentation Pem, right? And now what I need to have is the ip of this machine and it should work. Please work for me. Yeah, perfect. Okay, so we are going to become the root and we try to install checkoff. Okay, Pip is not found. All right, so what I need to do is to install Pip. I don't remember right now what it is here in Linux in the red hat. So I know already. Okay. All right, got it. Default problem. Yeah, of course I should install it as a user, but here only for this presentation purpose. It's okay to have this. Okay, so now I should have checkoff install. Let's try, maybe I need to do this a simple like that looks like not. All right, checkoff not found. So let me just check where it can be because it should be installed really in our, all right, let's come back to this a little bit later. Right now we will install how I did it. Okay, Pip, install checkoff. So should be here. Let me just repeat it, just, and now let's try to do maybe the same thing here just to speed things up. Oh yeah, I have it. Right, so checkoff is installed. Right. So now what I want to do is to install terra scan. Okay, so this is the turbo which I need to install first download obviously. Right, I have it. So now let me unpack. It's okay. What happened here? Let's try maybe with the name. Okay, I have some issue here again. For some reason things do not want to work for me today. So let me try to do this again. All right, let's check. I have it. All right, I have it. So let me try to, yeah, probably some typo or things like that. All right, let's remove the bundle. Okay, now what I need to do is to install terra scan into UsR bin local bin install of course. All right, and now I can remove it from here. All right, so now let's try if I have it. Yes, version two without dashes, just version. Okay, version 112 is installed. All right, so now the last tool, tFSec, let me copy it. Okay, I have this one. Yes. And as you can see right now, this one is downloaded not as a package. But just like executable or almost executable. So first let me just move it to TfSec to have more clear name. We need to do ch code for it and obviously we need to install it TF sec into bin, usr uSR local bin and remove it from here. Let's check. Yeah, we have it in version. Okay, so what we will do now let me check if I have git. I do not have it, so let me just install quickly git. What I will do, I will clone a curix test templates and we will see how they work on it. Okay, why this template? Because these prepared quite interesting and problematic templates. All right, so let's go to terraform directory. And we are here. So how to execute checkoff? It's very simple. We need to put checkoff minus d means directory, obviously, and let it be our current directory. Bam. We need to wait a second or two and this is the effect. All right, so as you can see a lot of failed or fails and these organization is clear. You have all information here, what failed, what resource, what file, in what lines from where this file was called, what is the code here? And also the guide, right, this guide means some information, how you can fix it. And on the top, on the top there's a lot of it, as you can see. Come on. On the top there is a summary, so summary should be on the end. Let me do this differently. All right, so as you can see here, past tracked is 44, failed tracks is 58 and nothing was skipped. Skipped means that you can say that specific checks need to be skipped. You don't want to check them because for example Chekhov has the scan which is checking the description of your security group. Maybe it may be an issue for someone that this is not described properly, I mean about AWS, but for sure it's not the security issue, right? So here for example, when I have my security group for this instance, let me show it to you. These, you see this description field. So generally we can say what is the reason of having this specific record, right? So this is Checkoff alerted as a problem. Okay, so this is checkoff. Quite nice. Let's try with TF scan, I mean TFSEC. So here the syntax is a little bit different and you have even the information about the use of disk iOS, about what was the time for evaluation, runnings, et cetera, et cetera, how many files were checked, loaded modules and the results. So we found ten criticals, ten high, medium low and ignored. As you can see there is a less so TF scan found less issues but it doesn't mean it is worse tool than Chekhov. Checkoff is finding different issues, right? There is like a common part of issues that all free tools care catching. But I did a quite extensive best for all of them and results said that I should have three of them in my pipeline to be like 90% sure what I'm doing right. So quite not perfect situation. All right so this was TF SeC, how to do with terra scan. Right now we are saying something else, terra scan scan and we are scanning it and terra scan takes the most resources, takes the most time. So it's these longest run even if it wasn't shown here maybe they improved something but like one year ago this time was definitely the longest because terrascan had implemented API inside. So you were able to run terrascan as like a separate tool which is available all the time like Sonarcube and just execute the API call from the pipeline which is very useful because you don't need or to prepare proper image for your pipelines or download the tool all the time. As you can see we have 20 high 13 medium and five low 38 policies are violated. So this is how they work in the most default options. All right, I don't want to show you definitely more complicated options for very simple reason because I strongly believe that the beauty lies in simplicity. So if something works from default out of the box that is perfect tool for me. So this is how I run them just through my Linux machine. So how it looks in the maybe first I will show you the templates of pipeline which I created because I've created my pipeline using cloudformation. So this is the template, very simple. I didn't pay too much attention on the roles but I need to have few roles in order to have my pipeline AWS code pipeline and I have four blocks inside. One is mandatory to pull source. I'm doing this using my own created repository in the code commit and then I have my tests where one block is for checkoff, checkoff, second is for here TFSec and third is for Terrascan. And here the configuration means these like a general approach to doing things right. So I have my phases during the install time I'm installing checkoff, creating the directory for reports. Then during the build I do these proper test and as you can see here I stream the output into the file. Why you will see in the moment the same, very the same situation. Of course the installation process is different, it's for TFSec and also I have the possibility to send output in junit format into the file and the same for Terrascan, I'm streaming it to the Terrascan XML file right now. All free tools allow us to use junit to have proper reporting what is really perfect, because when we go to code build and I mean more like a code pipelines, we have our pipeline here which is failed and it is on purpose. So let me just release this pipeline again right now. These mandatory block will pull code code on. Yes it did it. So let me open all of those scanners, these as you can see I'm executing them in the same time. In fact it's not always these same time, but generally the idea is to have them in the same time and then of course I can have another next steps. But for this presentation this is what I need really. So as you can see there is a lot, it's generally what happened, the same happened in normal execution, right? So as you can see right now, this command was executed and of course it failed. And this is also very important, all free tools can fail your pipeline. So this is very good, let's say. So I told you that I send this to this file, to this report and also I publish this report. So here we can say we can see how the report from checkoff looks like in the code pipeline or code build we see that 20 checks failed, 84 was positive. This is for different repository, not this from macurics, another one and we have the full information what failed really? Okay, so this is Checkoff. Let's see what we have for TFSeC. Download, preparation and then we have the execution. TFSeC is only one tool which is much more happier if terraform is installed. It can work without but it's definitely more happier. So again we have the reporting, let's see what we have in report. All right. Eleven best failed, we don't know how many passed unfortunately. And again we have the information, these about everything. What about Terrascan? Again, download, installation and as you can expect it failed. Perfect what we have in a report, right? 14 checks failed, 16 passed. So first of all, only TFSEg doesn't show how many checks pass, right? It can be useful information if you build metrics on it, on this report it may be interesting how many checks in time, how much more checks in time you have and how many of them failed. So I prefer these option here, not like this one. And this one is also perfect. And so as you can see, 2080, 411, 1416 so different policies, different checks, different findings, which one will be best for you? It's up to you. It's really up to you. You should go with all of them, check them, look how these work, right? And then decide what to use. Right? So this was the presentation, but we didn't talk about one important element, tools much about drifts. Right? So what about drifts? The challenge here with drifts is that drifts have to be controlled continuously and there is always someone who has elevated privileges, right? So even if I say that we should prohibit it, there is always the root who can do things and we need to remember about that. And the available tools not always cover all possible changes. The best example here is the drift detection from cloud formation which catch a lot of stuff, but definitely not all of them. Even from the list of this you remember this percentage from this list. These detection has problems and important and very unpopular management must understand that if these say to you it must be done now, it means a lot of risks. What tools we can use for drift detections. Drift Ctl acquired by SNC recently Kubediv for Kubernetes SaaS offerings like bridge crew akurix. They have offerings, paid offerings of course, for continuous monitoring of your templates, of your misconfigurations and your drifts. So we should go into new level just even before, beyond just scanners, we should have everything as code. And it means that with those newest offerings we are able to create it. To create automated controller remediation processes and security must be implemented on the earliest possible stage for infrastructure, for everything, right? And everything as code means here policy as code means security as code means drift as code with those offerings, right? With those paid offerings and remediation as code. And what I'd like you to have as a takeaway from this session is that please do not pretend that you have security because every possible misconfiguration will be explored now tomorrow in one week, but will be right for a few years ago for wrongly configured elasticsearch, it was like a couple of minutes needed to have elasticsearch hacked. A couple of minutes and security is very complex and expensive. Thank you very much. I hope you enjoyed the session and I hope you enjoy the all sessions from this conference. Thank you.
...

Pawel Piwosz

Lead Systems Engineer @ EPAM Systems

Pawel Piwosz's LinkedIn account Pawel Piwosz's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)