Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer,
a quality engineer who wants to tackle the challenge of improving
reliability in your DevOps? You can enable your DevOps
for reliability with chaos native.
Create your free account at Chaos native. Litmus Cloud
hey there, I'm Tim Davis, DevOps advocate with Mzero, and today we are going to
be talking about the pitfalls infrastructure as code code and
how to avoid them. But first I want to thank the sponsors and
the conference folks and everybody for putting on the event,
for accepting me to talk, and hopefully this is helpful for
you. And all of the other speakers that have been speaking today are great.
So it's a pleasure just to be involved a little bit.
About me I did mention I'm currently the DevOps advocate with Mzero. We're an infrastructure
as code automation startup. Previous to that I was with VMware
and I was part of a small group of people that created the cloud and
developer advocacy team there. Before that I was in the networking and security business
unit. Previous to that I just had a string of things of
enterprise architecture, virtualization, engineer, systems administrator,
infrastructure operations. Long story short,
my background is in infrastructure,
but really I've always known that there's no point in delivering
infrastructure just for the sake of delivering it. You're delivering it for the sake of
running the application that runs the business. So I
have always been focused on the application specifically and not
just throwing a bunch of servers and whatnot out there for no reason.
So what infrastructure as code code?
It sounds pretty self explanatory. I mean, it is delivering
and deploying infrastructure using code, but how did we get
there? Folks like myself for years have just
been building physical servers, installing operating systems,
configuring applications. With the advent of public cloud
and automation tools, it kind of changed from physically building stuff
to going into a UI, clicking around and building stuff.
Well, with DevOps and all sorts
of things that are going on these days with application development and
releases, with small startups, smaller companies and even
some bigger companies, developers are actually deploying
more and more infrastructure themselves, some of which is done
through portals that the infrastructure teams have set
up. Some of them are through systems like
OpenStack back in the day, where you could package an API call and throw it
out to figure out what you wanted to do.
Developers though, they don't want to jump into a UI
and click around to get what they want. They want to write
code and do exactly what they do for their normal job. And that's really what
brought around the infrastructure as code model,
and that's just kind of morphed into what we know today
with infrastructure as code and Gitops workflows and all of
these complex tasks. It just kind of came about from
developers wanting to do what developers do, and that's write code.
So what types of pitfalls can come up with
infrastructure as code? I've got good news
and I've got bad news for you. The bad news is it's a lot.
It's both. It's most of the pitfalls that you can think of coming around
with infrastructure, and it's most of the pitfalls that come along with
writing and designing and deploying code. There's a lot going
on here, but really, the way to mitigate that and
to help solve a lot of the issues that come along with infrastructure as
code can simply be solved by bringing together
the development team and the infrastructure, or the questions team. I mean,
it's development and it's Ops. It's DevOps.
So really it's one of those things where a lot of people
think that DevOps is just throwing tools like CI CD, pipelining tools
and stuff like that at a problem. And that's really not it. The tooling is
only a small part of it, DevOps and being successful.
A lot of that is people and process. So if
you do have the communication, kind of break down those old
silos, you will have a lot better time. And one of
the reasons for this is because infrastructure folks have the background,
they have the knowledge. They know exactly what kinds of issues may
come up, how to solve it, what things you may need to think of.
Development folks have been writing code, they've been storing code, they've been deploying
code, editing it. They know exactly what kind of goes into
that process, all the different methodologies. So why not utilize
everybody's experience across the board in this new
venture that brings everybody together to be successful?
So let's kind of break this down a little bit. We'll start with some infrastructure
pitfalls, then we'll move into some code pitfalls,
and we'll just kind of give you a lot of questions to
ask and hopefully set you up for success along the way.
One of the biggest things when it comes to infrastructure as code is
what kind of framework are you coding to use?
Now, this could technically also be a code pitfall,
because there are languages that you may need to learn.
Some of them don't use regular languages. Some of them use their own, like,
for instance, terraform uses HCL. Pulumi, though, will allow you to use
the languages that you're already writing in. So if you're using go or node
JS, you can write your infrastructure as code files with that. But from an
infrastructure perspective, you need to make some choices.
Know, what is your cloud going to look like in the
future? Now, for some folks, you may be in AWS right
now, you may be in Azure, you may be in VMware, you may be in
whatever, and as far as you're concerned, you're never going to move.
You're all in on that. You're always going to be in AWS.
So cloudformation is definitely the right choice for you.
Well, that's great, but what happens if something happens
and your company is acquired twelve months from now and they're
100% azure shop or they use IBM cloud,
you're not going to be able to take those AWS cloudformation scripts with you.
You're going to need to rewrite it or convert them somehow.
There are frameworks though, that allow you to work a little
more easily between clouds. Your terraforms, your pulumis, your terragrunts,
these are more of not a cloud specific framework, but a cloud agnostic or
a multi cloud framework, if you will.
And not only do these give you the ability to kind of move around clouds.
Now, of course it's not as easy as just say, taking your AWS terraform
and deploying it in azure. You're going to have to change the provider, you're going
to have change up a lot of stuff, but it gives you a little bit
better starting point so that you already know the language,
you already know the structure, you already know your deployment methodologies,
all of this kind of stuff, and you don't really have to change that up,
so it gives you a little bit better. Also with terraform, it's not just
the big three public clouds. There's terraform
providers for every major public cloud, a lot of minor
public clouds, some private cloud software, some SaaS
tooling. And at this point, I believe f five has
a terraform provider that will allow you to interact with hardware
that they have. So really there's a lot of different things you can
do with terraform or Pulumi or terragrunt that
allow you to kind of stay a little bit agnostic
and give you a little more freedom down the road just in
case something happens. Also, if you choose a bigger, more popular
framework, which at this point in time, terraform is the industry standard for
infrastructure as code, you're seeing a lot more Pulumi and things like that these days.
But you're also going to get a lot more providers, a lot more community support.
It's open source. So if you feel that you need to change something
or you want to suggest can update, or you want to open up an issue,
you can always just jump into GitHub and do that. Now of course, cloud formation
has AWS support and stuff like that, but it just kind of closes you
in a little bit. Now of course, if you are going to be 100%
in AWS and you know that and you want to take that risk and you
want to go for it, great. Cloud formation is a great infrastructure as code framework.
There is a ton of different tools built into AWS to
help you with that. Same with arm templates inside of Azure.
So there's lots of different things you can do, there's lots of different variables that
kind of go into it. But really the very first question
you need to start asking yourself when you're looking infrastructure as code, code is
what framework are we going to use, what direction are we going to go off
in security things is something really big.
It shouldn't be an afterthought, and in the past it always has been.
In my experience, I worked in so many shops where developers just had
their sandbox. They did their own thing throughout the day. All of
their development processes and everything got pushed through and
then security wasn't really brought in until it was time to
push this new thing out to production. Then it's here's a bunch of firewall requests,
here's all the stuff we need to push things into
production when it comes infrastructure as
code. Code and speeding up the developer lifecycle and all this kind
of thing, you're going to have a lot better time.
If you bring security closer in and you make it
more a part of that lifecycle. A little bit of like we talked about with
bringing developers and infrastructure together to help be successful.
If you bring all the stakeholders in, your security folks, your networking folks and
everybody into this design and deploy process,
you're going to have a lot better time. Now there's a term you may or
may not have heard before called shift left. And if you're not familiar with this,
if you think of the development and deployment lifecycle as kind of a timeline,
you're building code, you're testing code, you're deploying code as
just a giant line. Well, if you take security and you're just kind
of doing the firewall request, you're making your IAC policy and stuff like
that, right? When you're pushing it out there at the end. Well, if you have
a problem with that, you may have to go and take five steps back in
order to fix that. Well, if you shift security left
and every single time you build and test your code, you're building and
testing your security policy as well. If you ever have a problem with
that, you just take one little step back. Or you can even take it a
step further. With tools like Terra scan by accuracy or checkoff by bridge crew,
you can actually set it up in your automation where you
can't deploy. If you have something insecure, say you're deploying an s
three bucket out to AWS and you may have some
sensitive data in there. Well, you don't want to deploy it with a default
open policy or anybody can read.
If you set up security as code during
that build, test or deployment process, if you see an issue like this, you can
just cancel that deployment, make the quick fix, and then actually go
through and finalize the deployment. So there's lots of different tools out there
that will help you to design this security policy and
to help push that out. So we talked a little bit about choosing the right
framework, bringing infrastructure and developers together,
having those conversations, making sure everybody's kind of involved so you can be successful.
A lot of that kind of focuses around the infrastructure side. But what about the
code side? What are some issues that you may run into and some
of the mitigations that you may have for that default values
is a big one for any infrastructure person out there, you know,
or even some developer. If you've used AWS and you've
jumped into the UI and you said need to deploy an EC
two instance, well, it's going to open up a little wizard. It's going to have
a bunch of little boxes. And every time you click next, there's a bunch more
boxes that you need to fill out or answer questions or do it so that
you can configure that instance to be deployed. When you're using infrastructure
as code and you're deploying an EC two instance,
you don't necessarily have to account for every single one of
those little boxes. Some of them, if you leave them blank
and you don't specify, well, it may fail the deployment.
Others it may have a default. So if you
don't specify something, it may just pick for you. Well,
what happens if it's the IAC policy or something like that for security
where it chooses a default policy, but that may not be as secure as
you need for that resource? Or if you design your code and you try
to deploy it and it fails because it needs something, making sure
that you know exactly which resources need.
What kind of specifications do you have that set up as a
variable? Are you hard coding that value? You just need to
make sure that you are answering that question so
that you don't accidentally deploy something that's insecure
or incorrect and that you have all those questions answered so
that you just don't have a bunch of failures and have to try and start
over. So how do we mitigate that? Open policy
agent is great. This can help you out. You saw this on the security
slide a little bit back and you absolutely can use it for some security,
but also for some compliance, making sure that the
deployments that you have going out stay within a guardrail
of hey, this can only be within a certain region,
or I can only use a certain type of instance, or I can only
have a certain amount of instances. Also,
hey, I cannot deploy an S three bucket that has
an open policy to the Internet.
There's so many different things and it's totally customizable and
it's another one of those blank as code type things. We have infrastructure
as code. We've talked about security as code and now we have policy as
code with open policy agent. So you can actually go through
work with your security team, work with your compliance team, work with your
infrastructure and developers to build a policy of what
your deployments should look like, or maybe the other way
of are there just specific things that you will not allow and everything else
is okay? Open policy agent allows you to write that
and then enforce that policy during the deployment time.
Don't repeat yourself. Don't repeat yourself.
This is a developer methodology that
aims at simplifying writing
code. For instance, if you are using infrastructure
as code and you say pick terraform and you want to deploy out to AWS,
but you also have a few different environments. You have a
testing environment, you have a development sandbox, you have production.
And all of these use similar versions of the same app.
They just may have different values for things like IP addresses,
which vpcs they're connected to, and et cetera. Does it make more
sense to have all of these different versions of the same
code or write your code in such a way
that you maybe only have to have one piece of code that
you change things whenever you deploy it for a different purpose?
How do we mitigate this? There's lots of different ways we can do this.
You could just take regular code and make everything a variable,
and then during the deployment process, with whatever automation you choose just inject
the variables that you need, but there are also things that have been purpose
built to work and help you make drier, more manageable
code. Things like terraform modules.
If you have at least a VPC or an
RDS instance, or some EC two in all of your different environments,
why not just write that code one time, make it a
reusable module, and then just call and customize
that module every time you want to deploy it. Or if you say, have the
same set of code that you utilize for production
and staging and development and everything like this, but you
just want to make a little bit of configuration changes each time.
That may be something that a terragrunt works for you so that you can deploy
all this stuff. That way you only have to update one set of code,
but then it can go through and redeploy to all of your different environments.
It just allows you to not have to manage so
much of the same code and just write a little bit less
code that's a little more manageable and a little more reusable.
Designing for state size, this is a good one.
When you deploy things with infrastructure as code,
you end up with a state file and that is simply just a
text snapshot of the current state of the deployment.
So if you have terraform that's deployed out to the cloud and you
have three EC two instances, you have an
elastic load balancer. Those are connected to a VPC that you've
got an RDS instance tagged to as well. All of that stuff, you need
to kind of know what's out there at any given time.
Now one of the reasons that it does this is because a lot of infrastructure
as code frameworks have the smarts to be able to say,
well, I've already deployed three instances
out, they changed the code to say that I need four,
well, I'm not going to go through and deploy four more on top
of that three, or I'm not going to delete those three and deploy four.
It just says I already have three, I'm going to add one and then I've
got four. That's one of the cool things about declarative infrastructure
is you just tell it what you want it to look like at the end
and it will decide for you how it does that.
Now if you have all of your different resources,
like your vpcs, your vnets or your instances,
your databases, and all of this kind of stuff mashed into one
infrastructure as code file, you're going to end up with one giant state
file. Now this is all fine and good if it's a very static environment,
but if it's a very dynamic environment and it's always changing and
it's always updating every single time you need to
make an update to that infrastructure, the framework is
going to have to go through and check that entire state file. It's going to
have to make sure and kind of sanity check what does my state
file say I have and what's actually out there so that it
makes sure that it's deploying what it needs to do and that everything looks the
way it should in the end? Well, that can sometimes take
a really long time. I've had customers that come and say that
their deployment times are upwards of one to 2 hours because
they have giant terraform workspaces with one giant
state file that takes forever to run against every single time.
So how do we mitigate state size issues?
The good news is we just learned that and talked about it.
Don't repeat yourself. This methodology that we talked about
of designing smaller, more repeatable
code can help you design a state file,
thus improving your deployment performance down the road.
So if you are designing your code utilizing things
like terraform modules or terragrunt code,
it's going to be broken up into much more manageable pieces.
So you have your VPC module, you have your EC two module,
you have your RDS module, and every time you need to make
an update to that, your deployment process is
going to run against that module and it's going to say oh great, here's my
VPC, I know what I have. I'm going to deploy this out and im going
to be done. It doesn't have to go through and check all of the
EC, two instances, all of the load balancers, all of
the databases. It's only checking against that specific module.
Now writing your code in a module doesn't just fix that problem.
You have to make sure that your automation allows you to do this as well.
But really starting with the dry methodology
when you're designing this code and making sure that it is smaller, more repeatable
chunks will absolutely help you mitigate any state file
performance issues down the line.
If you want to learn more, there are so many awesome resources
out there. Infrastructure as code code today at this
conference I'm surrounded by so many awesome other speakers that hopefully
you've learned a bunch from as well. If you have questions for me,
always please feel free to reach out on Twitter. That is by far the best
place to get me. I put out videos and stuff.
Infrastructure as code code tools once a week on Mondays
at 08:00 central time, so feel free to check those out as well.
Thank you so much for checking out my session. Hopefully this was helpful
for you and have a great rest of your day. Hopefully next time we'll see
each other in person.