Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, my name is ran Isenberg, and today we're going to talk about AWS
CDK best practices. So AWS CDK allows
you to write infrastructure as code to describe your resources in
the cloud with code, not JSon or YAML
files, but actual code it. And to me, as a developer,
it feels right at home to write code.
Now, I had the pleasure of talking at the first AWS
CDK day about three years ago, and I came there
from a perspective of a newbie with CDK.
And I wanted to share the experience of working with CDK and to share
how it helped to accelerate the development at Cyberark with CDK.
So in this use case, I described an audit service,
Poc. You can think of an audit service as a
type of an ETL log service, where you take a log, you alter
it, extend the information and save it into a bucket.
And here you can see we have lots of serverless services, we have IoT,
we have kinesis, we have lambdas, we have API gateways,
we have elasticsearch, which is now Opensearch, we have buckets.
And we didn't have a lot of knowledge and experience with these
services back then. And since there's not a
lot of business domain logic here, it's just a matter of connecting
all the Lego pieces here and configuring all the event driven architecture.
So it took us with CDK, it was me and
two other guys, just three days, three days to get all this
working, which to me was mind blowing. It really shows how CDK
can really accelerate your development. It's a very powerful
tool. And AWS Uncle Ben says,
with great power comes great responsibility. Since CDK
is such a powerful and flexible utility, it's really easy to make mistakes
when you're writing code. You can basically do whatever you want.
I'm using Python, the CDK
variation Python, and you can do basically whatever you
want. So it's really easy to make mistakes, and these mistakes can be quite
costly. So this brings up the agenda
of this talk. We're going to talk about best practices so you don't
make mistakes. So we're going to cover CDK app guidelines,
constructs guidelines, SACD guidelines, security and resilience,
journal development tips. And we're going to summarize it all. So let's
start so, a little bit about myself. My name is Ron Eisenberg.
I'm a principal software architect at Cyberark,
at the platform engineering group, I'm an editless community
builder. I'm an owner of the website runthebuilder.
Cloud where I share my serverless knowledge and you can see
the QR code, you're going to see these QR codes along the
presentation. This is the link to my website where again I share all my serverless
knowledge. Okay, so let's talk about
CDK app guidelines. So usually when
you go ahead and start your first service,
you start with CDK application.
Usually it should be around one business domain and it
should have one stack, not several. I'm going to explain why.
And these stacks are going to have at least one construct. Each construct
is going to have multiple resources with configurations
between them, all their relationships, and sometimes even
between constructs you can have relationship between the items.
Maybe there's an event driven mechanism here. And I
usually view constructs as a micro or nanoservice within
the application itself. So I
think it's usually best practices to have you should have one stack,
one business domain maintained by one team so you don't have
conflicts. One CI CD pipeline. And the reason I
said one stack is because assuming
you have several stacks in this application, so you deploy
the first stack and then the second stack. What happens if you
have a serious issue that you want to fix in the second stack,
but for some reason the first stack fails deployment.
So now you're stuck. You have increased your blast ready. So you have
a bug in your first stack and you must solve it before you can
solve the really critical issue that you have in your second stack. So I think
it's really for the better to have just one stack in your application
and to have a smaller blast ready.
Okay. So however, in some
cases you need to split the application into
another application, another repository, another stack,
and I can think of two use cases. One use
case is when a different team will maintain the new application or
it's a different business domain. It started as a small part of
your service and then over time you realize hey, it's going to
be its own domain, so you're going to move it to its
own repository and maintain it there. And you should keep
in mind that you shouldn't oversplit. Right. The balance is the key here
because there is a complexity added when you
have multiple repositories, because sometimes services
depend one on each other. Sometimes you need to develop
features that are cross repository, cross application.
Then you need to develop it with feature flags and coordinate
the enablement of these feature flags. Sometimes there is a deployment
time dependency, maybe there's an HTTP API
gateway that one
application builds and the other one needs to know it.
So then you can use things like SSM and cloud map that one
stack publishes to the other and use it in deployment time.
So it gets more complicated.
Instead of just heading all in the same repository,
let's talk about project structure. So I believe that you should
have three folders. You should have CDK service and the
test. CDK will obviously contain the application. You have
the application on the root folder, CDK will contain the stack
and all the constructs. Then you have the service,
which is the business domain logic, all your lambda function code, things like
that. And then you have tests, and tests we're going to cover. You're going to
have unit integration, end to end, and security and CDK
infrastructure examples that we're going to cover later on. And as
you can see, I'm a true believer in the DevOps
mentality. That infrastructure as code and the business
domain code should reside together because the developer should
have the ownership and the understanding of everything together,
from the development stages to the production and the monitoring.
Okay, so let's say that you've created your amazing
template, your amazing application,
you have all the best practices, you have an amazing CI CD
pipeline, your project structure is amazing and it
works. Now you want to create the second
service. So what do you do? Do you just
copy all the code from there, just duplicate the first repository
and manually change it to create another repository? No,
it's a lot of work. So what we saw, like I said,
I'm part of the platform engineering group, and we saw that by
creating a CDK template project, a self service
project, teams can just start and create their
service really fast and they can get started
with something that works, that has all the best practices, all the project
structure, all the CI CD pipeline, everything just as it
should be. And they can just focus on writing the business domain. They usually go
and we provide them internal training when we tell them
about the internal sdks that we use all the best practices for writing lambda
functions, how the CSCD pipeline works, things like that.
And we studied it really helps to reduce the cognitive load from the
developers and accelerate the development. So once I
started, it really works for us at Cyberark, I decided why not make an open
source out of it? So I created the editless Lambda handler cookbook
project, which is found in the QR code on the
top right. And it's basically a serverless service that
allows you to create an API gateway lambda that writes to a dynamodb
table and uses feature flags based on app config configurations.
And it has the CI CDK pipeline and observability and all
the best practices for writing lambda functions and the testing and
everything. So you should check it out.
So now let's talk about construct guidelines.
So the same way that you don't write all your code in one
function, right? You don't just have one file with 10,000
lines. You shouldn't do that the same in your stack. You shouldn't define
all of the resources in your stack. You should use constructs.
Constructs and constructs are really easy to share between
teams. You can
have a best practice construct that
you can share between teams and save time. So use
constructs. Usually I see them as a microservice or
a nanoservice. And one exception to
the resources on the stack is the lambda layer. If you're using lambda layer that
is used in multiple constructs, I think that's okay to
define on a stack level. But usually you should just have
different constructs that define the micro or nanoservices
of your application.
So why did I mention that constructs are really great to
share? Usually platform engineers will create
and maintain the shareable construct. You can think about
it as organization approved,
security approved constructs or patterns that you can use
across the organization without reinventing the will. You can create a library.
In my case it's Python, because we use Python. It's a Python library
of CDK constructs and you can import and use it in
your CDK code and just use it as a black box, so to speak,
so it saves time for developers. However,
since it's a library, it has a version and you might need to upgrade.
And in upgrades you need to be careful not to whoever
maintains it needs to be careful not to change the logical id of
stated resources so you don't get your database deleted.
And we're going to talk about it later on. But when
you're doing it, you should
be careful when upgrading and writing, when you're changing logical
ids. And we're going to talk about it later on and I'm going to explain
it in several details. Okay,
some examples of shareable constructs so you
have maybe WAF rules that you want to use for your API,
gateway or cloud front distributions. Maybe you have an SNS
SQS pattern subscription with
encryption at rest which is not enabled by default. You might want
to have an AWS app config dynamic configuration construct.
Maybe you want to have Datadog log shipper or Pii
sanitizers. And you can find more example the following links
constructs dev, serverless land CDK patterns and the
edibles solutions constructs.
So now that we understand that we need to write
constructs, how do we take an application and split
it into constructs? So I think that
it should be by business domain driven. Let's take a look
at the following service.
So we have the crud API. We have an API gateway that
invokes two lambda functions that write and read
to an Aurora serverless database. It has
its own VPC networks and all the fun stuff.
There's an Aurora stream that triggers a lambda function that sends a message
via SNS. There's an incoming message via the SNS
to an SQSQ that triggers a lambda function and again reads
the aurora function, the Aurora database. So how do you go about
and split it into constructs? So like I said, I think it
should be business domain driven. We have the crud
part and we have the database part. I think the database part,
even though it's defined in the crud,
it is an internal contract because the
lambdas there, they're the only one who write there, writes into the database.
So I think they own the database, so to speak.
You still should create the aurora as its own construct
because it's a very complicated construct and it's really easy to share it
across organization. So once you do it, you create it once and then
you can share an aurora database across all the organization and just have
a best practices and secured Aurora serverless
database. And on the other hand, you have the messaging,
the asynchronous part, you have the SNS and the queue
and the two lambda functions. And again there's connections between the
two constructs. As you can see, the lambda functions they need
to access the overall database. So it's
important to understand that gathered is no right and wrong in this case
because it's all defined under the same stack.
There isn't really right or wrong, they're going to get deployed the
same way. However, I think it makes more sense to split like this
because it makes it easier to find the code, to find the resources
in the project itself. It makes it easier to maintain and the
readability of the code. But you can choose whatever
type of construct changes that you want.
But I think this is a good example of how to do it that makes
sense. Okay, let's talk about
CI CDK guidelines.
Okay, so usually you'd
like to model your CI CD guidelines, stages in code.
Different environments have different configuration and that's okay. And CDK
needs to know how to make these configuration changes into
your environment. And usually in my case we use Jenkins
that sets environment variables and
injects them into the CDK application.
We call it a profile, can be dev test, production, whatever,
and then CDK code knows how to address this parameter and
make the different configuration. So you can see also
in this example that I'm using different accounts. I'm using
dev account test account production account and they're a different account.
And the reason for that is that you want to
have a small best radius in case of a breach. If somebody hacks
into your dev account, you don't want to have your production account jeopardized.
Another reason is to have the AWS resource quarter limits.
You don't want to reach it. So by using different accounts you're
probably not going to get there. So let's see an example
of how it works in CDK. So in this case I want to define a
table. I have the profile environment variable
that I'm going to get that
Jenkins sets. So in this case I'm defining a dynamodb
table. And you can see at the point in time recovery table recovery
argument. So if I'm at the developer environment dev environment,
I don't want to enable it, I don't care about this database, it's going to
be in a firmware. Users use it for just branch development
and feature development and I don't want to backup the database.
However, if it's production, I do want to have backups.
Right. I want to be able to return to
appointed time, recovery time in case of a crisis or a disaster.
The same thing goes for the removal policy. If I'm
in a development environment, I want to remove the database
when I finish with the stack. But in production, if for
some reason there is a mistake and the stack is removed, I want
to keep my data. I want to keep my database. Okay, so this
is an example of how you can use different configurations
in your CDK code. Let's talk about security guidelines.
So, secrets in, never ever write
secrets hard coded in plain text in CDK or config files.
You should store them in GitHub, Jenkins or some sort of guidelines,
whatever you're using as an internal secret. And then you
can inject it into CDK as an environment variable or parameter
into the constructor of the stack. And then CDK
will use this parameter to deploy it into
secrets manager or SSM parameter store as an encrypted string,
and then the lambda will consume it from SSM and
secrets manager. And it's going to have an environment variable
that tell it the secret name. And of course the correct permissions
to get the secret. This is how you should do it. And don't use
in the lambda functions, don't use the environment variable for storing secrets.
Don't do that. So this is the proper way to do that.
Okay, let's talk about resources, security configurations.
So as you can see, AWS is really thinking about security.
Back in January, new s three objects are encrypted
by default and DynamoDB supports encryption at rest
for quite a while now, but it's not all the
same for all resources. What about SNS encryption at rest? You can see
that it's disabled by default and you need to know it and
enable it yourself in the CDK code.
So security defaults differ by the service itself.
AWS gets better it, but it's your responsibility in the end.
You have the shared responsibility model where
AWS keeps the security off the cloud,
but you need to make sure that the security in the cloud
is properly defined because these are your resources and you
own them and you need to make them secure. It's your responsibility. Nobody else
is going to do that for you. Okay?
So you should make sure that all your
configurations are really use the best security
best practices. You should have security review, you should
have scheduled a penetration test from time to time, and you should
also use CDK security tests. And that's what I'm going to show you now.
We're going to use a tool called CDK Nag.
And these are tests that you run prior to deployment.
So you're not going to deploy a stack that has
security misconfiguration. So you don't expose yourself to
a security hazard. You're going to run it before
the deployment to your account. And these tests,
what they do, they actually synthesize the cloud formation
template of your stack and then they run a bunch of
assertions again and security checks on that stack.
So in this case we have two tests. The first test is going to check
for AWS solution architects best practices for security measures.
And the second one is the HIPAA standard for security
checks. And if you did something wrong, like an overly privileged role
or an open bucket to the world, a public bucket,
it's going to tell you hey, it's going to fail and you're not going to
push the code and deploy something that is risky. So that's
very important. Okay.
Another thing that I think is very important is to write your own IM
policies. In this example, I want to define a dynamodb
table and I want to provide a lambda role with the permissions to
get an item and put an item into the table.
So in many cases you can see that people tell
you hey, you should use the table grant read, write data
to your role. It's really easy, it's very readable and it works.
But what happened is that I wanted to have
two permissions added to my llama function, but by using
this function I actually provided something like, I think there's like
eight or ten permissions here that I don't need. So my
role is not least privileged. Okay, so if somebody gains
access to this role, he can make a lot of more damage to
my dynamodb table that
we wanted to have access to. So what you should
do is use the CDK
to write your own inline policies. And this way you
understand the IM policies better. And you can see you
write a policy document and policy statement. You say I want to only put
item and get item on a specific resource table
arn my specific table, right. You're not going
to use an asterisk here and I'm going to allow it. So this way we
have just the permission that we wanted and I think
it's going to make you a better developer since you understand Im policies better.
Let's talk about resilience.
Okay, so sometimes in
CDK people can go ahead and make refactor the code
and move resources from one construct to another, maybe rename
the construct. And sometimes they don't realize that by doing that they change
the logical id of the resource. That means that CDK
and cloudformation are going to delete the resource and create it new with
the new logical id. And that could be a big issue,
a serious issue if we're talking about stateful resources such as
tables with data, actual production data, or maybe cross
account trust role that you change its arn
and now you don't have access to production to the
other account. So you can
have serious issues by doing something that seems very simple
and naive. And another issue that I've encountered, only ones
to be honest, is that if you have your CDK code
resin exception that somehow doesn't fail the entire process
of deployment, you can have entire resources deleted from your stack.
So you can basically deploy and remove an API gateway or
bucket and things like that, which is not very great. So one
way to avoid that is
to write CDK unit infrastructure best. So let's see how you can
do that. So in this use case, again,
this runs before prior to deployment. So you know you're going
to keep your code safe. And here we're going to
create and synthesize again the cloud formation
template. And we're going to make some checks. We're going to make sure
that our critical resource, the API gateway, the rest API,
it's going to be there. The same thing for the DynamoDB table,
and we can also add checks to make sure that the logical id is
there and it hasn't changed. So if it changes, we know
that we're going to basically create a new table with zero data
there. So it's not great. So this can be a nice safeguard to
prevent that. Another cool utility
that you can use is CDK diff. It's an open source that you
can add to your pipeline and what
it does basically is it visualizes
new resources and changes to your stack. You can see that
a new resource is added in the green and
a resource is deleted in red. So it makes it easier to
understand if there is a critical change or maybe somebody is making something that
they shouldn't be doing and changing critical resources.
It just makes it a better visibility.
Backups so in backups you should use retain policies like
what we saw earlier. It's better be safe than sorry.
You should have the ability to retain the database.
Then you can restore the data into
the new table in case you delete it, in case you created a new table
instead. And you should always backup your resources.
DynamoDB has a point in time. Same thing for Aurora databases.
You can use AWS backups for our resources so you can
recover your lost data in case of a disaster.
Let's talk about some general tips and guidelines.
So usually when I'm
using a new service in CDK and I'm not
really sure how to define it, I can go ahead into the console,
the AWS console, and play around with it and just try
to understand how the resources and entities play
together, maybe what's the relationship between them? And then
it makes it easier for me to write the CDK code because I understand the
service much better sometimes.
The second tip is that sometimes the higher level constructs,
the abstractions that CDK provides, does not expose all
the configuration that you might need. Sometimes you need to use the lower
abstraction, the CFN low level resources.
They're less, let's say easier to use or fun
to use, but they usually expose all the cloud formation aspects
and configuration and you can use them to define pretty much whatever you want.
The third tip is tags. Tags are super important
because you can use tags on the stack level and
they're added to all the resources. So it's really easy to understand
all the resources that you see in AWS, who created them,
when they created them, what service they belong to, and it's really
easy to manage your services, to manage your resources like
that, or mermaid to understand why you have some orphan
resources because they have tags on them. And lastly,
I think the most important tip is that we're developers
and we like to have cool abstractions and cool factory methods.
And my tip for you is don't do it. This is a CDK code,
infrastructure code. It should be as simple as possible,
okay? It should be really readable and easy
to use, and you shouldn't make it too
complicated and you really should. I'm okay with more codiplication
if it's really easier to read.
So let's summarize it. Like we said,
CDK is very powerful, but you need to be responsible.
And we covered all the best practices for CDK
app stack constructs, how to share constructs.
We talked about the CDK template and self service mechanism,
security and resilience and that's it.
I hope you found it interesting and helpful. And thank you very
much. You can follow me on Twitter LinkedIn and my website runthebuilder.com.
Thank you very much.