Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and
observing changes exceptions errors in
real time allows you to not only experiment with confidence,
but respond instantly to get things working again.
Cloud Mcjolo
I'm a senior solution architect with AWS. For the past 15 years
I've been in financial services infrastructure,
specifically web infrastructure, authentication systems,
distributed caching, to name a few. And for the last
three years I've been working with cloud and infrastructure,
AWS code and CI CD, and I fell in love with these concepts. So that's
me. Hi everyone, my name is Jack Yu. I'm a principal senior solutions
architect AWS. So I've been in AWS about three years
and helping financial service customers to adopt and optimize on AWS.
And I have a mix of software engineering background cloud infrastructure
and help building complex distributed systems.
And I help highly regulated financial service customers
architect multiregions transactions workload
and also help to build their DevOps
pipeline and deploy those applications as well.
So we'll talk multi multi region terraform deployments with Terraform built
CI CD on AWS. Everything you see today is built
using terraform and we'll talk about multiregion deployments. So Jack,
why multi region deployments?
So first we want to understand what is AWS regions.
So AWS has a concept of regions, which is
a physical locations around those workload where we cluster data centers
and we call each group of logical data centers and availability
zones. And AWS cloud spans 84 availability
zones within 26 gigabyte region worldwide.
And each availability zones has one or more discrete data centers
and they all have redundant power, networking and connectivities
and they all house in separate facilities for
vast majority of the customer workloads.
Those highly resilient multi AZ deployment
is actually the best way for customer to host their
applications. This is how Amazon.com
has run and still runs today. With that
said, let's talk about why customers looking into multi region deployments.
And we actually have seen multiple use cases for customers.
So first, we see customers want to actually
get their compute resources closer to their customers,
so their request doesn't need to traverse all the way across
the world to get to the compute resources they need.
This type of customers typically have a global
expansion plan and they want to leverage multiregions deployments
for that purpose. Second, it's really to
support mission critical applications.
These type of customers want to achieve
extreme resiliency for their workload.
And third, it's related to the business continuity
and regulatory requirements, that is
to actually to build multiregion architecture to
satisfy business disaster recovery strategies and business continuity
requirements.
Um, so multiregions
come with a cost, especially when it comes to infrastructure deployment.
So first, it's actually very difficult to make
sure that you deploy the same infrastructure across multiple
regions. Especially hard for the customers to use AWS
console to deploy their infrastructures,
and customers actually figure out those already.
They use infrastructure as code, actually is
the way to actually manage the consistency between different stacks,
different environment. But that's
actually very challenging with a single region deployment.
With the IEC, it's actually very simple. You just multiple
modules together and you click a button to deploy it. And management is
extremely simple. But this actually gets very complicated
when you have multiple business units, when you
have multiple SDLC environment and multiple region deployment.
And finally, organizations need to think about their deployments strategy
and how do they leverage toolings to
actually enable their continuous delivery strategy.
So in this talk, Lerna and I is going to give you some prescriptive
guidance to actually manage your multi region terraform code
and deploy this in scale. So, Lerner,
why don't you take us through how to manage this complexity?
Absolutely. So let's talk about the terraform
deployment workflow. We have four simple steps in our
deployment workflow. First, we'll talk about infrastructure AWS code and
git tag. So here you see two pink
boxes. These are the accounts that we're using in this solution.
The larger box is the central tooling account. It contains all
of the CI CD resources, as well as a git compatible repository
in which we store our infrastructure AWS code. So that's
our terraform code, the infrastructure sample workload that gets deployed
into the target workload account. That's the smaller pink box. You can
imagine the target workload account to be belonging to a business
unit, a line of business, and it is also per
environment. So we're imagining a research business that
has a dev account, Qa account, staging account.
Similarly, another business with different requirements for security
and access, like RISC, has its own accounts. So DevOps engineers
are working against the terraform repository,
and they follow their branching strategy of their choice in the solution.
We're imagining trunk based branching strategy.
So let me walk you through the tagging and
how it looks like. So, first of all, DevOps engineers,
they'll be working in short lived branches. They'll be making their changes,
and then when they're ready with the changes, they'll submit their
changes through a merge request for a teammate to review and approve.
And then the code gets merged into the main branch.
In the repo and then they can tag to release from the main
branch. The tags will follow a convention. So we're imagining that
first of all there's an environment like dev QA staging prod
here I'm showing you the dev tags. Next there will be a
deployment code like a region name or global for global resources
deployment and we'll talk about that later. And then next will
be the team name. So this is those business unit name as well
as a version number. The version number is important because you want to
know at any given point the resources that are deployed
in your account, what version they are.
So here, as soon as DevOps engineer git tags,
using for example this tag, dev Eu central one research
and then a version number, our pipeline parses
the tag and knows from the tag which target workload
account to deploy inside the infrastructure resources
as well as what type of resources it's deploying and what's the scope.
So in here we're telling the pipeline that we're intending to
deploy into research Dev account and in EU
central one region. Next, let's take a
look at triggering of the pipeline. So the pipeline will get triggered
as soon as those DevOps engineer git text against the repository.
There is one more step in between. Jack is going to go into that and
of course this pipeline will deploy into the target account. So as
the pipeline runs, it goes through a number of stages. Jack is
going to go into the details of those and then at the end it's going
to deploy those resources, terraform infrastructure resources
into the target account. So Jack,
what about the core infrastructure pipeline?
So let's zoom into different pipelines stages.
First we want to look at those resources stage.
In the resources stage we have a versions s those buckets,
that is a source of pipeline. So it contains
two pieces of information. One is the terraform code that describe the target
state infrastructure. On the right. So on the right it contains those
vpcs, the application load balancers, you see. And those
second piece of information is what learned talk about the git tech
data so that it can actually be passed on
delayed stage of pipeline so that they can influence the
pipeline deployments, so different regions and different accounts,
different environments and so forth. And let's take a look into how
we capture that information.
So the magic here is the AWS code.
So that actually is triggered when the DevOps
engineers tag the git
tech repo. So that actions would trigger a
code build, which in turn grab information from
the code, commit repo, that's the terraform code.
And also it would get the git tag information,
it bundles it and it will put it into the Amazon s
three buckets. That in turn would trigger the
pipeline to run. And next we
want to look it into the infrastructure as code linting stage.
So as best practice we would want to lend the terraform
code very early in the pipelines. In fact it's actually the first
step in the pipelines and that actually would ensure that the code that
we want to deploy is actually adhering to the best practices.
In this case we are using PTF lit. It's an open
source tool. It performs a couple of things.
It does those static analysis of the code to find out
possible errors, for example illegal instances
type. It will warn about any deprecated
synthexes, unused declarations. And actually it
would enforce the best practices.
If there's any validation detected, the pipeline actually stopped
there and would notify the DevOps engineer to fix terraform code.
So let's take a look at the next stage. So those
next stage is a very important stage. It's an infrastructure as code security
scan. It's actually very crucial to
have this security scan in this early stage.
For illustration purposes,
we use open source tool called checked off.
Checked off is actually open source tool that
contains thousands of policies.
It's ready to be used. And of course you can use
any tools and choice for the security stand stage.
But essentially what it does is it will scan your terraform code and would
generate a junit XML report and
it would give you a report so that you
can approve or reject in the pipelines run.
So this actually has
a very robust security feature so that it would scan
anything that, for example like if it's an s three bucket
that's open to the world or any security group that has a quad serial
that opens to worlds. So the track top would actually find
out that vulnerability and would actually stop the pipeline to proceed any
further. So next we
wanted to take a look at how the terraform actually
get deployed to a target environment.
So you probably will guess it, it's terraform plan
and terraform apply, right? The terraform plan actually would generate
terraform plan file that describe
the changes that's going to be happening on the target workload
environment. Now the terraform plan
would generate that and that can be reviewed
by a DevOps lead. Once they're okay with the change then they
can put their manual approval in their pipelines
so that it can go on at the terraform apply
stage where the terraform apply actually happened to make changes
into your target workload so Lerna,
you put a lot of thought into this architecture. Can you highlight some of
the key considerations here? So we
are following multi account best practice and
we have a central tooling account. It contains the CI
CD resources. We have a target workload account that contains the
sample workload. So you can imagine that the central tooling
account has the git repository, the pipelines,
and it needs to be accessed by the DevOps engineers to trigger these deployments
into the business unit accounts that are the target workload accounts.
So it's pretty sensitive in terms of the resources that it contains.
Similarly, the target workload accounts for the business units.
For example business unit like research that is
publishing research documents to its external
clients, or a risk business unit that has
internally use only data for its internal
clients, internal company workers. So each of
these have a different security profile and a
different access profile, so accordingly they need
an account of their own. Here we are following a cell architecture principle
and the idea is that we are categorizing these different
use cases based on the security profile access profile
and also we are minimizing the blessed radius by containing
these resources that belong together. The idea is
similar to the bulkheads that are these vertical partitions that
is used inside shipbuilding all the way back to
twelveth century. And this idea of bulkheads is also
used in the space, in ISS, for example International
Space Station. So we are applying that principle of
containing these resources, minimizing the bless radius
in cloud deployments.
What about multi region pipelines? So currently our pipeline is
a single region and from those single region pipeline
we are deploying against workload accounts and against multiple
regions. What if we extend our pipeline and deploy our pipeline
resources into a second region? For example, we can do this
because we are using terraform for our pipelines deployment as well.
And the idea here is to use the regions boundary
as the cell and making sure that we are containing any
issues within that boundary within the region.
So what we will do in this case is have each region's
pipeline be in charge of its own regions deployments.
Meaning if there is an issue in region one, then we want
region two through n pipelines to continue to be able
to deploy to their respective regions.
So this way we are containing the issue in a
given region and we are ensuring that there is no impact
across all of our regions. Again,
we're using the cell architecture principle for pipelines
per environment. We actually have pipelines per environment.
Here we are showing the dev pipeline that is triggered with git tags that are
prefixed with dev in the name and these dev
pipelines are using s three buckets that are dedicated
to the dev resources and they're targeting
the workloads in dev environment. So again,
cell architecture principle applied to environment. In this case, if there's
an issue in those dev environment with the dev pipelines,
for example, we're not going to be impacted in the higher,
more sensitive environments like production.
So I find it fascinating that we are using this idea of bulkhead
twelveth century building ships in International Space Station
and using it in cloud deployments for multiregion.
So it's pretty fascinating, but I'm really curious about the security
considerations for our solution. Jack, can you walk us through those?
Yeah, sure. It's very interesting that you kind of
make that analogy that International Space Station and the ballcat
with the 20th century ship
making and multiregion deployment has in common. Very fascinating.
So let's take a look at the security aspect. Now,
we wanted to actually integrate those security scanning
in the very early stages in the pipeline.
So imagine there's a developers go in
and try to make an s three bucket
public, right? So if that happened,
that actually would get caught in the very early stage. Right. In the third stage
here you see in the security scanning, it would get caught and the pipeline
would not proceed and deploy that open s
those bucket change into
the target workload accounts. So if you look
at how traditional enterprise
software development process goes, they actually have security
scanning stage. AWS, a very late stage of the development
lifecycle typically is right before production deployments.
So imagine that happened. Then they
pretty much have to go back to the development cycle again.
So to be able to fix that issue and then redeploy again.
And that's highly inefficient. So with this shifting
left approach, the iteration of the
development lifecycle can happen much faster. So you
provide type feedback loop back to the developers so in case
they make any mistakes in terms of security vulnerability
perspective. And the last security
principles that we're looking into is to
actually make the pipeline AWS the authority source
for deployment. So what it means that all the changes that happens
to those target workload should be made from the pipelines.
So with that concept, that would actually ensure that the
environment perspective is all consistent.
Right. So how do we do that? So one
way we do that is to actually prohibit developers
actually going in into those target environment to actually make
changes and to enable the pipelines to be able
to make changes only there, right? So if you
look at an implementation perspective, it's a combination of
the IM roles and IM policies to make
sure that in the pipeline it actually has the permissions
to deploy to the target environment
and it would also prohibit any individuals,
personnel, developers to go in into the target environment
and start making changes. So learner,
so let's dive into the code
deployment part.
So we are storing our terraform state files
in Amazon s three buckets as per Hashicorp
best practices for AWS terraform remote state management.
And we also use dynamodb tables
for the terraform state locks. This is so that in case
there is concurrent write attempts against the state files
that these are prevented using the lux as per best practices. So that's
pretty standard. But if you look at our buckets
and dynamodb tables, we actually have one set of
these per environment. Again, we are following the cell architecture
principle, ensuring that we're isolating and minimizing
this impact scope per environment.
Now if you look at those s three keys for the terraform state files themselves,
you'll see some familiar things in there.
They are actually coming from the git tag. So I
want to dive into this git tag example on the left.
So a git tag that looks like dev, EU central
one research and a version number when the DevOps
engineer tags the repo with this, our pipeline is
parsing the git tag. As we had mentioned, it is taking the environment,
the team name, so that's dev and then research and
the deployment scope EU central one in the first example there,
and it is using it to construct the s three key for the terraform
state file inside the Amazon s three bucket.
And if you look at the different terraform state files,
there are kind of multiple reasons why we are using this structure.
And the reason is first of all, if research dev account
has an issue in its existing regions. So let's say
research dev currently deployed in EU central one and us east one is
experiencing an issue in EU central one, then we
know, and let's say that's an issue with the terraform state.
Maybe we need to make some terraform surgery state surgery
in EU central one for research Dev. We know that we will not
have any impact to the existing deployments of
research dev in any other region like research dev, us east one
is not going to be impacted because that's in a completely different terraform
state file. Now similarly, if we want to expand research dev
account to additional regions, we know that our provision
deployments in existing regions like EU Central one and US east
one won't be impacted because the terraform state file
for the additional region will be in a completely different
place. So research Dev, ap southeast one terraform
TF state for example.
Next, let's take a look at a pipeline demo and terraform
code structure walkthrough.
Let's take a look at the terraform code structure for our sample
infrastructure code. So first we have
our account considerations that is available under
the environments folder. We have the different environments
that we use that we configured, and under it
we have those team names, these map to the business
units, and inside of it we have variable cfrs.
This is where we are passing the account number to the
pipeline. So remember that the Git tag has information about
the environment name like Dev and the team name research.
So the pipeline parses the git tag and is able to get
the value of the account number to target.
Next, let's take a look at the provider configuration.
So our provider is parameterized by region.
And as you remember that inside the git tag
we also talk about deployment scope. So that is
used as the region by the provider
to target. The provider will also assume an
IAM role inside the target workload account. So here you're seeing
that when the target workload account is created,
initially this role also should be provision
inside the account. And here you also see the account number.
We just talked about how we feed that into the pipelines.
Next, in main TF, we have created two terraform
namespaces. This is helping us to ensure consistency
of what's deployed inside each region.
Remember that regional resources are scoped
at the account and region level, and global resources are scoped at
the account level. So global resources are managed
for the whole account and regional resources are managed
per account and region. Therefore we
have created different terraform namespaces and in
the pipeline we will target these namespaces depending
on the tag. So in the tag
we have the deployment scope if it's a global deployment, and for
example dev underscore global, that's the prefix.
If it's a regional deployment, then it's dev underscore and then a region name.
So depending on the tag then we will use either module global
resources to be deployed or module regional
resources to be deployments. And these are mapping to the folders
inside our repository.
Inside these folders we have for example under global for our sample
infrastructure workload we have IAM role, and under regional
we have VPC, which is we're using versioned
terraform modules. So we try to reuse as much of the existing
well tested code as possible and also all
the way up to an application load balancer that's external facing.
So that's our sample workload. And this concludes
the code structure walkthrough for our sample
infrastructure workload in terraform.
All right, so let's take a look at the resources in our
accounts. So on the left hand side you are seeing our central tooling
account and I am logged in as cloud apps. So I have list privilege
permissions for cloud apps to be able to
access the CI CD resources. On the right hand side you're
seeing the workload account, the target workload account,
and here on the right hand side, in the purple, like in
the other browser, I have read
only access. Okay, so let's first start with
the central tooling account. Here you're seeing the code repository.
This is a terraform infrastructure sample
workload repository with the one that we just did the
code walkthrough. So you can see the project here with the environments
configuration and all of the folders that we discussed
earlier. And if you look at those git tags,
you will see all the previous Git teams that triggered pipelines
for deployments. I have triggered
this dev global research two six
and also dev EU central one research two six
for an EU central one, deployment of regional resources
for that one. And the previous one is global resource
deployment. And if we
look inside the build projects,
what we will see is that we have a number of
build projects defined. And these are also,
some of these are the ones that we will use inside the pipeline stage
actions. So the ones at the bottom here,
for example the TF lint for the linting
of terraform code checkoff for security scanning
the terraform plan and apply are the ones that we will use inside
the pipeline. And also the ones
on top are the ones that we use per environment.
When you git tag as DevOps engineer on the
repo, what these projects
do is that they'll take the git tag and a full git clone
of the repository and push that terraform artifact
into an SG bucket
that is versioned and it's a separate SG bucket per environment.
So then that SG bucket becomes those source
stage of the pipeline. So let's take a look at the pipeline itself and how
it looks. So I've been working on the dev pipeline.
That's why you're seeing this one used and the other ones I haven't
been touching. So we'll look inside the dev pipeline
and if you look there, you will see that there is a source stage.
So this is the stage that we just mentioned with s three versioned
bucket. And every time
there is a data change in the SG
version bucket that contains the terraform infrastructure code
for our sample workload, as well as the git tag that
the DevOps engineer just created. Every time
there's a data event, this pipeline gets triggered.
The next stage there will be a Tf lint of the terraform
code. So let's take a look at the output of that. In our
case, the tflint ran and it was successful.
In the next stage we have the Chekhov
security scanning of our terraform code and this
stage runs generates a junit XML
report of Chekhov's security successful
rules and also the failures in
our terraform code. And that report will be presented
in a manual approval action to the
person reviewing the report and deciding whether or
not it's okay to move
forward with the check of output or we
should reject because we saw some security scan failures
that we decided that we need to fix.
So here you're seeing the check of output. Here on top is the
actual output in the logs, but it also
generates a junit XML file that we can view
in the dashboard after
this step. The next step is those terraform build stage.
Inside the build stage we have a terraform plan action that
generates the plan. So here we
see that plan generated in the
output of that job.
And then if we look at the stage
itself, we'll see a manual approval action that will
present the terraform plan to the reviewer
and the reviewer can check the plan and ensure that it
performs all the changes as intended as per our terraform
code. And if they're satisfied with the terraform plan, then they'll approve
or they can reject it. If they see something funny in there,
they'll reject it. Now, yesterday I approved
this terraform plan and it went into terraform apply.
Now if you look at the apply output, it's just a typical
terraform apply. As you can see, we are applying
what was in our terraform plan. In this case,
I was deploying these resources into the EU central one region
and this terraform plan completed
successfully. The apply completed successfully.
As you can see here, it's all green and on the right hand side,
as you see in this region, EU central one,
I have the resources deployments and those load balancer
and my load balancer in there, it's working as well.
So here you can see that I am able
to see the output from my load balancer and
that concludes those pipelines
walkthrough.
Thank you very much lauren for taking us through the demo and
code structures. That's really helpful.
So let's summarize what we learned about in this talk.
So if you have one slide to take away from this talk,
that particular slide. So it actually summarized
what we learned so far. So you have a DevOps engineer
to commit the telegram code and tagging a code that
would in turn trigger the pipelines through the code
build. And also this pretty much
pipeline described different stages that we spoke about
and have the pipeline actually deploy the terraform
artifacts under target workload accounts. So it
also illustrated the concept of the multi account structure that we learned about
and having different pipeline per environments and
having different accounts for different workflow deployments
as well. So I think the best thing
is that we build every single component here
using terraform, which is really cool. So Lerna, what are
some of the key takeaways?
Absolutely. So the most important takeaway about
multi region deployments is that we should be
making sure that we are consistent with our deployments.
So it helps if you're leveraging the pipeline as
our single source of truth for the deployments.
It helps to structure our infrastructure as code and architect
our pipelines such that we maximize code reuse, such that
we can be consistent in the resources that we deploy into those accounts.
And also it's important to shift security left in our pipeline
such that we can catch those security issues early on
in the SDLC, as Jack mentioned.
Thanks for joining us. Thank you for joining
us.