Transcript
This transcript was autogenerated. To make changes, submit a PR.
Thank you for joining this session. I am pleased to be here and
we are going to see today building secure and flexible multicloud
image with multiple a comprehensive CI CD approach.
But before starting I'd like to present myself. So who
I am, I am a brazilian expert living in Poland. I have
about 70 years of experience in it and my
main experience relies on being a developer specialist
with Fox automation, infra and cloud.
So in this session what we will see, we will see
hints of the code and we will see why we choose the
approach, our solution,
some overview of the CI and
the reason behind all of this
approach. So the CI CD overview.
So we are going to see what stands for CI
CD. For those that don't know, CI CD stands for continuous
integration and CD stands for deployment.
Depends on which kind of product do you have.
So for CI is usually when you have
automated builds, automatic testing, you want early detection,
you have code qualities and continuous delivery is
like when you have automated deployment,
it wants to reduce the time to the market. So you build and
you promote this to the marketplace for example,
and this you help to have a better delivery
and deployment. You have a commit, you push and
you build and you deploy production services.
So you have an efficient feedback loop. In our case,
in our approach we just need CI and
the CD four coach which is delivered because we are not deploying anything to the
customer in the resource connector.
So in
the user expectation,
the user expect reliable and available system
that you can have frequent updates is easier
to fix bug and it's
very fast to fix the bug. It's constant in the user experience,
has security and you can have privacy and
you can also see the logs, so you can see the transparency, you can see
what's happening in the jobs has nice and
features. So that's what the user usually expect from
a CI CD system. So our
context is the resource connected product.
So the resource connected product is
a piece of inside of the zero
trust network access suite. So Cisco
has this product to help the customer to guarantee
zero trust in the network. So you are allowing security
features from access to internal websites
or systems or any endpoint.
The resource connectors works as a kind of tunnel or
proxy VPN and is deployed in
the same subnet where the private app he wants to
give access to the user in his smartphone or
his computer and to access is not via directly but
via the suite of zeropress and then using
the connect resource tool to access. So because
the customer can have different clouds and be in
different data centers. So we,
for our own reasons we want to support different
clouds. So that's why we want to support Azure,
Google, AWS and so on and so on,
VMware and OpenStack. Whatever is where the
customer is, we want to be there because his interest
will be there. So that's why he wants to have a product
that is much cloud and because
the clouds may support TPM, may support
security boot and we want to
have the safety features available
in each cloud. So that's why you want to have different boot.
But also internally we need to disable the security boot for
internal and reason or troubleshooting and debugging.
So that's why you need to build Gifron boot modes.
And we have a strategy on building different stages.
So we are creating harden, so we get from the supply
for canonical we get Ubuntu, we harden where we install
and configure the OS with secure checks,
mandatory regulation, anything that's necessary to create a secure
and harden OS and that os we use
in the product and in our own internal
builds, runners, internal services, anything that
we need to use. Then with the hardened image we
create a baseline product where we could partitions,
make os grades with for example
anything that we update some software internally and
that we get latest version with
everything, with any cv set and so on.
And it's the base image. Then we use as an input
to correct the boot mode, so users base and then we install Cisco
preparatory code that's like for example the inconnect and
other software that makes the product WhatsApp
product and enable disable the boot modes
or selecting boot modes to enable in
that VME machine. And we have a code
overlap. So we don't want to copy and
paste code all over the place, we want to share
the code. For example to create the product is similar code
to build the
product. So instead we have one file
for azure, one file for VMware, one file for AWS
and so on, so on. We have a build a script that
will do similar all across to the,
to the clouds so it's easier to maintain. But also the
other hand is tricky because if you change one line and
you can maybe break something that you are not even thinking,
so you are changing something for taking change
for AWS. But then you start
knowing you break something as you. So we need to double check
how the that you support. So every time that you support a new boot
mode or a new cloud, this increased the pain.
So as much we are increasing the support, increasing also the
pain. So this is very critical for our developers
to support and are very painful as well.
So because that we came with the approach that
is supporting different
states in different repo. So every stage will
be a repo, so the harden will be a repo, the base image
will be a repo and the product
image will be another repo. And we will use GitHub
actions because we are using GitHub. So we are placing the
indigo, we are using the GitHub action as
function. So with arguments and values we have a
contract where I say okay for build, this is the inputs
that receive and this is the values that I expect.
What you do internally, I don't care because this concerns
the self contained inside of the, in that
component. We use Packer
to build image. So Packer
has infrasorce that is nice to
use to build different cloud providers
and we use the app one to deploy because we need to build and test.
We need to deploy the machine and everything that's around
of the machine, the environment that's necessary to test
the VM and also we use ansible to
build, to run some command
lines to build and also to test. So we want to
test it. Something is working properly inside the machines. We use ansible
to test and we need to use regression test strategy.
So every time that you change a critical part
of the infra code, in any part of this stage, we need to
run a full regression test. So if we change one
line, one script that's shared in the hardware image,
we need to test then the build, then we test the building base and we
test the built in product and we test the deployment and everything.
And if everything's fine, yes, then we are saying that it's safe,
that our build is okay and we
can merge that code without any problem. So that's how
we structure the project. So we have four
different repos, one for each
stage view. So harden product base and the product itself
and a tester repo where we have 70
tests, integration test, performance, test,
vulnerabilities, scans, test, anything that concerns
to different stage of testing. We are placing this repo
and we are calling from the other build repos
in across pipeline strategy.
So the builder suite structure has
a similar folder and structure across all
repos. So it's easier to maintain. So the developer when reads
the repo, the code will see okay, yes,
stimuli is a bit different because the context
is different. When you are building a hardening it will be different from building a
project, but the structure is similar. So it's
easier to maintain if you know how to maintain one repo you
know how to maintain the other repos. So we use reusable
workflows from GitHub. So then
we use composite actions like is components,
its functions in high level terms that
allow us to use inputs and outputs and
do self contained logic in niche cloud.
So if needs to get some secret, some specific configuration for that cloud,
we don't care because it's self contained. We just pass what we
need to do like build with that version, with those parameters
and give me an image with some
name and I get this and we'll be using the reusable workflows
Anskivo files are used to
build. So we use playbook with some
roles and some tabs that give us the freedom
to install to configuring
the van as we want. With those in symbol tasks it's
very easy to provision the deployment
and we use bucket script
to build the cloud image.
So it's multi cloud supported
is if as a code sales also is a
stack that we use. So the file dependency
is how we manage the pin inversion.
So we use pin inversion across all the repos so
that this way when we are changing something it will
be committed, will be reviewed in the code review in
the PR so we know exactly what we changed and
then we can track back by
the commit hatch when was maybe some change and how
some bug was injected. So we can trace back and find
the bug and fix the bug very fast.
And the test suite structure, we have a
similar structure from the build, but GStav Packer
we use the form. So we use health to deploy
the VM and the environment around any resource that's necessary to
deploy in that cloud. And also
we use common ansible files.
So in this case we don't use to the provider the
VM, but we use to run some tests.
So every test in the Xbox
roles are one type of test. So for example, let's assume that
our product needs to create some files and some folders and we
need to guarantee that those files and folders are in place inside the machine
because it's critical for the running software.
So one type of test for example, could be a
test that goes and make assessment machine check
if the file and the folder exists and check if the file
has the format and the content expected to
be executed with success. So if everything's fine,
so the asymptotes will run with success. If something
is wrong, we ignore, so we place ignore the
errors and later when everything is executed, we summarize
our success and error or ignore tasks
and we say okay, we have some testing or because they are errors,
that means we run all tests but we place
a failure because we should not ignore anything because that means we
have a failed case. So that's how we
use this strategy to build and test. But why this
step? So I just give a hint why. But let's go deeper
and check some nice features on for example GitHub.
So because we use GitHub as their repo was
very natural to use GitHub action because it
has seamless integration. You don't need to use external Ci
CD like Jenkins or another
type of CI CD that's commonly used for
the community. But we choose GitHub action
for because then has efficient Ci CD pipelines.
You can customize because it's made
in YamL file is human readable.
The developers can easily extend
or create nice features. You can
create the whole end to end automation. So you can build, you can test,
you can deploy everything in one place, you don't need to build in one place,
then you have to test in other places and you have to deploy in a
third place. That is very handy and covers many
aspects of the soft delivery lifecycle,
software development lifecycle. So that is why
you choose the GitHub action and
why they have form. We choose to have formed because
infra is a code. So we can define the infra with code
in a very constant way, so we can see. So if we need
to change something that we deploy the infra,
we can use that to always have the same state
we desire in that cloud,
because we need to support different cloud on Prem and also public
clouds. Terraform supports that you can even create your own
provider if your environment is not
supported officially by community or for Dev or for the hashtag.
So that's how you can use and has
resource graphs. That means if you create the order
that makes sense and that makes your
environment work. So if you need to deploy
the VM and then after the VM you need to run some
scripts that will call some APIs to make some
hash string, for example, you can decide his
order, you can place it to use some depends
on properties that we okay run one
after another and guarantee that everything
will work as should you can reuse
components. So for example if you need to register something in
our builds for all cloud, so you can create a register
module and reuse this across the different
clouds so that you reduce the copy
paste script, you have a state management
file. So because we built in
one job, we deploy another job and this deployment
can take can run for after minutes, you can once
run for a few hours or even you want to leave the machine
there for a few days for some longevity
tests. We want to have the build the test
and the tear down in different jobs. And they are different
individually, they are not dependency.
But when we need to have to destroy,
we need to know what we are destroying. So that's why we need to have
a I state to know what we need to destroy.
So that's why we use the form because it's easier. We just have to pass.
Okay, this is the file. Please make a trdon
of this file that has everything that you need to know
about what was deployed in AWS or in Azure,
in any other cloud. And let's assume that
the deployment didn't occur with success. So something
was deployed and something didn't apply. So you need to make a cleanup.
So because the terraform has a state of what was deployed
and what was not deployed. So you have,
okay, make a cleanup that will destroy everything that was created.
And you have a clean lab so you're not
leaving anything behind on the lab. So it's good
to reduce any trash,
anything that should be deleted and has many
integrations with the commute x system.
So why you choose fucker Packer is
very quickly to build image with
parallelization. So that's why we need it once
built very fast and we want to build with parallelization
different clouds and different
types of boots and we want to support different
platforms as we RLC, so we want to support where the
customer is. So we need to build for different platforms and
we have also the seamless
integration so we can integrate with all the tools.
We want to have security so we want to use from trust
certs that also unpacker supports and
we use different plugins and the
plugins are very different plugins
wide supported for the community. So that's also a nice feature.
And as terraform is infra
and why is multiple infra for building images
at least for the product, we have a mandatory reason
that we need from the same input, the same files
and we need to generate the same output.
So that's why we are using pin versions, we are using
infraso code because we can guarantee that that same
commit will guarantee you generate the
same result over and over. So this guarantee
trust for some oddity or anything that
we need to guarantee that. Okay yeah we can. That's what
your, this binary, this image is this code and
you can guarantee that because if you build you have
the same output and asymbol why use
the asymbol? Because it's agentless. So we don't need
to install anything in the products to guarantee
that we can test or we can build. So we
just use over ssh and very
simple so we can run comments
as doing SSh in command line manually but
via automated and programmatic code is
using emo file as well. And as
GitHub action is human readable so you
can read it's easier to understand and is easier to change.
That documentation is a good documentation has
many plugins. So if you want to use third
party plugins, for example you want to use AWS, not t
UA from NS but instead to call this AWS
ClI you can use the plugin for AWS.
So you know has different plugins and you
can predict the automation so it's id potent.
So and also the community is very
wide and has different plugins
that you can install using galaxy repository.
And let's talk about the workflows.
So here we have the build pipeline workflow.
So we have the hard image where we
call the. When we change the code
this will create the hardware image and then we use as
input create the product and the product will create,
will be used to create the the product image with the
web security boot or the web security boot
for the body for this example. So now
imagine that we need to support last azure and any
other clouds and increasing. So this is our
case right? So and then for each one we need to
run tests, we need to deploy the inference, we need to run the
test and we need to tear down the infra as well.
So here is how will be look
like the full pipeline. So there one
orchestrator that will know what to trigger
the builds the hardware. Then for each hardened image in each
cloud we create a product based image and then we'll create
the product image with the proper boot mode. And for
each image that was product image that was created,
was built we triggered a deploy sanity test
and then a chair down machine. So how
we can make this, you know to guarantee regression
tests in this case, in this scenario.
So I will give you here
an example about one
user changed as shared ansible test in the harden suite
via PR. So is adding a new security
check or security configuration in the harder and
this may affect the whole chain right?
Because you add in something that will
change the os that will affect,
that may affect the base image that can may affect also the
product that will affect the product itself and affecting
the test. So we need to guarantee that any change in
any stage is fully tested. So in this case we are building
temporary hardened image. Then with this temporary
hardened image we use as inputs to create the temporary basic
product. Then we use the result of that to create a temporary product.
Then for each security boot then we deploy
that temporary product, then we run the test and you're done.
So if everything's passing in our clouds, in our
security boot modes that was affected by this change,
then we are fine to merge. And we merge.
Once it's merged it will
create automatically APR with the new
latest we build a hardened
final harden image. Then we create a PR with the
latest build of this harden to
baseos. Then same thing. Because you create
a new version you need to check if everything is still working
because in the base image we
are not having everything pinned. We run some OS
grades software like we need to
update their python, we need to run Ubuntu
updates or something like that. That is not possible to pin unless
we have cache. So in that case. So that's why every
time that you build a final hardware image,
we create a PR, then the PR will run the regression test
to guarantee if everything is working. Then when
the regression test is passed we merge. Then we create a final
base image, same thing. We create a PR to the
product and guarantee that everything is fine.
But because the product image we
are not running anything that is OSB grade. We are pinning
everything, the base image, the Cisco proprietary code and
out versions are pinned because we need to reproduce all
over again the same build. We don't need to
run regression tests in the PR. When we create the PR
with the new versions, we are just creating a build.
We will see later a bit more on that.
So see about the promotion testing was saying so
we create a pr, change the version, the version
of the base OS or the version of the any
connect or anything that is of the Cisco preparatory code that
we have different repos that create binaries
and other versions. Then if
this file change it, we will see okay,
I change it for all. This change will affect all calls
and also boot mode or will change
only one specific input mode and one specific cloud.
So based on the permutation of the change that we made,
the pipeline will run a script that we define
which pipeline should execute. If you execute
all pipelines for all clouds, or only for one specific
boot mode and one specific cloud or subset of clouds
and so on, then this will create real product
images. Then with this image we deploy,
we run the test, we guarantee a down. So we guarantee that it's
fine, it's working. Once we merge that
image created in the PR will be used for further
testing. So contest we are going to scan for
vulnerabilities, CV's and other stuff, and we run performance
longevity tests and then we are placing stage for a
few days where we are running all the kind of manual
tests and other stuff that we guarantee that we
are, everything's fine, we run against the provider
scans, vulnerability and other tasking as
well. And then when we are ready and then we are ready for GA,
it's not necessary that we are going to deploy that we are going to deliver
that, but it's just that it's already. So if
you want to deliver a new version is ready, we just
have to write the
change logs to the customer that say okay, what we are changing and
why we're changing and then it's ready to
be on the marketplaces all over
the clouds. So that's how we can have end
to end from one image
to the marketplace. So in depth builds.
So we also need to guarantee productivity and
facilitate the life of developers so they don't need to have a
toil of tasks to get some
buildings and some machines. In this
case here for example, a user wants
to place some permutation of real
versions or that versions of everything and
then create an image using this as input.
Then this devil developer
can run some tests and something that
is cycle before to place in
the code in the repository. So you still
running some code
or something that is still not mature to be merged to make.
But you need some, some results, some products
that you can test, that you can deploy and you can make troubleshooting
or some analysis on the dev side.
Another case is where you have an image,
can be the image that we built in the previous step or
real image that we want to,
you know, run manual tests. For example, you are
changing the instrument as you are adding new, new cases,
new cases to guarantee more coverage
in the test. So this changed the code. So if
you commit and push it will trigger the regression test
for outlooks. But it's not very,
let's say effective because if you have a bug you have
to push over and over. So this will
create and destroy, create and destroy will take some time
and the feedback loop is not very fast so
it will be a time consuming task. One way is to use
deployment, so you choose exactly which version that you want to
deploy and you deploy. We create environments
and you have the ephemeral SSH access for the
time that you decide. I need 1 hour deployment,
I need one day so the developer decide how much time you
need and after that automatically you tear down.
And during this access you can make troubleshooting or
you can also run asybo playbook. So you change the playbook
manually in your code. You don't need to push, you don't need to commit,
you just run and you see the results. Okay,
it's working. Okay, now I can push. No it's
still not working. I have a typo, I have a bug.
So you can just fix in the code and run again
the playbook, fix the code and run again
the code. So it's very fast feedback,
very handy to developers.
These features that we have in our ci.
So about the code sample, so I have
some here, some screenshots how
we organize the repo. So you can see here
our repo. So in the first
column is the build. So where
we have, this is how we like replace
structure with the GitHub folder with the actions and
the workflows, the in symbol, the backer have
a packer script repair boot
mode. So that's how we can know, okay,
for security boot we need
coding one way, for bios we
need another way, for TPM we need another way.
So this is just a sample how you could, you know how we can have
different cloud providers
building using bucker and
for the testing we have also
similar structure, as you can see actions. Then inside the actions
we have instable test,
that is action set up and chill down. So why three?
Because we want to have separated components
that we can use in different times. So I want to
set up on setup, I want to run
the AC bot testing another time and I want to destroy something.
That's why it's three different components.
And why is this split in different clouds.
So when we need to deploy for
AWS we need to know exactly the conditionals,
this CPNET, VPC and other
specific uses from the AWS for
access, anything. So we need to know the subscription group,
resource group and other stuff for Azure we need
to know where is the lab, what is the game sphere formation.
So this is self contained. So this concern
to outside of the component. So that's, we have
this self contained action and then we have the
workflows, same thing that will,
this workflows will call the composite actions based
in the what we are building,
what you are building. And as you can see this we
have also callable plugin. So this is how we have
the testing I mentioned before where we collect
all state from the incivol and then we check what was passing what
was not passing and then we break the pipeline
or not based on how many tests was passed,
a bit of the composite action. So for
those that don't know the composite actions, this looks
like more or less composite action. You have a
name, you have a description, you have inputs and outputs.
So the inputs, that's how you call, you pass and
then you have outputs. What you expect after you call this
composite and using the runs using
composite is the syntax. So if you
want to know more about the syntax, I recommend you to
go to the GitHub documentation is
well documented and Very simple to implement. Again,
have some very nice examples how to implement and
you can use the GitHub SAS to test
by yourself. Here in
the second column we have a step. So the
composite is a grouping of step, and step is
basically some, one common group of commands that you
run inside of your crop.
So in this case the
packer, we are calling the initializing packer,
we are validating and then we are building the
machine for AWS.
That's how we build for an image.
In this case could be instead mi could be
the security boot that we are saving. So and
we received the security boot as the input.
So we could make this as customized. So we said
to be hard coded. The amount will be the secret boot in
this case as from the base or the seat
because the Basel seed has no cpu. So that's why I placed this
example.
The reusable workflow. So the reusable workflow.
So workflow is basic script
where you have jobs and each job has steps
or groups of steps that you can have many composite
actions or you can have specified the
steps directly in the jobs and
you can have many jobs in the same workflow.
So in this case we have
a build, we have reusable build pipeline
where we have one job that is built that
will generate one
contract to call the composite
action. As we can see in the last line, the second
column where you say okay, this is the platform and
this is the, the tag, the version of
the code from the if I want to use and here is
my inputs, the version
and this version is this. So with
those inputs the composite will know how to
handle these informations and build a base image
from the seed and the version that we want,
the version that you want to use as the suffix version,
how we call this reusable workflow, right. So we
have another workflow that is executed
by PR changes and then
this PR disk workflow has different jobs.
So the has a job to generate the jobs to know which.
So here we are saying okay, I changed the spec file
and compare with the spec file in
the main, give me the diff. And then based
on that we create a matrix of pipelines.
And then based on the matrix we call
the job. So the workflow has
these, so we
know pipeline AWS, BIos,
pipeline AWS with security groups. So those are jobs
in this, in this workflow we
could use matrix that could generate automatically
this script, but this doesn't work well at
least today with chain of pipelines
like a build that calls another build. So in the graph in
the pipeline is not working well. It makes
one unique block and you cannot see the flow
is a bit hard to get. So this could be done automatically.
But we decided to place hard coded
this block or this small piece of
block to guarantee that we can see in the graph everything.
And we will see later in another screenshot what I need for
this pipeline. In this
slide you can see. So we have a job that generates the
pipeline. We have then in this case
we are generating for everything. Then we generate for
the base image for AWS, for Azure and for Kcal,
and then the base image for AWS will generate for
views and for security books and each one will generate
a setup of the environment. Then the test and teardown
and the azure in this case only triggered
in this case that simple case we are only supporting Azure
bios and we are only supporting VMware bios.
Just to show that we can have different numbers
of security boot per cloud and each
one after the base image was built will trigger the
following stage. Here in the
debug VM pipeline we have example of how
the developer can set up the machine run for them
for 1 hour and after 1 hour because we are
using the environment rules
from the GitHub, it automatically will wait for
1 hour and then you trigger the tier dump. So you have 1 hour
window to do whatever you want inside this machine.
Then later you lose access and the machine will be destroyed together
with everything that was deployed on the first job.
Some crucial takeaways. So we
learned that with this the developers don't need to
deploy the thermal efrain themselves, they just need
to use if they are using integration test is
done by the pipeline if they need to do for some
manual reasons, the dev reasons, so they can
deploy using just passing some parameters
so they pass what they want to use and that will automatically
apply for that. The worst case scenario that
we identified was 3 hours. So we changed the hardened
and then we built everything and test so it takes 3
hours. So if you fired a CVE in the
hardened in the or a
problem, a security issue in the harden,
we can that we create APIs,
we merge and then we have in 3 hours everything tested
and working ready to be patched
to the marketplace. If we need,
we can add easily my providers without affecting the build
and the testing timing because they run in parallel so
they are not depending on each other. And because we are using
infrastructure code principles along with the pin versions,
it allows us to find bugs faster. Because if
we change something, one proportion
for example, and something start to
break for some reason we know okay,
that was working with that version. The new version is not working. So we identified
what was the reason of the break and
the pipeline is modified. So the difference between the
cloud providers are self contained
so they don't affect the rest of the pipeline
or outside of the components.
Here I placed some links where you can learn
how to use this stack.
I think this stack is pretty nice so you can have
very good get started. So if you want to have
some similar implementations on your
pipeline you can start looking into these links
to get some knowledge. And here are my
social networks so you can find me on LinkedIn,
you can send me a DM and be happy to answer
any question that you have from this talk or
if you want have more details how to about
some implementations or any other subject.
Happy to be answering your
DM and if you want to follow me on GitHub here is
my profile as well. So I appreciate
your time to listen to my presentation. I hope you have
enjoyed this session and have a good conference as
well.