Abstract
DevOps is all about the orchestration of the developing and deploying of a pipeline, and in order to orchestrate your pipelines you need to secure your agent. Which option is best to choose, V-Net vs. Public Network? V-net takes more upfront work but offers customization. Public Network requires less to implement, but has limits when applying to individual projects. In this talk, Hong Bu will discuss the pros and cons of securing an agent with V-Net or a Public Network, and when you might want to go with one over the other. Every security agent has limitations, but with this guide you will be able to choose what is best for you.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Now let's see how we can secure our
DevOps agent. And before that we also
need to know why we need to secure the DevOps agent.
What issue might happen if I didn't do
it. So my team inside of Microsoft, we are doing
a lot of code with engagement with a strategic
customer or big customer to help launch their solution on
our cloud. So as a program
manager of the projects, it's my biggest goal
to ensure that all the project could be delivered on time with
high quality. And that's where the understanding
of the time spent on security, design,
implementation and plan comes in.
And here all my securing and artifacts
and learning here are coming from my teammates and
myself from a previous customer engagement.
So what happened inside of that project?
Well in that customer project we are using
the Azure DevOps or the ADO
to orchestrate the automatic developing
and the deployment or CI CD. So this
is a quite common practice and AdO is widely used
by a lot of our customer.
And here we were using Microsoft hosted
agent which is running in the Microsoft
public network and at the same time
all the customer resources are located inside of
their virtual network or VNet which is under protection.
So they're using Azure web application in
the front to ingest and
accept all the requires from the Internet users
and then rock this bike to the VNEt
resources to get responsible to their end user request.
And at the same time they do have
an access control to the web application's ICM
site. For the people who may not
aware of an SCM site, well it's
actually the engine behind any
Azure web application for deployment.
So that means if you want to deploy
any upgrade version or new version to the web application,
you need to deploy that first to the SCM
site of the web application and then make
the upgrade to the app. So you can imagine
it's quite crucial to protect and make it secure
for the access to the SCM site.
And this is the same for our customer.
So they do set a restriction to the SCM
site of their web application that only the sites
from requests from their virtual network are allowed
to access to that SCM site.
However, our Microsoft hosted agent
which is running all the pipelines for the automation orchestration
is located in the public network and this is
per design. We couldn't change it.
So how can we solve this problem?
Before that, you may have a question in your mind why
you are using Microsoft hosted agent which is running
on the public network and caused this challenge.
Well the reason is it's so easy,
it's so easy to start because the Microsoft Healthset
agent provides so many virtual machine images that you can choose.
So you don't need to worry about how to build the different
images from the starting but can quickly start your deployment
and developing work. And also a
lot of the Microsoft hosted agent images have their
purview tools and runtime dependencies libraries
that you don't need to worry about these tools
and installation, that you can focus on your own pipeline
and build the orchestrations.
And what's more, the Microsoft health state agent is managed
service on Azure. So that means it's totally and
fully maintained and also kept by Microsoft the Defender
systems and also the security
mechanism and also keep upgrading and
also updating software by Microsoft so you don't
need to worry about all the maintenance that makes
everything easier in your project so you can save a
lot of time and efforts to start your work.
So we chose Microsoft hosted agent with reasons,
but we also encountered the challenge that this agent
is not able to deploying any services
to the SDM site of the web application based on
customer security rules. So we came
out with our first solution which tend to be
a very easy one.
So we know that every Microsoft
health agent one is running it will be allocated an
IP address. If we add whats IP address
in the runtime to the allow list of the web application
in its access control configuration then
we can temporarily get the passport to the SCM
site and then make our deployment. After the deployment
we can remove the IP address from the allow list.
So whats the SCM site will be projects from
then on. We only need that window of time of allowance.
So you may ask, did this work?
Yes it did, but something unexpected
happened during the middle of the project.
So we are in the middle phase of the engagement
and I received another urgent call from our customer project
manager. He told me that his company
security team risked a high severity ticket to our project
and said they detected the web application that
we're running is
temporarily using allow policy that
accessed from
an IP address that's coming from the public network.
Even this is added temporarily, but it's
regarded as a big security leakage from their security
policy and this option operation needs to be
abandoned and stopped immediately and never tapped in again.
Well I was shocked.
The CI CD pipeline is
running all the time and our team,
my team is doing the automation of the deployment
using the Microsoft hosted agent and this pipelines to
make our upgrade all the time.
If the allow list is forbidden, that means we cannot make any further
deployment to the web application and our homework will
be stopped and suspended.
But we are so critical phase of this project and
we cannot afford any delay. How can I
do? Well, I gave the customer team
a quick call, security team a quick call and try
to understand the rhythm. It turned out that the
security team had regular scan of
all the services running on their cloud network
and they figure out this security leakage as
they mentioned before. And I explained to
them that you see, the allow list
was added only temporarily during the deployment window,
and after the deployment it will be relieved and
it's only for the testing and our developing purpose.
And also the Microsoft health state agent is maintained and backed up
and supported by all the Microsoft security tools
and policies, et cetera that you don't need to worry about
it. But the customer security
team was also very strong. They said that even
we are using this temporarily for the deployment
window, but think the Microsoft hosted agent is
running in the public network. No one
can guarantee that it won't be attacked by a hacker
during that deployment window. And in
whats case it means their network or their enterprise
network will be at security risk
at that moment and that will not be allowed
from their zero trusted policy.
Well, I know this is something that I cannot further
negotiate. What I was going to do is immediately find
a workaround or a solution to solve this issue as soon
as possible. So I immediately
held the urgent meeting discussion with our engineer team and
we came out with another solution. It's also straightforward.
So instead of using the Microsoft hosted agent which can
only be running in the public network, we'll be using
a self hosted agent pool or a virtual
machine skill site which is built by ourselves
which will be running inside of the same virtual network
as customer the other resources. In that case
it will be in the allow list and will be permitted
to make the deployment to the SDM site of
the web application by default.
But the self hosted agent needs
all of our effort, starting from the zero,
building the customized image by ourselves
and also install the tools and dependencies and running time and elaborates
all these things by ourselves. And we need to run
it and test it to ensure it works. And the worst thing
is sometimes you don't know what is missing, what library is
missing and what tools needs to be installed until
the pipeline failed on the
agent. And it will be very time consuming.
But do we have other choice? Not likely.
And then in the next week it's a
very critical phase of our project. We divided the
project into two one is doing the
continuing development in our own subscription
so that we won't be lagging too much of the progress.
At the same time, the other team is building the self hosted
agent by ourselves and doing all the testing
to ensure that the pipelines works without issue.
So there were a lot of back end for as
I mentioned, you don't know what is missing and what is wrong until
the pipeline failed on the newly built agent and the
troubleshooting really took time.
It was a painful journey. But after
one week we finally had a stable agent that
all of our CSD pipeline could run successfully on that
agent without any issue. And then
we immediately move all of our work
onto this newly built agent and started orchestrate
our pipelines from here and make the continuous deployment
to the SEM side of the application and
that resumed all of our work.
Luckily it didn't cast a big impact to
our project and we finally meet our timeline.
So and good ending.
In order to illustrate this procedure I did
a workout, a demo beforehand and also made a
recording of that procedure. So now let me
play the video and also explain the
procedure step by step at the same time. So this is a very
simple application I made. When I
type in my name it will send back the
grating to me.
So as I introduced, I used the
access control to protect my application.
So if you look at here, this is my web application.
So inside of the network settings there is
access control policy settings.
All right, so this is the detailed securing. If you look
at here, there are two pros. One is about
the main site. So this is about the
access control of the Internet
request to this application. So by
default is allowed all the request from the Internet.
And on the other hand there is advanced
two set. So for this part
as planned before, this is about
the definition on the access to the SCM set or
is about which request will be allowed
for deployment on the web application.
So by default for all the unmatched
rule, it will be denied the access and
only the access from the VNet will be allowed to
access to the SCM side of the web application.
So let's see how my pipeline
will be working.
And now this is my CRCD pipelines.
This is the Yaml file I wrote to run my
automatic deploying here I'm using
the Microsoft hosted agent which is
configured by default.
So let's see how the pipeline running result
will be.
All right, so here is the result.
If I look further,
you see the failure reason is IP forbidden
and the URL trying to access
the ACM site of the azure web application.
So as I explained, because the
SCM site only allows the request from the virtual
network. So the IP address
which is running on a Microsoft hosted agent in
a public network doesn't allow to make the deploying.
So let's see how I go on with
my first resolution which is add
the IP address to the access
control list of the web application.
So here in the same place
of the advanced two site of the web application,
I started to add a new rule to
allow the IP address which is running on my Microsoft
hosted agent that I detected beforehand.
So here I know the IP
address which is running on the agent.
So I directly added this IP address
to the rule list.
So here you see I entered all the settings,
I added this rule.
Now let's see how I rerun the field pipeline.
Okay I made the deployment again to the web application.
All right we see whats the pipeline running is successful,
so the deploying is successful as well.
Just a tip because in reality it's
not feasible to add the IP address in
a static way. So in practical
I used a Powershell script like this.
Instead I'm using this script to fetch the running
ip of my Microsoft hosted agent and
then I add this rule directly using
this Powershell script as well. So it
will be much easier. And after running the
deployment this IP address will be removed
from the access restriction rule of the web application.
So this is just one small tip which are running
this in the reality. But as you know this
solution anyway is forbidden by the customer so I
won't use this solution and now I will remove this ip
address from the allow list again.
Now let's see how I move on with the second solution.
Whats I started to build my self hosted
agent from creating the image.
All right, so now you can see here, this is the
image I created, a customization image to
satisfy my running pipeline request. So here,
this is the image I created. It's a quite large one,
about 8gb, and this is
the JSOn file to describe this image.
So you can find a lot of documents and guidance
and tell you how to create your
own customized image. What I recommend is this
GitHub which I found is very helpful.
It provides detailed step by step guidance and instructions to
create your customization image. I follow this and
successfully created my own image like this.
So I recommended use to this as well.
And next, what I'm going to do is to use this
created image to build my virtual machine or
virtual machine skill site. But before that there's
a very critical step that in order to
build this virtual machine on Azure I need
to publish the created image to Azure.
So that means I need to publish this
to a place. So where is the place?
Here it is. So this is the self hosted agent
gallery. You can imagine the agent
gallery is like the replace.
So the agent gallery is the place host your image.
This is the image I just published to this agent
gallery. And what I can do is
to use the published image to create a virtual machine
or create a virtual machine skill site. And what
I'm going to do is to use this image to create my virtual machine
skill site. A virtual machine skill site is a pool
with a flexible number of virtual
machine as you defined, so that it could
provide the convenience and the flexibility to
scale in or scale out of your virtual
machines based on the workload on the running pipeline.
All right, so this is the virtual machine skill site I created.
If you look at here virtual machine skill site and
the image I have been used is the one adjuster
published in the agent gallery. And if we
look at the settings here, it's within the
same virtual network which is in the allow list of
the web applications SEM site configuration.
So with this settings, I'm now going to
move back to my Azure DevOps. What I'm going to
do is to create a new agent pool.
Instead of using the by default Microsoft agent,
I'm going to create a new agent pool and link that
to the virtual machine skill site I just created.
So inside of Azure DevOps organization
and project settings, there's menu
in the left hand called Agent Pool.
So start from here, you can configure your
existing agent Pool or add a new agent pool
as I'm going to describe. So what I'm going to do is
add a new agent pool. So now here you
see I'm going to select my subscription and then
link to this new created agent pool to the virtual machine
skill site I just created.
Now I select the virtual machine I created. That means I
will bind this new agent pool to this virtual
machine skill site. I name it Ado Pool. And with
the other settings done. All right, so you
will see there will be a new pool
called Ado Poor. Let's click on it.
If we look at the agents, agents will listed all
the virtual machines which is in the running
status, but right now there's no agent.
Why? Because the virtual machine I created
in the virtual machine skill site has not yet been
in running status at that moment. So I moved back to
my virtual machine skill site and check all the virtual
machine status and ensure that they're in running status.
So that means my ado pool is ready
to use.
Okay, now I move back to my ado pool
and check the agent status again.
Yeah, let's see.
The two virtual machine listed under the agent pool are
exactly the one you just saw from the virtual machine
skill site which are in running status. Why these
two agents are in idle status because there's no pipeline
running so there's no job on them.
But this is a very good signal. Means our ado pool is
ready to work and our pipeline is ready to
rerun again. Let's see how the result
will be and before run
the pipelines.
Remember now what we are going to make
the deployment or the new agent is
taking this work is not the default settings
we need to change it. So in the yaml file
look at the deployment job description and in the
pool settings change the name of
the pool from the by default Microsoft Health state
agent to the new one I just created.
Okay, so this is very important and then let's
rerun the pipeline.
All right, now you see the result. The deploying
to the Azure web application has finished and successfully.
Okay, so make a ramp up of the procedures of
my demo. So I created a web application
and also I make the configuration
of the access rule that only the access
or the deployment request from the VNET will be allowed
to the SDM site of the web application.
In order to meet that requirement I started to
build myself customization image and publish
that to the Azure
Agent gallery as a published image. And starting
from there I create a virtual machine skill site
using that customized image. And then I created
Ado poor inside of my project and
linked my ado poor with the virtual machine skill site
I created so that I can use the virtual machine running
inside of the virtual network with my customer image.
Then I updated
my pipelines and used this newly
created agent as my agent pool and
redeploy my application and everything works.
Now the
last part is about my takeaway from my whole engagement.
So the first thing is the security
thing is never the last thing, but always the first thing
we should consider in any project.
The reason is because cloud security is definitely a shared
responsibility. So we definitely
need to work closely if we're working with a customer,
a big enterprise team, we need to work with their
security team or expert to understand
and define the requirements from their side because
every organization may have specific request and you can
imagine how busy this type of team will be.
Lucky in our case. So try to make the conversation and
dialogue with them as early as possible to avoid any
unexpected issues or risk
or violations in the
later part of a project to give you a big shock and
last but not least the thing here you can see that
just from my demo. Building a self hosted
agent is definitely more complicated than using the Microsoft
hosted one and also consider the security
and other security factors
into our design or implementation will
add more complexities and even more challenges.
But thinking about in the Hong Bu we can avoid
big asset because everything is protected, everything is secured.
It may take extra effort, it may give you some inconveniences
but it will ensure that the customer project will be successful
that all your resources and your asset will be also
protected so that will avoid
big asset loss in the long run. No pains
no gains so we pay the effort to build a
security workload to build security network and then
in the long run we get the successful project so
all the pains pay and that's all my sharing
today. I hope this is helpful and
thank you for your listening.