Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, thank you so much for coming today. I'm super, super excited
to be here. Today we are going to talk about one of my favorite topics
which is how to build centralized policy management at
scale. But I believe that at least for
some of you this is the first time that we meet. So hello,
my name is Noaa Barki. I've been a developer advocate
at Tree and a full stack developer for about seven years.
Six, seven years. Yeah. I'm also a tech writer and
one of the leaders of GitHub Israel community which is the largest
GitHub community in the whole universe. And I work
at an amazing company called the Tree where we help developers and DevOps
engineers to prevent misconfigurations, Kubernetes misconfigurations
from reaching production. But enough about me,
because today we are going to talk about how to build centralized policy
management at scale. We are going to specifically talk about
policies. What are they? Why do we even care about policy
enforcement? We are going to see some very cool tools that I personally
like like OPa, gatekeeper, conftest, argosd and
my very own those tree open source project.
So without further ado, let's just get it started.
Ladies and gentlemen, close your eyes.
Picture this. You had a long week
and it's Friday now and you're in your bed dreaming peacefully
about a warm wonderful comfortable weekend.
Unfortunately your weekend came a
little sooner than expected when you woke up
from the sound of your phone and you had 15 missing calls
from work. Oh no. You wake up, you wash your
eyes, you go to the Slack channel and oh apparently
somebody forgot to add in one of the deployments memory
limit which caused one of the containers a memory leak
which cases all Kubernetes node to ran out of memory.
Wait a second.
Oh did I?
Oh no,
I think, I think, I think that somebody,
that somebody is you.
Now of course I'm kidding, of course I'm kidding. This is not
you. I'm sure it will never happen to you you. Because first of all you're
very responsible developers. I always forget
that I add this slide anyway. I'm sure it will never happen
to you. You're a very responsible developers, you're here with me
that's for sure. And I know that you know how to use kubernetes,
you definitely know to never forget those memory limit.
So let me ask you a different question. Who's here is
ready to play a game? So the game goes like this.
I'm going to show you two kubernetes manifest each
time I'm going to point to a specific key which is configured
differently on every manifest. You will have
to look very carefully and tell me which one you will deploy,
left or right.
Let's get it started. Okay, so this is a
cron job configuration. Pay attention to the concurrency
policy. Which one you will deploy, left or
right? I'll give you 10 seconds.
Ten, nine. Okay, I'll stop.
And the right answer will be right. You see,
we always want to make sure that we set the concurrency policy to
either forbid or to replace, never to allow. The reason
why is because whenever a cron job will get failed and we set the concurrency
policy to allow, the failed cron job will
never replace the previous one. So you'll end up with a lot of pods
that just will spawn your cluster. And this is actually what
happened to target. They has one failing quant job
that created thousands of pods that were constantly restarting.
And not only that it took their cluster down,
but it also cost them a lot of money because their cluster
accumulated a few hundreds of cpus during
that time. Very sad story. Let's move forward to the next question.
This is another quant job configuration. And once again
pay attention to the concurrency policy. Which one you
will deploy, left or right?
Tan dun dun dun dun dun dun. And those right
answer will be right
again. You see here on the left side,
the concurrency policy isn't part of the cron job
spec. So we end up with a cron job without any limits.
And this is actually what happened to Zalando which is an online
leading fashion company with over 6000 employees.
It's a big company guys. They actually used the
correct configuration, however they placed it incorrectly
in their yaml, which is very sad. And this immediately
took their API server down.
So let's move forward to the next question.
This is pretty simple. Pod,
which one you will deploy? Left.
All right, I'm sure that you know this one and
this is going to be short. The right answer is of course
right again, we want to make sure that
we never forget the memory limit.
And this is actually what happened to Blue Metador. Back then they were a
small startup company with those monitoring software. They pods
hosted a sumo logic third party application whose container
were memory hogs. And because
they forgot to put the memory limit,
nothing. Basically stopped from those pods to take up all the memory in the node
and eventually caused out of memory issues.
Very sad.
But you see target, Zolando,
Blumetor, they aren't the only one who suffered
from the pretty innocent mistakes. I'm talking
about big companies I'm talking about Google,
Spotify, Airbnb, Datadog,
Toyota, Tesla, who's not here. I'm talking
about a lot of other companies who share their own Kubernetes failure
story. Trust me,
nobody is immune to Kubernetes misconfiguration. So first of
all, I highly recommend everybody to read about other
companies failure stories. Not only that you would learn so much
about how Kubernetes works, what to do, what not to do and what are
the best practices, but it will also will make
you ask the ultimate question, which is
how can I make sure it won't happen to me? How can I make sure
that I won't become one of those failure stories?
And this question is very important because it forces you
to think about what are the workloads requirement in
your organization? What is the stability in the security that
you want to achieve for your cluster? How can you make
sure that it won't happen to you? And the answer
will be policy enforcement. I know this because
before we launched the tree for the first time, we wanted to learn
as much as possible about the common misconfigurations
and the most common pitfalls in the Kubernetes ecosystem.
And what we did was to read more than 100 Kubernetes
failure stories. And not only that, we learned that policy
enforcement is the solution to prevent misconfiguration from reaching
production, but it also turned out to be a
key solution to improve the DevOps culture in your
organization.
So great. So how do we start? So first of all,
we want to define the policies and the rules that we want to enforce in
our organization. Maybe we want to make sure that every container
has a memory limit. Maybe we want to make sure that all the containers
has configured readiness or liveness probe.
Maybe it's about quant job, that every quant job has a deadline
or that those scheduler is valid.
It's not really matter. But those policies that you
will define are really dependent on your workloads
requirement. And once you have a set of policies that
you want to enforce, the real question is how will you
integrate, how will you distribute those policies
in your organization, in your pipeline? How will you make
sure that you and your teammate will follow these policies?
So the way I see it, you know what,
let me tell it in a different way. I believe in two things.
I believe in shift left and I believe in Githubs.
I believe that as soon as you find a mistake, the less it might take
your production down. And I believe that every Kubernetes resource,
every config file should be handled exactly the same
as your source code in the CI,
exactly like your source code. So with this mindset,
the way I see it, we should automatically validate our resources
on every code change in the CI.
Furthermore, integrating and validating your
resources in the CI using tools that can be used as a
local testing library.
I'll pause. As a local testing library
can extremely help you nurture the DevOps culture in your organization station.
And the reason why is because local testing library is actually one
of the developers policies.
Developers, they know how to use local testing library. They are used to
it. They are used to write code, test it
locally on their local machine and then submit a pull request. And they
expect those tests, at least those tests to be
ran again in the CI. This is actually part
of the developers policies and allowing the
developers to do the same with infrastructure has code
with Kubernetes resources will allow the DevOps to delegate
more responsibilities to the developers
and therefore to liberate the DevOps from the constant need
to fence every Kubernetes resource from
every possible misconfiguration.
But I hear you. I hear you back there.
I can totally hear you. Here in Israel we have
a thing, it's called let's talk Dugri,
which means show me the real business. Come on Noah.
So let me show you the real business. Let's talk dogri.
Let's talk about how can you start today? Use your
policy. Use policy enforcement in your organization.
And the first tool that I want to talk about is OPA or OPA,
I'm going to call it OPA. So OPA is general purpose
policy engine. You can write all your policies in it and execute
it with a specific input. Check if it violates any one of
those policies. You can practically think about OPA as a super
policy engine. I like to think about it this way.
Now those main idea behind OPA is actually to
decouple all the policy decision making logic from
the policy enforcement usage. You see,
suppose you have microservices architecture and one of the
microservices receives an API request. You probably
need to make some decisions in order to allow or disallow
this request. Right now these decisions,
they are based on rules. They are based on criteria that you want these requests
to meet. These rules, they are called policies.
And what OPA gives you is the ability to
decouple and to offload all those policies into
one dedicated agnostic service
and therefore to provide more control to your
administrators and your ops team in your organization.
To control in a better way over the policy enforcement and the
service at runtime.
So let's talk about how can you use OPA.
So there are actually two ways to use OPA.
If your services are written in go, you can use OPA
as an internal package and embedded it within
your code. The second option is actually to
use OPA as the host level demon and query it with an
HTTP request. You will send the input as JSON
and OPA will evaluate and will send you
the response. Now this is important.
By definition all the policies should be written in a special
language called Rego, which is those official policy
language by OPA. It's very easy to learn in
fun fact actually it's inspired by datalog and
basically that's it. From this moment on, OPA is your centralized
policy service. You will write all the policies in OPa
and you can query all those microservices to ask
OPA what is the policy enforcement. OPA will
evaluate it with the input and will
return a response, valid or not.
Now OPA is great.
OPA is wonderful. I really love OPA, but when
it comes to Kubernetes, not so much
because it still requires a lot of heavy lifting work.
And I do enjoy crossfit training and I do enjoy heavy lifting,
but not when it comes to my Kubernetes cluster. This is where
conftest is coming to picture. So conftest
is an open source utility that allows us to write
tests against any structured file.
And this is number one world about me. When I say any,
I mean any. I'm talking about JSON XML Docker
files and of course YaMl. I'm talking
about our Kubernetes resources. Conftest allows
us to write tests for our Kubernetes resources
with OPA as a policy engine under the hood.
Conftest is specifically designed to be ran in the CI
or just the way I like it as a local testing library.
Now let's talk about how can you use conftest. So first of
all you need to install conftest on your local machine.
Then you need to write all your policies in
Rego. So here for example, you can see two policies.
One that makes sure that verifies that I don't run with
root privileges, and the second that I always use the app
label for my deployments.
And after I wrote all the policies
I put them in a folder. By default the folder
name should be policy and I simply execute conftest with
conftest test with the path of all the files that I
want. Conftest to test conftest will take all the policies, will take
all the files that existed in that path, will send everything back
to OPA and will output those result.
Now I said that you can use conftest in the CI.
So here for example, I used GitHub action and what
I want you to pay attention to is actually the green part over here.
So as you can see I pull conftest and this
is actually an awesome feature by conftest. Conftest allows
us to push and pull our policies to a docker registry,
which is kind of nice. So I pull all my policies and
I simply run a conftest test with those path of all my
resources. Very easy, very fun. I love conftest,
it's super friendly, makes life so much easier.
Highly recommend everybody at least try it.
So with configure in those CI we can safely go back to sleep
now because we validate our resources on every code change.
Let me just put myself here. Oh, this is more comfortable
here we can safely
go back to sleep because we validate our resources on every code
change. And this means that our cluster is truly
project, right? Wrong. This is
not true because what about those criminal users
who can, anytime they want during the day, can simply type
cubectl, apply and do whatever they want with our clusters.
What about those criminal users? And in the
year of 2020, Google,
Styra, Microsoft, Red Hat, they asked themselves
those exact question and their answer was this is
not enough. So this is the story of how they
created gatekeeper and gatekeeper.
Gatekeeper is actually the bridge
between your Kubernetes API server and OPA.
And it allows us to validate our resources,
our Kubernetes resources natively in our
cluster. But before we talk about what
is exactly Gatekeeper and what it does under the hood,
I want us to talk about Kubernetes admission controls and specifically
Kubernetes admission webhooks. You see,
when an API comes into Kubernetes API
server, it passes through a series of steps. First of
all, it's being authenticated and authorized.
Then it passed into the admission controllers
which basically triggers a list of webhooks that
can mutate, validate and that's
it. Mutate and validate your request. And then,
and only then when your request is valid it move
forward and to be persisted and executed to
those AHCD. Now, gatekeeper,
gatekeeper is a customizable admission webhook.
So whenever a resource in the cluster is being created,
updated or deleted, the API server calls the admission
controls, which triggers the gatekeeper Webhook
gatekeeper takes those request along with the resource and predefined
policies, sends everything to OPA. OPA evaluates
everything, and if OPA find any
misconfigurations, any violation, then gatekeeper will project
a request and the user will receive an error.
Now let's talk about how can you use
Gatekeeper. So first of all, you need to install gatekeeper
on your cluster. Then you need to write all
your policies. But this time, since Gatekeeper is
installed on the cluster, you don't write your policies in rego files,
you write them in a constraint template
CRD. So as you can see here, the constraint
template is basically the policy that you want to enforce. You can see
the actual rego in the green part over here.
And this is an example of a policy that receives a
required label and check if that resource, if a
resource actually includes that
label. Then after
you write all your policies, you still need to tell Gatekeeper how
to use that policy on which kind you want this policy
to be applied to, and you do that with those constraint.
So as you can see here, I created a constraint that takes
those Kubernetes required labels, which is the policy that we
just wrote, and applies it on every namespace
with the owner label. So in conclusion, this policy
ensures that every namespace in the cluster has the owner
label.
So with conf test in the CI gatekeeper in the cluster,
now we can go back to sleep safely,
because we gain some very powerful policy enforcement
to our Kubernetes resources,
but does come with a price. There are a couple of challenges that
I want us to talk about. So first of all,
as your organization grows, you would probably want to
modify or to change some of your policies. You would probably
want to add new policy, delete some of the
policies to change one of the policies. And this
can be very challenging tasks when your policies are written
both as rego files and as a constraint template for a
gatekeeper, especially if you have multiple or a
lot of many git repositories.
Now there are a couple of ways to face this challenge.
You can use a dynamic Yaml file that you download
from s three with all your configuration.
You can use eTCD as your centralized management,
state management, and you can pull your configuration from
it. Those are many ways to face this challenge, but it is a challenge
that you should take under consideration.
Another challenge that I wanted to talk about is actually the Rego
language itself. And I know this because I tried to implement some of
those tree policies in Rego language.
So if your policies require some level of complementary
let's reverse it. If your policies don't require any
level of complexity, and they are pretty straightforward,
you shouldn't worry at all about Rego. But if your policies will
require some complexity, such as, I don't know,
like make sure that one of
the keys is within a range of numbers,
make sure that one of the keys, I don't know, uses a specific kind
of input unit, or compare
one of the keys to another key, maybe in a different file.
Then Rego might be a little difficult to
use and a little bit difficult to implement.
So my fair advice to you is
to look for tools that already come
with built in policies, or at least look for
policies that already written in Rego that you can use and send
inspiration.
But another way
to face this challenge, if you think about it, would be
by not using gatekeeper at all and to guarantee that
your git repository is your single source
of truth and by that eliminate the need to fence
or to guard your cluster because users
won't be able to kubectl apply whatever they want
and to destroy our production. Some people might call
it githubs.
Now I know that there are many ways to practice githubs
and I know that there are many tools out there, but the tool that I
want to talk about today is Argo CD because
I really love Argo CD. It makes life so much easier
and it's specifically designed to make
continuous deployment in kubernetes more efficient.
But before we talk about Argucd and why so magical,
I want us to talk about how SCD workflow without archd,
how it actually looks like. So let's
say that I have Kubernetes cluster
in production and I use Jenkins
for Sci CD. So you know the drill. Usually we have
a developer that submit a code to a repository
upstream, maybe new feature, maybe Hotfix and
it triggers Sci pipeline which build
those code, test those code, build a new image of
our application, push that image into a docker registry and
then in CD we update Kubernetes
deployments resource, maybe some more resources to use that
new image of our application and we apply those resources
to our cluster. Now this is
all nice and easy, but there are a
couple of challenges. First of all, in order to Kubectl
to fully function, we first of all need to configure some
access to our Kubernetes cluster.
Additionally we usually use, I don't know,
AWS, Microsoft or Google. So for example I use
eks. So I also need to configure some
my credentials to the AWS account, then not
only this is a configuration challenge, but this is also a
security challenge. Another challenge
that I want us to talk about is once
Jenkins deploys that deployment, we don't have
any visibility over the deployment status. We don't know if
the deployment actually failed or not, and we need to manually
check the logs in our cluster, which can be very difficult
or very uncomfortable for sometimes.
And Argo CD is made just for that, to make continuous deployment
in kubernetes more efficient. And it does it by simply
reversing the flow instead of pushing the changes
to the cluster. Argo pulls those changes
from a git repository. And the real magic
about Argo CD is that it is actually part of the
Kubernetes cluster itself. It's an extension to the Kubernetes
cluster. So you don't need to provide any secrets or any
configuration credentials, or to configure any credentials to
Argo CD, which is really awesome,
but let's talk about it in more details. So first
of all, to use Argo you need to install it on your cluster
and then you need to configure it with a repository so that Argo
will monitor that repository by default every 3 seconds.
So if Argo will detect that new changes were
submitted, I. E the last commit
has is different now, then Argo will pull those changes,
will clone the repository and pull those changes to
the cluster. Now you're probably wondering what will
happen if we will apply Kubectl apply manually to
our cluster. Now, since Argo CD is installed
on the cluster, then in this case Argo controller will detect
that the cluster is out of sync and Argo will
override those changes with the state that exists in the
repository, which will make the repository our single source of
truth.
Now a quick note about repositories,
because it's very important, and I mentioned it a lot,
it's been established as a best practice in
githubs and specifically with ArgosD to separate
between the application source code and the application's configurations
repositories, and to have run a repository
to the application source code and a different repository to
the application configure. The reason why is because you
usually have more than one deployment
file. You usually have services and
ingress files and deployment and
secrets config maps, and you have a lot of other config
file types. You usually manage them
in different environments with tools like has customize
Bazel whatsoever. So you have a lot of configure files
to manage. And once you change one of these files, you don't
want to trigger the application CI pipeline because
nothing really changed in the application source code. And this
is why it's very recommended to
separate between the two and to have one repository for the
application source code and another repository to the application configuration,
which usually called the Gitops repository.
So now let's talk about how a CD workflow looks
like with Argocd and a separated Gitops
repository. So once again we have
a developer that submit new code into GitHub's
repository upstream, which triggers the CI
pipeline, which build the code, test those code,
create a new image of the application, push that
image into docker registry, and then the CI updates
a deployment to use that new image of
the application. But in a separated repository,
a repository, a Githubs repository
that argo monitors. And when Argo
will detect that those new changes were made, Argo will pull
those changes and will apply them to the cluster.
So if you think about it, using conftest for
instance in the CI of your Githubs repository
will provide you a very powerful policy
enforcement because now you can guarantee that your
GitHub's repository is your single source
of truth much.
Now the good news is that
you have very powerful policy enforcement. The bad
news is that this is only the beginning because
once you have policy enforcement, your organization is a continuously
living, breathing cell and policy management policy enforcement
should continuously evolve and continuously updated according
to your organization's needs.
And this is where the centralized policy management comes into the picture.
So to build centralized policy management, first of all you
need to make sure that you have those right environment to dynamically adjust
your policies. Not only implement those policies,
you also need to make sure that you can update your policies and
reconfigure policies and control your policies.
And git. Git isn't the best solution
because Git won't provide you anything that you need. Let's take permissions for
instance. How will you use git to controls over who can
delete or create a new policy, for instance,
another thing that I wanted to talk about is that you
would also need to make sure that your policies are
actually effective. And the thing about policies in
Kubernetes is that those is sort of like a contract between
application owners, cluster admins and security stakeholders.
So in order for the policies to be truly effective,
they not only need to work and to
really enforce what you want them to enforce, they also need to be communicated
properly between everybody in those organization
and to make sure that your policies are actually being communicated
properly. You need to ensure that people actually
know what to do when one of the policies gets failed. You need to
know which policy gets failed the
most. You need to ask yourself,
how will I delegate the knowledge? How will
I provide some guidelines to my people in order for
them to actually solve the policies when they fail?
You can use email for that, and we saw a lot of companies do it,
but obviously this doesn't scale because I'm a developer,
I do code for my life, I'm a feature machine,
I don't know infrastructure, I don't know Kubernetes.
And honestly I can't remember to put the
memory limit or to use that version instead of that version.
Misconfiguration can easily happen here,
so you need to be able to review constantly,
continuously review, monitor and
control your policies. Which policies fail
the most? Which policies are actually being used in practice?
How can you make sure that you make
progress and that you get those improvement that you want
with your policies? And the last
tool that I wanted to talk about is actually my very own the Datree
Cli open source,
which actually combines everything that we just talked about.
So the datree, much like conftest, allows us to validate and
scan all our Kubernetes resources. But unlike
conftest, the tree already comes with built in rules
and policies for Kubernetes and Argo CD.
Now the tree is specially designed to be run in the CI
or as a local testing library or as a
pre commit hook. And the way that those CLI works
is that for every resource and for every file that exists on
a given pet, the datree runs automatic checks to
see if it violates any
one of the policies and the rules, and for
every violation that it finds, for every potential misconfiguration,
the datree displays a full detailed output of the violation
and guidelines of how to actually solve
this failure. Now under the hood,
every automatic check, every one of these automatic checks
include three steps. First of all, to ensure that the
file is actually valid yaml file.
The second step is that those file is
actually a valid Kubernetes file or ArgoCD resource.
And the last step is the policy check to verify that
the resource follows the best practices, the rules
that you customize, that you wrote for your organization and
the tree provided out of the box.
Now to use the Datree, first of all you need to install it
on your machine, on your local machine, and then
simply run it with the tree test and the path
of all the files that you want to test for.
And that's it. It's free, it's open source. And I
highly recommend everybody submit a code, review the
code, submit a pull request. I will be there. Can't wait
to meet you. And last
but not least, I really those that this session inspired
you to start thinking about what are the policies in
your organization, how would be the best way for you to enforce them,
and how will you start to build your own centralized policy management
solution. Thank you very much.