Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRe?
A developer, a quality
engineer who wants to tackle the challenge of improving reliability in
your DevOps? You can enable your DevOps for reliability with
chaos native. Create your free account at Chaos
native. Litmus Cloud hello.
Hi, con 42. Hi everyone. Hi Noah.
How are you? I'm very good. How are you, Shiman? Very good.
Very excited. It's, you know, there aren't so many SRE conferences,
you know, really well, you know, it's a question by itself. What is an
SRE? What is a DevOps? You can have an entire talk
about the difference and who calls it what,
but yeah. Cool. So today we're
going to talk to you about what we've learned from working with
companies and experiencing and reading
a lot of Kubernetes postmortems. We're going to go over some
of the postmortems themselves and also talk about what
are the best practices in order to avoid taking
your cluster down or introducing security hazards and so on.
So we're going to show you practical tools of how to do it and
also the methodology of how we believe is
a good way to avoid it.
Cool. Spoilers. Spoilers,
no?
Nice ship. Yeah.
Cool. So before we begin, maybe we should introduce ourselves.
Yes. Hi everyone. My name is Noaa Barki.
I'm a full stack developer for over five years and
I'm also a tech writer and one of the leaders of
GitHub Israel community, which is the largest GitHub community
in the whole. And I'm very,
very excited to be here. Thank you all for coming.
Thank you very much. And my name is Shimon. I'm one of the co founders
and the CEO of the Tree. I'm also very much involved
within communities. I'm an AWS
community hero. I run an 8000 members meetup
group of cloud enthusiasts,
mainly around AWS, and also I run the local CNCF
chapter in Tel Aviv. And my background is in DevOps
backend software engineering.
And let's dive deep into know.
Let me tell you a little bit more about how we got here. So in
my previous role, I was the general manager for a software
engineering department which was responsible for all of the infrastructure
engineering, and we had 400 engineers
in the company. And at the
time we were already running on the cloud and we had a lot
of services and hundreds of git repositories.
And one day a developer made a misconfiguration which
reached production, which happens. I make mistakes all the time.
What can you do? But at this point
I was faced with the question like how do I propagate this development
standard and make sure that people in my organization
don't make the same mistakes again. And this
is fast forward to today. We created the tree,
which is a company that is aimed to
prevent misconfigurations from reaching production.
And the way we do it is we have an open source Cli.
I invite you to go and check it out. It's written in go.
And what it does is it scans the Kubernetes
manifests and helm charts. And every time
that a developer or anyone makes a code change to
a Kubernetes related manifest, we scan it,
we provide predefined rules out of the box and we prevent
misconfigurations that we identify.
And usually people run it in the CI, CD and or
on their machine. So probably
if you read between the lines here and I'm sure that you do in
the tree, because this is what we do, because we integrate the
code that you write, your Kubernetes resources
and we check them
on every change policies is what we do for a living.
This is what we do every day, every single day, all day.
And partly my job as a developer at the tree
was not only to understand how Kubernetes works and
what are the main components in kubernetes, but also
how you can blow up your own cluster.
And this is practically what we are going to talk
about today, the 100 plus postmortems.
So today, Shimon, I know that we usually
do it in a postmortem way, so usually we would talk
about the event and see what are the lessons that we've
learns from that event. But especially today. Are you ready?
We are going to do it in my very own. What's the
mistake? Private game show.
Wait, what are the prizes? What are the prizes?
What can I win? It's a million dollars. What is it?
Okay, I'm sorry I didn't think about it, but I will.
I promise I will. Okay. If you're curious about it,
just contact me later. I'll tell you what you wanted to won or did
not win. Maybe I won't win. I believe in you. Thank you.
So what we are going to do is that I'll show you a
Kubernetes resource and you will have to find
or guess what's the mistake? Are you ready?
I'm ready. Let's do it. Hit me. Hit me.
This is a default cron job configuration.
Yeah. So there isn't too much configuration here.
With Cron job you always mix up the scheduler
and the configuration. So I'll go and say that there
is a problem with the scheduling here. The scheduling
I guess. Is it the scheduling? Maybe not.
No,
it's not the scheduling. Low tech
effects. Okay. It's the concurrency
policy. We always want to make sure that we set the concurrency
policy to either forbid or to replace.
Do you know why? Tell me more.
I'll tell you. It's because when we set it to
allow, I'm sorry, when we set it to.
No, I'm not sorry, when we set it to allow, whenever a
cron job gets failed, it will not replace the previous
one, it will always create a new one. Oh, and if I'm
not mistaken, the cron job here spawns like
a task every minute. Yes. So I guess they wanted to have like an almost
long lived one. And if it won't finish the previous
one it will just spawn more. Yes, every minute.
So concurrency policy as it sounds,
controls the concurrency. Do you want to know how much more?
I'll tell you anyway. If you don't want, that's your problem.
But if you want to know how much more, it's 4300
plus restarting pods that were
constantly restarting. And this is actually what happened to target.
They had one failing crunch up that created thousands pods.
And not only that well it immediately took
their cluster down, but it also cost them a lot of money because
their cpu always accumulated thousands of
cpus. So the lessons
here that we learns from target is to never ever
trust the default configuration. Just because Kubernetes allows you to
deploy a specific resource specification which are
the default, it doesn't mean that you should do it, it doesn't mean that
this is the configuration that you should use even though it's default.
And you always say to yourself that's the default. I see
how it can harm me I guess, because maybe
the classical case is that maybe you spawn a cron job once
a day or something. So I'm thinking as a Kubernetes developer
it makes sense to me why you would use concurrency policy allow,
but in this use case when they're trying to spawn it every 1 minute,
it doesn't make sure makes sense. So I guess even if you configure
something you should check what are the default configuration.
Even if they're not stated in your manifest, you should check what is under the
hood and understand how is it going to behave. Yeah,
I know that whenever I see the default behavior
subconsciously I always say to myself well how
can it be dangerous? It's supposed to be safe.
But it's not. You should
think about it. But let's move forward to the next
mistake. This is another cron job
configuration. What's the mistake here?
Okay, so you've taught me about the concurrency policy, which is forbid,
which you taught me is good. Yes, it's good. By the way,
I think that maybe in some cases it's bad, right? It depends on your use
case. You are totally correct. And I want to say to you
audience that whenever I say that
we should always.
I have a disclaimer. Just my
main point is to know why you should
not use it and how it works under the
hood. So you will definitely know if that default
behavior is good for you. Sometimes you would wanted
to use the allow. Yeah, I agree there
for a reason. Yeah, that's for sure. Yeah.
So the concurrency policy is correct. I'll go with restart
policy. Restart policy,
yes. Let's find out. No,
it's not the restart policy,
it's incorrect yaml structure. We always want
to make sure that, and this time, yes, I mean it.
We always want to make sure that our yaml structure in
our Kubernetes resources is valid. There is no way
that you wouldn't want to do it. And this is actually what happened
to Zalendo. Zalendo is a big online fashion company
with over 6000 employees. I mentioned that
because I want you to understand the scale of that company.
What happened to that company when their API server was down and their
API server was down due to out of memory issues because
they used the correct resource configuration
for Cronjab but they placed it incorrectly in their
yaml. Instead of placing it here, they placed it
here. So it's like you think you configured it, but actually you didn't
because it's in the wrong place. So Kubernetes takes the
default configuration or other configuration that you have. So you actually didn't
configure it. Yes, very misleading. And it's just like tabs.
Yes, it was honest mistake, it was an innocent mistake
and one of their developers did it
by mistake. And as
you can see here, the concurrency policy wasn't part of the cron job spec.
So then ended up with a cron job that created thousands of
pods that immediately took their APR server down.
And Zelando
taught me the importance of having
a clearly defined standards and policies and validation
standards in your organization because this is actually yaml structure
that was invalid. It's not the understanding
of kubernetes that was wrong, it's not using the default
behavior that was not the
behavior that you should have used.
It's a simple automated checks that you could have done.
It's really hard, and it's really hard as a human to
see this because it's like it's the right thing, it's just placed in the wrong
location. It's kind of hard
to see it. It's like as a developer,
it always reminded me when I thought about Zalando incident
the days before typescript when you could have forgot,
I don't know, a semicolon or something and just everything
didn't work. And you don't know why. You don't know
why. Yeah. So this is what Zalando
told me. But let's move forward to the next question.
This is an ingress resource. What's the mistake?
I guess, you know, in this case it seems like the host is
star, which is kind of weird because usually you want to give like a URL
or where do you point this? Like which service you want
to listen to this ingress? So I'll go with the host
star. Are you sure it's not maybe the rules
maybe is this like who wants to be a millionaire? We're like,
ask me a thousand times. Yes, why not?
My final answer is host star. This is your final
answer? My final answer, yes.
And you are correct. The host.
We want to prevent user from specifying
host as a wildcard. And the reason that we would
want to do it is because when we put a star in
the ingress resource, Kubernetes will immediately forward
all the traffic to that container. So you may be
ending with one container having all
the traffic through an entire cluster. And this is what happened to target.
This is not something that you don't want to do ever because
sometimes you would want to use wildcard.
But what happened to target is that one of the developer set
a star in their ingress resource and
just immediately took their entire cluster down.
And the lesson that we learned here from target, this is
actually fun fact. It was their first mistake when they started to use Kubernetes.
Anyway, the lesson that we learned from target here is how important
it is to delegate the knowledge from the DevOps
team to the developers teams or SRE teams or the SRE
or the SRE, God forbid the SRE to
the entire teams in the organization.
It's very important. And I
think that even five years ago when
you worked at that company that you talked about, it was
still important back then. It's not something that really
related to kubernetes, just kubernetes is such an amazing thing.
And it's across your organization whether you would want it or not. This is
the entire app. I think that once we switch from
Ops team and dev team and you throwing things off
the wall as a developer to the Ops team to do it, and once
you delegate infrastructure as code and operations responsibilities
to developers, that's it. You got to educate them,
you got to give them the tools in order to do the right thing.
Because otherwise how are you supposed to do it? Like maybe I'm
the best java payment engineer, but I'm not a docker expert,
I'm not a Kubernetes expert.
I think it's more than that because I
always get angry when people say that DevOps and
developers should work together just like that.
It's not the same, it's not the same job, it's not
the same way of thinking. I am a developer
and I want to be the best feature machine ever.
And you as a DevOps engineer, you would want to keep your
production safe, your production warrior.
That's the reason that you wake up every morning and go to your
work. And it's not only
that, it's also how we think as a developer.
I take, I don't know, let's imagine a microservice
and I take that microservice and I go to the controller and
then to the service and then to the bits and the bytes of every
variable. And you as a DevOps, you should think like in
a zoom out way, you should think about, okay, I'll take that
microservice and I put a load balancer and that region and
if it would go to that place and that place. But that's the thing
now, developers also need to think about it. Yes, and you need to give
them the tools. Exactly. This is the key
point because you should understand that it's
different people, you should taught them.
I agree. But you know, today if I interview someone, like as a developer,
one of the questions that I ask them is like, how is your code making
it to production? If a developer tells me it's
not magic, it's a like I send
it and then there's a build and Harry Potter does the CICD,
you know, I think it's important, but let's move on to the next one.
Yes, you are correct. Okay, so this
is just a simple pod. Simple, innocent pod.
What's the mistake here? This doesn't seem like there's
a mistakes here.
Front end. Yeah, I go with no mistake.
What could go wrong here? No mistake.
No mistake. Are you sure? I am positive. This is my
final answer. No mistake. No you are not correct.
This no limit doesn't have any limits. We always
want to make sure that we specify the request limits,
especially if we use third
party applications. And this is actually what happened to Blue Metador.
This is a small startup company, they had one pod that
served a third party application,
but it didn't have any limits.
So nothing stopped from those pod to take the
entire. Yeah, like if you think about it, especially when you bring on
third party applications, you don't know what they're going to do.
And running things in containers inside Kubernetes
is similar, but different than running it in a vm,
right, because let's say you have an EC two instance or a vm or something,
just give it like you're forced to give it a memory limit
and a cpu limit because you define how much memory and cpu
they're going to have. But here in kubernetes it's interesting, you can
actually spawn it without specifying any limits. Now I remember
using for example, RabbitMQ, which is a very popular queuing service,
and the default configuration is just to accumulate as much
memory as possible in order to use it for queuing. So like the default behavior
of this third party application is like I'm going to hog your entire memory.
And if you don't know that, it's very important to
put boundaries and limits to all of your third party applications.
Yes, but Blue Metador actually the lesson that they
taught me is, and I think it's the important
lesson of all, is that it can really happen to anyone.
And when I say anyone, I mean anyone.
Google, Spotify, urban beat, Datadog,
Toyota, really anyone? The Internet is
full of stories about companies that blew
their cluster know,
but the real question is how we can prevent it in the future.
Yeah, because I think remembering, like don't put star here
and put the memory there in a way it's becoming redundant
and it's very hard. It's almost impossible. Like how do you propagate
this to, I donts know, 400 developers, 100 developers.
And every day there are new features added to Kubernetes and every
day you need to remember if you want to use the default configuration, not use
the default configuration, and maybe you have new developers. Yeah,
you're on board new developers every day and how are you supposed to tell them?
So I'm a big believer in automation and I think that the
way to go is actually to automate and have automated tests during your
development process. Both in your computer and or
in your CI CD process. So I think
that number one, you need to define what are your
policies. And you
can use tools like conftest and gatekeeper or
the datree. The main difference between them is like with conftest
for example, it's an amazing tool, but you need to script your entire
rules by yourself. So you could go and look at different best practices.
And I think that the important thing is, number one, to define what are the
policies that you wanted to have and which ones you
want to enable and also which
rules should run on which types of workloads. Because maybe a
front end server is different than a back end server. Different than like a batch
processing the from. Yeah, I think that if you
donts know if we're saying policies, policies, policies, if you
don't know what your policies, just ask yourself,
what do I need to do in order to feel confident with my code
to ship it to production? I know that as developer the
basic things that I would do is to use unit
testing and QA testing, integration, testing, testing,
testing. I would also use clean code and
conventions and coding standards that would make
me feel confident and I would do
it through automated checks. And the
policies in the DevOps world would be to check your
Kubernetes resource. It's a code like every
other code in your application. So the first thing that we
would want to do is to define the policies. What we
want to check every time on every code change.
Yeah. The next step would be to
integrate policies in your organization.
We would want to make sure that we validate
any code change through automated policy
checks on every change. So as we said before,
it can be through the CI CD, it can be as
a pre commit hook. I recommend to use it also as
a local testing because it will make a
lot of sense to your developers because they're used to it. They used to
using unit testing. Every time that they make code change
right before they push it to the repository,
they always like NPM test. Good. I feel confident,
yeah, let's push it. And it would make a lot of sense to
them and it will help you create a
good communication between the developers and the DevOps.
It's like a gateway. Yes,
exactly. Yeah exactly. I think it brings confidence
to everyone. Yes of course.
But the third thing. Yeah, I think that one
of the challenges, so let's say, okay, we know which policies we want,
we integrate them like as a pre commit hook, as a vs.
Code extension, in RCI, in RCD and so on.
But then I think one of the challenges is like, okay,
an R D organization is a living breathing thing, right?
And you have new people coming in, people offboarding new services,
and then also you upgrade to new versions of
clusters. You have different causes in different workloads. So how do
you dynamically adjust the policies that are needed in general,
the policies that you want to run on the specific workload
and how do you propagate the changes? So let's say I have 500 git
repositories and I enable those tests and now I upgrade,
I don't know, to Kubernetes 122, whatever.
So what now I need to make 500 pull requests to
change all my policies? I hope not.
Yeah, so you need some kind of like a centralized control,
centralized management that can pull and dynamically configure it.
Of course you can build it by yourself, maybe create like a console
or an ETCD or put it in s three as
a json. I don't know what, but you need some kind of way to
centralize the management and control of the
policies because it's just impossible to do it one by one. Unifying and
having a dedicated place is one of the biggest advantages
of using GOPA, right? I datree
this is something that we definitely want
to have. So now that we understand,
what are the steps that we need to do in order to start
using policies in your organization, let's talk about three main
tools that you can use in order to use it
in your organization, like how to implement it.
And we'll talk about three tools. We'll talk about confidence and gatekeeper,
which are based on OPA, OPA under
the hood. And we'll talk about our very own, the tree application which
is also open source, which is also an open source. And I really encourage
you to submit a pool of crest. This is my code and
mine, obviously my team.
And the important thing that
I want you to think about when
we will introduce you those tools is to think
where would be the most suitable place to you to use these
tools in your pipeline. Because at the end of the day you
have a developer and you have a DevOps and you have the development and the
production and where the exact
point that you will choose to use one of the tools might
affect your entire organization and how you work with kubernetes.
If you would use gatekeeper, you will protect
the production itself and not the pipeline of any
code changes in the development phase,
as I like to say. But if you would use the tree and conftest,
you will place it in the CI pipeline like a shift left before it even
reaches the production. Yes. So let's go deeper into
it. So Gatekeeper is a
great project which is part of the open policy agent organization.
It runs inside your kubernetes and
it's actually admission webhook that is hooked
on top of your kubernetes. And every time that someone does like a
Kubectl apply command and schedules a resource
to be added or modified inside your kubernetes,
it will run tests. You can write tests in Rego,
which is like a declarative policy language.
And it will then allow or deny this change to
be applied to your cluster. This is great, but the
downside of it is that it's like it's in the end of the line.
And when I write the code and test it on my machine and
test it in my CI CD, I won't see any problems. And only
when our continuous delivery, continuous deployment
production actually applies it, only then I will see that there is a
problem. Yeah. So moving on to the next one is
conftest. Conftest. I really like
conftest. Conftest enables you to write tests on any
structured file. Any structured file
can be Yaml, XMl, Docker, JSoN,
anything. And it is specially designed to be used in the
CI or as a local testing. You can practically think about it
as a unit testing library for your Kubernetes resources.
And under the hood conftest is built on top of OPA, just like
gatekeeper. So all the policies that you
will write should be also written in Rego. There is
no other way. And something that is pretty cool about it is
that it also allows you to push and pull your
policies to a docker registry. It's not about containers
anymore. And the way that confidence work is
very simple. You only need to install it,
write your policies and manage them
of course, and run the command confidence test.
And it will simply output you all the violations.
And that's it. And that's it. Mainly about conference.
It's really simple and I
really like that it's in the CI pipeline.
But let's move forward to the tree.
Next slide. I don't see. Ah, here, now I see it.
Okay, so yeah, the final one is we
have the tree again, an open source, you can use it. Noah wrote some code,
I wrote some code, feel free to check it out. The main difference
is that it's very similar to gatekeeper and conftest,
but it has both capabilities and also it comes predefined
with policies. So you don't need to actually write your
own policies. We have predefined rules and
rule packs that you can enable and disable. And you also have a centralized
management. So in the next slide you will be able to see that the
difference by the tools. So gatekeeper,
it only runs on the production and you need to write the
tests manually. Conftest runs on
a shift left way and you can run it in your computer and or in
your CI CD. But again, you need to define the policies by yourself
and somehow build a centralized policy management.
And the tree is basically the same as confidence and gatekeeper only.
You don't need to write your own rules. We have predefined rules. We also support
writing custom rules. You can also write it in Yaml
specs or in OPA. And then
finally it has a centralized management mechanism. So let's
say you have 500 repos, you can dynamically say which rules
run on, which resources and so on.
And my main takeaways from this session
is from the very 100
postmortems to donts trust the default configuration,
define the policies in your organization and have a clearly
defined coding validation standard
and coding standards. That's also something that I
encourage. Delegate the knowledge to
the developers, put an effort in that and
just remember that it can really happen to anyone.
And to sum up, I really want you to remember that Kubernetes
is not something that it's done in a day and DevOps
is a process and you should really think about
how you wanted to do it and manage it in your organization because
it's a culture, it's something that takes time and
patience. And eventually communities will become your
production temple.
And I really hope that this will inspire you to start thinking about
your policies. What are your policies and how you
want to enforce those policies in your organization.
Great. So thank you very much, Noel. It was great. Thank you very
much, Iman. Thank you. And let us know if you have any questions.