Conf42 Site Reliability Engineering 2021 - Online

What we learned from reading 100+ Kubernetes Post-Mortems

Video size:

Abstract

A smart person learns from their own mistakes, but a truly wise person learns from the mistakes of others.

When launching our product, we wanted to learn as much as possible about typical pains in our ecosystem, and did so by reviewing many post-mortems (100+!) to discover the recurring patterns, anti-patterns, and root causes of typical outages in Kubernetes-based systems.

In this talk we have aggregated for you the insights we gathered, and in particular will review the most obvious DON’Ts and some less obvious ones, that may help you prevent your next production outage by learning from others’ real world (horror) stories.

Summary

  • Are you an SRe? A developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your Dev Ops for reliability with chaos native. Create your free account at Chaos native.
  • Today we're going to talk to you about what we've learned from working with companies and experiencing and reading a lot of Kubernetes postmortems. We'll also talk about what are the best practices in order to avoid taking your cluster down or introducing security hazards.
  • Shimon Barki is one of the co founders and the CEO of the Tree. The company is aimed to prevent misconfigurations from reaching production. They use an open source Cli. that scans the Kubernetes manifests and helm charts. Today they will publish 100 plus postmortems.
  • Shimon: We are going to do it in my very own. Private game show. What are the prizes? What can I win? It's a million dollars. If you're curious about it, just contact me later. I'll tell you what you wanted to won or did not win.
  • Kubernetes shows you a default cron job configuration. With Cron job you always mix up the scheduler and the configuration. The problem is not the scheduling but the concurrency policy. Never ever trust the default configuration.
  • This is another cron job configuration. What's the mistake here? It's incorrect yaml structure. We always want to make sure that our yaml Structure in our Kubernetes resources is valid. It's not using the default behavior that was not the behavior that you should have used.
  • One of the developer set a star in their ingress resource and just immediately took their entire cluster down. The lesson that we learned from target here is how important it is to delegate the knowledge from the DevOps team to the developers teams. You need to educate them and give them the tools in order to do the right thing.
  • Blue Metador blew their cluster with a pod that didn't have any limits. It can really happen to anyone. The way to go is to automate and have automated tests during your development process.
  • Three main tools that you can use in order to use policies in your organization. We'll talk about confidence and gatekeeper, which are based on OPA, OPA under the hood. And we'll also talk about our own, the tree application which is also open source. The important thing is to think where would be the most suitable place to use these tools in your pipeline.
  • Conftest enables you to write tests on any structured file. Any structured file can be Yaml, XMl, Docker, JSoN, anything. Also allows you to push and pull your policies to a docker registry. It's really simple and I really like that it's in the CI pipeline.
  • The tree is basically the same as confidence and gatekeeper only. We have predefined rules and rule packs that you can enable and disable. And then finally it has a centralized management mechanism. I really hope that this will inspire you to start thinking about your policies.
  • Great. Thank you very much, Iman. And let us know if you have any questions. Bye.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRe? A developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus Cloud hello. Hi, con 42. Hi everyone. Hi Noah. How are you? I'm very good. How are you, Shiman? Very good. Very excited. It's, you know, there aren't so many SRE conferences, you know, really well, you know, it's a question by itself. What is an SRE? What is a DevOps? You can have an entire talk about the difference and who calls it what, but yeah. Cool. So today we're going to talk to you about what we've learned from working with companies and experiencing and reading a lot of Kubernetes postmortems. We're going to go over some of the postmortems themselves and also talk about what are the best practices in order to avoid taking your cluster down or introducing security hazards and so on. So we're going to show you practical tools of how to do it and also the methodology of how we believe is a good way to avoid it. Cool. Spoilers. Spoilers, no? Nice ship. Yeah. Cool. So before we begin, maybe we should introduce ourselves. Yes. Hi everyone. My name is Noaa Barki. I'm a full stack developer for over five years and I'm also a tech writer and one of the leaders of GitHub Israel community, which is the largest GitHub community in the whole. And I'm very, very excited to be here. Thank you all for coming. Thank you very much. And my name is Shimon. I'm one of the co founders and the CEO of the Tree. I'm also very much involved within communities. I'm an AWS community hero. I run an 8000 members meetup group of cloud enthusiasts, mainly around AWS, and also I run the local CNCF chapter in Tel Aviv. And my background is in DevOps backend software engineering. And let's dive deep into know. Let me tell you a little bit more about how we got here. So in my previous role, I was the general manager for a software engineering department which was responsible for all of the infrastructure engineering, and we had 400 engineers in the company. And at the time we were already running on the cloud and we had a lot of services and hundreds of git repositories. And one day a developer made a misconfiguration which reached production, which happens. I make mistakes all the time. What can you do? But at this point I was faced with the question like how do I propagate this development standard and make sure that people in my organization don't make the same mistakes again. And this is fast forward to today. We created the tree, which is a company that is aimed to prevent misconfigurations from reaching production. And the way we do it is we have an open source Cli. I invite you to go and check it out. It's written in go. And what it does is it scans the Kubernetes manifests and helm charts. And every time that a developer or anyone makes a code change to a Kubernetes related manifest, we scan it, we provide predefined rules out of the box and we prevent misconfigurations that we identify. And usually people run it in the CI, CD and or on their machine. So probably if you read between the lines here and I'm sure that you do in the tree, because this is what we do, because we integrate the code that you write, your Kubernetes resources and we check them on every change policies is what we do for a living. This is what we do every day, every single day, all day. And partly my job as a developer at the tree was not only to understand how Kubernetes works and what are the main components in kubernetes, but also how you can blow up your own cluster. And this is practically what we are going to talk about today, the 100 plus postmortems. So today, Shimon, I know that we usually do it in a postmortem way, so usually we would talk about the event and see what are the lessons that we've learns from that event. But especially today. Are you ready? We are going to do it in my very own. What's the mistake? Private game show. Wait, what are the prizes? What are the prizes? What can I win? It's a million dollars. What is it? Okay, I'm sorry I didn't think about it, but I will. I promise I will. Okay. If you're curious about it, just contact me later. I'll tell you what you wanted to won or did not win. Maybe I won't win. I believe in you. Thank you. So what we are going to do is that I'll show you a Kubernetes resource and you will have to find or guess what's the mistake? Are you ready? I'm ready. Let's do it. Hit me. Hit me. This is a default cron job configuration. Yeah. So there isn't too much configuration here. With Cron job you always mix up the scheduler and the configuration. So I'll go and say that there is a problem with the scheduling here. The scheduling I guess. Is it the scheduling? Maybe not. No, it's not the scheduling. Low tech effects. Okay. It's the concurrency policy. We always want to make sure that we set the concurrency policy to either forbid or to replace. Do you know why? Tell me more. I'll tell you. It's because when we set it to allow, I'm sorry, when we set it to. No, I'm not sorry, when we set it to allow, whenever a cron job gets failed, it will not replace the previous one, it will always create a new one. Oh, and if I'm not mistaken, the cron job here spawns like a task every minute. Yes. So I guess they wanted to have like an almost long lived one. And if it won't finish the previous one it will just spawn more. Yes, every minute. So concurrency policy as it sounds, controls the concurrency. Do you want to know how much more? I'll tell you anyway. If you don't want, that's your problem. But if you want to know how much more, it's 4300 plus restarting pods that were constantly restarting. And this is actually what happened to target. They had one failing crunch up that created thousands pods. And not only that well it immediately took their cluster down, but it also cost them a lot of money because their cpu always accumulated thousands of cpus. So the lessons here that we learns from target is to never ever trust the default configuration. Just because Kubernetes allows you to deploy a specific resource specification which are the default, it doesn't mean that you should do it, it doesn't mean that this is the configuration that you should use even though it's default. And you always say to yourself that's the default. I see how it can harm me I guess, because maybe the classical case is that maybe you spawn a cron job once a day or something. So I'm thinking as a Kubernetes developer it makes sense to me why you would use concurrency policy allow, but in this use case when they're trying to spawn it every 1 minute, it doesn't make sure makes sense. So I guess even if you configure something you should check what are the default configuration. Even if they're not stated in your manifest, you should check what is under the hood and understand how is it going to behave. Yeah, I know that whenever I see the default behavior subconsciously I always say to myself well how can it be dangerous? It's supposed to be safe. But it's not. You should think about it. But let's move forward to the next mistake. This is another cron job configuration. What's the mistake here? Okay, so you've taught me about the concurrency policy, which is forbid, which you taught me is good. Yes, it's good. By the way, I think that maybe in some cases it's bad, right? It depends on your use case. You are totally correct. And I want to say to you audience that whenever I say that we should always. I have a disclaimer. Just my main point is to know why you should not use it and how it works under the hood. So you will definitely know if that default behavior is good for you. Sometimes you would wanted to use the allow. Yeah, I agree there for a reason. Yeah, that's for sure. Yeah. So the concurrency policy is correct. I'll go with restart policy. Restart policy, yes. Let's find out. No, it's not the restart policy, it's incorrect yaml structure. We always want to make sure that, and this time, yes, I mean it. We always want to make sure that our yaml structure in our Kubernetes resources is valid. There is no way that you wouldn't want to do it. And this is actually what happened to Zalendo. Zalendo is a big online fashion company with over 6000 employees. I mentioned that because I want you to understand the scale of that company. What happened to that company when their API server was down and their API server was down due to out of memory issues because they used the correct resource configuration for Cronjab but they placed it incorrectly in their yaml. Instead of placing it here, they placed it here. So it's like you think you configured it, but actually you didn't because it's in the wrong place. So Kubernetes takes the default configuration or other configuration that you have. So you actually didn't configure it. Yes, very misleading. And it's just like tabs. Yes, it was honest mistake, it was an innocent mistake and one of their developers did it by mistake. And as you can see here, the concurrency policy wasn't part of the cron job spec. So then ended up with a cron job that created thousands of pods that immediately took their APR server down. And Zelando taught me the importance of having a clearly defined standards and policies and validation standards in your organization because this is actually yaml structure that was invalid. It's not the understanding of kubernetes that was wrong, it's not using the default behavior that was not the behavior that you should have used. It's a simple automated checks that you could have done. It's really hard, and it's really hard as a human to see this because it's like it's the right thing, it's just placed in the wrong location. It's kind of hard to see it. It's like as a developer, it always reminded me when I thought about Zalando incident the days before typescript when you could have forgot, I don't know, a semicolon or something and just everything didn't work. And you don't know why. You don't know why. Yeah. So this is what Zalando told me. But let's move forward to the next question. This is an ingress resource. What's the mistake? I guess, you know, in this case it seems like the host is star, which is kind of weird because usually you want to give like a URL or where do you point this? Like which service you want to listen to this ingress? So I'll go with the host star. Are you sure it's not maybe the rules maybe is this like who wants to be a millionaire? We're like, ask me a thousand times. Yes, why not? My final answer is host star. This is your final answer? My final answer, yes. And you are correct. The host. We want to prevent user from specifying host as a wildcard. And the reason that we would want to do it is because when we put a star in the ingress resource, Kubernetes will immediately forward all the traffic to that container. So you may be ending with one container having all the traffic through an entire cluster. And this is what happened to target. This is not something that you don't want to do ever because sometimes you would want to use wildcard. But what happened to target is that one of the developer set a star in their ingress resource and just immediately took their entire cluster down. And the lesson that we learned here from target, this is actually fun fact. It was their first mistake when they started to use Kubernetes. Anyway, the lesson that we learned from target here is how important it is to delegate the knowledge from the DevOps team to the developers teams or SRE teams or the SRE or the SRE, God forbid the SRE to the entire teams in the organization. It's very important. And I think that even five years ago when you worked at that company that you talked about, it was still important back then. It's not something that really related to kubernetes, just kubernetes is such an amazing thing. And it's across your organization whether you would want it or not. This is the entire app. I think that once we switch from Ops team and dev team and you throwing things off the wall as a developer to the Ops team to do it, and once you delegate infrastructure as code and operations responsibilities to developers, that's it. You got to educate them, you got to give them the tools in order to do the right thing. Because otherwise how are you supposed to do it? Like maybe I'm the best java payment engineer, but I'm not a docker expert, I'm not a Kubernetes expert. I think it's more than that because I always get angry when people say that DevOps and developers should work together just like that. It's not the same, it's not the same job, it's not the same way of thinking. I am a developer and I want to be the best feature machine ever. And you as a DevOps engineer, you would want to keep your production safe, your production warrior. That's the reason that you wake up every morning and go to your work. And it's not only that, it's also how we think as a developer. I take, I don't know, let's imagine a microservice and I take that microservice and I go to the controller and then to the service and then to the bits and the bytes of every variable. And you as a DevOps, you should think like in a zoom out way, you should think about, okay, I'll take that microservice and I put a load balancer and that region and if it would go to that place and that place. But that's the thing now, developers also need to think about it. Yes, and you need to give them the tools. Exactly. This is the key point because you should understand that it's different people, you should taught them. I agree. But you know, today if I interview someone, like as a developer, one of the questions that I ask them is like, how is your code making it to production? If a developer tells me it's not magic, it's a like I send it and then there's a build and Harry Potter does the CICD, you know, I think it's important, but let's move on to the next one. Yes, you are correct. Okay, so this is just a simple pod. Simple, innocent pod. What's the mistake here? This doesn't seem like there's a mistakes here. Front end. Yeah, I go with no mistake. What could go wrong here? No mistake. No mistake. Are you sure? I am positive. This is my final answer. No mistake. No you are not correct. This no limit doesn't have any limits. We always want to make sure that we specify the request limits, especially if we use third party applications. And this is actually what happened to Blue Metador. This is a small startup company, they had one pod that served a third party application, but it didn't have any limits. So nothing stopped from those pod to take the entire. Yeah, like if you think about it, especially when you bring on third party applications, you don't know what they're going to do. And running things in containers inside Kubernetes is similar, but different than running it in a vm, right, because let's say you have an EC two instance or a vm or something, just give it like you're forced to give it a memory limit and a cpu limit because you define how much memory and cpu they're going to have. But here in kubernetes it's interesting, you can actually spawn it without specifying any limits. Now I remember using for example, RabbitMQ, which is a very popular queuing service, and the default configuration is just to accumulate as much memory as possible in order to use it for queuing. So like the default behavior of this third party application is like I'm going to hog your entire memory. And if you don't know that, it's very important to put boundaries and limits to all of your third party applications. Yes, but Blue Metador actually the lesson that they taught me is, and I think it's the important lesson of all, is that it can really happen to anyone. And when I say anyone, I mean anyone. Google, Spotify, urban beat, Datadog, Toyota, really anyone? The Internet is full of stories about companies that blew their cluster know, but the real question is how we can prevent it in the future. Yeah, because I think remembering, like don't put star here and put the memory there in a way it's becoming redundant and it's very hard. It's almost impossible. Like how do you propagate this to, I donts know, 400 developers, 100 developers. And every day there are new features added to Kubernetes and every day you need to remember if you want to use the default configuration, not use the default configuration, and maybe you have new developers. Yeah, you're on board new developers every day and how are you supposed to tell them? So I'm a big believer in automation and I think that the way to go is actually to automate and have automated tests during your development process. Both in your computer and or in your CI CD process. So I think that number one, you need to define what are your policies. And you can use tools like conftest and gatekeeper or the datree. The main difference between them is like with conftest for example, it's an amazing tool, but you need to script your entire rules by yourself. So you could go and look at different best practices. And I think that the important thing is, number one, to define what are the policies that you wanted to have and which ones you want to enable and also which rules should run on which types of workloads. Because maybe a front end server is different than a back end server. Different than like a batch processing the from. Yeah, I think that if you donts know if we're saying policies, policies, policies, if you don't know what your policies, just ask yourself, what do I need to do in order to feel confident with my code to ship it to production? I know that as developer the basic things that I would do is to use unit testing and QA testing, integration, testing, testing, testing. I would also use clean code and conventions and coding standards that would make me feel confident and I would do it through automated checks. And the policies in the DevOps world would be to check your Kubernetes resource. It's a code like every other code in your application. So the first thing that we would want to do is to define the policies. What we want to check every time on every code change. Yeah. The next step would be to integrate policies in your organization. We would want to make sure that we validate any code change through automated policy checks on every change. So as we said before, it can be through the CI CD, it can be as a pre commit hook. I recommend to use it also as a local testing because it will make a lot of sense to your developers because they're used to it. They used to using unit testing. Every time that they make code change right before they push it to the repository, they always like NPM test. Good. I feel confident, yeah, let's push it. And it would make a lot of sense to them and it will help you create a good communication between the developers and the DevOps. It's like a gateway. Yes, exactly. Yeah exactly. I think it brings confidence to everyone. Yes of course. But the third thing. Yeah, I think that one of the challenges, so let's say, okay, we know which policies we want, we integrate them like as a pre commit hook, as a vs. Code extension, in RCI, in RCD and so on. But then I think one of the challenges is like, okay, an R D organization is a living breathing thing, right? And you have new people coming in, people offboarding new services, and then also you upgrade to new versions of clusters. You have different causes in different workloads. So how do you dynamically adjust the policies that are needed in general, the policies that you want to run on the specific workload and how do you propagate the changes? So let's say I have 500 git repositories and I enable those tests and now I upgrade, I don't know, to Kubernetes 122, whatever. So what now I need to make 500 pull requests to change all my policies? I hope not. Yeah, so you need some kind of like a centralized control, centralized management that can pull and dynamically configure it. Of course you can build it by yourself, maybe create like a console or an ETCD or put it in s three as a json. I don't know what, but you need some kind of way to centralize the management and control of the policies because it's just impossible to do it one by one. Unifying and having a dedicated place is one of the biggest advantages of using GOPA, right? I datree this is something that we definitely want to have. So now that we understand, what are the steps that we need to do in order to start using policies in your organization, let's talk about three main tools that you can use in order to use it in your organization, like how to implement it. And we'll talk about three tools. We'll talk about confidence and gatekeeper, which are based on OPA, OPA under the hood. And we'll talk about our very own, the tree application which is also open source, which is also an open source. And I really encourage you to submit a pool of crest. This is my code and mine, obviously my team. And the important thing that I want you to think about when we will introduce you those tools is to think where would be the most suitable place to you to use these tools in your pipeline. Because at the end of the day you have a developer and you have a DevOps and you have the development and the production and where the exact point that you will choose to use one of the tools might affect your entire organization and how you work with kubernetes. If you would use gatekeeper, you will protect the production itself and not the pipeline of any code changes in the development phase, as I like to say. But if you would use the tree and conftest, you will place it in the CI pipeline like a shift left before it even reaches the production. Yes. So let's go deeper into it. So Gatekeeper is a great project which is part of the open policy agent organization. It runs inside your kubernetes and it's actually admission webhook that is hooked on top of your kubernetes. And every time that someone does like a Kubectl apply command and schedules a resource to be added or modified inside your kubernetes, it will run tests. You can write tests in Rego, which is like a declarative policy language. And it will then allow or deny this change to be applied to your cluster. This is great, but the downside of it is that it's like it's in the end of the line. And when I write the code and test it on my machine and test it in my CI CD, I won't see any problems. And only when our continuous delivery, continuous deployment production actually applies it, only then I will see that there is a problem. Yeah. So moving on to the next one is conftest. Conftest. I really like conftest. Conftest enables you to write tests on any structured file. Any structured file can be Yaml, XMl, Docker, JSoN, anything. And it is specially designed to be used in the CI or as a local testing. You can practically think about it as a unit testing library for your Kubernetes resources. And under the hood conftest is built on top of OPA, just like gatekeeper. So all the policies that you will write should be also written in Rego. There is no other way. And something that is pretty cool about it is that it also allows you to push and pull your policies to a docker registry. It's not about containers anymore. And the way that confidence work is very simple. You only need to install it, write your policies and manage them of course, and run the command confidence test. And it will simply output you all the violations. And that's it. And that's it. Mainly about conference. It's really simple and I really like that it's in the CI pipeline. But let's move forward to the tree. Next slide. I don't see. Ah, here, now I see it. Okay, so yeah, the final one is we have the tree again, an open source, you can use it. Noah wrote some code, I wrote some code, feel free to check it out. The main difference is that it's very similar to gatekeeper and conftest, but it has both capabilities and also it comes predefined with policies. So you don't need to actually write your own policies. We have predefined rules and rule packs that you can enable and disable. And you also have a centralized management. So in the next slide you will be able to see that the difference by the tools. So gatekeeper, it only runs on the production and you need to write the tests manually. Conftest runs on a shift left way and you can run it in your computer and or in your CI CD. But again, you need to define the policies by yourself and somehow build a centralized policy management. And the tree is basically the same as confidence and gatekeeper only. You don't need to write your own rules. We have predefined rules. We also support writing custom rules. You can also write it in Yaml specs or in OPA. And then finally it has a centralized management mechanism. So let's say you have 500 repos, you can dynamically say which rules run on, which resources and so on. And my main takeaways from this session is from the very 100 postmortems to donts trust the default configuration, define the policies in your organization and have a clearly defined coding validation standard and coding standards. That's also something that I encourage. Delegate the knowledge to the developers, put an effort in that and just remember that it can really happen to anyone. And to sum up, I really want you to remember that Kubernetes is not something that it's done in a day and DevOps is a process and you should really think about how you wanted to do it and manage it in your organization because it's a culture, it's something that takes time and patience. And eventually communities will become your production temple. And I really hope that this will inspire you to start thinking about your policies. What are your policies and how you want to enforce those policies in your organization. Great. So thank you very much, Noel. It was great. Thank you very much, Iman. Thank you. And let us know if you have any questions.
...

Noaa Barki

Full-stack Developer @ Datree

Noaa Barki's LinkedIn account Noaa Barki's twitter account

Shimon Tolts

CEO @ Datree

Shimon Tolts's LinkedIn account Shimon Tolts's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)