Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              We have a really interesting topic here. Most of these day, we have been
            
            
            
              talking about chaos engineering in general. This talk is
            
            
            
              a little bit more about how to practice it, rather than
            
            
            
              preaching why is it needed and what it
            
            
            
              is.
            
            
            
              Before we start, about me, I go by Uma
            
            
            
              Mukara. You can call me Uma. I'm co founder and
            
            
            
              CEO of a company called Mayadata.
            
            
            
              We do cloud native data management primarily
            
            
            
              for Kubernetes. So you can see
            
            
            
              me talking a lot about Kubernetes today.
            
            
            
              And I'm a co creator of the following open source
            
            
            
              projects. The genesis of this entire
            
            
            
              cloud native chaos engineering is really
            
            
            
              about trying to do chaos engineering for
            
            
            
              an open resources project called Open EBS.
            
            
            
              And we ended up creating a
            
            
            
              cloud native chaos engineering infrastructure called litmus.
            
            
            
              While trying to do chaos engineering as a practice for
            
            
            
              open EBS. I'll talk a little bit more about that.
            
            
            
              So we've been talking chaos engineering.
            
            
            
              It's a need everybody knows, right? So Russ Miles
            
            
            
              started with why chaos engineering is a need.
            
            
            
              And we've been also hearing it's a culture you need to
            
            
            
              practice. It's not a thing. Right,
            
            
            
              but I got all that. How do I actually get started?
            
            
            
              Right? This is the question that we got into.
            
            
            
              I'm in this Kubernetes world, and I'm
            
            
            
              developing an application, or I'm operating a big infrastructure
            
            
            
              or a deployments. I need chaos engineering. How do I
            
            
            
              do? Right. As part of the operations work
            
            
            
              I do, I got into
            
            
            
              a situation where we need to practice chaos engineering for
            
            
            
              a very different reason. We were building a data management
            
            
            
              project called Open EBS, where the community
            
            
            
              has to use this software
            
            
            
              for relying on these data. Right.
            
            
            
              Storage is a very difficult subject. We all know that.
            
            
            
              So you don't want to lose your data at any point of time. So how
            
            
            
              do you convince your users that you have done
            
            
            
              enough testing or it has enough resilience?
            
            
            
              So the best way is for them to find out that,
            
            
            
              okay, I can actually break something and still see that it is good.
            
            
            
              Right? So we started working on that and
            
            
            
              then realized, okay, Kubernetes itself is
            
            
            
              new. Chaos engineering is as an awareness
            
            
            
              is growing, but we need to start creating some infrastructure
            
            
            
              for our own use. And then slowly we realized this
            
            
            
              can be useful for community as
            
            
            
              a whole. And then we created litmus. Right?
            
            
            
              So today I just want to touch upon the
            
            
            
              following topics. We all know what is chaos
            
            
            
              engineering, but what is cloud native chaos engineering, right? And then
            
            
            
              if we understand why cloud native chaos engineering is
            
            
            
              a different subject or a little bit of a deviation from
            
            
            
              what we've been talking, what are those principles? Like?
            
            
            
              We commonly observed some principles that we can say that here
            
            
            
              are the cloud native chaos engineering principles.
            
            
            
              And then we created a project called litmus, and then we can talk about
            
            
            
              is it really following all these principles?
            
            
            
              And then how many of you know operator
            
            
            
              hub in Kubernetes
            
            
            
              so very similar? Or helm charts, right. Helm charts
            
            
            
              is a popular concept because you have a chart to
            
            
            
              pull up and taken bring an application into life.
            
            
            
              So same concept applies to a chaos experiment
            
            
            
              as well. Right. Then we'll go to some examples and then we
            
            
            
              can talk about what are
            
            
            
              the next steps and then so forth. Right? So what is
            
            
            
              cloud native chaos engineering? Are there any
            
            
            
              cloud native developers here?
            
            
            
              Yeah, what is the cloud native developer?
            
            
            
              If you can Docker pull a developer,
            
            
            
              then you're a cloud native developer,
            
            
            
              right? So it's really about practicing
            
            
            
              chaos engineering for specific environments,
            
            
            
              which are called cloud native environments. We'll just see what
            
            
            
              these environments are. And you can also say that
            
            
            
              chaos engineering done in a cloud native way is called cloud
            
            
            
              native chaos engineering. Even further, to simplify
            
            
            
              chaos engineering done in a Kubernetes native way is
            
            
            
              called cloud native chaos engineering. So it's all about this general concept
            
            
            
              of how Kubernetes is changing our lives, has developers
            
            
            
              and sres, and you might have observed this
            
            
            
              many, many times that there is a particular way that
            
            
            
              the DevOps have know the way we practice
            
            
            
              DevOps have changed.
            
            
            
              Adrian talked about infrastructure as a code
            
            
            
              that has taken as a primary component
            
            
            
              of Kubernetes deployments. Right.
            
            
            
              So I'll just touch upon
            
            
            
              what a cloud native environment looks like. I've taken this
            
            
            
              concept or picture from one
            
            
            
              of the GitLab commit conferences where Dan Cohn from CNCF
            
            
            
              was presenting about why a CI pipeline is
            
            
            
              useful in cloud native environment. I just
            
            
            
              will repurpose that for a different need here.
            
            
            
              Right, so you are writing a lot
            
            
            
              of code if you're a cloud native developer,
            
            
            
              and then after you develop, you run it through CI pipelines and then you put
            
            
            
              it onto production, or like you release it and somebody's going to pull
            
            
            
              that and then use it, right? So where are they going to deploy it?
            
            
            
              So they're going to. Let's assume that it's a Linux environment, it's a
            
            
            
              lot of code and you will have kubernetes
            
            
            
              running on it, right? That's much more code than Linux
            
            
            
              itself, right? 35 million of source locs
            
            
            
              compared to the Linux. And then you will have a lot of microservices
            
            
            
              infrastructure applications that will help you run your code
            
            
            
              and then there are some lot of small libraries or databases that
            
            
            
              run on top. And finally your code, right?
            
            
            
              So that's how small your code is, it's about 1%.
            
            
            
              Right? And the common factor is you need to think how often
            
            
            
              your Linux upgrades happen,
            
            
            
              right? Maybe 18 months, twelve months.
            
            
            
              And can you guess how often Kubernetes upgrades happen
            
            
            
              before you know, you understood one version of kubernetes, another version comes
            
            
            
              in, right? So it's very dynamic and
            
            
            
              that's the purpose of actually adopting kubernetes because it runs anywhere and
            
            
            
              then everybody runs one thing inside
            
            
            
              a container in one thing only, right?
            
            
            
              So kubernetes upgrades happen very frequently.
            
            
            
              And your apps, not your apps,
            
            
            
              but the application or microservices surrounding your
            
            
            
              application, these also need to be updated very
            
            
            
              frequently and then you don't control all of it, right? And you are
            
            
            
              always thinking about my code. How do I make sure that the
            
            
            
              code that I bring in deployments is working fine,
            
            
            
              right? So the summary is that your environment, the cloud native environment,
            
            
            
              is very dynamic and then it requires a continuous
            
            
            
              verification, right? I mean it's the same concept of
            
            
            
              don't assume but verify. Right? But what do you
            
            
            
              verify? You verify your 50,000 lines
            
            
            
              of code or remaining 99%. How do
            
            
            
              you verify? So you cannot go and say that
            
            
            
              everything was working fine. I verified everything if I'm an
            
            
            
              SRE. But it's a bug in the Kubernetes
            
            
            
              service that is provided by the cloud provider, right.
            
            
            
              You cannot assume that cloud provider has
            
            
            
              really hardened up the Kubernetes
            
            
            
              stack or any other microservices. So it's really up to the
            
            
            
              SRE or running whoever is running the ops to verify
            
            
            
              that things are going to be fine. Right?
            
            
            
              So that's one thing. So how do you do this?
            
            
            
              Obviously it's chaos engineering. Chaos engineering,
            
            
            
              not just for your code, but for the entire
            
            
            
              stack, right? Because your application depends
            
            
            
              on the entire pyramid. So that's
            
            
            
              one concept. The other big difference is, okay, I need to
            
            
            
              do chaos engineering, but how, what's this cloud native
            
            
            
              environment? The big difference is everything
            
            
            
              is yaml. You need to practice this yaml
            
            
            
              engineering, right? You cannot write. I mean,
            
            
            
              I think Andrew has really demonstrated very well today about
            
            
            
              how to do chaos engineering for MySQl server
            
            
            
              on kubernetes, how we did like
            
            
            
              we really killed some pods or cube invaders. That's good for demos.
            
            
            
              But how do you really practice, if you are necessary in production,
            
            
            
              right? You have certain principles. If I am putting something
            
            
            
              into an infrastructure, it should be in a yaml,
            
            
            
              I need to put it into git. And then even if I'm
            
            
            
              doing some chaos experiment, it should be in a git because
            
            
            
              I am making an infrastructure change, I'm killing a pod.
            
            
            
              Who gave you permission? Right? So it has to be
            
            
            
              recorded and that's Gitops. Right?
            
            
            
              So that's what is the cloud native environment
            
            
            
              dictates? If you are a cloud native engineer or a developer, you have
            
            
            
              to be following this infrastructure principles.
            
            
            
              So what you need is not just chaos engineering,
            
            
            
              but it is cloud native chaos engineering. That's the concept I
            
            
            
              want to drive here in
            
            
            
              a simple way. This is my definition.
            
            
            
              If you try to put chaos engineering as
            
            
            
              an intent into a manifest AMl manifest, then you can
            
            
            
              say that you are practicing cloud native
            
            
            
              chaos engineering. So that's how we started, right? So we are going
            
            
            
              to give open EBS as an application to your
            
            
            
              users and they're going to take it, run it. How do they verify?
            
            
            
              However they've been verifying
            
            
            
              an application is same as deploying an application. How do
            
            
            
              you deploy an application? You pull up some resources
            
            
            
              into a ML file and then you do a Kubectl apply,
            
            
            
              your application comes. How do you kill it? It has
            
            
            
              to be the same way, right? So that's the primary difference that
            
            
            
              we found. And then we started building up this infrastructure
            
            
            
              pieces to do the same thing, right?
            
            
            
              So let's look at Kubernetes Resources that we have
            
            
            
              for developers.
            
            
            
              We have a lot of resources provided by Kubernetes as an
            
            
            
              infrastructure component, right? Pod deployment, pvcs services,
            
            
            
              stateful set. And you can write a lot of crds, right?
            
            
            
              So you use all those things and then define an application,
            
            
            
              you manage an application and then this concept of crds
            
            
            
              have brought in these operators. You can write your own operators to
            
            
            
              manage the lifecycle of the CRS
            
            
            
              itself, right. That's good for development. If I'm
            
            
            
              practicing chaos engineering or chaos testing for
            
            
            
              kubernetes, I need to have a similar resources.
            
            
            
              So I need to be able to have some chaos resources.
            
            
            
              So what are those chaos resources that I would need?
            
            
            
              So you would need some chaos CRS in general,
            
            
            
              right? So just like an application is
            
            
            
              defined by a pod and a service and some other things. And how
            
            
            
              do I define chaos engineering? You need to be able to visualize your
            
            
            
              entire practice through some CRS. And obviously
            
            
            
              you need a chaos operator to manage the CRS and to put things
            
            
            
              into perspective, upgrade each test,
            
            
            
              right? And it's all about observability.
            
            
            
              If you don't have observability, you can introduce chaos,
            
            
            
              and then you're not able to interpret what's going on.
            
            
            
              And then it's all about history has. Well, right. So metrics
            
            
            
              are very keen in chaos engineering.
            
            
            
              So these are the chaos resources that we figured out at a general.
            
            
            
              These three resources are needed. Right? So then
            
            
            
              looking at all those things in general, we can summarize. The principles
            
            
            
              of cloud native chaos engineering are the following,
            
            
            
              right? So first of all, whatever you try to
            
            
            
              do, it has to be open source, because now we are trying to generalize.
            
            
            
              Right. I can just write some closed source code and say that these.
            
            
            
              It is. You're not using to accept it. Right.
            
            
            
              Kubernetes was adopted and then the entire new
            
            
            
              stuff is becoming adoptable so fast because it's open
            
            
            
              source. So that's just a
            
            
            
              summary of it. And then you need to have acceptable
            
            
            
              APIs or CRS, just like pod service for
            
            
            
              chaos engineering. And then what's
            
            
            
              the actual logic? How do you kill something, right? So that,
            
            
            
              again, you should not determine. It should be pluggable. Right.
            
            
            
              Maybe my project does it in a particular way,
            
            
            
              but somebody else also has a particular way of killing. For example,
            
            
            
              here today, we talked about Pumbaa, right? How do
            
            
            
              you introduce network delays? It does in certain way.
            
            
            
              And then you should be able
            
            
            
              to use Pumbaa and plug it into this cloud native
            
            
            
              chaos engineering. And it should be usable and
            
            
            
              then it should be community driven. Right? So when will you
            
            
            
              do this chaos engineering as a practice? It's not just preaching
            
            
            
              about chaos engineering, that it is a culture. It's a good thing
            
            
            
              to do. What are the best practices? We all need to be able to build
            
            
            
              these chaos or experiments together, right? And that's
            
            
            
              when you call it as principles. Right? So I've written a
            
            
            
              blog about whatever I just said and then published it on CNCF,
            
            
            
              same concepts. And then it's available on
            
            
            
              CNCF itself. Now what I want
            
            
            
              to do is with these principles, this is
            
            
            
              how actually we grew, started practicing chaos
            
            
            
              engineering, and then we named it as litmus. And Litmus
            
            
            
              project is exactly a manifestation of our effort to
            
            
            
              practice chaos engineering with kubernetes. And we turned it into
            
            
            
              a project that it's not just useful for, only for our project,
            
            
            
              but anybody who is practicing or
            
            
            
              developing on kubernetes want to practice chaos engineering,
            
            
            
              they can use litmus. Right? So this is just a brief
            
            
            
              introduction. It's totally apache two license.
            
            
            
              Then there are some good contributions coming in.
            
            
            
              It recently went GA 10 a couple of weeks ago.
            
            
            
              It really means that it's got all the tools
            
            
            
              or infrastructure pieces that is needed for you to
            
            
            
              start taking a look at it. And then it's
            
            
            
              open source, obviously, it's got some Aps or CRS.
            
            
            
              I will explain that in a bit. And then whatever
            
            
            
              the chaos logic that you're using, you can just wrap it up into a docker
            
            
            
              container and put it into litmus, right. You don't need to change anything.
            
            
            
              And then it is obviously community driven. I'll explain
            
            
            
              that in a bit. So let's see, what are the CRS or CRDs
            
            
            
              that litmus has? Right? So these first thing is chaos
            
            
            
              experiment. You want to do something. Take the
            
            
            
              minimal thing that you want to introduce as a chaos and define that
            
            
            
              as a chaos experiment, right?
            
            
            
              You can do a killing of a particular application that might
            
            
            
              include three, four different chaos experiments. But a
            
            
            
              chaos experiment is something that is like minimalistic
            
            
            
              killing of something, right? And then
            
            
            
              you need to be able to drive this chaos experiment.
            
            
            
              For an application, we call that as chaos engine. This is
            
            
            
              where you tell. Here are the chaos experiment. This belongs to this application.
            
            
            
              Here is your service account. And then who can kill, et cetera.
            
            
            
              So you define that. So to do this, and then after you
            
            
            
              do, you need to be able to put your results into another
            
            
            
              Cr called chaos resulted, so that Prometheus
            
            
            
              or some other metrics can come and pull it up, and then somebody
            
            
            
              else is going to make sense or some tool is going to make sense of
            
            
            
              what exactly has happened, right? And then there will be multiple
            
            
            
              chaos experiments that you can keep adding into a chaos engine,
            
            
            
              right? So that's the CR that
            
            
            
              you have. And then for pluggable chaos,
            
            
            
              for example, it already has Pumbaa and powerful seal as
            
            
            
              built in libraries. You can actually pull in your own library.
            
            
            
              So how do you pull your library into this infrastructure?
            
            
            
              For example, here I'm explaining powerful seal.
            
            
            
              How we did it, the community did it, is all you need to do
            
            
            
              is whatever the killing that you do, you put that into a
            
            
            
              docker image, right? So if you just do a docker
            
            
            
              run of that image, and if it goes and kills something or introduces
            
            
            
              chaos, then it can be used with this infrastructure,
            
            
            
              right? So once you have the Docker image, you just create
            
            
            
              a new CR, new experiment.
            
            
            
              A new experiment really points to a
            
            
            
              new custom resources. And then inside that custom resource definition,
            
            
            
              you just say that here is your chaos library, right? So it's
            
            
            
              as simple as that, right? So the
            
            
            
              reason why I'm just trying to emphasize this is think
            
            
            
              litmus as a way to use your
            
            
            
              chaos experiments with a more acceptable
            
            
            
              Gitops way you can just orchestrate
            
            
            
              all your experiments using litmus. The way they are written,
            
            
            
              you just need to convert them into a Docker image, right?
            
            
            
              So litmus chaos render automatically picks up this docker image and
            
            
            
              then runs, puts it into a chaos result cr and then you're
            
            
            
              good to go. And then these community is developing more tools
            
            
            
              to observe chaos. All that will work very naturally.
            
            
            
              Right? So it's community driven. What it really means
            
            
            
              that we have something called operator hub, a chaos hub.
            
            
            
              It's got a lot of experiments already and taken you
            
            
            
              as a developer when you create a chaos experiment.
            
            
            
              If you want this chaos experiment to be used in production or
            
            
            
              in pre production by your users, you're going to push it into your chaos
            
            
            
              hub. And your sres, whoever is practicing your chaos
            
            
            
              engineering, are going to pull this experiment and then use it,
            
            
            
              right? So imagine if Andrew had published that experiment.
            
            
            
              You just need to create a yaml file and
            
            
            
              then put in some key value pairs and then it runs.
            
            
            
              So this is how the cloud native orchestrator architecture
            
            
            
              looks like. So you will have a lot of crs
            
            
            
              defined for your users, chaos users.
            
            
            
              And you have a lot of experiments that are being used by
            
            
            
              this CRS, but you can develop more, right?
            
            
            
              So has more and more people develop this chaos
            
            
            
              for various applications. You will slowly see that the entire
            
            
            
              Kubernetes chaos engineering can be practiced just
            
            
            
              by installing litmus, right? So how
            
            
            
              do you get started? Just imagine that you have on the hub a
            
            
            
              lot of chaos experiment for various different applications
            
            
            
              and you are running your app in a container and
            
            
            
              then all you need to do is pull litmus
            
            
            
              helm shot or just install it. It runs in a container.
            
            
            
              The moment you install you will get chaos libraries and operator onto
            
            
            
              your Kubernetes cluster. And then you need to pull.
            
            
            
              The assumption here is that you have so many charts and you
            
            
            
              may not need all of them. So you pull whatever you need onto
            
            
            
              your Kubernetes cluster and then you just inject chaos
            
            
            
              by creating a cr, right, chaos engine which points
            
            
            
              to various different experiments and then
            
            
            
              it opens up, the operator goes and runs
            
            
            
              chaos against a given application and it creates a chaos result,
            
            
            
              right? It's as simple has that.
            
            
            
              So let's see an example of how this changes
            
            
            
              a cloud native developer everyday's life,
            
            
            
              right? How does a developer
            
            
            
              create a resource? Right? You create a pod and then you
            
            
            
              create more resources for an application, for example a
            
            
            
              pv or a service, et cetera. And that's
            
            
            
              usually that's where it ends. You want to test something,
            
            
            
              right? So how do you do it, you inject chaos
            
            
            
              by simply creating one more crs just like you've
            
            
            
              been using kubernetes for, right?
            
            
            
              So you can actually create a chaos engine and
            
            
            
              tell what needs to be killed where and then it's all done,
            
            
            
              you get your results. So it's extending the concept
            
            
            
              of your experience with kubernetes to
            
            
            
              do the Chaos engineering also. So that's cloud native chaos engineering.
            
            
            
              Just to summarize. Right, so on the Chaos hub we
            
            
            
              generally have two types of experiments.
            
            
            
              One is generic, that is like generic chaos experiments for
            
            
            
              Kubernetes resources in general. And then there are application
            
            
            
              specific. So this is where it gets interesting.
            
            
            
              So you can, like we have seen again,
            
            
            
              sorry to take these same example again and again
            
            
            
              Andrew showed I do container
            
            
            
              kill of MySQl server pod.
            
            
            
              Right. So then you have to go and verify
            
            
            
              what exactly happened, right. We verified that the pods have come back.
            
            
            
              That's all we saw. Right. In Kubernetes invader.
            
            
            
              Hey, more pods are coming up. But how do we verify whether
            
            
            
              application that the MySQL server is
            
            
            
              really working well or not automatically. So that's these,
            
            
            
              you can write more logic into your application specific chaos experiment
            
            
            
              and then use them in production. So some
            
            
            
              of the experiments that are available are already available.
            
            
            
              You can just do a pod delete, container kill.
            
            
            
              I mean like a pod can have multiple containers, you can just kill only one
            
            
            
              of that. And then you can do a cpu hog into
            
            
            
              a pod. Network latency, network loss, network packet corruption
            
            
            
              just introduce some corruption into your packets that
            
            
            
              are going into your pod and see what happens. Right? So that could be
            
            
            
              the granularity level that you want. Then there are some infrastructure
            
            
            
              chaos. Of course these are specific to the cloud providers.
            
            
            
              For example, how do you take down a node on AWS
            
            
            
              is different than how do you take down a node on Google.
            
            
            
              Right? So there are disk losses, which is
            
            
            
              a very common thing, right? So I suddenly
            
            
            
              lose a disk, what happens? And then disk fill is
            
            
            
              one more common thing. So these are the initial set of things that are already
            
            
            
              there, you can start practicing and application specific things
            
            
            
              that are available here are an
            
            
            
              application really constitute a pod
            
            
            
              or multiple pods and then it's got some service and then there's got some
            
            
            
              data. Right? So what do you mean attacking
            
            
            
              an application? Right? So you need to define the logic
            
            
            
              of what is it that you're going to kill.
            
            
            
              Am I going to kill MySQl server or am
            
            
            
              I going to kill a part of it, et cetera, et cetera. That's definition.
            
            
            
              And then you verify before killing everything is good or not,
            
            
            
              right? So that's the hypothesis. And then you use the generic
            
            
            
              experiments to actually do the chaos and then
            
            
            
              verify your post checks, right? So all this can be put
            
            
            
              in into an experiment, and every time, you don't need to run
            
            
            
              it again and again. So just put that into Yaml file and your application
            
            
            
              chaos happens, right? So the result Cr gets
            
            
            
              done. So for example, I'll quickly take an example of open EBS,
            
            
            
              how it's done. Open EBS has got multiple components. Now I'm verifying
            
            
            
              open EBS as an application. Works well when I
            
            
            
              kill something, right? I cannot go and say, okay, open EBS
            
            
            
              is a cloud native app. That means it's a microservices and it's got
            
            
            
              multiple pods. I can kill a
            
            
            
              container that belongs to openabs and see what happens. That's one
            
            
            
              way of saying. The other way of saying is I can kill a controller
            
            
            
              target of openabs and see what happens,
            
            
            
              right? So you will end up having multiple different
            
            
            
              chaos experiments that are specific to that application, and then
            
            
            
              you can go start using them. For example, I can kill an SQC target of
            
            
            
              open EBS and then see what happens. You don't need to really know what
            
            
            
              should happen. It's all defined by the open EBS developer. As an open
            
            
            
              EBS user, you will be able to say,
            
            
            
              okay, my open EBS is functioning properly because I just killed
            
            
            
              a target and then it is behaving as expected.
            
            
            
              Or you can kill a replica and then see, so you don't
            
            
            
              need to learn the nittygritties or complexities of the application,
            
            
            
              how the application should behave when something happens in production.
            
            
            
              So all that is coded up by your developers and pushed up onto
            
            
            
              chaos hub, and then you can just use it. Right? So that's
            
            
            
              summary of how a cloud native chaos engineering
            
            
            
              framework can work, and litmus is just one of that, then you can
            
            
            
              contribute. It's on the Kubernetes slack itself. And then
            
            
            
              if you find some issues, you do that. But primarily take a look at,
            
            
            
              if you are practicing chaos engineering, take a look at chaos
            
            
            
              hub and then there are some things that are already there and it's
            
            
            
              just these beginning we opened up. And then
            
            
            
              hopefully more contributions will come from the community, from the
            
            
            
              CNCF itself.