Conf42 DevOps 2024 - Online

Shift Left Cloud Chaos Testing on Your Local Machine!

Video size:

Abstract

Chaos testing with the aid of Cloud emulators help increase dev-velocity while maintaining agility & cutting costs. This talk unveils shifting left in chaos testing to enable high velocity, quality, nimble test-driven development, along with sharing practical lessons to avoid production mishaps

Summary

  • Today I will be presenting my talk on shift left cloud chaos testing on your local machine. The focus of the talk would be around local cloud chaos test and how you can leverage open source cloud emulators to make this happen.
  • The idea behind chaos engineering is all about how you can experiment on a system to uncover behavioral issues and make your application resilient to such conditions. Chaos testing is essentially a DevOps practice at the end. A strong DevOps pipeline should ensure that your application should remain robust.
  • Using cloud emulators like local stack for testing RDS failovers. Everything runs inside a single isolated docker container. Allows you to resume your database activities in the shortest amount of time. Cool demos.
  • Sophia: You can use local stack with other services such as FIS for actually injecting fault into your application setup. Let us actually run this whole setup on our local machine and showcase to you how you can inject fault.
  • Before implementing any sort of chaos testing suite, always try to establish a system's steady state. Always try to design the experiments and put them in categories of knowns and unknowns. Use cloud emulators to cut down on the cost and the time to build a better developer and testing experience.
  • So that was all for today. And I do believe that you can look into shift left your chaos testing suite and enable your developers to leverage chaos as much as possible. Thanks to the Conf 42 team for having me today, and I'll see you next time.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I hope everyone is having a fun time at the conference. My name is Harsh and I'm currently working as an engineer at local stack. Today I will be presenting my talk on shift left cloud chaos testing on your local machine. The focus of the talk would be around local cloud chaos testing and how you can leverage open source cloud emulators to make this happen. So without further ado, let's jump directly into the talk. So let's check out the agenda for today. We will start with understanding what do we mean by shift left chaos testing and how various open source tools can help you get started. With this, we will uncover how performing chaos testing on public cloud can be risky, and how cloud emulators can help you in this regard. We will check out a couple of cool and interesting demonstrations that will explain how locally running emulators can help you build a strong test suite for maintaining conformity and latency both on your local machine and in your continuous integration pipelines. We will end this session with a final concluding note that would remark some strategies around building a test suite for a local chaos testing setup and how developers can be empowered to build failover mechanisms right from the start. So let's jump right into it. What is shift left chaos testing and what do we exactly mean by that? But before that, let's do a quick recap around what chaos engineering exactly is and where do we stand right now? The idea behind chaos engineering is all about how you can experiment on a system to uncover behavioral issues and make your whole application resilient to such conditions right from the start before you land into the production. This concept, I guess, emerged around 2010s, maybe early two thousand and ten s and was heralded, in fact, due to the outages that have suffered like that have made the organization suffer a lot of significant cost. Now these incidents always have highlighted a long standing issue. Organizations of all sizes have experienced such disruptions, and some of the times these are the disruptions. We are beyond their control. Now, these outages not only affect their maintenance cost and the company revenue, but also have broader consequences for the development and also the teams that are managing the whole user experience. Now this is where chaos testing comes into the picture. With chaos testing, you can set up a system's baseline or optimal operational state further. Then you can identify some of the potential vulnerabilities and design scenarios, and you can further evaluate their potential impact. Now, for a lot of developers, injecting these faults or issues seems almost counterintuitive. Now just imagine this. You already have unit tests, you already have some sort of an integration test suite running with every single commit being merged. So why would I need another layer of testing just to ensure that my application would work fine? Now this part is particularly important because of third party services, particularly the public cloud that has made chaos testing useful. You can cover up some of the faulty configuration issues, behaviors, outages and other interruptions that might not necessarily be due to your code, but just something that's going wrong with just your provider. Standard testing can fix some of the usual issues that might have a net end user negative reaction, but chaos testing can help you introduce issues into your system and just build some solutions around it about how you can react to such kind of issues at the end. And one of the core tenet of this is that I would like to highlight is that developers should value the chaos, and once they value that, they need to build the solutions around it with a predefined plan that allows you to introduce errors and also help you build kind of a swift response to that. This allows you to handle any sort of unexpected behavior and also maintains the balance between thorough testing and the stability for your application. And this of course could not be possible without a steady DevOps culture. Chaos testing is essentially a DevOps practice at the end where you can define various testing scenarios, you can handle various kind of executions, you can track the results in varying outcomes and ensure that your end customers do not suffer any impact. Resilience is a pretty important aspect for any deployment process, and a strong DevOps pipeline should ensure that your application should remain robust. It should remain reliable in a cloud environment where factors like scalability, availability, fault tolerance are all of critical aspects, some of which might not be even in your control. Now there are various chaos testing tools, both open source and commercial in nature, and teams have been using them to build reliable solutions for quite some time. There was an interesting keynote session at Conf 42 Chaos engineering conference by Pablo I guess the last year, and he talked about like shifting left chaos testing with Grafana K six. So you can definitely check that out. We have some of the key innovators over here like Chaos Monkey from Netflix, which started as a test tool to evaluate the resiliency of AWS resources and simulate failures for services running within an auto scaling group. Litmus Chaos is another interesting project by CNCF that uses a chaos management tool called as Chaos center where you can basically create, manage and monitor your chaos using Kubernetes custom resources in a pretty much a cloud native manner. Other projects include, I guess Chaos blade, chaos Mesh, Chaos Toolkit and more, which can provide you various features to enable your cloud testing workflows. But one of the things that I would like to highlight, and I guess it's one of the persistent issues with chaos testing cloud components, is that they are often proprietary in nature, and the only way that you can test them is running them directly in the production. Now you can imagine this testing things such as missing error handlers, or missing health checks or missing fallbacks cannot be done without actually deploying to a staging or a production environment. Though you will have the comfort of running everything directly onto the cloud, you will miss out on two significant aspects. The first one is that since you're running on the cloud, it will cost you significantly, especially deploying all your resources, including databases, message queues, kafka clusters, kubernetes clusters, and more of them. Second, some of these tests can take you hours, if not days or weeks, to test reliably, and you will miss out on the agility aspect of the DevOps culture. So how do you shift left your chaos testing early in the development process as closely to the developer machine as possible? Few years back this would happen, implausible to think of, but now we have a certain solution and the answer is cloud emulators. Cloud emulators are powerful developer tools because they can counter the friction between cloud native paradigm and also the local development paradigm. Now, there are two certain aspects of cloud emulators. The first one is that they almost remove the need for provisioning a completely dedicated development environment like sandbox and all of these things which mirrors your production setup. And the second is that every change that you do to your application or to your code do not need to be packaged and uploaded to the cloud needs to be deployed so that you can run your tests against it. So with the help of these emulators, you now have a replica of a cloud service that's running fully local on your machine or in an automated test environment. And this makes chaos testing so easy and helpful. So how do you go about building a chaos test suite for your cloud application? The answer to this almost lies in the testability of your cloud application deployments. As you can see at the top of this pyramid, we have the classic strategy of using mock libraries. With mocking, you can scope the tested component in unit tests, and you can implement the same interface as the real class to allow the predefined behavior to be invoked. Second, we have the service emulation, which means that we now have local stripped down versions of the managed service, where each individual service requires its own implementation as a follow up to that, we now have cloud emulation, which enables a superset of cloud resources that can interact with each other and run on your local machine. And finally, we have real staging environment which uses real cloud resources. An example for this that I would like to showcase is of course moto. It's a pretty popular mocking library for AWS, and with Moto you can basically add decorators to your test cases and redirect all of the AWS API calls to a mock resource and not to the real AWS services. Now, mock libraries are excellent tools for local cloud development, but they are often limited. The very first and the foremost reason for this is that mocking these services do not correctly replicate the behavior of remote cloud services that are often interacting with each other, and also the application. You can definitely build a chaos testing suite around just plain mocking, but it might often lead to unsatisfactory results because the behavior of your locally executed test always diverges from the behavior that you see in the production. Now, with the limitations of mocking, we now have progressed further and we have local development servers that provide an emulation layer. The difference between mocking and emulation often lies in the degree to which the behavior of the service is reverse engineered. Now, as an example here, if we pick up DynamoDB local to create a new database, the mock will return us a parsable API lessons, but it does not necessarily carry any state, and it will just give you some random attributes that do not reflect the request context. In comparison, the emulator will correctly store and retrieve the state for you. However, there are always some issues with the service emulators. The first one is that there is no sort of an integration between different services, which means that you cannot hook your dynamodb with appsync or with lambdas or whatsoever. You cannot use them with a fully fledged ise script, and there are often API compatibility issues, and they are also not up to date with the latest API enhancements that might be happening in real time. So as the best example of a cloud emulator, we have local stack. Local stack is an open source service emulator that runs in a single container on your laptop or your continuous integration environment. For the end user, it means that it allows them to run these AWS services pretty much on their local machine. The entire purpose of a project like local stack is to basically drive local cloud development, and also to make sure that engineers can collaborate and just do away with inefficient development and testing loops. But now the question starts is how do you enable chaos testing with local stack? As a first step you can basically use local stack as a drop in replacement for AWS on your developer machine and your CI environment, and you can use them with services such as Amazon fault injection simulator, which is fis, and you can perform resilient testing for your cloud apps running locally. Now the benefits of this are paramount. This means that you can not just deploy your application locally, but also run experiments that can inject errors at the infrastructure level which was not possible to replicate beforehand. And you can do this only once you deploy to the production, so it makes it even hard. But with a solution like local stack, you can basically shift left your chaos experiments and tests right on the developer machine. And you can use this to basically tackle some of the low hanging fruits to redesign your cloud architecture so that you can trigger some failover mechanisms and just observe locally about how your system starts responding to that. Now that we have done, I guess, enough talking. So let's move on and explore some cool demos. I'm going to showcase two examples. The first one is to demonstrate the feasibility of using cloud emulators like local stack for testing RDS failovers. And the second one is to actually show you how you can inject chaos experiments in your locally running applications. So in this sample we will locally run and test a local RDS database failover. During this failover process, the standby RDS instance is basically promoted as the new primary instance. Thus it allows you to resume your database activities in the shortest amount of time. Now if you are trying to set up this whole thing on the real AWS, it almost takes about more than an hour, which seems okay for a production setup, but for testing such a workflow it can often be time consuming. Now the best part about local stack is that you don't need to do a lot of manual configurations. Everything runs inside a single isolated docker container and it exposes a set of external network ports. So this means that if you're using any sort of an SDK, you can just hook them with the local stack by specifying which endpoint URL that they are running on so that you just don't have to send any of the API requests to the remote cloud provider. So let's check this out in action and see what we have got over here. So in this setup we have a Python script that basically uses boto three. So boto three is the AWS SDK for Python. And over here we are basically setting up some cluster ids and some regions. Just notice that we have defined multiple clusters over here, just showcasing like pretty much a global setup and we are creating a global cluster with a primary and a secondary we are creating a global cluster with primary and secondary clusters. And this basically simulates like a real world scenario where multiple database instances are managed across different regions. So in this case we have got run global cluster failover and this is basically a function that triggers a failover, switches the primary database with a secondary u one, and this is pretty much necessary for handling unplanned outages or basically just for maintenance. So in this case we are going to run this script pretty much on our local machine just to make sure that this setup pretty much works fine. So let me just go ahead and start local stack on my machine. So local stack is shipped as a binary or as just a pip packet. So if you're a python developer, you can just go ahead and just say like pip install local stack and that's going to set up the whole thing on your machine. And you can just say local stack start to start the local stack docker container on your local machine. This basically means that now you can send all of your AWS API requests to this running docker container. And over here we have specified this as localhost four, five, double six. Once we have done that, I guess I can just go ahead to my other terminal and I can just run this script right over here. My local stack container is ready. So as soon as we hit enter you can see that local stack starts creating the global cluster, the primary database cluster, and the rest of the things. You can immediately go back to the logs and you can see that local stack is now installing the postgres SQL, which is the database engine behind RDS and just making sure that everything is up to date and ready for this particular experiment. So now it's starting a global database cluster failover. And at the end of the decode you will notice that we have set up a lot of assertions to basically make sure that the failover part is successful. And as you can see, the test is done, all assertions have succeeded. And now we have a pretty good idea about how a cloud emulators like local stack can help you in this regard. In a real AWS environment, as I mentioned before, these operations might have taken over an hour, but using local stack you can just perform these kind of assertions in just less than a minute or two. And the best part about this is that we have not created any real cloud resources. Everything is happening on your local machine. And as soon as I shut down my local stack container, all of the resources that I have created before are gone. They are ephemeral in nature, so everything just vanishes with a poof. Cool. So that was a nice and steady experiment. Let's go ahead and let's see what else we have got. And yes, as I mentioned before, you can obviously use local stack with other services such as FIS for actually injecting fault into your application setup. Now FIS is a managed tool by AWS, and again, it comes with certain limitations that AWS FIS has. But with local stack you can obviously go beyond that and you can basically use a user interface or a CLI to basically inject chaos into your application setup. Now FIS has a lot of use cases. You can basically use it to strain an application. You can just define what kind of faults that you want to introduce. You can specify the resource to be mentioned as a target, and you can also specify like single time events, or you can induce some API errors and more of that. So with fis you can do a lot of these varying activities and you can just see how your cpu spike is happening or how memory usage is increasing. How does your system respond to this whole thing? With a consistent monitoring and more of that. So you can use fis with local stack, you can use it either using the traditional CLI experience that AWS itself provide, and just use local stack instead of that. Or else you can use a dashboard like this to basically control all of your chaos engineering experiments happening right on your local machine. So in this case, I'm going to use this to basically run a simple experiment on one of the sample applications that I have. So the application that I'm going to showcase to you is a serverless image resizer. It has like a bunch of lambda functions that allows you to upload an image, resize that image, and showcase that on a local web client. We have a simple website that runs inside an s three bucket. We have some SSM parameters, we have some SNS topics and more of these things. So let us actually run this whole setup on our local machine and showcase to you how you can inject fault into your local developer machine. So let me just quickly switch back to my vs code and we have the entire application cloned over here. You can grab this application using one of the GitHub URLs that I can share at the end, but let us go ahead and let us check out the application. But before that let me start local stack. So in this case I am starting my local stack instance. I'm specifying an extra configuration flag to basically allow the course origins, which I guess is pretty much of a pain for almost every developer out there. And it will start the local stack container, but as soon as it is started I can run this one script that will create all of the local AWS resources for me. Basically it will set up the whole application that I have and I don't need to set a lot of commands to basically run through the whole application deployment. So let me go ahead and let me run this deploy script and this will set up the application in just a few seconds which would otherwise have taken a few minutes if you are trying to do this whole thing on the real AWS. So you can go back to the local stack logs, you can just check out how local stack is creating the SM parameters, the lambda functions, the SNS topics and more of these things. And I guess this should be up and ready in just a few seconds. I guess it's taking some time over here because we are installing pillow because we need to set up the whole resize lambda function over there. But as you can see, once the whole application is deployed you will get a web client that we have. And over the web client you can basically upload the image and you can just click on one button and this will start the whole resize operation. So I guess the web assets are being specified over here. And now we have this particular web app running on this particular URL. So let me just switch right back to it and if I just go back to my other tab I can just hit refresh. And here we are. Here we have the locally running serverless image resizer application running pretty much on our local machine and it is just consuming all of the emulated AWS resources that we have created before. So I'm going to just click this button and I'm just going to list like a few function URLs that is necessary over here and just hit apply. And once it is done I can go ahead, click on this and basically specify one of the images that I have. In this case I'm just going to specify a pretty awesome picture of hagea Sophia. And once you click upload you can go back to the vs code right over here and you can actually see how local stack is pretty much executing like how local stack is pretty much executing lambdas on the back end. And you can see like the image has been resized and now it is ready pretty much. So let's go back to our vs code. Yes, here we are. Oops, sorry. So yes, I guess the image resize is pretty much successful. So you can just go and click refresh and this will automatically list the resized image right over here. So here we have the original image which was of these many bytes, and finally we have the resized image which is of these many bytes. So what if we start injecting some faults into our application setup? And I guess it's pretty easy with this whole dashboard that we have right here. So let me just hit refresh just to make sure that everything is ready. And as you can see, we have a few experiments that are pretty much specified. So this dashboard gives you like a set of predefined templates that you can use and run. So one of the first things that I want to showcase is how you can make a service unavailable on your local machine. So in this case I'm just going to mention like lambda as one of the things. So I can just hit lambda, I can specify the region that I want to run this experiment on and I can just click start experiment. As soon as the experiment is started, we can now go back over here, I can choose another file, which I guess would be one of the other images that I clicked during my recent trip to Turkey, and I can hit upload over here. Something will just go wrong. The image is not being listed, even though the original image is right over here. There is no mention of the resized image. If you want to see why this has particularly occurred, we can go back to the Vs code that we have right here. So I guess this is going to be it. So over here you can see that this whole lambda invoke operation has failed as per the fault injection simulator configuration. So now we can see that because of this configuration that we have already specified before, we have injected a fault into our application infrastructure, and this is not pretty much working out, which I think is very well expected. Let's go back and let's maybe stop this experiment and let's maybe start another one. So this time I want to make an AWS region just go unavailable just to simulate an entire regional outage. So in this case I can just specify us east one and I can start the experiment. And if I go back and I just hit a reload, because this entire website is being served from an s three bucket. So now you can actually go back to your Vs code again or basically to your local stack container logs, and you can see that there are exceptions happening. And this is because of the fault injection simulator configuration. So now we see this whole application is now set up, and now we can immediately inject fault into the setup and just see how our application starts responding to that. So this is one of the demos that I wanted to showcase. And yes, if you go back to your website, you can basically see that everything is happening because of this FIS configuration. So you can stop this experiment and I can just hit refresh and this whole application is now back and ready. But this was just like an initial preview. You can obviously introduce latency to every API call that you make on your local machine. You can inject some issues with Dynamodb, with kinesis, and basically you can use this experience as a way to test your local fis configurations and also to validate if your application infrastructure needs certain changes to accommodate these kind of interruptions. So now that we have came this far, let's just take a look about how we can further move ahead with such kind of a setup and how you can use open source cloud emulators like local stack in such a scenario. Now, the very first step that I always mention is that before implementing any sort of chaos testing suite, always try to establish a system's steady state. And this can be measured based upon overall throughput or error rates or latency. And this should always represent like an acceptable expected behavior of the system. Second, always look into creating a hypothesis that always aligns with the objective of your chaos testing setup, because they should match real world events and they should not cause deviation from the system's steady state. Finally, always try to design the experiments and put them in categories of knowns and unknowns. Just look at this table that I have mentioned over here and try to ideally target the knowns here because they are pretty much low hanging fruits and they should be easy to fix. But the ideal goal should be to discover and analyze whatever chaos your setup your system might end up encountering, and thus you might just end up venturing into some of the unknowns over here. Once you have the setup, you can always conduct the experiments. You can discover some potential failure scenarios within your application infrastructure and always try to assert in certain things like is your application failing for a small percentage of the production? What would happen if a certain availability zone or a region will suddenly go down? What will happen if there is a critical system performance, like if there is an extremely great amount of load? And how exactly does it affect the system performance? Now, solutions like local stack like k six or some other cloud native chaos engineering tools can certainly help you in that regard and assert in how well your infrastructure will treat such an interruption or such a chaos. And finally, yes, you can obviously test your cloud apps pretty much locally. You can always do away with staging if you want, but testing your cloud apps is definitely possible. And you can use cloud emulators not just to cut down on the cost and the time, but also to overall build a better developer and testing experience. Cloud emulators in the context of chaos engineering can always help you around with some of the low impact risks, like a missing failover, a missing handler, or things that might just fall under your radar during your usual unit testing or integration testing setup. And you can always learn from this experience. You can fix and iterate them quickly on your developer machine, and you don't have to wait and see your application blasting up in the production to actually do that. The second one is that you can always use these lessons to set up some sort of a playbook so that you can always validate these incidents, work upon them, and make sure that you have resilient solutions. And the best part is that you can actually integrate these failovers over on your CI pipelines. So this means that you can run these tests repeatedly on your CI pipeline, and you can make sure that none of the code or none of the infrastructure that you're adding to your application will negatively impact it in the long run. And finally, you will achieve the ability to handle some of the unplanned failovers or handlers that you might just end up missing. And you can certainly develop more resilient, better tested solutions at the end of the day. So this brings me to the final conclusion of this session. So that was all for today. Thank you folks, and I hope you enjoy the talk. And I do believe that you can look into shift left your chaos testing suite and enable your developers to leverage chaos as much as possible. Thanks to the Conf 42 team for having me today, and I'll see you the next time.
...

Harsh Mishra

Software Engineer @ LocalStack

Harsh Mishra's LinkedIn account Harsh Mishra's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)