How Microsoft Shift left using Chaos Engineering practices to prevent outages and improve services quality?

Video size:

Abstract

Outages leads to degradation or unavailability of service or entire distributed system, which are quite hard to mitigate, root cause. Don’t worry, we got you covered if you like to know how to proactively catch compute, networking, storage, Secrets & certificate related issues as early as possible in development life cycle.

Reach out to Ravi Bellam, Chris Ashton & Rohith Gundreddy

Summary

Chris Ashton: Use chaos engineering practices to ship left and prevent outages. Service outages and disruptions can happen at any time and this impacts your overall application unavailability. More than ever, applications must be designed to handle these failures. Add chaos engineering to your CI CD pipeline, test in a pre production environment.
Azure Chaos Studio has faults and actions that kind of make up our library. We have what we call service direct faults, no agent required. This allows us to do things like shut down or kill gracefully or abruptly. Over time we want to have much more robust capabilities to do a full unavailability zone down exercise.
Chaos Engineering is a discipline that focuses on creating and managing systems with a high reliability. The successful adoption of Chaos engineering can be challenging. Ravi Belam shares the key steps that we have learned to scale Chaos engineering adoption proactively.
Rohit Gundadi: The expected outages of investing in chaos engineering are twofold. First, to enhance the customer experience by accelerating the delivery of top quality features. Second, to continuously test chaos scenarios across the entire application stack. By following these steps, we were able to implement chaos engineering in a methodical and controlled way.
We have developed some nominal and disaster scenario standards. These standards are stack specific but we have tested combination of these stacks. By incorporating technology and processes with standards and nominal disaster scenarios, we have initiated a journey of continuous verification of our services secrets and reliability resiliency.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everybody, my name is Chris Ashton and I'm here with Ravi and Rohith from Microsoft, and we're going to talk to you today about using chaos engineering practices to ship left and prevent outages. I'm going to talk to you a little bit about chaos engineering in general and give an introduction to Azure Chaos studio. And then Ravi and Rohith will take over and talk more in depth about practices they've developed and are using in their teams. Software as a service has revolutionized cloud application development. It's now easier to get applications up and running in a shorter period of time, and you have a variety of microservices and other services available in a toolbox to pull together and compose and build these applications. While this has great benefits, with it comes a new set of challenges. Service outages and disruptions can happen at any time and this impacts your overall application unavailability and can therefore upset your customers. And these disruptions can happen at any time. You have a lot less control over the environment in which your things are running, and you could be subject to power outage at a data center, a flood in a control room, network going down, bugs being introduced by developers into different components, configuration changes coming at any time, lots of variables make this a very unfriendly environment. So now more than ever, applications must be resilient. They must be designed to handle these failures. Microsoft provides the well architected framework in Azure Architecture center to give application developers guidance on how to build resilient applications. But basically, resilience is a shared responsibility. Microsoft needs to ensure that the platform is resilient and build a lot of capabilities into the services that make up all of the components that you can use when you're composing your applications and the infrastructure that they run on. But then it's also responsibility of the application developer to validate their resilience and make sure that you can handle the conditions that you've designed for and unexpected things that come up. Basically trust but verify. Build your application using these guidelines and then make sure that it operates in the way that you expect and understand. These quality practices can be done on an ad hoc basis. Some folks perform game days and drill events and that's great. But more mature organizations are building these quality practices right into their service development lifecycle. Add chaos engineering to your CI CD pipeline, test in a pre production environment. Shield your customers from these outages and issues and shift left and do these things earlier so that you can catch the defects early and prevent them from going to production. This is where chaos engineering comes in. Chaos Engineering is a practice that has evolved over the last decade of subjecting your application and services to the real world outages and disruptions that they will face in production. You've designed for failure, you've architected for resilience. Make sure that you can stand up to the disruptions that will be encountered. With chaos Engineering. You're not just validating that you've chosen the right architecture and that your code doesn't have any bugs. You're also validating that you've configured everything correctly, that you've got the correct observability and monitoring in place, and then you're also able to validate your process aspects. Does your DRI, your designated responsible individual, or your on call engineer have the reporting and monitoring and metrics in place that they can see what's going on with your service? Do they have the correct troubleshooting guides in place when something goes wrong and they need to investigate and see what's going on further? So all of these things are basically enabled by and validated through chaos engineering. But a big component of Chris is to do it systematically and in a controlled way. There's a lot of value in doing this in production. There's a lot of value in sitting down and just injecting faults, and that could be quite fun, but really, it's very dangerous, especially doing it in production. You could impact your customers. You could accidentally cause an outage. It's much better to do this in an earlier stage, in a pre production environment that's very production like, that represents the infrastructure it will run on all of the services that are running and how they're configured. With a production like workload, usually synthetic, but still representing what your real workload would be. And then you're able to, in a much more systematic and scientific way, assess the resilience of your solutions. So with chaos engineering, there's a little bit of a metaphor around following the scientific method, we have chaos experiments. But just like with the scientific method, you don't just start down and doing an experiment. It's good to sit down and think about what you want to do, form a hypothesis. Which resiliency measures have you taken advantage of, and what are your expectations of your application when it is faced with disruptions? Will you fail over to another region? What happens if an availability zone goes down? What happens if there's an AAD outage or a DNS outage? Formulate an experiment, go out and perform it. Monitor how your system works while it's being disrupted and affected. And then do some post experiment analysis, make some improvements, and then rinse and repeat. This can become part of a very virtuous cycle and for mature teams this fits very nicely into a typical product lifecycle and validation cycle. Adding chaos to a CI CD pipeline is a great way to make this very methodical, systematic and organized. There are lots and lots of use cases for chaos engineering and then of course Azure Chaos studio. You can basically, a developer can sit down in their office and just do ad hoc experimentation and disrupt dependencies systematically and one by one, hopefully in a dev or test environment to validate their new code or validate some upcoming configuration changes and just see how the system will handle it. That's a great way to get started. Another thing lots of folks like to do is host drill events or game days. These are great ways to gather a group of people together in a designated period of time and validate one or more hypotheses and see how the system stands up to these. Again, best to do in a safe environment, shifting left, but also often done in production. These are great ways to catch monitoring gaps, validate your livesight process, but of course catch code defects and architectural issues. But as I've mentioned, we really believe that it is important to do this early and if at all possible, do it in automation. Get it into your CI CD pipeline or some other automated process that you have. You dont have to shift all the way left and do this in a dev stage with your acceptance tests. Although some teams have been quite successful doing that, most teams that are doing this are quite successful integrating in their integration stage and often augment existing stress or performance benchmarks or some other synthetic workload. With chaos, they're able to run the existing tests and gauge their baseline or how do they perform against known metrics and trends, as well as gauge the impact to those trends and baselines when chaos is applied and apply that to overall availability metrics. Digging a little bit further, we've talked about the chaos experiment. Formulate a hypothesis, think about the things that you want to validate and in automation, what would this look like? Sometimes it's very systematic. Disrupt each of your dependencies one by one. Other cases it's craft things around a scenario so you could systematically go through and say, I'm dependent on Cosmos DB, I'm dependent on SQL server, and introduce different disruptions for each of these one by one. Or you could think about this from a scenario perspective and look at things like what happens if an availability zone goes down or a DNS outage and so on. So then here's where Azure Chaos studio comes in. Azure chaos Studio is a tool that we developed for use inside Microsoft to validate the resilience of the components that make up Azure. And these are things like our identity with Azure Active directory or the storage platform networking. These teams are using chaos to validate the resilience of the platform itself, but then also other solutions inside Microsoft that are building on top of that platform use chaos to validate their resilience. Microsoft Teams dynamics power platform but then we didn't want this just to be an internal tool and to keep it to Microsoft. We wanted our customers to be able to do these same validations. So we're actually releasing Azure Chaos Studio as a product. We want our customers to be able to validate their resilience to things that can go wrong on the Azure platform, but also to validate the resilience of their own applications and solutions and the communications amongst their own microservices, et cetera. So Azure Castio is currently in public preview and we're currently planning on going ga general availability later this year. But in general, Azure Chaos Studio is a fully managed service. We basically offer chaos as a service. We have deep Azure integration. We have an Azure portal user interface, we have Azure resource manager arm compliant APIs so that you can use Chaos studio manually from the user interface, or you can use it programmatically through the APIs to validate the resilience of your solutions by introducing faults. A key component of Azure Chaos Studio is an expandable and growing fault library. This fault library has capabilities that allow you to introduce faults in a variety of ways. I'll talk about that a little bit more in a second that allow you to perturb your application in realistic ways, and we have workload execution that allows you to do orchestration that allows you to execute cause experiments that have fault actions and disruptions that can be applied sequentially and in parallel to make up these scenarios. Again, I'll talk about that a little bit more in a second, but we don't want you to accidentally cause an outage while you're doing this, and we don't want your testing activities to bleed over and affect production. So we're building a lot of safeguards into the product. We've got role based access control for the Azure resources that can be used in experiments, and we have role based access control for who can run the experiments and even what faults can be used on individual resources. One other final thing super important, we're building integration in so that we've got observability aspects to this. We want you to be able to correlate your chaos activities with the things that are going on in your service so that you can see right in your own telemetry and monitoring what's happening and enable evaluation of what's going on, as well as to enable you to validate that you have the correct monitoring and observability components in place. So even more detail on Azure Chaos studio on the right here you'll see the faults and actions that kind of make up our library. We have agent based faults. We have a chaos agent that can be installed or deployed to your windows or your Linux virtual machines. And then through that we have a variety of faults that can affect things that are running right there on that virtual machine. Lots of resource pressure faults, cpu, memory disk I o. We can delay the network calls, we can actually block access to IP addresses. We can change firewall rules on Windows systems, a few other things. And we have more faults coming on the other side of the house. We have what we call service direct faults, no agent required. We can disrupt Azure services directly. This allows us to do things like shut down or kill gracefully or abruptly. We can shut down virtual machines. We have a cosmos DB Fillover fault. We can reboot redis, so several very specific faults that can be applied to services. Another group of faults that we have like this is for aks. We leverage chaos mesh to be able to apply fault actions to aks. And then we have some more generic fault activities that can be done. So we have network security group rules faults that allow you to block access to IP addresses and ranges so that you could disrupt your own microservice communications. And then through Azure service tags we can also block access to other Azure resources. So we can take the Azure Cosmos DB service tag, put that in an NSG rules fault, and block access to Cosmos DB. And then a really cool sub aspect of that is Azure service tags support a regional component. So you could do things like block access to Azure Cosmos DB east us and affect only the Cosmos DB databases that are running in east us. And that would allow you to do things like a region failover or region isolation drill. And so that's what's on the left of the slide is a lot of the scenarios that we recommend that you validate. It's good to think about the faults and the individual disruptions that you can apply to your dependencies, but it's much better to think about these things a lot more systematically in terms of the incidents and outages that will affect you and your application or service. So again, I've mentioned these several times, but what happens when an availability zone goes down? What happens when there's a DNS outage? Gas studio today has faults that help you validate these scenarios in sort of fairly comprehensive but somewhat limited ways over time and by the time we go general availability, we want to have much more robust capabilities to do a full unavailability zone down exercise. So today we can affect vms, vms service fabric, virtual machines and aks and have it appear that zone two has gone down in the east us region or something. But over time we'll be able to impact Cosmos, DB, SQL server and other things in that same way. Same with all of our faults and the library. We're just going to keep working with individual Azure service teams to enable more fault capabilities and expose those in our library so that you can then build up to more scenarios that you validate. Now I'm going to hand off to Ravi who will talk to you more about how they're using Chaos engineering and Azure Chaos Studio in the cloud security organization to shift left and add a lot of chaos validation right in their development pipelines and in their development process to catch defects early and to prevent outages and disruptions. Thanks everybody for your time. Hello everyone, I am Ravi Belam, part of Microsoft Security organization that provides solutions to protect our customer and Microsoft workloads. I'm leading the charge in delivering proactive security and reliability through platform data, site reliability engineering and machine learning solutions in the Microsoft Security Management Plane organization. Thanks Chris for a great overview on Chaos engineering practices and Azure Chaos Studio which is a cutting edge platform that helps organizations embrace uncertainty and drive innovation. In this segment, I will dwell into how a big organization like ours scaled at chaos engineering adoption with a combination of process and technology innovation, followed by role's presentation on their experience with adopting Azure Chaos studio for their services. As Chris mentioned, Chaos Engineering is a discipline that focuses on creating and managing systems with a high reliability by intentionally introducing controlled chaos and testless resilience. Cloud ecosystem Security organization is a home for critical services and it's crucial for us to implement chaos engineering practices to proactively identify and mitigate potential problems before they affect our customers. However, the successful adoption of Chaos engineering can be challenging. In this talk, I would like to share the key steps that we have learned to scale Chaos engineering adoption proactively. The first step in scaling chaos engineering adoption is to set the vision by defining the problem and setting a clear goal chaos engineering practices are designed to improve the reliability and security of systems, and it's crucial to identify the areas where these improvements are needed the most. By defining the problem, we wanted to focus our effort on improving the areas that needed the most attention and set a clear goal for what we want to achieve. In a bit, I will go over into the details of how we set the vision. The second step is to anticipate the common pitfalls and forecast potential problems that may arise during this process. For example, organizations may struggle with obtaining buy in from stakeholders, ensuring that the chaos experiments do not impact end users. Measuring the success of chaos engineering experimentation, the cost of onboarding and prioritization these are just a few examples out of many. By anticipating these potential problems, we have prepared and put measures in place to mitigate them. Once the vision is set and the potential problems are anticipated, we have then set a strategic approach to scaling chaos engineering adoption. Chris approach include following some key elements like defining the scope of Chaos engineering program, establishing a governance of the program, developing a plan for conducting and automating the chaos experiments. Once the vision is set and potential problems are anticipated, we have then set a strategic approach to scaling chaos engineering adoption. This approach included the following key elements one, defining the scope of the program two, establishing the governance of the program three, developing a plan for conducting and automating the chaos experiments. The final step in scaling the adoption or onboarding of the Chaos engineering practices is to measure the success and continuously improve the approach. We have achieved this by gathering the feedback from the stakeholders, analyzing the data from Chaos experiment, analyzing the data from the services telemetry, and making changes to the approach. Based upon the insights gained, we have regularly reviewed and updated our program to ensure that it continues to meet the evolving needs of our customers and teams. Let's look into how we can achieve the key steps that we have discussed so far. In this section, I will dive deep into our approach to setting a vision and aligning with certain keystones. As a quick background, our distributed systems are exponentially evolving, growing complex by the day. Our services are required to be updated frequently while expected to deliver the highest level of quality, performance, availability and reliability. Service availability and performance could be impacted either from a direct or dependency related change. The primary causes of major incidents and outages are service degradation caused by unavailability, congestion, heavy traffic, load cascading failures to the combination of degradation and load. There is also another interesting instance where heavy load crumbles, a system that is recovering from an outage or incident which we have seen in past major outages. Instead of following the status quo of reacting to the outages, preventing one as early as possible from reaching the production has positive effect on customer experience. Overall well being of the on calls, developer productivity and revenue the objective of our chaos engineering effort is an ambiguous which is to improve the customer and our team's experience. We have centered the vision for implementing the chaos engineering practices around a few core principles. First, we put higher emphasis on the importance of culture of continuous improvement where our teams are empowered to implement and iterate on their process to drive positive outages. Second, we wanted to proactively identify and mitigate potential failures before they occur, reducing the likelihood of outages downtime and improving the overall customer experience. Our data suggests the cost of problem skyrockets as the software engineering progresses. This is where chaos engineering practices shine, allowing girls to simulate the real world failures or nonhappy catch scenarios at the early stages of the software development saving value experience. Our data suggests the cost of problems skyrockets as the software engineering progresses. This is where chaos engineering practices shine, allowing us to simulate real world failures or nonhappy pass scenarios as at the early stages of the software development, saving valuable time and resources. Chris also improves the overall resilience, reliability and stability of our systems and processes. The next core principle is duplication of effort which is costly and ineffective approach as it results in the redundant work and resources being spent on the same task. By centralizing our chaos engineering efforts through the use of Chaos studio, we have been able to avoid duplication of work and instead focused on partnering with the Azure Chaos studio team. This collaboration has allowed us to directly implement the features our organization requires and has resulted in more robust and useful product for both Microsoft and our customers. Our contributions such as adding features for Cosmos DB Failover Network Security group, Azure Keyword access denial and Azure keyword certificate rotation, attribute changes and validity related faults have made chaos studio an essential tool in our journey towards enhancing our customer and team experiences. To summarize, our ultimate goal is to make sure our services deliver faster with high quality for our customer. Having established the vision and core principles, we then took a proactive approach and considered the potential hurdles we might encounter during the onboarding process of hundreds of our services. Our analysis was based upon the previous similar initiatives. Despite many benefits, many organizations face challenges in adopting chaos engineering practices into the software development. One of the big challenge in adopting chaos engineering is the lack of understanding of the concept and its benefits. Many people are not familiar with chaos engineering and may got see the value in introducing chaos into their systems. This lack of understanding can lead to resistance and reluctance to dont the practices. Even with organizations that have good understanding, the challenge in adopting chaos engineering is the technical aspect. Creating chaos scenarios and conducting experiments can be complex and require specialized skill and knowledge. This can especially be difficult for organizations that have limited technical resources and other priorities. Additionally, the infrastructure required for chaos engineering can be also be costly and time consuming to implement. Integrating chaos engineering practices into existing workflows can also be challenging. Organizations need to figure out how to incorporate chaos engineering into the development and operational process without disrupting the work they already doing. This requires careful planning and a deep understanding about the current workflows. Finally, measuring the success of chaos engineering can be difficult. There are many factors that contribute to the success of chaos engineering and it can be challenging to quantify these factors and determine the overall impact of the system. This can make it difficult for organizations to see the value of investment they are making in chaos engineering and to make the decisions about continuing or expanding the adoption of the practices. So far, we have covered the vision, key principles and challenges we may encounter during the adoption process. With our goals and potential obstacles in mind, we devised the strategy for success and scalability that we will now dwell into. As outlined previously, our objective is to design dependable and efficient service that delivers a smooth experience for the customers and our teams. To accomplish this, we need a comprehensive chaos engineering platform. In our case, it is Azure Chaos studio and streamlined processes that are easy to use, even those with a limited knowledge. Apart from the previously discovered challenges, one other challenge of adopting chaos engineering is the complexity of individual services. With hundreds of services, it could be difficult to identify all the potential failure scenarios and to create a meaningful chaos experiment. Another challenge is the risk of introducing failures into a live system. The consequence of failed experiments can be severe, causing downtime and loss of revenue. To overcome these challenges, it's important to develop a strategy that considered the system's complexity and the potential risks involved. Our approach is to break down the scenarios into nominal and disaster scenarios. Nominal scenarios are those that can be quickly executed in the CI CD pipelines as part of the build qualification process. They are typically less risky and focus on simple failures such as network connectivity or resource exhaustion. For example, cpu pressure, memory pressure, disk ivo pressure and dependency disruption are a few examples. Disaster scenarios, on the other hand, focus on more complex failures such as for our cases, it's like ad outage, DNS outages, the load balancer based outages, availability zone outages, data center outages, and or load related large scale cascading failures. We put together recommendations within our documented standards that these scenarios should be carefully planned and executed in a controlled environment, such as staging environment or sandbox, and should not move to production unless the teams gain confidence with repeated experimentation and recovery. Validation as discussed before, one of the core aspects of chaos adoption is the continuous measurement of the effectiveness of the scenario validation. In order to validate the effectiveness of the chaos scenario testing, we teamed up with Azure Chaos Studio to track the resiliency of our services before, during, and after each experiment using the telemetry data. This not only gave us a clear picture of the resiliency of our services, but also enabled us to set standards and guard against the future deviations in our resiliency. With a solid understanding of the technology and process in place, we chose to pilot the implementation of chaos engineering within the critical services in our organization. Our selection process was strategic, choosing services that provided a comprehensive representation of compute, networking, storage, and authentication stacks. This allowed us to standardize nominal and disaster scenarios and gather valuable feedback from the pilot services onboarding, which allowed us to eventually improve our process and raise the needed features within the Chaos studio, which allowed other organizations and other teams within Microsoft to be successful as well. Here is an updated overview of our organization's onboarding progress made possible by the power of Azure Chaos Studio and our effective onboarding strategy. To date, we have effectively onboarded 55 services with an additional 65 services in the process of onboarding. Of the 55 onboarded services, six are utilizing nominal scenarios for build qualification and two are conducting monthly disaster scenario validation. Our efforts have thus far uncovered 60 previously undetected bugs and prevented 22 critical issues that could have negatively impacted the customer experience. This is an ongoing effort and we are continuously working with our Azure Chaos Studio team to enhance the features within the Chaos studio, which would eventually help us to get better at adapting much more further scenarios to summarize our learnings. Scaling Chaos Engineering practices for your organization requires a clear vision, a strategic approach, and continuous improvement. By following the steps we have discussed thus far, organizations can effectively implement chaos engineering practices and achieve proactively reliability and security. By constantly measuring the success and making improvements, your organizations can continuously improve your chaos engineering program or adoption and ensure that it stays aligned with your evolving needs. With that, I'm going to hand over to Rohit, who will go over his team's successful journey of using Azure Cloud Studio and the strategy we discussed so far. Thanks everyone. Hello everyone. I'm Rohit Gundadi, part of Microsoft Secrets Automation. I'm the engineering manager for the key management platform that protects the data of Microsoft and our customers. Thanks Chris and Ravi for the overview of Chaos Engineering, Azure Chaos Studio, and the detailed explanation on how can automation scale the Chaos engineering adoption. During this presentation, I will concentrate on how we boosted our confidence in our service resiliency, understood the bounds and limitations by incorporating Azure Chaos Studio. I'll also talk about the specifics of our method for creating scenarios integrating with our CSCD pipelines and the situations we successfully prevented from affecting production. The expected outages of investing in chaos engineering are twofold. First, to enhance the customer experience by accelerating the delivery of top quality features. Second, to avoid production problems, thereby improving the customer satisfaction and mitigating engineering fighting, thus enabling more time for future development, which supports the accomplishments of the first outcome. Aligned with the strategy that Ravi mentioned, our strategy to achieve the outcome revolves around these five pillars. First, service level agreements establish the performance and reliability standards for a service. Rather than focusing solely on individual faults, our approach considers scenarios that impact our customer commitments and internal goals, such as service level objectives and service level indicators. This approach provides a comprehensive solution offering both recommending standard scenarios commonly shared across teams and unique scenarios specific to our service. Second, it is crucial to constant, continuously test chaos scenarios across the entire application stack, including compute, networking, storage, and auth, as real world failures can arise from within a single stack or as a cascading effect across multiple stacks. Testing only one stack at a time may not fully reflect the complexity and interdependency of the entire system, leaving it vulnerable to unforeseen disruptions. By testing the full application stack, we want to better prepare for the unpredictability. Having determined to adopt a top down approach and test both within and across the application stack, the next step is to map out specific nominal and disaster scenarios for our service. Before inducing failures into our system, we documented various service specific scenarios and their predicted outcomes, referred to as hypothesis, and then tested them using the Chaos studio simulation. This allowed us to concentrate our chaos engineering efforts and create targeted experiments. During this stage, we also had the opportunity to request desired features from the Azure Chaos studio and Azure low test. Next, we began conducting chaos scenarios and monitored the service behavior before, during, and after the experiments. This allowed us to experience the robust features of Azure Chaos Studio, including its ability to introduce faults, versatile experiment creation automation friendly design, a portal where we can review past runs, and most importantly, its ability to undo the simulation. Post experiment, we also provided feedback to Chaos Studio on how to enhance the customer onboarding experience and contributed for features that can help us monitor service health and automate the end to end process. We conducted a dry run of these scenarios to validate our hypothesis. After several dry runs, we compiled a list of validated scenarios or experiments. We incorporated these chaos engineering scenarios into our CI CD pipelines to automate the chaos engineering process and ensure that our systems are constantly tested and validated as the changes to our code cause are made. By following these steps, we were able to implement chaos engineering in a methodical and controlled way, enhancing the reliability and durability of our service and reducing the likelihood of failures in production. Here is a high level overview of our service architecture. Our service architecture consists of three major components, the data plane, control plane, and management plane. To meet our service needs, we utilize infrastructure as service virtual machines for the data plane and service fabric backed virtual machine skill sets. For the rest, we store our data using Azure Cosmos, DB and Azure storage accounts. Our customers access our endpoints through a load balancer which is managed by DNS. Our applications are secured within a virtual network within a network security group. Our authentication stack is based on Azure Active directory and we use an internal certificate authority PK service and a keyword equivalent for certificate and secret management. Next, let's go over the details of our service scenarios and automation we were able to achieve to meet our adoption needs, we have developed some nominal and disaster scenario standards. Although these standards are stack specific, we have tested combination of these stacks. For instance, we load each stack with both nominal and disaster scenarios. As without specific leads, the nontypical scenarios will not be triggered. The same applies to combining network and compute scenarios, et cetera. What we see here is a small slice of standards we were able to come up with. It is essential to test for resource exhaustion in our compute stack, whether caused by bugs of bugs or resource leaks. Our initial runs aim to establish the boundaries of our service, evaluate its ability to self recover using auto scaling and set benchmarks. This helps us ensure that any future changes to the service do not compromise our ability to handle resource exhaustion. We also wanted to determine if we are over provisioning and if reducing the capacity while increasing the load through a combination of load and fault injections could lead to sock lead to cost savings. In terms of compute disaster, we wanted to assess the preparedness of our service monitoring systems and recovery procedures in an event of a partial or full outage of availability zone, availability, set rack or data center. In regards to storage, we tested the overall service resiliency during a failover from the primary to secondary. We also evaluated network degradation between compute and storage, resulting in valuable insights further network we continuously validate important scenarios such as interdependency communication outages implemented using network security group faults. This helped us test cross stack resilience and identify any interesting design flaws and bugs. The real Chaos studio experimentation produced excellent results. In some cases, our hypothesis was validated, such as detecting a bug in preproduction by replicating 99% cpu pressure on our fleet. However, we also encountered an unexpected operation failures. Despite the expected increase in latency due to hardware limitations preventing auto scaling, we have since fixed the issue. Here are a few outcomes of the real Chaos studio experimentation where we were able to successfully validate the hypothesis in certain cases, whereas failed in others. For example, we were able to catch bug in preproduction by replicating 99% cpu pressure on our fleet. While the latency increase is expected as our service account auto scale due to certain hardware requirements, we have seen operation failures which is not expected here. As you can see, in the other scenario, the services crashed and failed to auto recover when we replicated a scenario of slow network packet transmission from computation to storage under heavy traffic, these type of bugs are difficult to detect unless a real world event occurs. With the help of Chaos studio, we were not only able to replicate the issue, but continuously validate it as part of our CI CD process. Our load testing, which we currently conduct using Azure load testing, helped prevent performance related bugs caused by heavy load from reaching production. We expect to achieve the same results with Azure Chaos Studio in the near future. As previously mentioned, it is essential to shift left in the development cycle and continuously measure the resiliency of the service by emulating manual touches to chaos experimentation. To accomplish this, we have developed and internally released an Azure DevOps extension for Azure Chaos Studio, which can be used in release pipelines that helps teams automate nominal and disaster scenarios in their pipelines. For our service, we benchmark official builds in pre production by running load and performance and chaos engineering nominal scenarios before allowing the build to reach production. We also run automatic bi weekly disaster scenarios that gives us confidence on our service resiliency. In conclusion, by utilizing Azure Chaos Studio and incorporating technology and processes with standards and nominal disaster scenarios, we have initiated a journey of continuous verification of our services secrets and reliability resiliency. This approach has already began to yield benefits, enhancing the customer and engineering experience in today's dynamic and complex technology landscape. It's more important than ever to ensure that our services are secure, reliable and resilient. This is where chaos engineering and Azure Chaos Studio comes in, providing a proactive approach to identify and mitigate potential issues before they impact real world scenarios. Automation using CI CD helps to streamline and speed up the process, allowing for quicker and more frequent testing and deployment. By standardizing nominal and disaster scenarios and scaling adoption within large organizations, we're able to create a consistent and systemic approach to continuous verification. This not only benefits our customers by providing a more stable and dependable service, resulting in better customer experience, but also our engineers by reducing fatigue and allowing them to focus on innovation and growth. Overall. Our presentation on Chaos engineering, Azure Chaos Studio, and the scaling adoption with large organizations standardizing nominal and disaster scenarios is a game changer in the world of technology, and we believe it will help take your organization to the new heights of success. Thank you for watching. We hope you have a success. Full journey in implementing kiosk engineering in a controlled and secure manner.

See all 16 talks at this event!

Conf42 Chaos Engineering 2023 - Online

February 16 2023

How Microsoft Shift left using Chaos Engineering practices to prevent outages and improve services quality?

Video size:

Abstract

Summary

Transcript

Ravi Bellam, Chris Ashton & Rohith Gundreddy

Awesome tech events for

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2023 - Online

February 16 2023

How Microsoft Shift left using Chaos Engineering practices to prevent outages and improve services quality?

Video size:

Abstract

Summary

Transcript

Ravi Bellam, Chris Ashton & Rohith Gundreddy

Awesome tech events for