Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, my name is Chris Ashton and I'm here with Ravi and Rohith from
Microsoft, and we're going to talk to you today about using chaos engineering
practices to ship left and prevent outages.
I'm going to talk to you a little bit about chaos engineering in
general and give an introduction to Azure Chaos studio. And then
Ravi and Rohith will take over and talk more in depth about practices they've
developed and are using in their teams.
Software as a service has revolutionized cloud application development.
It's now easier to get applications up
and running in a shorter period of time, and you have a variety of
microservices and other services available in a
toolbox to pull together and compose and build these applications.
While this has great benefits, with it comes
a new set of challenges.
Service outages and disruptions can happen at any time and this
impacts your overall application unavailability and can therefore upset
your customers. And these
disruptions can happen at any time. You have a lot less
control over the environment in which your things are running, and you
could be subject to power outage at a data center,
a flood in a control room, network going
down, bugs being introduced by developers into different components,
configuration changes coming at any time, lots of variables
make this a very unfriendly environment.
So now more than ever, applications must be
resilient. They must be designed to handle these failures.
Microsoft provides the well architected framework in Azure Architecture center
to give application developers guidance on how to build
resilient applications. But basically,
resilience is a shared responsibility. Microsoft needs to
ensure that the platform is resilient and build a lot of capabilities into the
services that make up all of the components
that you can use when you're composing your applications and the infrastructure that they run
on. But then it's also responsibility
of the application developer to validate their resilience
and make sure that you can handle the conditions that you've
designed for and unexpected things that
come up. Basically trust but verify.
Build your application using these guidelines and
then make sure that it operates in the way that you expect and understand.
These quality practices can be done on an
ad hoc basis. Some folks perform game days
and drill events and that's great. But more mature
organizations are building these quality practices right into their service development
lifecycle. Add chaos engineering to your
CI CD pipeline, test in a pre production environment.
Shield your customers from these outages and issues and
shift left and do these things earlier so that you can catch the defects early
and prevent them from going to production.
This is where chaos engineering comes in. Chaos Engineering is a practice
that has evolved over the last decade of subjecting your
application and services to the real world outages
and disruptions that they will face in production.
You've designed for failure, you've architected for resilience.
Make sure that you can stand
up to the disruptions that will be encountered.
With chaos Engineering. You're not just validating that
you've chosen the right architecture and that your code doesn't have any bugs.
You're also validating that you've configured everything
correctly, that you've got the correct observability and monitoring
in place, and then you're also able to validate your process aspects.
Does your DRI, your designated responsible individual,
or your on call engineer have the reporting and
monitoring and metrics in place that they can see what's
going on with your service? Do they have the correct troubleshooting guides in place when
something goes wrong and they need to investigate and see what's
going on further? So all of these things
are basically enabled by and validated
through chaos engineering. But a big
component of Chris is to do it systematically and in a controlled way.
There's a lot of value in doing this in production.
There's a lot of value in sitting down and just injecting
faults, and that could be quite fun, but really, it's very dangerous,
especially doing it in production. You could impact your customers. You could accidentally cause an
outage. It's much better to do this in an earlier stage,
in a pre production environment that's very production like,
that represents the infrastructure it will run on
all of the services that are running and how they're configured. With a
production like workload, usually synthetic,
but still representing what your real workload would be.
And then you're able to, in a much
more systematic and scientific way,
assess the resilience of your solutions. So with
chaos engineering, there's a little bit of a
metaphor around following the scientific method, we have
chaos experiments. But just like with the scientific method,
you don't just start down and doing an experiment. It's good to sit down and
think about what you want to do, form a hypothesis. Which resiliency
measures have you taken advantage of, and what are your expectations
of your application when it is faced
with disruptions? Will you fail over to another region?
What happens if an availability zone goes down? What happens
if there's an AAD outage or a DNS outage?
Formulate an experiment, go out and perform it. Monitor how
your system works while it's being disrupted
and affected. And then do some post experiment
analysis, make some improvements, and then rinse and repeat. This can
become part of a very virtuous cycle and for
mature teams this fits very nicely into a typical
product lifecycle and validation cycle. Adding chaos
to a CI CD pipeline is a great way to make this very methodical,
systematic and organized.
There are lots and lots of use cases for
chaos engineering and then of course Azure Chaos studio. You can basically,
a developer can sit down in their office and just do ad hoc experimentation and
disrupt dependencies systematically and one by one,
hopefully in a dev or test environment to validate their new code or
validate some upcoming configuration changes and just
see how the system will handle it. That's a great way to get started.
Another thing lots of folks like to do is host drill events or game
days. These are great ways to gather a
group of people together in a designated period of time and validate
one or more hypotheses and see how the system
stands up to these. Again, best to do in a safe
environment, shifting left, but also often done in production.
These are great ways to catch monitoring gaps, validate your
livesight process, but of course catch code defects
and architectural issues. But as I've mentioned,
we really believe that it is important to do this early
and if at all possible, do it in automation.
Get it into your CI CD pipeline or some other automated process
that you have. You dont have to shift all the way
left and do this in a dev stage with your
acceptance tests. Although some teams have been quite successful doing
that, most teams that are doing this are quite successful
integrating in their integration
stage and often augment existing stress or
performance benchmarks or some other synthetic workload. With chaos,
they're able to run the existing tests and
gauge their baseline or how do they perform against known metrics and
trends, as well as gauge the impact to those trends and
baselines when chaos is applied and apply that to overall availability
metrics. Digging a
little bit further, we've talked about the chaos experiment.
Formulate a hypothesis, think about the things that you
want to validate and in automation, what would this
look like? Sometimes it's very systematic.
Disrupt each of your dependencies one by one. Other cases
it's craft things around a scenario so
you could systematically go through and say, I'm dependent on Cosmos
DB, I'm dependent on SQL server, and introduce
different disruptions for each of these one by one. Or you could
think about this from a scenario perspective and look at
things like what happens if an availability zone goes down or a
DNS outage and so on.
So then here's where Azure Chaos studio comes in.
Azure chaos Studio is a tool that
we developed for use inside Microsoft to validate
the resilience of the components that
make up Azure. And these are things like
our identity with Azure Active directory or
the storage platform networking. These teams are using chaos
to validate the resilience of the platform itself, but then also
other solutions inside Microsoft that are building on top of that platform use chaos
to validate their resilience. Microsoft Teams dynamics power
platform but then we didn't want this just
to be an internal tool and to keep it to Microsoft. We wanted
our customers to be able to do these same validations. So we're actually releasing Azure
Chaos Studio as a product. We want our customers to be able to validate
their resilience to things that can go wrong on the Azure platform,
but also to validate the resilience of their own applications and solutions
and the communications amongst their own microservices,
et cetera. So Azure Castio is currently in
public preview and we're currently planning on going
ga general availability later this year. But in
general, Azure Chaos Studio is a fully managed service. We basically
offer chaos as a service. We have deep Azure integration.
We have an Azure portal user interface, we have Azure resource
manager arm compliant APIs
so that you can use Chaos studio manually from
the user interface, or you can use it programmatically through the APIs
to validate the resilience of your solutions by
introducing faults.
A key component of Azure Chaos Studio is an expandable and growing
fault library. This fault library has capabilities that
allow you to introduce faults in a variety of ways. I'll talk
about that a little bit more in a second that allow
you to perturb your application in realistic
ways, and we have workload execution that allows you
to do orchestration that allows you to execute cause experiments
that have fault actions and disruptions that can be applied
sequentially and in parallel to make up these scenarios.
Again, I'll talk about that a little bit more in a second,
but we don't want you to accidentally cause
an outage while you're doing this, and we don't want your testing activities to bleed
over and affect production. So we're building a lot of safeguards into the product.
We've got role based access control for the Azure resources that
can be used in experiments, and we have role based access
control for who can run the experiments and
even what faults can be used on individual resources.
One other final thing super important, we're building integration
in so that we've got observability
aspects to this. We want you to be able to correlate
your chaos activities with the things that are going on in your service
so that you can see right in your own telemetry and monitoring what's
happening and enable evaluation of
what's going on, as well as to enable you to validate
that you have the correct monitoring and observability components in place.
So even more detail on Azure Chaos studio on
the right here you'll see the faults and actions that kind
of make up our library. We have agent based faults.
We have a chaos agent that can be installed or deployed to
your windows or your Linux virtual machines. And then
through that we have a variety of faults that can affect things that are running
right there on that virtual machine. Lots of resource pressure faults,
cpu, memory disk I o. We can delay the
network calls, we can actually block access to
IP addresses. We can change firewall rules
on Windows systems, a few other things. And we have more
faults coming on the other side of the house. We have what we call
service direct faults, no agent required. We can
disrupt Azure services directly. This allows us to do things
like shut down or kill gracefully or abruptly.
We can shut down virtual machines.
We have a cosmos DB Fillover fault. We can reboot
redis, so several very specific
faults that can be applied to services. Another group
of faults that we have like this is for aks.
We leverage chaos mesh to be able to apply fault actions
to aks. And then we
have some more generic fault activities that can be done. So we have
network security group rules faults that allow you to block access
to IP addresses and ranges so that you could disrupt your own microservice
communications. And then through Azure service tags
we can also block access to other Azure
resources. So we can take the Azure Cosmos
DB service tag, put that in an NSG rules
fault, and block access to Cosmos DB. And then a really cool
sub aspect of that is Azure service tags support
a regional component. So you could do things like block access to
Azure Cosmos DB east us and affect
only the Cosmos DB databases that are running in east us. And that would
allow you to do things like a region failover
or region isolation drill. And so that's what's on the left of
the slide is a lot of the scenarios that we recommend that you validate.
It's good to think about the faults and the individual disruptions that you can apply
to your dependencies, but it's much better to think about these things
a lot more systematically in terms of the incidents
and outages that will affect you and your application or service.
So again, I've mentioned these several times, but what happens when an availability zone
goes down? What happens when there's a DNS outage? Gas studio
today has faults that help you validate these scenarios in sort
of fairly comprehensive but somewhat limited ways over time
and by the time we go general availability, we want to have much more robust
capabilities to do a full unavailability zone down exercise.
So today we can affect vms, vms service fabric,
virtual machines and aks and have it appear that zone
two has gone down in the east us region or something.
But over time we'll be able to impact Cosmos, DB, SQL server and
other things in that same way. Same with all of our faults and the library.
We're just going to keep working with individual Azure service teams to enable
more fault capabilities and expose those in our library
so that you can then build up to more scenarios that you validate.
Now I'm going to hand off to Ravi who will talk to you more about
how they're using Chaos engineering
and Azure Chaos Studio in the cloud security organization to
shift left and add a lot of chaos validation
right in their development pipelines and in their development process to
catch defects early and to prevent outages
and disruptions. Thanks everybody for your time. Hello everyone,
I am Ravi Belam, part of Microsoft Security organization that provides
solutions to protect our customer and Microsoft workloads.
I'm leading the charge in delivering proactive security and reliability through
platform data, site reliability engineering and machine learning solutions
in the Microsoft Security Management Plane organization.
Thanks Chris for a great overview on Chaos engineering practices and Azure
Chaos Studio which is a cutting edge platform that helps organizations embrace
uncertainty and drive innovation. In this segment,
I will dwell into how a big organization like ours scaled at
chaos engineering adoption with a combination of process and technology innovation,
followed by role's presentation on their experience with adopting
Azure Chaos studio for their services. As Chris mentioned,
Chaos Engineering is a discipline that focuses on creating
and managing systems with a high reliability by
intentionally introducing controlled chaos and testless resilience.
Cloud ecosystem Security organization is a home for critical services
and it's crucial for us to implement chaos engineering practices to
proactively identify and mitigate potential problems before
they affect our customers. However, the successful
adoption of Chaos engineering can be challenging. In this
talk, I would like to share the key steps that we have learned to scale
Chaos engineering adoption proactively.
The first step in scaling chaos engineering adoption is to set
the vision by defining the problem and setting a clear goal chaos
engineering practices are designed to improve the reliability and security of
systems, and it's crucial to identify the areas where
these improvements are needed the most. By defining the problem,
we wanted to focus our effort on improving the areas that needed the most
attention and set a clear goal for what we want to achieve.
In a bit, I will go over into the details of how we set the
vision. The second step is to anticipate the common
pitfalls and forecast potential problems that may arise during
this process. For example, organizations may struggle
with obtaining buy in from stakeholders, ensuring that the chaos experiments
do not impact end users. Measuring the success of chaos
engineering experimentation, the cost of onboarding and prioritization
these are just a few examples out of many.
By anticipating these potential problems, we have prepared and put
measures in place to mitigate them. Once the
vision is set and the potential problems are anticipated,
we have then set a strategic approach to scaling chaos engineering adoption.
Chris approach include following
some key elements like defining the scope of Chaos engineering program,
establishing a governance of the program, developing a
plan for conducting and automating the chaos experiments.
Once the vision is set and potential problems are anticipated,
we have then set a strategic approach to scaling chaos engineering
adoption. This approach included the following key elements
one, defining the scope of the program two,
establishing the governance of the program three,
developing a plan for conducting and automating the chaos
experiments. The final step in scaling the adoption or
onboarding of the Chaos engineering practices is to measure the
success and continuously improve the approach. We have achieved this
by gathering the feedback from the stakeholders, analyzing the data from
Chaos experiment, analyzing the data from the services telemetry,
and making changes to the approach. Based upon the insights gained,
we have regularly reviewed and updated our program to ensure that
it continues to meet the evolving needs of our customers and teams.
Let's look into how we can achieve the key steps that we have
discussed so far. In this section, I will dive deep into
our approach to setting a vision and aligning with certain keystones.
As a quick background, our distributed systems are exponentially
evolving, growing complex by the day. Our services are required
to be updated frequently while expected to deliver the highest level
of quality, performance, availability and reliability.
Service availability and performance could be impacted either from a direct
or dependency related change. The primary causes of major
incidents and outages are service degradation caused by unavailability,
congestion, heavy traffic, load cascading failures
to the combination of degradation and load. There is also another interesting
instance where heavy load crumbles, a system that is recovering from an outage or
incident which we have seen in past major outages.
Instead of following the status quo of reacting to the outages,
preventing one as early as possible from reaching the production
has positive effect on customer experience. Overall well being
of the on calls, developer productivity and revenue
the objective of our chaos engineering effort is an ambiguous
which is to improve the customer and our team's experience.
We have centered the vision for implementing the chaos engineering practices around
a few core principles. First, we put higher emphasis
on the importance of culture of continuous improvement where our teams
are empowered to implement and iterate on their process to drive positive
outages. Second, we wanted to proactively identify and
mitigate potential failures before they occur, reducing the likelihood
of outages downtime and improving the overall customer experience.
Our data suggests the cost of problem skyrockets
as the software engineering progresses. This is where chaos engineering
practices shine, allowing girls to simulate the real world failures or
nonhappy catch scenarios at the early stages of the software development
saving value experience. Our data suggests the cost
of problems skyrockets as the software engineering progresses.
This is where chaos engineering practices shine, allowing us to
simulate real world failures or nonhappy pass scenarios as at the
early stages of the software development, saving valuable time and resources.
Chris also improves the overall resilience, reliability and stability
of our systems and processes. The next core
principle is duplication of effort which is costly and ineffective approach
as it results in the redundant work and resources being spent on
the same task. By centralizing our chaos engineering efforts through the
use of Chaos studio, we have been able to avoid duplication
of work and instead focused on partnering with the Azure Chaos studio team.
This collaboration has allowed us to directly implement the features our
organization requires and has resulted in more robust and useful
product for both Microsoft and our customers. Our contributions
such as adding features for Cosmos DB Failover Network Security
group, Azure Keyword access denial and Azure
keyword certificate rotation, attribute changes and validity
related faults have made chaos studio an essential tool in our journey
towards enhancing our customer and team experiences. To summarize,
our ultimate goal is to make sure our services deliver faster
with high quality for our customer.
Having established the vision and core principles, we then took a
proactive approach and considered the potential hurdles we might encounter during
the onboarding process of hundreds of our services. Our analysis was
based upon the previous similar initiatives. Despite many benefits,
many organizations face challenges in adopting chaos engineering
practices into the software development. One of the big challenge
in adopting chaos engineering is the lack of understanding of the concept and
its benefits. Many people are not familiar with chaos engineering
and may got see the value in introducing chaos into their systems.
This lack of understanding can lead to resistance and reluctance to
dont the practices. Even with organizations that have good
understanding, the challenge in adopting chaos engineering is the technical aspect.
Creating chaos scenarios and conducting experiments can be complex and
require specialized skill and knowledge. This can especially be
difficult for organizations that have limited technical resources and other priorities.
Additionally, the infrastructure required for chaos engineering can be also
be costly and time consuming to implement. Integrating chaos
engineering practices into existing workflows can also be challenging.
Organizations need to figure out how to incorporate chaos engineering into the development
and operational process without disrupting the work they already
doing. This requires careful planning and a deep understanding
about the current workflows. Finally, measuring the success of chaos
engineering can be difficult. There are many factors that contribute to the success of
chaos engineering and it can be challenging to quantify these factors and determine
the overall impact of the system. This can make it difficult for organizations
to see the value of investment they are making in chaos engineering and
to make the decisions about continuing or expanding the adoption of the practices.
So far, we have covered the vision, key principles and challenges
we may encounter during the adoption process. With our goals and potential
obstacles in mind, we devised the strategy for success and
scalability that we will now dwell into. As outlined
previously, our objective is to design dependable and efficient
service that delivers a smooth experience for the customers and
our teams. To accomplish this, we need a comprehensive chaos
engineering platform. In our case, it is Azure Chaos studio
and streamlined processes that are easy to use, even those with a
limited knowledge. Apart from the previously discovered challenges,
one other challenge of adopting chaos engineering is the complexity
of individual services. With hundreds of services,
it could be difficult to identify all the potential failure scenarios
and to create a meaningful chaos experiment. Another challenge is the risk
of introducing failures into a live system. The consequence of failed
experiments can be severe, causing downtime and loss of revenue.
To overcome these challenges, it's important to develop a strategy that
considered the system's complexity and the potential risks involved.
Our approach is to break down the scenarios into nominal
and disaster scenarios. Nominal scenarios are those
that can be quickly executed in the CI CD pipelines as part
of the build qualification process. They are typically less risky and
focus on simple failures such as network connectivity or resource exhaustion.
For example, cpu pressure, memory pressure, disk ivo pressure
and dependency disruption are a few examples.
Disaster scenarios, on the other hand, focus on more complex
failures such as for our cases, it's like ad outage,
DNS outages, the load balancer based outages,
availability zone outages, data center outages, and or load related
large scale cascading failures. We put together
recommendations within our documented standards that
these scenarios should be carefully planned and executed in a controlled environment,
such as staging environment or sandbox, and should not
move to production unless the teams gain confidence with repeated
experimentation and recovery. Validation as
discussed before, one of the core aspects of chaos
adoption is the continuous measurement of the effectiveness of
the scenario validation. In order
to validate the effectiveness of the chaos
scenario testing, we teamed up with Azure Chaos Studio to track
the resiliency of our services before, during, and after
each experiment using the telemetry data. This not only gave us
a clear picture of the resiliency of our services, but also enabled
us to set standards and guard against the future deviations
in our resiliency. With a solid understanding of the technology and
process in place, we chose to pilot the implementation of
chaos engineering within the critical services in our organization.
Our selection process was strategic, choosing services that provided
a comprehensive representation of compute, networking, storage,
and authentication stacks. This allowed us to standardize nominal and
disaster scenarios and gather valuable feedback from the
pilot services onboarding, which allowed us to eventually improve our
process and raise the needed features within the Chaos studio,
which allowed other organizations and other teams within Microsoft
to be successful as well.
Here is an updated overview of our organization's onboarding progress
made possible by the power of Azure Chaos Studio and our effective
onboarding strategy. To date, we have effectively onboarded
55 services with an additional 65 services in the
process of onboarding. Of the 55 onboarded services,
six are utilizing nominal scenarios for build qualification and
two are conducting monthly disaster scenario validation.
Our efforts have thus far uncovered 60 previously
undetected bugs and prevented 22 critical issues that could have negatively
impacted the customer experience. This is an ongoing effort
and we are continuously working with our Azure Chaos Studio team to
enhance the features within the Chaos studio, which would eventually help
us to get better at adapting much more further scenarios
to summarize our learnings. Scaling Chaos Engineering practices
for your organization requires a clear vision,
a strategic approach, and continuous improvement. By following the
steps we have discussed thus far, organizations can effectively implement chaos
engineering practices and achieve proactively reliability and security.
By constantly measuring the success and making improvements, your organizations can
continuously improve your chaos engineering program or adoption
and ensure that it stays aligned with your evolving needs.
With that, I'm going to hand over to Rohit, who will go over
his team's successful journey of using Azure Cloud Studio and the strategy
we discussed so far. Thanks everyone. Hello everyone.
I'm Rohit Gundadi, part of Microsoft Secrets Automation. I'm the
engineering manager for the key management platform that protects the data
of Microsoft and our customers. Thanks Chris and Ravi for the
overview of Chaos Engineering, Azure Chaos Studio, and the detailed explanation on how
can automation scale the Chaos engineering adoption. During this
presentation, I will concentrate on how we boosted our confidence in our service resiliency,
understood the bounds and limitations by incorporating Azure Chaos Studio.
I'll also talk about the specifics of our method for creating scenarios
integrating with our CSCD pipelines and the situations
we successfully prevented from affecting production. The expected
outages of investing in chaos engineering are twofold.
First, to enhance the customer experience by accelerating
the delivery of top quality features. Second, to avoid production
problems, thereby improving the customer satisfaction and mitigating
engineering fighting, thus enabling more time for future development,
which supports the accomplishments of the first outcome. Aligned with the strategy
that Ravi mentioned, our strategy to achieve the outcome revolves around
these five pillars. First, service level agreements establish
the performance and reliability standards for a service. Rather than
focusing solely on individual faults, our approach considers scenarios that
impact our customer commitments and internal goals, such as service
level objectives and service level indicators. This approach provides a comprehensive
solution offering both recommending standard scenarios
commonly shared across teams and unique scenarios specific to
our service. Second, it is crucial to constant,
continuously test chaos scenarios across the entire application
stack, including compute, networking, storage, and auth,
as real world failures can arise from within a single stack
or as a cascading effect across multiple stacks.
Testing only one stack at a time may not fully reflect the complexity and
interdependency of the entire system, leaving it vulnerable to
unforeseen disruptions. By testing the full application stack,
we want to better prepare for the unpredictability. Having determined
to adopt a top down approach and test both within
and across the application stack, the next step is to map out specific
nominal and disaster scenarios for our service. Before inducing failures
into our system, we documented various service specific scenarios
and their predicted outcomes, referred to as hypothesis,
and then tested them using the Chaos studio simulation. This allowed
us to concentrate our chaos engineering efforts and create targeted
experiments. During this stage, we also had the opportunity to request
desired features from the Azure Chaos studio and Azure low test.
Next, we began conducting chaos scenarios and monitored the service behavior
before, during, and after the experiments. This allowed us
to experience the robust features of Azure Chaos Studio,
including its ability to introduce faults, versatile experiment
creation automation friendly design, a portal where we can review
past runs, and most importantly, its ability to undo the
simulation. Post experiment, we also provided feedback to Chaos
Studio on how to enhance the customer onboarding experience and
contributed for features that can help us monitor service health and
automate the end to end process. We conducted a dry run of these
scenarios to validate our hypothesis. After several
dry runs, we compiled a list of validated scenarios or experiments.
We incorporated these chaos engineering scenarios into
our CI CD pipelines to automate the chaos engineering process and ensure
that our systems are constantly tested and validated
as the changes to our code cause are made.
By following these steps, we were able to implement chaos engineering
in a methodical and controlled way, enhancing the reliability and durability
of our service and reducing the likelihood of failures in production.
Here is a high level overview of our service architecture. Our service architecture
consists of three major components, the data plane, control plane, and management plane.
To meet our service needs, we utilize infrastructure as service virtual machines
for the data plane and service fabric backed virtual machine skill
sets. For the rest, we store our data using Azure Cosmos, DB and
Azure storage accounts. Our customers access our endpoints
through a load balancer which is managed by DNS. Our applications
are secured within a virtual network within a network
security group. Our authentication stack is based on Azure
Active directory and we use an internal certificate authority PK
service and a keyword equivalent for certificate and secret management.
Next, let's go over the details of our service scenarios and automation
we were able to achieve to
meet our adoption needs, we have developed some nominal and disaster scenario
standards. Although these standards are stack specific, we have tested combination
of these stacks. For instance, we load each stack with both nominal and
disaster scenarios. As without specific leads,
the nontypical scenarios will not be triggered. The same
applies to combining network and compute scenarios, et cetera. What we
see here is a small slice of standards we were able to
come up with. It is essential to test for resource exhaustion
in our compute stack, whether caused by bugs of bugs
or resource leaks. Our initial runs aim to establish
the boundaries of our service, evaluate its ability to self recover using auto
scaling and set benchmarks. This helps us ensure that any
future changes to the service do not compromise our ability
to handle resource exhaustion. We also wanted to determine if we
are over provisioning and if reducing the capacity while increasing
the load through a combination of load and fault injections could lead
to sock lead to cost savings.
In terms of compute disaster, we wanted to assess the preparedness
of our service monitoring systems and recovery procedures
in an event of a partial or full outage of availability zone,
availability, set rack or data center. In regards to storage,
we tested the overall service resiliency during a failover from the
primary to secondary. We also evaluated network degradation between
compute and storage, resulting in valuable insights further
network we continuously validate important scenarios such as
interdependency communication outages implemented using
network security group faults. This helped us test cross
stack resilience and identify any interesting design flaws and
bugs. The real Chaos studio experimentation produced excellent
results. In some cases, our hypothesis was validated,
such as detecting a bug in preproduction by replicating 99% cpu
pressure on our fleet. However, we also encountered an
unexpected operation failures. Despite the expected increase in latency
due to hardware limitations preventing auto scaling, we have since
fixed the issue.
Here are a few outcomes of the real Chaos studio experimentation where we
were able to successfully validate the hypothesis in certain cases,
whereas failed in others. For example, we were able to catch bug
in preproduction by replicating 99% cpu pressure on our fleet.
While the latency increase is expected as our service account
auto scale due to certain hardware requirements, we have seen operation failures
which is not expected here. As you can see, in the other scenario,
the services crashed and failed to auto recover when we replicated
a scenario of slow network packet transmission from computation to
storage under heavy traffic, these type of bugs are difficult to
detect unless a real world event occurs. With the help of Chaos studio,
we were not only able to replicate the issue, but continuously validate it
as part of our CI CD process. Our load testing,
which we currently conduct using Azure load testing, helped prevent
performance related bugs caused by heavy load from reaching
production. We expect to achieve the same results with Azure
Chaos Studio in the near future. As previously mentioned,
it is essential to shift left in the development cycle and continuously measure
the resiliency of the service by emulating manual touches to
chaos experimentation. To accomplish this, we have developed
and internally released an Azure DevOps extension for
Azure Chaos Studio, which can be used in release pipelines that
helps teams automate nominal and disaster scenarios in their pipelines.
For our service, we benchmark official builds in pre production by running
load and performance and chaos engineering nominal
scenarios before allowing the build to reach production. We also
run automatic bi weekly disaster scenarios that gives
us confidence on our service resiliency.
In conclusion, by utilizing Azure Chaos Studio and incorporating technology
and processes with standards and nominal disaster scenarios,
we have initiated a journey of continuous verification of our services
secrets and reliability resiliency. This approach
has already began to yield benefits, enhancing the customer
and engineering experience in today's dynamic
and complex technology landscape. It's more important than ever
to ensure that our services are secure, reliable and resilient.
This is where chaos engineering and Azure Chaos Studio comes in,
providing a proactive approach to identify and mitigate potential issues
before they impact real world scenarios.
Automation using CI CD helps to streamline and speed up the
process, allowing for quicker and more frequent testing and deployment.
By standardizing nominal and disaster scenarios and scaling
adoption within large organizations, we're able to
create a consistent and systemic approach to
continuous verification. This not only benefits our customers by
providing a more stable and dependable service, resulting in better customer
experience, but also our engineers by reducing fatigue and
allowing them to focus on innovation and growth.
Overall. Our presentation on Chaos engineering,
Azure Chaos Studio, and the scaling adoption with large organizations
standardizing nominal and disaster scenarios is a
game changer in the world of technology, and we believe it will help
take your organization to the new heights of success. Thank you
for watching. We hope you have a success. Full journey in implementing kiosk engineering
in a controlled and secure manner.