Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and observing
changes exceptions. Errors in real time
allows you to not only experiment with confidence, but respond
instantly to get things working again.
Cloud hi,
my name is John Engelkemnetz and I'm principal program manager
manager on the Azure Chaos studio team here at Microsoft.
To talk to you a little bit about how we do chaos engineering at Microsoft
using our new service, Azure Chaos Studio,
and to tell you a little bit about some of our learnings in doing chaos
engineering. So to start us off, I do
like to talk a little bit about this concept of resilience,
as well as what it means to establish and maintain quality
in the cloud. So we know that resilience is
the capability of a system to handle and recover from disruptions.
And a disruption can be anything from a major outage
that drops availability to 0% for
a long time window to something much more
minor, say a deviation in
the availability that is only slight, a sudden high
amount of stress, higher latency, et cetera. All of these are
examples of more brownout type cases where
there is still some disruption to the service, even though
it is not a complete and utter in availability of that
service. Now, regardless of whether it is a
brownout or a true blackout major outage event,
any sort of impact that is disruptive to
the availability and performance of your service
is going to impact customer experience. And we know
that when there are outages that impact availability,
there is business impact. You can have upset customers,
you can lose revenue. And a key thing
with chaos Engineering is being able to measure the impact to
your business when there is an outage, in terms of that
cost to the business, whether it be lost
revenue or lost
sales or anything that might fit into that category.
But beyond the simple business impact of an
outage, what we found running a major
cloud provider is that our customers are running mission critical
apps on Azure, and that means that beyond
lost revenue, there can be major legal
consequences of an outage, and even in some cases,
life or death consequences. So in a legal
example, many financial institutions need to provide
audit evidence that they can successfully recover from
a disaster. If that does not
remain true, there can be legal consequences
from a government. Another example in the life and safety
area is emergency services. Increasingly,
emergency services operate on top of cloud providers,
and an outage in an emergency service might be the difference
between an ambulance getting to where it needs to go on time
or that ambulance not being able to respond
in an emergency situation as appropriately.
So we take this really seriously, knowing that businesses
the stock market, finances as
well as life and death scenarios and legal
consequences can happen when there's an availability due
to a service outage. Now at Microsoft, we think
that building quality into the entire service
development and operation lifecycle is the right way
to tackle this challenge. And when we talk about
building that quality into the entire service
development and operation lifecycle, we really mean two things.
The first is thinking about quality from the beginning
of the ideation of a new service through the development
of that service, and through the deployment and
operation of the service. Now that continues
through to the continuous deployment and development of that service,
and even maintaining quality through deprecation
of a legacy service. The other thing that this means
to us at Microsoft is that beyond
simply making quality something that our
site reliability engineers and DevOps engineers think about,
quality has to be something that is a part of the culture
of the entire company. And that means including
leaders, managers, as well as other folks
involved in the building and development of applications.
So product managers,
testers, folks who are doing marketing,
even getting them involved in thinking about quality of
the services from a business perspective,
help to reinforce the importance of quality.
As a product manager, my accountability is not just
to additional users or additional revenue,
it is also to having a service that is
quality. Quality is a customer requirement,
and that means that both me as an individual contributor
as well as my management chain all have to be thinking about
quality and prioritizing it as a fundamental,
similar to security that everyone takes seriously
and contributes to as a new service is being
built or while operating an existing service.
At Microsoft, one initiative we're doing to tackle this
is making sure that as part of the core priorities for
every employee who works in our Azure division,
we're including that we think about quality as
one of those core priorities, and then we're measuring
ourselves so that everyone is accountable for contributing
to improvements in quality in some way,
shape or form. Now, all of
this becomes particularly important when
we're talking about cloud applications.
There are two interesting aspects of cloud applications that
make building resilience a little bit more challenging than it may
have been on premises. The first aspect of
this is that the architecture types for
cloud applications tend to be highly distributed,
highly complex, and oftentimes less
familiar to folks who are using those.
So while there are enormous benefits in leveraging
services like Azure Kubernetes service or Azure app
services, there is a slight drawback
in that the patterns for building resilience
of those services may be a little bit less mature.
And certainly knowledge of how to leverage those patterns
within any given organization might be lower.
So using cloud native applications
can increase resilience by virtue of the fact that these are
built to be resilient to failure. But there
is this consequence of potentially lower knowledge and
lower ability to execute on having those best
practices built in when developing a cloud native application.
The other part about migration to the cloud that can
be challenging is the sudden increased
difference or distance between the
cloud consumer and the application that you've written
and the underlying compute that's going to run
that code. So depending on the service type
you choose, let's say you're developing a serverless application,
there may be three, four, five layers of compute
between the code you've written and the actual code
that is running in our physical data center.
Now, that benefits you as a cloud consumer
because you benefit from the scale and cost
efficiency of the cloud. You also benefit from the
resilience that can get built at scale by a large
scale cloud provider like Azure. However,
it does mean that there's these abstractions
that can mean sometimes you're at the mercy of the cloud provider
when it comes to resilience, and there's plenty that can
be done to defend against a failure in your underlying cloud
provider. But sometimes you're really just
sort of hoping that the platform is
stable, because if there is an issue in the underlying
platform, there's not much you can do to avoid
that becoming an issue for your upstream service.
And this is why we believe that much like the
security pillar of cloud development
with resilience, there is a shared responsibility between
the cloud provider and the cloud consumer. And when we
use this term shared responsibility, what we mean is that we both
have a shared accountability for ensuring that our
applications are robust, redundant,
reliable, so that your
downstream customer, the consumer of your applications,
don't see downtime. Now, if the cloud provider were
to have 100% availability, but the cloud consumer
were to have not implemented best practices in terms
of resilience, or in the alternate case where a
cloud consumer has implemented every best practice
available to defend against any sort of failure, but the cloud provider
is just simply having horrible SLA
attainment and constant outages. In either of those scenarios,
there will still be downstream impact to a customer. And that's why
we believe that as the cloud provider, we need a solution
that helps our customers to become resilient and to
defend against an issue that can happen either
within their own service, or an issue that could happen in
the underlying platform that impacts a service depending
on that platform. So we believe that we need to provide that sort
of solution, as well as continue to meet our responsibility
in our shared responsibility for continuing to
up our availability and our resilience of the
services that you depend on. So all of
this is highly relevant to Azure Chaos Studio, where Azure Chaos
Studio, the exact same product that we make available to
customers that run on Azure, is what we're using within Azure
across Microsoft cloud service teams to
enable us to improve our availability by
doing chaos engineering. So let's talk a little
bit about this concept of chaos engineering.
Microsoft's approach to chaos engineering fits very well
with the models and approach that are out in the industry and
that many of the experts in site reliability
engineering have developed over time. One thing we do like to
emphasize is the importance of leveraging the scientific
method when going to do chaos engineering
so that your chaos isn't simply chaos for Chaos's
sake, but rather is controlled chaos,
structured chaos, chaos that has a definitive
outcome and results in some sort of tangible improvement.
So if you're familiar with chaos engineering, you're likely
familiar with the idea of starting with an understanding of your
steady state, having appropriate observability and
health modeling such that you can identify an SLI,
a service level indicator, and a service level objective that
are going to kind of be your bar for availability,
and then leverage that steady state to
formulate a hypothesis about your system where we say we believe
that we won't deviate from the steady state more than
a certain percentage given some particular
failure scenario happening within our application.
Now with that hypothesis, you can then go and create
a chaos experiment and run that experiment to
assess whether your hypothesis was valid or invalid, and that
allows you to do some analysis to understand were we resilient
to that failure. If not, we have some work to do to
dig deeper into the logs, the traces, to understand exactly
why our hypothesis was invalid, and that
inevitably teams to some sort of improvement in the service.
And this is cyclical, both because you're going to continuously
want to up the bar in terms of the quality of your service,
but also because services are going to continue to change,
whether it is a service that is growing and there's continuous
development happening on that service, or if it's simply the
fact that over time the platform that
your service depends on has mandatory upgrades, say upgrades
to your version of kubernetes, upgrades to your operating system,
version upgrades to your version of net or
python or whatever that libraries are that you depend
on, some of those will be forced on your service. And so
that means that maintaining resilience against certain
scenarios requires that you're thinking about this cyclically,
and not just as a one time activity.
Now at Microsoft, we also believe that chaos engineering can
be used in a wide variety of scenarios from those
that we hear of as shift right scenarios where
we're being game days, business continuity disaster
recovery drills, ensuring that
our live site tooling and observability data covers all
of our key scenarios. But also we believe strongly
in pulling those quality learnings earlier
into the cycle. So we prevent any regressions in what
we've done in shift write quality validation.
Now when to use shift write quality validation
versus pulling something into your CI CD pipeline
and leveraging that as a gate to a deployment
being able to be flighted outwards? Well, I think a
major factor here involves whether or not you
need real customer traffic or really well simulated
customer traffic. If there is a certain scenario
where you can generate load on demand and you
only need load for a specific amount of time, and that load
doesn't have to be as random or fit the exact
patterns or scale of true production customer
traffic, well, that's something that you can generate via a
load test and then perform chaos engineering in your
CI CD pipeline. But there are plenty of cases where
shifting right means having some percentage of
real customer traffic or really well simulated
synthetic traffic that would make your
service appear to be undergoing real stress from users.
Now shift right an interesting thing we found since
we've introduced Azure Chaos studio is that the sort of
colloquial wisdom that chaos engineering
needs to happen in production, that wisdom
may not apply very well to the mission critical services
that a number of our customers are
building and are running on the azure cloud. So when
it comes to a shift right scenario being done in
production versus pre production, we believe that there
is a very useful and valid case for when you
should be doing chaos in production. But there's also plenty
of cases where chaos should really be done or start
in pre production. The first case we hear from customers
is simply, hey, we have not built up that confidence
yet in a particular failure scenario to go cause
that failure in production. So it's the beginning of our journey
with availability zones, or we're just beginning to
stress test a new application, chances are you're going to
want to start in pre production where there's less risk before
moving that sort of test out into the production environment.
And the production test becomes more of a final checkbox
validation that everything's working as expected.
The other thing that comes up with shift right being in
production versus pre production is risk tolerance
when it comes to a particular failure. If you are a
mission critical application, if you are that healthcare provider
that is determining whether prescriptions are
issued for emergency medical needs,
chances are you may say that the risk of
an injected fault in production
causing an outage are too great and that
production simply is not a suitable environment to really be
doing chaos engineering. So keeping those factors in mind
can help you determine when and where you might do chaos engineering.
Now, a brief word about Azure Chaos Studio Chaos Studio
is our new product, available as part of the Azure platform that
enables you to do chaos engineering natively
within Azure. It's a fully managed service, which means that
there is no need to install any utility, make updates,
maintain a platform. Those can be expensive and they can be
a challenge for any service team to have to go and
operate, maintain and secure those tools. So having this be
fully managed means you can focus on the outcomes
rather than the implementation. We're well integrated
with Azure's management tools, including Azure resource
manager, Azure policy, project Bicep,
and several of the other aspects of Azure so that things
fit very naturally in your ecosystem. The way you deploy
your infrastructure is how you can deploy your chaos experiments,
and you can manage and secure access
to your chaos experiments exactly as you're doing
with any other part of your infrastructure estate that
exists in Azure. We have integration with
observability tooling to ensure you can do that analysis
when a chaos experiment happens. And we have an expanding
library of faults that covers a lot of the common Azure service
issues. One of our aspirational goals is to provide
experiment templates for the most severe Azure
outages that happen on the platform. And that's something we pay
a lot of attention to, is when there is a high severity
Azure outage that impacts a customer. How can
we transform that into a chaos experiment template
that would allow a customer, a cloud consumer, to go and replicate
that failure to ensure that they are well defended
against having an impact to their availability should
any similar sort of outage occur. And the final
thing I'll mention about Chaos Studio is that safety is very important to us.
We're not a simulator, we're not simulating faults,
the faults are really getting injected. And that means
that when we shut down a virtual machine, the virtual machine
is getting shut down. When we apply cpu pressure,
that cpu pressure on an AKS cluster is really happening.
What this means is that whether it be unintentional,
accidental fault injection, or something a little bit
more malicious, we want to help make sure that you can defend against
those by having appropriate permissions built into the system
restrictions and administrative operations on what resources
can be targeted for fault injection and what
faults can be used on a particular resource,
as well as permissions for the experiment to access those
resources. So there's plenty of safety built into
the mechanisms in Chaos studio. Now let's
talk a little bit about chaos engineering at Microsoft at Microsoft,
we've been using chaos engineering for several years to improve the
resilience of our own cloud services, and the
majority of those learnings have contributed to us building
Chaos studio, both as a central service
that all of our cloud service teams can use within Microsoft,
as well as an offering that we can make available to our customers.
And it's currently in public preview there are over 50 teams
at Microsoft, 50 cloud services that
are using Chaos studio today across a range
of Microsoft products, from the power platform to
the office suite to the Azure cloud services,
we believe. And there are two areas of particular
focus in Microsoft right now when it comes to chaos
engineering. The first is investing heavily in failure
scenarios over adding specific faults.
So we've learned in analyzing our incidents and in
looking at past resilience challenges that oftentimes
it's a more broad scenario, say a region
failure or an inability to scale with load
or a network configuration change is the
real scenario that you want to be able to replicate
when doing chaos engineering, and oftentimes you're
leveraging a set conf 42 recreate that failure scenario.
But at the end of the day, it's the failure scenario that matters,
not the individual faults that contribute up to that.
So rather than focusing on delivering faults
for every single option, we like to deliver faults and
encourage our teams to build experiments around those scenarios
and light up the correct faults for those major scenarios.
Take availability zone down an Azure active
directory, outage a DNS outage and focus on
those. The other thing that we've been really investing heavily in
is shifting this left,
particularly when it comes to high blast radius outages.
In the past we've known that there are a couple of places where
things like DNS or Azure active directory,
any sort of outage in those services can have impact
on the majority of Microsoft cloud services.
And so while we've done a lot to defend against those
dependencies having impact on every other cloud service,
when they do see impact, we now want to pull that from.
We validated it for services that are going out into production
and shift that left into preventing any regression
in any ability to be resilient against those particular types
of failures. And in fact, one thing we're looking forward to doing at
Microsoft is ensuring that at least for the Azure division,
every single new deployment of every service has
specific failure scenarios validated as part of
pre production, as a pre production gate before that
build is suitable to go out to production to ensure that
we're never regressing a scenario and
our resilience to high blast radius outages.
Now, two great examples of using chaos engineering within Microsoft.
The first is the Microsoft Power platform team. They've been
doing region failure experiments with chaos
Studio and have identified several opportunities
where when a data center went down, they were unable to
not only recover from the failure, they were able
to recover from the failure. But they said hey, we also want to be able
to fail over to a secondary region. So when there's a failure in region,
also have that failover. They discovered that by leveraging Chaos
studio to shut down all of their compute and all of their services in a
region to validate that the backup would come up
and when it didn't, they were able to go and identify an issue causing
that backup not to occur. Another example from that team
was simply during an outage event, acknowledging that
they didn't have the appropriate observability to detect
the outage early and respond quickly. Now with Chaos
studio they were able to recreate the conditions and find
new spaces where they needed to instrument further in
our monitoring so that they could mitigate and identify those
failures and automate responses to them
quickly. Another great example is our Azure key vault team.
That team has been doing several availability zone down
outages as well as scaling up the service outages.
And a great learning from that team was while
collaboration is important and validating configuration
is great. Small teams and changes over time
might mean that an original configuration in
an autoscale rule might not have the same effects
over time that it originally did. So in this case,
in a pre production environment, they were being some
chaos engineering and discovered that for a pre production
service, the autoscale rule was misconfigured such
that when stress was applied, the virtual machine scale
set they ran on was not scaling up further. And so they were actually able
to identify that in pre production, mitigate it before it
ever became an issue in production.
Now, what we've learned from doing chaos engineering
within Microsoft, as well as partnering with some of
our big customers who are leveraging chaos engineering and
Chaos studio to do chaos engineering in their environments,
there are a couple of insights that I'd like to share.
The first is that chaos engineering really
needs to start with maturing your tooling and processes
before you go to introduce any amount of chaos. So ensuring
that you have great robust observability,
that you've already built backup mechanisms and
you've made your service respond correctly
to outages, making sure that you have a great livesight
process in place, and that you have troubleshooting guides and
automatic mitigations in place. Those have to be there
before you start to do chaos engineering, because chaos
engineering is not suitable for a case where you're learning something you
already knew. Chaos engineering should reveal something new
about your service, something unexpected.
That's when chaos engineering is best. The second
insight we've had is that a great way
to understand where to apply chaos engineering is to
quantify and analyze past outages.
Now, the quantification is something I talked about a little bit
earlier, where being able to put a monetary amount on an outage
can help create that visibility across a
larger company and across different
sets of stakeholders to make them more invested
in the importance of reliability. Putting a dollar amount
or a rupee amount, any sort of currency
amount on your service when there is an outage, and what that
dollar amount was for. The outage is a wonderful
way of keeping everyone's head centered
around the importance of resilience. And then once you've
built that, going back and analyzing past outages
to identify when you've had high blast radius
outages and or high frequency
outages. Those are two great places to start
by looking back at your previous incident and
your responses to incidents, and then deciding from there
which chaos experiments you might want to start with. A third
insight for us has been it is important to build
confidence in pre production before heading out into production.
Now this requires that you've built a great
pre production environment within Azure, we have
the concept of our canary regions. These are two dedicated
regions within resource manager that are unavailable
to our customers. But almost every
azure service has a stamp in those clusters or in those
regions, and services have to go
in and bake for a certain amount of time in those regions before
they can move into production. Now, the fact that the services
being deployed in our canary regions are dependent on other
services in canary regions helps us to proactively
identify any dependency issues,
any failures, and mitigate those before anything hits
production. In fact, in certain services like Microsoft Teams,
we're the dog fooders. Where Microsoft Teams,
our own Microsoft service traffic for our
company goes through the Teams dog food environment
in a canary environment. And that helps us
to make sure that we are building quality in those environments
before something goes out to the general public who
rely on Microsoft Teams. And a final insight for
us is just that quality can't happen if it's only
on one person's back or only on one team's
back. For a large scale organization, you really have
to create a culture of quality where everyone believes that this
is important. And we talked about this a little bit earlier in the presentation,
but it is critical to remind us that
that culture of quality has to come before any investment
in chaos engineering. So with
that, I'd like to say thank you very much for your time, and I hope
you enjoy the rest of the conference. If you'd like to learn more about Azure
Chaos Studio, you can go to aka Ms.
Azure Chaos Studio. You'll be able to learn more about
our service, get started, read our documentation,
as well as see some of our user studies or
our customer studies. So enjoy the
rest of the conference, and thanks very much.