Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, welcome to my session on stress testing Azure resources
using Azure Chaos Studio.
To share a bit about myself, I'm Peter Detender,
originally from Belgium but moved to Redmond, Washington about two
years ago. I'm a Microsoft technical trainer at Microsoft providing
technical training. How hard can it be to come up with a job description
right to our top customers and partners across the
globe? In the bit of free time that I have, I still like
to go back to Azure. But sharing knowledge presenting at virtual
conferences like this one or in person on any
topic that's Azure related Azure DevOps site reliability
engineering or app modernization. I also like to
write articles on my blog website zero zero seven ffflearning.com
or publishing books where the latest one was a bit more than
two years ago on the art of site reliability engineering.
Feel free to reach out on Twitter, by email or
LinkedIn. Now, with the personal marketing out of the way, let's jump
straight into the technical piece of the session and starting with describing what
site reliability engineering is about. Now in short
side reliability engineering stands for site reliability
engineer or engineering. And initially it's from Google,
actually pointed to running main application,
the www.google.com search website and
it should have been available all the times. Now when the practice
moved out of Google and became like public
practice. Right. We refer to site as any possible
workload that should run business critical and running
24/7 now the other part is the reliability piece,
where reliability means that you want to guarantee as
a team that any running application you need to support is
available no matter what's happening or maybe even better
according business requirements. And the engineering piece is
applying to the principles of computer science and using
engineering concepts to build and maintain your systems applications
all the way from developing into the monitoring. Now,
drilling down a bit more on the specific, I think it would probably take me,
I don't know, two or three days, maybe more. But you could simplified
a little bit in these core responsibilities. First of
all, when you're wearing your developer hat, it means that you're
working on writing software for typically larger scaled
workloads. Now, sometimes you also take the responsibility
for side pieces of running your application
like backup, monitoring, load balancing, and even if you
like moving into the operations and then last,
it could also mean figuring out how to apply existing
solutions to new problems. Good.
Now with that, I need to move a little bit more away from site reliability
engineering into chaos engineering.
Now, what is chaos engineering? More specifically, I could summarize
it as the discipline of experimenting on a system in order
to build confidence in the systems capability and withstanding
turbulent conditions in production. Now this is the official
definition coming from the principles of chaos, which if
you ask me could be the name of a rock band. Now there's
three core key words I want to emphasize.
First of all, it's experimenting, which means if you know a
bit about DevOps, it also means failing fast,
because the faster you fail, the faster you're forced to recover. You're going
to learn how to make your systems more
reliable, more resilient. So I call experimenting
like licking a fuse as a kid, right where your hair would
spike, or maybe even again as
a kid, don't ask me how I know, but going downhill with your bike,
you're not super experienced in it yet, and you go like
super downhill, super fast, and maybe you're falling
and you break your arm and then you go, oh my God, this was so
cool, I'm gonna do this again. Now, the more loopholes we can
identify up front, the more confidence, which is the next part
in the definition we can have in the systems reliability.
By introducing a series of event simulations based on
real incidents or based on imaginary outages that could
happen, you can target your workloads and learn from the impact.
And then the last piece is overall withstanding any
possible turbulent conditions. Think of it as cpu
pressure or unplanned load, or maybe an unplanned
outage that could qualify as chaos engineering
issues. Now one example I would like to use here
to start is what I call the curious case of cpu pressure.
Now what does it mean? Imagine you have a workload,
could be anything, could run in cloud, could run on Prem, could be hybrid,
it's been running fine for months. And the average cpu load,
now why do you know that? Because if it's running in Azure, you're going to
use Azure monitoring. If it's running on Prem, you're going to use on prem monitoring
tools. And in the end it's not too important as long as you integrate
monitoring. But then suddenly there is a spike and
eventually when you hammer your system, it probably goes
down or it crashes. Right now it's stopping the application, the database
goes down, the web app is not longer available, and so on.
Now apart from troubleshooting the data piece, you also
going hand in hand with testing your engineering
team. Like how can we rebuild the system, how can we get
it up and running again as fast as possible? Now it might also
be that you don't even know the reason why,
and that's why you want to use chaos engineering because
what you're going to do is integrating functional testing to make sure
that any possible outage, planned or unplanned, is not
going to happen anymore. Or at least I would say trying to minimize
the risk. That's the main thing. You can see here that
I'm using a couple of examples like a virtual machine, a Kubernetes cluster
key, vault network security groups. Why? Because all these are
supported in my Azure Chaos studio service that I'll talk about later.
The last part I included here is the DevOps engineer.
Why? Because human beings are still important, right? There's still
a huge amount of issues, unfortunately, when running environments because
of human interaction. And don't forget we're mainly talking about
production environments here. Now you might go, wait a minute, Peter,
why are you not that happy with human beings? Or do you
don't like, like DevOps engineers? Now I provide training
and DevOps is one of my main technologies I'm providing training on.
Now why is it so important to integrate
this DevOps engineer as the curious case of cpu pressure?
Because we all know what happens. We publish applications maybe on a
Friday afternoon. Why? Because we have the whole weekend to recover in case
of something going down. But then again going back to production
environments. It also means that we need to make sure that everything keeps
up and running. And if it's not because of the platform, if it's not because
of the load of the platform, in some cases, unfortunately,
it's still the human being. Now we still are in the curious
case of cpu pressure. Now what's important here is
that we're actually trying to step away from one individual
component. Why is that? Because if you think about the virtual
machines, Kubernetes, clusters, then yes, we look at
CPU, but typically reusing CPU as the
example. It's not the main root cause. Why not? Because there's a
lot more going on in keeping your systems up and running besides monitoring
cpu load. So it might be that cpu is biting
because of latency in your database operations running, I don't
know, some complex, oh sorry, running some complex calculation
or running a database update. Or it might be that there are
network connectivity issue by which now the operation cannot write
to the database. And because the fact that it cannot write to the database,
that's actually causing cpu pressure. So my
analogy here is explaining that systems are complex
virtual machine scale sets. Yes it's running a virtual machine,
but it's running a few more than just a single one or a
more complex architecture like kubernetes.
Overall validating your IaaS, PaaS and serverless workloads
like Azure functions, or maybe even the latest one service bus as
part of your architecture Cosmos database. And again, so many other
examples. And then to add even more complexity on this,
it's like all of those in one single scenario
where you're running virtual machines for part of the workload.
Next to that, you're running short running container tasks
inside kubernetes clusters, maybe kubernetes clusters
across a hybrid scenario, partly Azure AWS,
Google Cloud, and why not Umpra? And again, bringing all
that together and then still having the human being as a
potential weak spot. Imagine you need to manage
your DevOps teams and they're active all over the globe, different time zones,
managing the permissions and so on. Now you might think that
chaos engineering is the next big thing, and maybe
it's even following side reliability engineering, which you could
say was following DevOps, but yet it's maybe too revolutionary
for your cloud environments. I think nothing is more wrong.
In fact, chaos engineering has been around for more than ten years now.
Initially developed by software engineers from Netflix already 2008
when they started migrating from on prem data centers into public
cloud data centers. While there are a lot of similarities
across managing your own data center or using public cloud,
theres also quite some big differences. And it was mainly those
differences that forced Netflix engineers to create service
architectures using higher reliability.
Now to be clear, I think that Chaos engineering is
not DevOps 3.0, but I definitely should be part of
DevOps teamed Arsenal of tools to meet your business requirements
to validate how your applications are running. So with
that, I would say let's make it a little bit more technology focused
on one specific service called Azure Chaos Studio.
Now Azure Chaos Studio is, as you can probably figure out,
an Azure service offering chaos engineering as a
service, which means that you can inject faults into
your Azure workloads. Now, thinking back about the definition,
preferably you're going to use chaos engineering against your
production environment. But honestly, trust me, you can do this against test
and development environment as well. Now, whether you're testing
how applications will run in Azure, or you're migrating applications to
Azure, or maybe you're already running production workloads in Azure,
Chaos studio allows you to bring in a full set of
faults into your scenario, ranging from virtual
machines which we call agent based chaos testing,
or serverless if you have Kubernetes
clusters targeting Azure, key vault targeting network security
groups, and one of the latest services we actually added
is service bus. Now the core of Chaos studio
is Chaos experiments. So a chaos experiment
is an Azure resource that describes the faults that should be run
and the resources those faults should be run against.
Now faults can be organized to run in parallel or in
sequence, and I'll show you in an upcoming demo depending on the needs.
Now, chaos supports two types of faults. I already talked
about service direct, which means you're
gonna a service which doesn't require
an agent. Next to that you got agent based faults and
that means you're gonna target a virtual machine workload which could be
windows, Linux and Kubernetes clusters as well.
Now the core is a chaos experiment. Now when you build
a chaos experiment, what you're doing is defining one
or more steps that execute sequentially, each step
containing one or more branches, as we call it, that run
in parallel within the step, and each branch
containing one or more actions, such as injecting a fault,
waiting for a certain duration, or anything else you could come up with.
Now finally you organize the resources which we call targets
that each fault will be run against. You can move them into
a group scenario called a selector, and that's where
you can easily reference a group of resources.
So in short you would start with experimenting. You're going to create an
experiment. You define the step by step process.
For example hammering cpu load. Next to that I'm going to
simulate latency. Next to that I'm going to fire off a crash
of a web server or maybe running some heavy loaded database
task or anything. Again that's running inside a virtual machine
or inside a Kubernetes cluster or targeting
network security group, or simulating an outage, or not having
the correct permissions to connect to key vault and app services
and functions and so many other examples.
And then in the next step you're going to define the actual actions.
So reflecting on this is what I want you to do.
I want you to simulate an action called cpu pressure. Now think back
about one of the examples I shared before. Within the cpu pressure
I want to run a 99% cpu load, maybe 20,
maybe 50, whatever number that could work and is relevant for your
outage testing. And then what you want to run is the x
amount of time. I want to run this for five minutes, ten minutes,
and I want you to repeat it for the next hour,
although that would be an actually pretty long test. Now most
probably you don't need the full 3 hours or 1 hour or maybe even
a couple of minutes to validate and figure out if your virtual machine can handle
the load or any of the other services. I already mentioned.
Now those two previous slides is technically
all you really need to know about managing Azure Chaos studio.
You deploy resources, you create experiments and you're going to run
them. So with that, let's have a look at a couple of demo scenarios
and what it actually looks like in a real life scenario.
So this is my Azure environment where I already enabled
Chaos studio. And again, if you don't really know how to do
that, you go into your subscriptions and
within your subscriptions you're going to search for resource providers.
So go a little bit down here into the resource
provider section and that's where you literally
enable the Azure fabric features of
the platform where you're going to search for chaos.
And in my case obviously because otherwise I could not really demo anything.
You're going to click that register option here on top,
giving it a couple of seconds, worst case a few minutes.
And from there, once it's a green registered option, it means
you can start using your chaos environment or
the Chaos studio service, I should say. So let's jump
back to Chaos Studio and the first thing we're going to do
is defining a target. Now a target is again
anything that's already running in your environment quite important.
It needs to be up and running. The reason why is because you
need to define how you're going to manage your target.
As you can see here, I got two scenarios
already enabled. I have my Ubuntu Linux virtual
machine where I can manage actions. And the second scenario
I got is my AKS cluster. Now you can see
that there are a lot of other scenarios available. I can literally
target my virtual machine scale set over
here. I can test against an NSG. I don't have an example
for app services, although technically you can actually stop the app service
itself. The virtual machine is pretty obvious,
but also validating interactions against like a key vault
in this case. So it's not just about virtual machines, it's not
just about kubernetes, but it's expanding the target environment.
How do we install that agent?
Right, that would be the next step where again you got service direct and
you have agent based. So for a virtual machine you probably
gonna go for agent based. So it's nothing harder than selecting
your target. Virtual machine going up here,
enable target. And since again it's a vm, we gonna
install that package. The next step what you
need is a managed identity. Now what is a managed identity?
Right. Cael Studio, for the first time you need
to create that managed identity. If you don't really know the details
a managed identity is an Azure ad service principle,
like a security object interaction from one Azure
service, Azure Chaos studio, in this case to interact
with other parts of the platform, virtual machines, nsgs,
kubernetes, clusters, app services, redis, cache,
Cosmos, DB, and so many other services.
So that would be step one. I already created my managed
identity, very important. It's a managed identity for the Chaos
studio service. It's not a managed identity for the virtual
machine that you're going to use as a target. Second dependency
component is application insights. So again, you already
know we need application insights for our observability,
like providing the metrics sharing you the output where
you need to dive in your Azure portal or again using some automation
engine, terraform, Powershell, Azure Cli doesn't really
matter and you're going to deploy our application inside service.
From there you just need to define which one you want to use.
As you can see, I got quite a lot of them because I'm taking monitoring
quite serious and we're going to enable it as well.
And that's all it takes. Where from here it's going to install
that agent. Now to speed up my demo a little bit.
I already have this for my Ubuntu via and
you can validate your deployment. You don't really
have to wait in the portal to just validate what's going on, but it's
nothing more than any traditional extension. So maybe you're already
familiar with using chef puppet, some anti malware scenario
like Microsoft Defender, and installing it as an
extension, like adding a little piece of software like an agent
inside your virtual machine where my portal
seems to not to be refreshing. Let's try that again.
So I got my chaos engineering and provisioning succeeded again,
this takes just a couple of minutes, but I didn't really want to wait
to show you how it works on the web VM itself.
Keep in mind if it's a Linux backend, you need to install that chaos
ng package inside your cluster as well. And again,
you could probably find out how to do that using the traditional
virtual machine approach for for your Linux like apt
get depending a bit on the Linux flavor you're using would be
a good option. So what we have right now is
our target. We have the virtual machines defined and if you
want we could also go back where the deployment is still running
totally fine. Taking a little step back to my chaos
studio where now I could target the
similar concept but using a different service.
So I'm going to give it a couple of seconds before it's pulling up the
capable or compatible resources and
maybe using my Cosmos database,
where this time I'm going to enable it for a direct service model,
which means I don't need to deploy an agent. That's really how easy
it is. It's going to flip back, but we're not going to wait
for it. You probably get the idea how to do that,
where the next step is defining an experiment.
I already have a few experiments up and running that I'm going to
reuse just to again, keep it a little bit entertaining, not wasting time
on a lot of stuff happening in the backend. Nothing blocks
me from showing you how to create a new experiment.
So once you have this, it's opening up an experiments configuration.
So, interesting enough, an experiment by itself is nothing
more than a standalone azure resource. It also means that you can
automate the deployment using arm templates, bicep to just
give you an example. And we're going to call this the lord chaos.
My targets are running in central us, but in the end it's not
really that important. But I typically like deploying my
experiments in the same region where I'm going to run my testing or
gonna run my engineering experiment itself from
here. As I outlined in the presentation, the highest level is
a step within the step you got a branch and out of the branch
you're going to define a fault where a fault can be multiple actions.
You can totally customize the step one branch name and adding
multiple branches. I'm going to keep it a little bit easy for now because I
don't want to run overtime and just focusing on creating a
new folder where a fault is, I would say
based on a library of fault injections where
a lot of them are already available and if needed you
can insert some custom ones out of a fault library that I'll show
you in a minute. Now, since we have different target
endpoints, it means that we can go for different fault
scenarios. A couple of obvious ones, like validating
the shutdown of a virtual machine or shutting down the full scale set
and finding out what's the impact for my application workload,
specifically targeting Cosmos DB, where again I didn't need to deploy
an agent. It's that direct service direct model. And I
can start hammering my Cosmos DB, running a failover from
one region to another, shutting down, moving down to the rus that
the request units and causing some kind of latency
and again, testing how my web front end is responding
to that. Right, a nice list of aks,
specific scenarios where again all of them are based
on that open source chaos mesh scenario and
then from there a whole collection of standalone ones.
Let me change the color to just highlight that it's slightly different,
interacting with key vaults like not giving you access to find your
secrets certificates anymore. Cpu pressure,
physical memory testing, virtual memory,
cpu load stopping a service killing a process,
and so many other scenarios available.
So let's check out what that fault library is about.
So in the official Microsoft Docs there's a pretty long
list of potential faults and specific actions
that you can use. Now the way it works for like the easier tasks
or the easier fault validations that I talked about, you can just select
it and you don't really have to do anything. But you might also end up
in a more complex scenario like for example cpu
pressure. Quite easy to understand what it's going to do and it doesn't
really require a lot of settings. Now what you need is a JSon file.
You could have like an open source library. Again like chaos mesh giving
you YAML syntax for kubernetes environments.
We're now out of that AKS specific chaos engineering testing
that we offer out of Chaos studio in Azure. You need to translate
the YAML files into JSON and that's also documented in this
article. Don't worry too much about where to find that fold library.
You have that link in the Chaos studio portal and I also have it in
my slides all the way at the end. Good. So let's
go back to our scenario and we're going to simulate like a
cpu pressure. Now I can define some of my settings,
right? I'm going to run this for let's say ten minutes and I'm going
to hammer my server with 99% cpu.
Don't ask me why it's not possible to select 100%,
but you probably get the idea because once it reaches 100% cpu,
it's typically crashing the server. Next we're going to define the target
resource, where I'm going to hammer my Ubuntu and
the web VM. So the nice thing here is that you can run the same
performance testing against multiple endpoints. Could be the same server
workload, could be different ones. Like my web VM is running my web
engine, my Ubuntu VM is running a Mysql
postgres database, and I'm going to test the behavior when I'm hammering
my server with a cpu load. And that's literally all
it takes. If you want you can add a delay like
testing something for a few minutes. After that waiting couple
of minutes, testing it again, waiting a few minutes, testing it again.
So that's another maybe more complex scenario. And then
adding an additional step is nothing more than doing the same
thing where again it's going to run through that step by step scenario,
running through that sequence where let's say for now we
gonna kill a process and
the process name is called CMD Exe. Not the most fancy
one, but again this could be like any super important business
critical workload where you just gonna stop and kill that
process. That's the idea. Where next
while my cpu load is hammering, I'm gonna
trigger some other process stop. And again
from here you can add multiple steps, making it more complex,
integrating the time waiting scenario, you probably get
that idea. From here you review, create. That's basically it.
Now again, I already ran this process and
what I have is cpu experiments. And then later on
I got something for my aks cluster as well.
Now from here we have chaos studio. We enabled the service in the
platform. Next we define target direct service
or using that agent based deployment. Step number three,
we defined experiments as easy or as complex as you wanted,
targeting a single step with a single fault to a single vm target
object, or making it more complex. More, yeah, more complex.
Adding VM scale sets, web apps,
testing kubernetes, clusters and anything else that I already talked
about. Now from here,
next step is obviously triggering an experiment.
Now before this is capable of kicking off, you need
to define role based access permission. You need to provide
the correct permissions. And interesting enough, it only needs reader
permissions so there's no real security violation
I would say, or security violation actually better where
the cpu experiment by itself is becoming an
Azure standalone resource, which means it has a service principle
in the back and we need to define our RBAC permissions.
Defining read RBAC role based access for
my PDT CPU experiment towards that
specific resource. So we go into our target scenario,
in this case my ubuntu vm, we go into access control.
We're going to define that. My PDD CPU experiment gets
reader permissions, so add new role assignment.
Reader permissions is all it needs and
we're going to specify the member where my member
is. PDT CPU experiment that should
be able to find something. Not for now for some weird reason,
but you probably get the idea. Not gonna wait for it. So now
the next step is running your testing. And as you
can see I got a few other ones from other demos that I did,
and some of them are failed, some of them are successful. So we're obviously aiming
for successful ones. So what I'm going to do here is kicking it off,
starting my process to just show you what it's about.
It's going to move this into studio processing cube,
so nothing really to wait for.
And while it is running you can validate the details.
Looks like my portal is running behind a little bit now.
Here it tells me because I was not able to actually define that
RBAC. It's literally going to tell me like too bad it's not working.
I don't have permissions to run this, which is totally fine.
I explicitly wanted to show you, but I can go back to
another task and showing you the outcome of it, the exact
same scenario. I simulated cpu pressure
running cpu testing during 1010 minutes. So you can
nicely see here it started 446 and it
finished 456 and it ran for about ten
minutes and the outcome is complete. Now the
part that I cannot really show you, but you should probably find
out that you could go back to your Azure monitor, could go back to log
analytics, going to your virtual machine if you want to run
this live and validating where you go into the metrics,
checking the cpu load and seeing that the cpu is
spiking up to 99% and then after ten minutes it's totally
dropping back and at the same time, and that's obviously more important,
you're going to validate the impact on your workload.
Another scenario for my AKS cluster,
why not? Is first of all defining your
chaos studio. We already did defining the target agent
list. So the direct service option and creating
an experiment where what I did with this
one is defining a step where
I'm going to run a predefined chaos mesh. So that open sourced
library, the family of the library that I showed you earlier
in the docs and just grabbing one of those parameters where
you can see here parts of that JSon, and I don't really know if it's
related to the preview, but if I go into edit mode it's
not going to show me the details of that JSoN. So that's something
I noticed while going through the configuration that
from here it's not really, I would say publishing,
showing you the details of that JSON file experiment.
So maybe something to keep in mind during your preview testing to
copy your fold library snippets aside, if you don't just want to
rely on what's already available over here in that
fold library, if you want to spin up a new one.
So what I could do here is my
pots and grabbing one of those
jSon. So that's literally what I did and you can see
it down here. So let's run this experiment
as well. We're going to start it, gonna kick it
off. The best way to validate your
running paths and how they're behaving is running
cube control, right? You can validate a little bit of it using
Azure monitor as well, but why not showing you Kubectl
get pods. We're now out of my experiment.
It should run. Some of my pots and some other ones
are getting stopped. So that's literally what we're testing here.
I can see that already. In just a couple of minutes it's
terminating and it is restarting once where the other ones,
because there's a little bit of time I kept in between the recording of
kicking off the task and moving to the behavior of the pots,
where you can see that I'm literally stopping a pot,
starting it again, stopping it, starting it again. That's the main scenario
we're testing. I'm only running one note in the backend,
but you probably get the idea from here.
Awesome. I'm so excited that all my demos actually worked.
Now, to make sure I'm not running too much out of time here for my
session, let's wrap it up with sharing some resources. The first link
here is our official Chaos studio documentation on the Microsoft
Learn platform. The second one are pointers to additional
learn resources that could be helpful if you're a little
bit teased by the demos that I did, because those are the step by step
instructions on how to integrate chaos studio for your Kubernetes clusters
and using virtual machine agent based. With that,
I would like to thank you for watching my session. I hope
you learned from it, and obviously even more, I hope you enjoyed it.
Don't hesitate reaching out in case of any questions on the session
on Azure, and maybe even more so on Azure Chaos studio
specifically. Enjoy the rest of 42 side reliability
engineering conference and I hope to see you again in any of my other online
sessions. Thank you so much and enjoy the rest of the day.