Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hey all, thank you
very much for tuning into my talk. And thank you conf 42 for
putting this virtual event together. Today I'm going to be
talking about sprinkles of chaos and fires
and things that can happen in the kitchen. How is it that we can take
a moment to learn in those situations? And what are those
learning methods that we use in the kitchen that we can actually
go ahead and apply to building reliable,
complex applications and systems? Today's first
question is, how did you learn how to cook? Are you someone
that had to learn by watching others cook?
Whether it was YouTube tutorials, cooking shows, watching a
family member, are you someone that had to learn by actually doing
some hands on and actually attempting some of those recipes?
And are you someone that just needs to have feedback
early and often in the cooking process to be able to know
that you're doing things right? This might just be just asking
yourselves as you cook, how much salt am I supposed to be using?
How much temperature do I need to run this on?
Or are you one of those folks that had to use a fire
extinguisher to learn? There's no right or wrong answer. Everyone learns
differently. And that's why I want to kick off today's talk with
this question. How do you learn? Are you someone that needs
a very specific set of methods in
order to pick up a concept? And when we bring it back to
not cooking topics, what is the best way that you learn?
Are you someone that does need that content? Are you someone that
needs that hands on learning and be able to apply it?
We see some folks really go to a lot of conferences
in order for them to learn. A lot of them want to just watch
YouTube content and go through tutorials and read blog
posts in order to get an understanding of a technology,
a different tool, or just take a deep dive
into a certain technology topic. And I know for
me, my favorite way was building and breaking. I know
I wanted to always get hands on learning and then I had
to break it and be able to learn how to debug it and be able
to share a little bit upon that. So I am
talking about the kitchen. I do have to bring it back to things that
in the kitchen. How is it that I, my cooking
noncareer, have learned that food taste is good
for me? I have to spend a lot of time tasting
as I cook, whether it's as I'm seasoning or just trying
to make sure that the food is going to taste out tasty and
that I cook the shrimp and the chicken at the proper temperatures, it comes
to that. But the biggest portion
that happens to me learning how to cook is that I'm
not someone that likes cooking for one. Cooking for myself is not
fun. For me to actually be able to learn and have growth
in this cooking space, I have to cook for others. And as
I cook, I do spend a lot of time reading
content, watching YouTube tutorials, and I end up meshing, like four to five
recipes together. But I'm someone
that has to taste the food as I go. And I
brought that example of, am I using enough salt, or is this
too much salt, or is this too much salt that I need to throw
away my plate and startle over? It goes back to
me learning that I hates to go ask for feedback early and often.
And that's okay. It's you having to learn what works best for
you. And sometimes you're trying to learn a new plate. You might
have to burn it one time, two times, three times,
until you start getting the handle of the recipe and the cooking methods
that you need to use. So, yes, Anna, I'm at a tech conference,
supposed to be talking about chaos engineering. Can I stop talking about.
Yes. Yes, I will. I want to bring it back
to this. There's a lot of beauty in cooking,
and that is because we can learn. We can constantly take a
step back and improve. When I cook, I share that I
love tasting the plates as I go. This is something
very similar to what we do in building our applications.
We want to go ahead and observe and do gradual
rollouts of our applications, whether it's perfecting a
recipe before you show it to a loved one, or whether
it's making sure that you take out the chicken from
the pan and make sure that it's actually cooked properly.
And I also mentioned that other portion that I
love cooking for others. And that is because
my focus is on that end user experience,
those customers. And this is exactly that beauty where experimentation
comes in. Whether I'm actually trying to add a little sweeter kick or
a spicier kick to my plate, or I'm just trying to get my empanadas
crispier. And sometimes, and often,
this actually happens in my house, I actually need to
burn my plate in order for me to learn. This is where we
go ahead and we take that concept into building applications
as replicating past incidents we hates to learn
and practice in order to perfect a skill.
And if that means having to use a fire extinguisher along the
way or throw away your plate, that's okay, because you're doing this
for learning. With that, I wanted to take a moment,
introduce myself. My name is Ana. Ana. Ana.
Ana margarita Medina, senior chaos engineer at Gremlin. I love
introducing myself as a self taught engineer. I started coding
in 2007, got a chance to do a lot of front
end work, moved on to learn a little bit of back end,
and I somehow transitioned to build iOS and Android
applications. In 2016, I got
a chance to come into this beautiful world of site reliability
engineering and get a chance to actually learn.
Where is it that my code runs on? How do I make sure this stays
up and down? And that is also when I picked up chaos
engineering. And it was that moment where I was like, oh no.
I love burning by building and breaking and being able to take
these concepts and break them down into little chunks.
Whether it was trying to understand Linux capabilities,
a certain application, trying to understand the complexity
of microservices, all those things made it really fun.
And one of the things that really, really matters to me is representation.
If you can't see it, you can't be it. So I love making a
comment about being a Latina. I was born and raised in Costa Rica.
My parents are from Nicaragua. So for any underrepresented
person in tech that's watching this, keep on going.
You got this. To bring it back to today's focus,
we're going to be talking about learning.
And what are those things that we can do every single
day in order to push ourselves past our comfort zone
and take a step into learning. Hopefully all of
you have had a chance to think a little bit more about how do
you learn, what is the best way that you can pick up something new?
And maybe you learn very differently than me, and that's okay.
Maybe I actually didn't even cover the way that you actually like learning.
And I did want to touch upon some of the other ways
that folks do learn. And that could just be by
practicing, whether it's trying to pick up a new instrument and going through
and doing that work to learn it, or just making
sure that you're burning that muscle memory in order for you
to continue doing this practice. And hey,
if you have to use a fire extinguisher or burn yourself in
the oven before learning, I've been there. I just burned myself two
days ago trying to use my cast iron. It's always going to
happen. And that's okay. I am trying to bring this back
to software and technologies and as we know it,
the software and technology that we use every single day breaks
the world that we're building relies more and more on
the stability of naturally brittle technology. The challenge that
we face now is how is it that we continue innovating
and delivering products and services for our customers in
a way that minimizes the risk of failure as
much as possible. And when we talk about delivering
these experiences to our customers, we have to understand
that when we are not able to have applications
and systems that are up, we suffer downtime. And downtime
costs a lot of money. We hates things that happen
during the outage that can be quantifiable,
and those things come down to revenue.
Go and ask your accounting or sales team to try to understand
what are some of those costs that come into play. We also
have the portion of employee productivity as your engineers
are dealing with an outage, they're not working on features
or things to make your product battery. And then that brings us
to things that can happen after the outage, which is that
customer chargebacks. Maybe you're breaking some of those
service level agreements and you have to give money back
to your customers. We have this other bucket that
makes downtime really expensive, and those are those
unquantifiable costs. These things can be seen as
brand deformation, whether it's the media picking
up that your company or systems are down, or maybe it's
just happening all over Twitter. And the thing, too, is that customers
don't want to use broken products or applications.
And sometimes you can actually go ahead and see that happen pretty
easily, especially in the stock market. Overnight, one of those
other portions of unquantifiable costs come down to
employee attrition. People don't want to work at places where they're
constantly going to be firefighting. You're going to suffer
burnout rates that are really, really high.
And word gets around in the tech industry, where folks
talk about this vicious cycle of less people to
handle those incidents, which just leads to a
more burnout. The average company is expected to lose
around $300,000 per hour that they're down
and that number chaos. Nothing to do with their high traffic events or
any new launch that they're coming up with. And when we
talk about building reliable applications, we also
have to understand that the world that we're building is
only getting more complex, which makes it
very difficult for us to operate our applications to
continue being reliable. The pressure for
faster innovation is driving the adoption of new technologies,
whether it's new types of infrastructure, new coding languages and
architectures, or just new processes that we want to get
a handle on. And when we talk about this complexity,
we could also take a step back and understand that the complexity
hasn't always been like this in legacy applications.
When we were doing waterfall processes in our companies,
we only had one release. We only had one thing to care about.
We also only had one service to keep up and running. When we had
monolith architectures, and maybe when we were only managing
hundreds of servers as our organizations were on Prem,
that complexity was a lot smaller. But we've
lifted and shifted and rearchitected our applications,
and now we're in this world where a lot of things are cloud native.
And thankfully, we've seen a lot of organizations adopt things
like DevOps, and that allows for us to have daily
releases that allow for us to deliver better experiences to
our customers. We now have microservices. So instead
of having one service to keep up and running during all
this time, we now have hundreds of services to keep up.
And all of those have interdependencies within
each other or other third party vendors. We've also seen
that in this cloud native world, we now don't
only just have hundreds of servers to take care of,
but we have hundreds of thousands of Kubernetes resources that
we need to make sure are all reliable and tested, and that we have
documentation on how to keep them up and running. So with this current
complexity of our systems, we really, really need
experimentation. Folks just want to move fast and
break things. But what if I tell you that there is a better world?
A world where you can just slow down just a bit and spend
more time experimenting and verifying that you're building
things reliably, specifically for our users
to constantly be happy with our products, our services,
and continue being customers of our companies. At the end of
the day, we're building a complex and distributed system,
and there are things that we must test for or you will suffer
an outage. There's failures that you might see in the
industry that happen every few months, that happen once
a year, or just outages that get so
large that we can take a moment to actually learn from other companies
pain points and make our systems better. And that brings
me to my favorite ingredient for today's talk,
chaos engineering. We're going to talk about this the entire rest
of the conversation. The definition of chaos engineering is that
this is thoughtful, planned experiments designed to
reveal the weakness in our systems. And I have bolded the word
thoughtful, planned, and reveal weakness in our systems.
Because this is not about just breaking production for fun or
making sure that the team that you work with can actually
handle their on call rotation. This is about doing it in
a very thoughtful plan way where you communicate and
you build that maturity. And the purpose is not just
for breaking things. We do this with the purpose of breaking things on
purpose, to learn from those failure points and improve our
applications. As we talk about chaos engineering,
I want to take a step back and just explain some of
the terminology that's going to come up in today's talk. We're going to be
using experiments. This goes back to using the scientific method
to go ahead and learn from our systems. By following
that scientific method that we learned many years ago,
we have that fundamental of creating a hypothesis.
If such failure happens to my system, this is what I expect
will happen. We also have some safeguards that come into play
with chaos engineering, such as blast radius.
Blast radius is that surface area that you're running that
experiment on. This can be seen as one server,
ten servers, 10% of your infrastructure,
only one service out of your 100 microservice architecture.
That is that blast radius. The other terminology that
we have, very similar to blast radius, is magnitude.
Magnitude is the intensity of the chaos engineering experiment
that you're unleashing. This can be seen as
increasing cpu by 10%, then gradually going to 20%,
30%, and such. Or it can be seen as just
injecting 100 milliseconds of latency, going up
to 300, and incrementing all the way to 800 milliseconds of latency.
That is your magnitude. While using the blast radius and magnitude,
you can really tell your experiments to be really thoughtful
and planned. That last term that I want to cover
in this section is abort conditions. Abort conditions are
those conditions that can happen to your systems or things that you
might see in the monitoring or user experience that will tell
you that you need to stop this experiment? This portion is really
critical for creating your application. You want to make sure
to ask yourself, when is it that I stop running this experiment?
When is it that I can make sure that the experiment rolls back?
Now that we've covered the terms, let's actually go through
this process of using the scientific method.
That first one that we start doing is actually observing your
system. Observing your system can actually just be
by pulling up your architecture diagrams, trying to understand how
all of your microservices come together. What is the mental model
that you have of today's application? You can also observe your system
by just understanding the metrics that are coming
in. How is it that this ties into all the other systems
in your application? And then that brings us
to the next step of baselining your metrics,
this can be seen as setting those service level objectives,
service level indicators per service. What is it that
I can see today? Now that I've covered the terminology that
gets used in chaos engineering experiments,
let's actually talk about how this scientific method comes
together. That first step that we take in the chaos
engineering experiment is by taking a step
back and observing our systems. This can be done by
just looking at that architecture diagram, trying to
understand the mental models that you currently have of your application.
Or maybe it's trying to understand how all of your microservices talk
to each other. The next step that we take after that is that we
want to go ahead and understand how our system behaves
under normal conditions. This can be done by
just baselining your metrics. This can also be seen as
a great opportunity to set some service level objectives, set some service
level indicators that allow for you to understand whether your application
is healthy or not. This allows for us to move on to
that next step, forming a hypothesis with the work conditions.
This is one of those important steps that you get a chance to take a
step back and try to understand what is it that I
think that will happen to my application now, but how is it
that I can make sure that we don't cause a failure that
can affect our customers, and we do set those abort conditions
and are ready to take action on them? Then we can actually go ahead
and define that blast radius and magnitude and
say, I want to run a cpu experiment
on just 20% of my infrastructure, and that experiment
is going to increase cpu to have all of the cpu running at
70% in all my hosts. We then go
ahead and we run an experiment. This is that
fun time that you get a chance to do with your team.
But many teams don't always get a chance to run the experiment. That doesn't
mean that they didn't just learn anything from step one all
the way to five. As you run that experiment, you want to take a moment
to analyze those results. You want to understand,
after you've inputted these conditions into your system,
how did it behave? How did this behavior correlate to
the hypothesis that you created? And if your experiment
is successful, go ahead and expand your blast radius,
expand that magnitude, and get ready to run
that experiment once again. And if your experiment was
unsuccessful, hey, that's okay. You just learned something.
Take a moment to actually see what will make your application be more
reliable and work on that. Then go ahead and run
this type of experiment again, just to make sure that the improvements that you've put
in actually help your application's reliability. And that
last step is one of the most important steps that we have
in the chaos engineering process, and that is sharing the results.
This comes into actually sharing the results with your leadership
team across your organization. And I always
take it a step further and say, go ahead and share the results
and share those learnings with the wider communities, whether it's
the chaos engineering community, the open source communities of
the tools that you're burning with, or just any other type of
tech conference, and talk a little bit more about some of the ways that
you've been building and breaking things. I did want to go over some
chaos engineering experiments that we can kind of create,
at least to get you all started in thinking about this.
One of the big ones that I've been seeing across the board,
whether it's folks on containerized Kubernetes environments or
those that have adopted cloud technologies, or hopeful
that their applications will scale with regular use,
is making sure that you're planning and you're testing those resource limits.
On Kubernetes, resource limits are in order for you to
make sure that things are scaling properly. But we can also take it a
step back and think, how is it that we're making sure that
when we're using the cloud technologies, auto scaling is actually
set up and that you actually have an understanding on how
long it takes auto scaling to bring a new node in,
how long it takes for that new node to join the rest of them,
and for it to report back to your proper monitoring
observability dashboards in order for you to make sure that things are up
and running. And these things can actually be implemented in
a chaos engineering experiment by just having a
resource impact. So for some of the auto scaling work that I
do, I always start out by just saying, go ahead and run
a chaos engineering experiment and have that increase be
up to 60% of cpu on your servers, and go
ahead and make sure to run that small experiment on all of
your hosts and you create those abort conditions that
you'll stop that experiment if your application is not responsive,
if you start seeing HTTP 400, 500 errors,
anything that doesn't feel right for the customer, and you
can also take a step and understand what were the metrics that you were looking
at for your systems. It might be things like response rates
or traffic rates slowing down. When we
think about the hypothesis for an experiment like this, we want to ask
of what is it that's going to happen to my system when cpu increases,
do I expect that in 2 minutes the new node will
be up and running? Or do I expect that traffic from
one server is also going to be routed for another one
because this new node is actually coming up. One of the other ways
that we can think about chaos engineering experiments is trying to
understand what happens to our systems when one
of our dependencies fails. This can be a dependency
on an image provider, a third party vendor that actually processes
payments. When our application can't access that resource.
What does your user see? What is the user experience like?
And with these type of experiments we get a chance to do things
like inject latency, block off traffic to
a certain port application API
URL, and we can start doing that to try to understand
how is it that the UI handles this failure,
how is it that our entire microservices are coupled
in that this becomes a single point of failure that can actually
bring us down for a while? And when I set the slides up,
the experiment that comes to mind is something running on a Kubernetes
environment that on your architecture diagram might just not
be seen as a primary dependency. We see it as
just a caching layer and that is this redis cart that I
have written down here. That hypothesis comes down to me thinking
that when my caching layer has a latency increase,
this is just my caching layer. The application should also still continue
working without any issues. If you're interested in learning the effects about this
experiment, come to one of my boot camps and you'll get a chance to understand
how this also all couples down. So I am in a kitchen talk
and I now have to talk about that recipe that I do have for
building reliable applications. It first start off by
making sure that we can have availability, that we have capacity
to actually run our applications at the large scale that
we do need to. When we're talking about cloud native
applications, we have to make sure that we're ready for
failure, whether it's an entire region having issues or
that we're ready to fail over from one data center to
the other, from one cloud to the other. If you're multicloud
hybrid and it takes it back to that last step where you
also want to make sure that you have some form of disaster recovery business
continuity plan and that you've been exercising those
plans in a frequent manner, it also comes
down to that portion of reliability, making sure
that our systems can sustain these failures that happen
day to day to our applications. It comes to that
moment where we want our engineering teams to actually
experiment and try to build better products and features
and that we also get a chance to continue innovating.
As I've mentioned multiple times, practice is one
of these key terms in building resilient applications.
We're building really complex things that have a lot of dependencies.
By doing practice, we are able to understand a little bit more about
how all these services and tools play together.
But your team is also going to have a chance to be better equipped
to go back to that point of reliability and keeping things up and
running. So the best thing is that all these things get
a chance to come together and be tested and worked on
and constantly improved on. If you do perform some chaos engineering
experiments, you get a chance to understand the failures,
constantly be learning from them, and continuously
improve on those issues. People and processes,
portions of technologies. Our applications live in
such a distributed architecture that things are always
going to be breaking. In complex systems, you have
to always assume that it will break, or we take
it back to Murphy's law. Anything that can go wrong
will go wrong. We have to prepare for those failures,
and we have to always tell ourselves and our teams
always test it, go ahead and break it. Before you go
ahead and implement it, you want to go ahead and battle
test some of the technologies that you're trying to bring into your organization.
This allows for you to understand those dependencies,
those bottlenecks, those black swans that you might not
be able to see until you get a chance to put it all
together with the rest of your applications. You want to understand what
the default parameters of this tool are and
whether or not this actually works straight out of the box.
Is there any security concerns that you need to have in mind with any of
these tools? And how is it that this tool or
application needs to be connected with the rest of
my application in order for me to build it in a reliable manner.
You also want to go ahead and always ask,
what is it that's going to happen when x fails?
X can be any URL,
any API endpoint, any little
box in your architecture diagram, or even just one of
the processes that you have in place. And especially when you're
looking at those architecture diagrams, please ask yourselves,
what is going to happen if this tier two
application goes down. Hopefully you have a good hypothesis
for it. Hopefully you've gotten a chance to practice on
it and ask that hypothesis question.
You also want to take a step forward
and ask, what is it that your organization is doing
day to day to focus on reliability?
This is something that the entire company needs to be focused on in
order to have the uptime that your customers might be needing,
but these might be things that happen behind the scenes, the shadow work.
Or this is you actually picking up technologies like chaos Engineering
to innovate in your engineering workspace. You can also
just start asking what work is being done today
that makes sure that we're actually not regressing
into a past failure, that we're not about to relive
that past incident that you were on call for five months ago,
making sure that you've gone through those euro tickets,
making sure that you've actually gone through your systems
and maybe replayed some of those conditions that caused that
last incident and ask, can our system sustain
such failures if there were to happen again? And you
want to understand how your system behaves on a day to
day basis under normal behavior so that you
can get ready for those peak traffic events for those
days, that you're going to have more users on
your website, or that other things might be
breaking within the dependencies that you've built in. You want to
remember that you have to practice. You have to question
everything, whether it's in systems or general knowledge about
your applications. And that is that beauty that always keeps me
coming back to chaos engineering. It is a proactive
approach to building reliable systems, and you get a
chance to build reliable applications and systems,
but you also build reliable people and organizations
with that. I would like to close out and offer you a
nice little takeaway. If you're interested in joining the
Chaos engineering community and getting some of the chaos engineering
stickers that you see up here on the slide, head on over to
gremlin.com talk Anna
Anna Conf fourty two. And if you have any questions
about this, talk the topic anything regarding
to do with chaos engineer or Gremlin, feel free to reach
out via email@annaghremlin.com.
Or you can reach out via any of the social media platforms.
I'm usually Anna underscore M underscore my Dina.
And if you're interested in giving a try to Gremlin free,
you can always go to go gremlin.com slash Anna to
sign up and try the full suite of Gremlin attacks with that. Thank you
all very much. Have a great one.