Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi. Welcome to my talk.
Pleasure to come share with you some of the lessons I've learned in building
reliable systems. Along the way, let me jump in
by introducing myself. My name is Kolton Andrus, CTO of a
chaos engineering company, name Gremlin. But prior to that, I worked
at Amazon and Netflix, where I was responsible for helping ensure
that our systems were reliability, and that when they weren't reliable, we were
acting and fixing it as quick as possible. Why are we
here today? We're here today because we don't want to end up on this page.
We don't want to end up in the news for an outage that occurred.
And unfortunately, this is being more and more commonplace in today's
day and age, and we feel for things and feel for the folks that
are in these circumstances. We're operating very complex
systems with lots of moving pieces and interconnected
dependencies. And this makes the job of building a reliable system
fairly difficult. In the past, this is a world where one architect or
person could hold all of the things in their head and help and make sure
the right decisions were being made. But today,
reliability's problem is really everyone's problem.
It's really up to us to be able to mitigate issues and
failures early and often along the path, so that it
doesn't conspire into larger failures and cascading
failures. And so I'd love to talk a little bit about what I've
learned and what I think is most effective in building complex distributed
systems that are reliability. The first of a couple of
simple tenets is to just expect failure.
Sometimes, as engineers, we're thoughtful about the happy path,
and we really need to be considering the alternate paths that our users may experience.
Failure can occur at any time, at any level, and we really
need to ensure that we're handling that failure gracefully. And by doing that,
we're helping simplify the surface area and
making it so that when failures do occur, there's less of us for us to
figure. But this rolls right into the second tenant,
which is, we need to keep our system simple. We need to, where possible,
try to do the simpler, easier thing, as opposed to the clever
or more complex things. As we saw in the previous slide, our systems
are complex enough. And so a bit of a rule of thumb here is
that if we things, some fix, some optimization
is going to yield an order of magnitude better performance. Then it's probably worth
that overhead or complexity. But if we're seeing only an incremental gain,
we may want to think about the side effects of that additional
complex and choose to leave it. But the
next one is also one that's probably fairly straightforward and you've
heard before, but we need to talk about it. We need
redundancy. We need redundancy in our hosts and containers. We need
redundancy in our data. We need redundancy in the ways in which we can
do work within our system. Ultimately, a lot of failure
testing and a lot of reliability comes down to, is there another way to get
it done, or are we at a hard failure point? And so,
thinking about our single points of failure and how to mitigate those
along the way help us to balance that concern.
Now, I will mention the keep things simple is also somewhat in
conflict with this. And if you have redundant everything, you actually
have a much more complicated system than a single instance of everything.
So we need to weigh the pros and cons of that approach.
How do we test to know if we're in compliance with this?
The question is, are we comfortable bringing down a host,
a service, a zone, a region? If we
are, then we feel good that our system will continue operating
and the right things will happen. If we don't have that confidence.
If we're unwilling to do it, then we're probably not there yet.
By the way, this is why chaos monkey was built, is that in
the cloud, hosts can fail, and we have to be able to handle that contingency.
And by forcing it to occur to our engineers, and that's a social strategy,
then we're expecting them to consider and be able
to address, and we're able to handle those
host failures. Next, I want to think a little bit
about how we design for operations. We often don't think
about operations when we're designing our services. We wait until it's built
or deployed or until there's an issue to really come back and
put a fine tooth comb through things.
And so if we consider this earlier on one, we can think about
how to observe the service. How do we know that it's behaving well?
And more than just error rates at the front door, how do we know that
the underlying components are healthy and that they're performing the way
we expect? We need to expose configurability that allows us
to tune and to make sure that the system has the right
edges and the right points that are going to protect itself.
And all of these things that we're building likely should be treated as first
class citizens with our source code checked into source control for
our configurations, our runbooks, or our alerts,
our monitors, our scripts. Next, we just need
to think about this a little earlier and more consistently throughout
the process. If we, similar to test driven
development, are thinking about the failure modes up front,
we're more likely to be able to leave room for them,
design for them, or be able to address them, than if we have to come
back and duct tape a solution on after the fact. If the
case here, every time we add a new dependency to our service,
is a great opportunity to ask ourselves, is this a critical dependency?
Can I gracefully degrade? Do I really need it?
Lastly, we want to be thoughtful about traffic control.
Our services. We never want to allow in more work than
we're able to complete. And so in interest of protecting ourselves and
the work we have taken on, we need to shed work if we're unable
to do it well. This creates a faster failure loop for our
consumers that help them understand the state of our service,
as opposed to a delay and a wait while things are
unknown. Similarly, when we're calling our dependencies, it's an
opportunity for us to a be good citizens and b protect ourselves if
that dependency fails, and the circuit breaker pattern works
wonders here. First, if that service is
failing, we don't want to continue calling it. We don't want to make things worse,
and if we know the outcome is going to be a failure, we don't want
to waste our time. But second, if that service is failing,
often circuit breakers are helping us to think about how we could gracefully
degrade. What do we do if that does fail? And what other
source of truth could we find for that information to
steal a page from the security world? I think that failure
modeling is actually a very useful exercise as we're
designing and building our systems. And it's worth the team's time to
spend a little bit of time together brainstorming. What are the types of things
that could go wrong? What happens if two things go wrong at the same time?
Which of these failures are non starters are things
that we just must live with, that we cannot change? And what are the failures
that we can turn into non failures by gracefully degrading that
we can ensure behave the way we expect? Ultimately,
this is a risk assessment, and it's a business and a technological decision
in what we think sounds most appropriate. And so the one
but of advice I'll give here is that many things can fail.
And it seems like the combination of failure might be rare.
But as we're dealing with components, with thousands of moving pieces,
perhaps many, many more failure becomes commonplace. And something
that seemed unlikely may occur more frequently than you
think. A little visual for how we think about this is
looking at a service diagram. And from this we can get a view of the
dependencies that we rely upon. We can reason about how important
those dependencies are. In the case of an ecommerce app,
we know that the checkout service is being to be critical and we're not going
to be able to perform meaningful work without it. But something
like the recommendation service that offers other things we might like to
buy could fail and we could hide from the user
and they could still continue on completing the mission that they
have come to us to accomplish.
Which leads us to talk a little bit about our dependencies. There's more than
just the services we rely on when it comes to the dependencies.
There's infrastructure, there's the network, there's our cloud providers,
there's our external service provider tools,
whether that's a database, whether we need to hit the IRS's
or the government's financial services to get the latest interest rate.
There's lots of moving pieces here, and the first step is
just being aware of them. Do we know what we rely upon?
The second piece is knowing what's critical. And for those things that we rely
upon, do we need it? Must we need it? In many cases,
this is often a bit of a guess. And from my experience,
this is a great place where failure testing comes into play. By going and
actually causing the failure of a dependency, we can see if the
system is able to withstand it and we can test if we're able to
gracefully degrade. And that's how we turn as many critical
failures into non critical failures was possible.
Now, some we're going to have to rely upon,
and in those cases, we want to work with those teams to make sure that
they're doing the best they can to build a highly reliability service.
If we're trying to build a service with four nines of uptime
on top of a bunch of services that have three nines of uptime, we're probably
going to be disappointed, because as we do the math on
the availability of those services, it actually gets a little worse as we
compose them, not better. So I want to share a quote
that's one of my personal favorites from James Hamilton comes from on designing
and deploying Internet scale services. It's about 15 years old,
but the advice in there is all applicable and
a great read and worth your time. But he essentially challenges us
that if we're unwilling to go fail our data centers and go cause
the failures that occur, that we're not actually confident
our system can withstand them. And if we're unwilling
to go do this testing and to validate our recovery
mechanisms, then likely those recovery mechanisms won't work
when needed. And as someone who's been in this situation in the past, if there's
two failures that are happening at the same time, and we're
going to go out and mitigate them, and we have a recovery mechanism that doesn't
work, now we're dealing with multiple failures. We have to debug,
diagnose and fix at the same time,
while in the middle of dealing with it, in the middle of the night,
urgently. And so we want those recovery mechanisms
that are there to save us and to make things go smoothly. We want to
make sure they work and have confidence in them.
So, great. What should you do today?
What are the most effective places to begin for a team that's new to this,
or a team that wants to get better at this? The first is go hit
redundancy. Start simple. Make sure you can lose a host.
Graduate to being able to lose a zone or a data center, not your
entire application, but a big piece of it. And then
lastly, turn your disaster recovery plan from an academic paper exercise
into a real live exercise that your
company executes. Do a failover between zones or
regions, do a traffic shift, and I promise you, you'll find and
uncover lots of little things you weren't aware of that would have caused this
to be a problem. If the real world event occurs,
and having lived through that, you can make it so that when the real event
occurs, it can run smoothly. And that's what we want here.
The second thing is to understand our dependencies. First, we must
know them all. Next, we must know which of them are critical,
which we can do by going and testing them with a hard failure.
And then lastly, we can tune how we operate and interact with those
dependencies to ensure that we back off, that we time out,
or that we stop calling them if things are going wrong.
Lastly, and not to be understated, we need to be able to train our
teams. Some folks have been living this for many years.
Some folks are new to this approach and giving them a place to
ask questions, a place to practice, a place
to ensure that they can get the runbooks and that they're up to date,
that they know what the alerts and the monitors look like, that they have access
to their systems is paramount. It's key to allow people
to practice and train during the day, after the coffee's kicked
in, as opposed to 02:00 in the morning when things are going wrong,
you so with that, I want to thank you.