Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hello.
It's great to be with you all here today at Conf 42 talking
about chaos Engineering with some of my friends. Today I'm going to talk about
normalizing chaos a cloud architectures for embracing
failure. My name is Ryan Guest. I'm a software architect
at Salesforce. If you'd like to chat on email,
my email is included. Or at social media,
my Twitter hashtag is at Ryan Guest and I'd
love to follow up after this conference and talk about some
of those topics here. So to begin,
I'm going to start with a premise, and that is what if we could optimize
for getting paged at 02:00 p.m. Instead of 02:00 a.m.
I think it's kind of funny that every seasoned engineer I talk to,
they have a getting paged at 02:00 a.m. Or 03:00 a.m. In the morning story.
And it seems like it's almost become sort of a ritual from
going from junior engineer to senior engineers. You have to have a situation like that.
But I'd like to throw out there, it doesn't have to be this way.
And then as we look at designing systems for reliability and
other aspects, one of the things we can look at is potentially could we
optimized for getting paged or equivalent to system
errors happens during normal business hours and not
in the evening or not in the middle of the night. Taking a look at
some of the principles from chaos engineering, could we use some of those and
apply them to the architectures we built? And that's exactly what we
did. And I want to share one reference architecture that we used at
Salesforce and were pretty successful with to sort of optimize for failures,
not happens at off hours in the middle of the night, but actually happening during
normal business hours and having full cognitive
ability and the full team on staff to respond to
them and make changes. Now before I go into
that, I think it's important to give you a little background on myself.
Like I said, my name is Ryan. I'm a software architect and I've been at
Salesforce for 13 years. Over those 13 years,
I've spent most of my time on the core platform. Working in data
privacy and security are probably the two areas that interest me the most.
And Salesforce are an interesting time right now when we talk about infrastructure and reliability
in that we just completed a complete rearchitecture of our most
important infrastructure. Here it goes by the
name Hyperforce. And with hyperforce we had a couple of key values that
we focused on, one is local data storage. So for
our customers or tenants, they can know where their data
is stored. Those next is performance at both a, b to
b and b to c scale. And each of those have different sorts
of requirements. We also want to have built in trust. We know
that being a cloud provider, trust is key to our success as
well as backwards compatibility. So at Salesforce, we've been running
cloud systems for 20 plus years, and so
we wanted to interoperate with the latest and greatest,
but also we had legacy systems that are critical,
mission critical for our business and want to keep those running.
Another important value of hyperforce is we wanted to
be multisubstrate. So whatever cloud provider that we
want to use under the covers to provide infrastructure, we want to easily be able
to migrate and let customers choose which ones to
use. Given those challenges in running a large scale
cloud operation, I think it's easy for engineers to fall
into the firefighter mentality. And like I was talking about earlier,
all season, engineers having those story, or a lot
of them, multiple stories, about getting woken up in the middle of night by a
page or a phone call or a text and having to respond to an issue.
We feel good about being able to jump in and save the day.
There's a certain, like I said, badge of honor that folks wear and
they can recall times they've done this, or I was the only one that could
solve the problem, or I had the magic keystrokes to bring things back to normal.
So I want to acknowledge that it does feel good, but on the
other hand, it can lead to sort of firefighter mentality where we spend a lot
of our old time fighting fires and not enough time saying,
how can we prevent them in the first place?
This is not something you can do forever, so you don't want to be constantly
fighting fires. Doing that is one of those obvious ones
that leads to burnout if you don't handle it
well. But also, it can take away from innovation, right. If you're always
trying to keep things up and fighting fires,
it can sometimes hurt sort of innovation and product growth. So you
have to have a balance between the two. There's also at
some organizations have, this is compounded by organizational problems,
where we end up rewarding those people that save the day. So if
you have two engineers, one engineer builds a system
where the system is reliable and just by the design of the system,
you don't have to log in and ssh into production at 03:00 a.m.
And run a magic script to fix things because a
disk log is overflowing. And those other one who
does that is always on call and always available. And this
is a bit facetious, but we need to think what's better
for your career. I think there's a
natural tendency for the spot bonuses and the rewards go the person that's buying
the fire, because all that stuff is more visible. But it's also important
to recognize that folks that are building maintainable, reliable systems, that works important too.
Long term, that's better for the company. So I think we need to balance
both. And although it is okay to appreciate folks
when they do fight fires, we need to think in a bigger context, what can
we do to prevent that in the first place? And so I want to show
you how our architecture evolved to get there right, and so we can minimize
the amount of fires that do happen. And when fires do happen,
can we choose where they go? So when you think about failover, I think most
services start like this. With a traditional failover, this is pseudocode,
but try and connect to server a. If that doesn't work,
we'll try and connect to server b. And this is like circuit
2001 Failover logic here and a very basic
example, but this is calls over the place and it's used in production
in a bunch of companies and pretty straightforward. And for
a lot of places, for simple architectures, this works really well.
But when we start to apply some of those chaos engineering tools, we can
see that this architecture can be exploited. And as things go bigger and bigger
and get more complex, it can be hard to keep up with this.
Now, looking at this, one of my all time favorite distributed
systems papers is called analysis of production failures
in distributed data intensive systems. Comes out of the
University of Toronto a couple years ago. And what this paper
really did is they dug into popular
open source systems, they looked at Cassandra, Hadoop, redis and a few others,
and said, where are failures coming from? And they found the
majority of catastrophic failures could easily have been prevented by
performing simple testing on error handling code, right? And so
they did the research and found that error handling, that's where
a lot of the big bugs occur. And we think about it, we think back
to the previous architecture I was talking about. We only really get to the error
handling scenarios if things have already gone bad in the first place,
right? If the happy path or the normal server I connect to is down,
then we started to start to excise all this exceptional logic.
And their conclusion was that simple testing can prevent most critical failures.
And it begs the question, why don't we test these systems more?
And so I would say think about that. How can we test the exceptional use
cases more? How have we focused on that? As mentioned earlier,
like I said, this paper is one of my favorite. They dug into all these
sort of open source systems and looking at failure modes. What they did is they
went through a lot of the bugs, they sampled the bugs, the highest severity
bugs from the public bug trackers and
did some great research. Also what I found interesting too is that
you were scrapping the code for looking for an error handling code
for to do or fix me. Two comments
that developers mostly leave when they know it's an exceptional case.
They know they probably need to do more, but they're just focused on the happy
path. That was a great way to sort of identify, okay, these are the key
places that a failure happens. It's going to be an issue.
So looking at that research and then thinking about how
come I always get paged at 02:00 a.m. And not PM, I started
thinking, what if we just failed over all the time, right? So there wasn't no
exceptional case. What if all the time we were just in failover mode and
just operate like that? You could say it's a
sort of chaos engineering mode. But really, what if that's our standard?
What if we just normalized it and say, hey, this is how it works,
so that the happy path and the non happy path, they don't diverge.
And so here's our very rudimentary versions of more pseudocode.
But we just randomly pick between two servers,
flip a coin. If it's heads, go to server one, if it's tails, go to
server two and just bounce back and forth all
the time. If server one goes down, then we just keep going to
server two. And this works great because if
one server goes down, it's not the end of the world. And if
one server goes down, we can check the logs or we can look at alerts
or warnings, however you've instrumented or whatever type of observability
you have, and do that during normal business hours and
say, okay, these are the things that went wrong, or these are the errors we
saw. And to the end user, things are still working. But we
know we have to fix these things, we have to change this design or
even just bring the server up. And I would much rather analyze
those and think about those in the afternoon than in the middle of
the night. And so expanding on this,
we can do other things. So if we think of our
main service and the servers we talk to are all in one cloud provider.
We can split them between two different cloud providers to offer more reliability.
And so this type of thinking sort of changes our mindset, right? So we're no
longer thinking, okay, this is the main server, this is a failover,
this is the primary. We're now thinking that, okay, there's a pool of servers,
and there's no difference between if we failed over or it's a
normal happy path. Both the client and the
server don't care and can just behave as we do
doing those thing. There's also some issues that cropped up. So some
pragmatic things that we did, we ended up signing
requests. So you could say, this is
the type of data I'm looking for. You can get into different situations
and you have to manage this to your environment where different services
may be out of date or need a certain version, and you want to account
for that. But this is expanding on our architecture.
We events went a step further and let me dig into that.
So when we talk about fill over everywhere, we realized that 50
50 is kind of a naive approach and we could do
better. And so we did some probabilistic modeling here.
And essentially the difference here is that the choice of what
service you go to becomes a function. And in that function we can
have logic and decide what to do. And so you may,
90% of the time you do want to go the closest server, because in
some cases, you're limited by the speed of light. How fast can I get around
the world or bounce between data centers or whatever?
So you have a pool of outside servers, and you may want to hit them
a lesser percent of the time. This is just a fictional example,
but it's similar to what we're doing right now, is we have one where we
favor server pool servers and then secondary
servers, but during the course of the day of normal operations, we do
send requests to them and we do ping and see how
things are going. Now, we came up with a sort of formalized model for
this. And really what it's important is that all these probabilities
add up to one, like I said, so you can have individual functions for
each service and say, okay, in this 90%
of the time I want to go to my local service, and then this
2% of the time I want to go to a service on another cloud provider
just to make sure that that failover case works. And then maybe this 2%
of the time I want to go to a service, another geographic region.
If it's a service that is built like that. And this 3% of
the time I want to go to another service. But those service
maybe is an on premise service. And so I mentioned earlier,
we are in our feature architecture and what we're doing right now is expanding to
a multi substrate surface. One of the substrates may be a colocated or an on
premise substrate. And how do you balance that in.
Now, this architectures is good for some use cases, but there
are major trade offs. And you're probably thinking in your mind, oh, this is good
for this, but not good for that. So one thing is it's not good
for is if your data changes often. So if you're constantly updating
data, if you have a write heavy database, this is not
the failover, the architecture mode for that. You really want something where
changes don't happen that often, because you don't want to have to worry about replication
or keeping data up to date. If you're bouncing around all over the place,
this also isn't good. Similar to that, if you have a very chatty protocol.
So if you're constantly talking back and forth, you don't want to be sending those
messages all around the world. You probably do want to keep that located in
a smaller region, and that makes sense. Similar to that, if you value performance
over reliability. So if could say, hey, I'd rather have this request happen
really quick or fail, then this isn't the protocol for you.
This has worked best in our systems where it's been the opposite. It said,
hey, reliability is the key here. If a request takes ten times
as long, that's fine. I'd rather get an answer than not get an answer at
all. And I can wait a little bit and that's key.
So like I said, if your data is mostly consistent,
if stale data isn't a big deal, or you have updated versioning
system, and you really value reliability
as one of the key tenants over things like performance,
then this is a great architectures to choose. But it's important
that like all engineering decisions, you consider those trade offs and look
at what's the best. Thanks. I really
appreciated talking to y'all. Feel free to hit me up on Twitter
email both if you have questions or could like to talk about this further.
It is an architectures that we're evolved here and I'd love to hear your thoughts
and folks doing other similar things so they can really optimize for the
chaos mindset. And those failover isn't something that just
happens at 02:00 a.m. In exceptional circumstances, it can happen all
the time, and I would love to hear more about that.