Transcript
This transcript was autogenerated. To make changes, submit a PR.
I'm Vilas. I am a director of engineering at Walmart.
I focus on enabling developers to deploy their code
in a resilient, performant way into the
cloud of their choice. That's what I focus on. What I'm going to talk about
today is a little bit about what we saw in our journey
towards becoming more resilient as a software
company and what were the challenges we faced, what are the mistakes
we made and all of that stuff. So before I go there,
obviously everyone in the room knows what this is.
I'm not going to repeat, everyone else is going to talk about this.
Suffice to say, the way we looked at this was we realized the importance
of this at Walmart, but we also realized that to
execute something like this at Walmart scale was going to be
a huge challenge. Right. What is Walmart scale?
These are some numbers. We have more than 11,000
stores worldwide that get supply of software
that actually does everything in the store. All of the management on
the supply chaos side, all of the retail business is managed by software.
The annual revenue of all of this combined is more
than half a trillion, which explains to the amount of goods that
are moved across the world for it. Right. That number,
270,000,000, is the number of folks who transact on
our omni services, which means including the website, including the
stores in a week. So that's the amount of foot traffic or
transactions that we see in a week. So if you think about scales
like that, any kind of disruption could cause
a massive amount of damage. So that is something that we wanted to sort of
think about when we even thought about doing this kind of journey
at Walmart. So to begin,
initially, we had to establish some truths, right? So we said,
okay, the following things are true. One,
reliability is no longer just a function of redundancy
and overscaled hardware, right? So we are not going to just throw some servers at
it and we'll obviously fix the problem.
That's no longer the case. Specifically the way
that we were thinking about it. We wanted to exist in a hybrid cloud environment.
If we wanted to do that, we have to acknowledge that external cloud providers,
no matter how much their guarantees could be, they are still a dependency,
a variable. And if something bad happens, we still have to be ready and
resolve essentially what the customers needs are.
Customers expect, do not
like the idea of scheduled downtimes, right. There was a time when that used
to happen. This is no longer the truth anymore, which means customers expect
you to be on all the time, have the best service,
doesn't matter how many parallel connections are open, how many parallel
transactions are running. They want everything to be just
as smooth as it's running at peak as it is running at non
break hours. Right. Any user that's actually performing
any transaction on the entire service does not
expect any loss of behavior.
Essentially like, your cart is not available, your items are not available, they want
it all the time. And that's the expectation. And that's not wrong.
That's how our customers are today.
And a direct correlation from that is the users could lose trust
on a brand because of a single moment
of a bad experience, a glitch, something bad happening in the back end.
Right? That loss could be temporary, which means they feel, oh, yeah, this is not
a good place for me right now, and I'll go back and come back later.
Or they could be like, lifetime, this is it. I don't trust this brand
at all. And they could communicate that to their families. You could lose their entire
business for their entire lifetime. Right. These are pretty big numbers that we had to
consider. So these are truths. We held them as truths. And using this, we said,
okay, fine. So what is the goal? So obviously you will see a
lot of connections to the principles of chaos engineering directly.
But this was our goal to maintain an application ecosystem
where if there are failures in infrastructure and dependencies, they cause
minimal disruption to the end user experience. And what is minimal disruption
is something that we sort of defined over time. We refined it from being
a very macro level, very sort of amorphous
detail to making it more and more fine grained over time.
Right? So the
first thing that we started talking about is, how do we inculcate the idea of
running a chaos exercise? Right? So we said, fine, let's talk
to them about an outage. You had an outage.
There was probably a lot of things going wrong. How did you manage
it? So we realized that by talking about it. We realized that
every outage was essentially a chaos exercise that was completely unintentional,
but was happening nonetheless and was causing people to be reactive.
It was exposing gaps in our systems. It was exposing things where we
were not as good as we could be. Obviously, the revenue impact
could be huge, but essentially this entire thing constituted
a chaos exercise. So you were sort of in that mode already.
The only idea was to say, let's not have it unintentionally.
Let's prepare for this from the start.
Right? Obviously, the measure of all
of these exercises that we did at the start was we said, okay, we want
to calculate what the downtime looks like. Right? So we wanted to calculate a
per incident cost. Any incident that happens, we want to calculate exactly how
much it costs the company. We want to break it down by
quarter and see trends, find exactly where we are
super efficient, where we are not. And then we wanted to track
this and find sort of a path, essentially,
right. We want to plan better a culture of software resilience. Instead of saying,
yeah, just fix this for now and it'll be fine.
So when we started doing calculations for an incident. So this should
be familiar with anyone who's dealt with incidents, there is a path, right?
You identify the incident, you page out,
there is an on call that receives it, they file an issue, or maybe your
l one supports file an issue. There is some
logging that also is sending out some alerts. There is
initial triage with the folks who are on call. They try out some stuff.
They figure out exactly what the playbooks look like. They will try some things out.
If that doesn't work, you will assign this to a subject matter expert
for a fix that could take some time. You could escalate if necessary.
Finally, you would resolve it, you would close it. If you find a bug,
you would deploy it, or you would do a hot deploy. Or if there is
something that requires you to rearchitect something because fundamentally something is wrong,
then you would write that feature in and then you would release it.
Right. This is expensive if you look at it.
We want to try and solve problems or reduce the costs in
such a way that we want to stop it at that third block.
We want to stop it right. Where there is initial triage, we want to have
enough experience for that on call person, enough information for that on call
person to be able to solve it at that third level. Right? So this is
what we called an incident cost, and this is what we calculated for all of
the incidents that we had. So once we had done this
as the initial step, we still had a long way to go because we had
to do a lot of homework. The teams that were running
this had to do some homework. Right? And this
is where I would start talking about how others could
also institute this in your companies. Right.
The homework that is needed is very generic. Everyone who's trying to
run chaos exercises should be doing this. First is
observability. This is a non negotiable. If you
do not know how your system can.
How. How can you sort of keep a pulse on the system and understand if
something goes wrong and identify it in a certain
specific kind of period, then you are not observable,
right? So obviously everyone relies on logging. There's a lot of ways
to do logging. We have splunk dashboards, cabana dashboards,
things like that. But that's not enough. That's only solving part of the
proposal. Right? You need to have intelligent second level metrics
to figure out exactly what challenges, right? You want those deltas,
you want those trends and the changes. You obviously
need alert setup. But I would say the last thing is
important, right? So if an on call engineer, the measure is if at
any team that you run, can an on call engineer who gets
on call for an incident resolve an issue successfully within
the SLA of a p one, right? Typically, the slas for p one resolutions
are short of acknowledgment, incident resolution.
If that person can essentially find the issue root,
cause it to a successful degree, and then
push in a fix within that time, that, to me,
indicates that there is significantly good observability in the
system. Right? Of course, it's not perfect.
I'm sure there is better ways to measure this, but this is something that I
think gives us a good measure of how the system
works. The other
on call prerequisites is imagine
if you're an on call, and if you have a team has an on call
and they wake up in the middle of the night. Let's say they wake up
at 02:00 a.m. And they're on a call with the VP and
the CTO, and they're like they're losing millions every second.
How can we fix this? Right? At that point, you do not want them
stumbling to try out things. You want them to have a specific playbook
for all of the issues that can go wrong, right? So you need to have
a disaster recovery playbook, and you need to have this playbook well tested.
Having a playbook is not the answer. Testing a playbook continuously is
the answer. Right. You need to make sure that for every application or every microservice,
the way you define it needs to define its own critical dependencies.
Like what are the dependencies which, when go down, impact the application
to a degree where it's not functional, it cannot do its job.
Those are critical dependencies, right. For every critical dependency failure,
there needs to be a detailed playbook or an automated way to
have fallbacks instituted whenever the critical dependency is
not available. Non critical dependencies also have to be
defined, right? It's possible that you are sending off to
a log that's like asynchronously sent to another service,
but it's possible that initially the failure may not
affect you, but over time, accumulation of those logs may
impact you in some way. Right. Maybe the space on your system
is being eaten up and eventually run out of disk space. Right.
That kind of non critical dependency needs to be defined along with the thresholds
of when that non critical dependency will start impacting a service.
Right. If you have this, then an oncology engineer, even in the middle
of the night on a call with a high pressure situation,
will be able to navigate this in a systematic way.
So that's what we think is required homework.
Because we realized that anyone who sort of solves goes through this
exercise, realizes a lot of gaps in the system automatically, even without running
a chaos exercise, right. This reveals a lot of gaps in the system.
So these are the two sort of key homeworks. I would say if you do
not have observability and you do not have these prerequisites, you will not be
able to get the most out of a chaos exercise.
The other thing I would say is there needs to be a way for you
to be able to understand your production load, what it looks like,
and be able to generate it in a sensible way. Right.
This is crucial for two reasons. One,
you do not ever want to run in prod
unless you really are confident. So you have to always have
a do no harm approach. Right. You cannot knowingly cause harm to
prod, so you want to know how to do this in a pre prod environment,
but using prod traffic so you can be confident of the results.
There is no point of running failure injection tests if you can't really verify
what will happen if there is a failure and what happens to prod traffic.
Right? And you want to try and do that as much as possible early enough
in the cycle, so that you're not causing the company expense.
By all means, if you're confident, you should be pushing your testing to prod as
well. But if you're not mature enough, this is something that's important,
right? The build or buy question.
This is something that I think has been answered by lots of folks
that have more knowledge about this than me. I don't think there is any preference.
If you have a system that can generate production like road internally,
that's great. If you have proprietary needs to do that, that's fine. But there is
also buy options which other teams and other companies have used.
Right? So I don't have a say in that, but that is something that has
been debated. I don't have a say in that much anyway,
so the other thing I would say needs investment is a CI
CD workflow, right? The diagram that I
have on the screen essentially demonstrates what a real CI CD pipeline looks
like in this day and age. There is a lot of automation.
You'll notice some of the stuff is very familiar to you, right? Which is plan,
code, build, test and deploy. Right. In between there
is a profile stage, right? A profile could be something like run a
performance test, find out exactly how much utilization I've been
using, recommend the best solution for me to exist in.
Like give me the right kind of cpu and memory allocations for my Kubernetes pods.
It could be give me the right kind of flavor for my vms on Azure,
whatever those are, right. That essentially has now come part
of the CI CD cycle. And having that enables you to
solve a lot of these reliability issues way earlier in
the pipeline, rather than wait for some signal to come out of production.
You will also notice there is two key things here, which is
there is a multi cloud environment. So hybrid cloud, I obviously represented that.
But there is two things. Inference layer, right? The inference layer is essentially
a back channel out from prod coming into the pipeline,
the CI CD pipeline. This is something where I say observability is important.
If you have an observable system, you can read what's happening in production
and then feed that back into to make your code do
better, right? The other thing you'll see is a decision engine that's between
the CI CD pipeline and the clouds. This is something
that many teams have started investing in, many big companies
have started investing in, which is figuring out what's the best
kind of cloud configuration,
if you will, for my application. Would it be better on
prem because it has latency restrictions? Would it
be better on a certain kind of cloud provider? Because certain kind of cloud provider
provides a certain kind of platform service? These decisions
have now started to get automated, right? And you'll see more of this coming in
the years. But this is an investment that I feel enables you to
have a better system in production.
Building a maturity model. So we did this at Walmart.
We did not allow teams to just go come in and run chaos engineering tests,
right? We have a detailed maturity model from level one of
maturity to level five, with requirements at each level. And this is all detailed
in that blog post. It talks about
how as a team, you can enable yourself to grow from one level to
another. And that also enables the team to become more and more confident.
It enables management to be more confident on the team running these exercises.
So that maturity model I have seen, it really helps. Right.
What happens essentially, because of that maturity model, is that over time,
as you go from red, which is level one, to green,
at level five, the support costs go down. Right.
What also happens is that the revenue lost per second, per minute, however you
calculate it also goes down, the resiliency obviously shoots up.
At level one, I would say the support costs could be in the five
digit number, five digit dollar number per minute, whereas at
level five, you're probably looking at a few hundred dollars per minute. Right. So imagine
a couple of engineers working on observability systems and just getting
all the answers at level five because they have everything in place, whereas a
whole bunch of team of engineers trying to figure out what do I do next
and what do I shut down to solve this? The other
thing is build the right tools. So we invested a lot of time in making
sure that we have the right tools to enable our teams to run the resiliency
test. One of the tools that we did invest in is resiliency,
doctor. You can read all about it in the
article that Vijita has published. So investing in
tools is important because I believe the ecosystem is still
sort of getting up to speed on that killer product.
So we all need to sort of contribute, chime in and do
our best. So this is essentially something that I would say
all companies would. It would be good value for money.
And this has really helped us enable that maturity model in all the teams.
So apart from the maturity model for the teams, the other thing that I would
focus on also is building the right mindset of each and every individual.
Right? So what we suggest all
companies to do, because we've seen this succeed, is a lot of trainings.
Right. The path is not easy. The journey is
not easy. You have to make sure that people understand what they're trying to
accomplish before they do it. So trainings, certifications, brown back sessions,
all of the usual stuff. Making sure that you're building resilience after an
outage. Right. And resilience doesn't just mean software resilience.
And I know there is other speakers who will talk about this,
which is about human resilience as well. Right. You want to make sure
that the person who was on call doesn't treat this as a traumatic experience.
They actually learn from it because the postmortem itself was
blameless. You're not really pointing fingers. You're trying to figure out what
the system looks like and how to make it better. And teaching that
is also something that doesn't come naturally to human beings.
Right. We tend to want to find blame in someone, and blameless
postmortems is something that I think it really takes a culture
change in companies to accept that.
And the last thing is carrots, not sticks, which teams, you treat them with
rewards, not punishments. Basically, the idea is
you provide some kind of an incentive for folks doing these kind
of resiliency exercises. It's a hard thing to do. The fact that they're
accepting it, doing it means that they're committed to it, they're passionate about
it. You want to make sure that you enable them and you incentivize them.
So what did we learn on our journey? So all of this stuff, I think,
is something that I think works for any company anywhere.
But there are some things that I would say do not do, and that's
what I'm going to talk about next. So these are the things that are don'ts.
So these are learnings. The first learning that we found was
we mistakenly created vanity
positions, right? And this is something that other folks also have done,
which is we did not want those, which is like, this person
is the designated chaos practitioner, or this person is the resiliency
expert, or such. And such that doesn't work
because, one, it shifts the responsibility
of something to be done to that one person instead of democratizing
it to everyone, saying, oh, yeah, we are all enabled and empowered
to be able to do this. It shouldn't be this one person. So we
learned that pretty early on, but it was something
that was, we realized that it was not helping our
cause. Second thing was, the exercises cannot
be conducted without complete participation. When I say complete participation,
it doesn't mean complete participation of just the team. It has to be the
team's dependencies. Maybe the SRE teams,
maybe the other teams that support you in production. All of these
folks need to have sort of, they need to be signed
up, they need to be committed to this cause. And that also takes effort.
It does also take interest from their side and passion from their side.
So this is something that we sort of found out over time, and we realized
that we have to sort of fix this. The other thing was
don't assume, verify. So this is sort of a
take on the same thing, right? So like trust but verify kind of thing.
But it always starts with assumptions, right? All of us in this room,
we know what the value prop of chaos engineering is, but you can't
go in with that assumption with everyone, right? So, for example, observability,
you cannot go into it if you ask a team, hey,
do you think you can find an issue in 15
minutes? They'll probably say yes. I don't think there is any team that has said
no to us or anyone else. Right. But you
have to verify that by checking if the observability truly is as
they say it is. Right. No. Teams, I would say,
is as well instrumented as they think they are.
On call. Being reactive is always the problem.
Right. You have to make sure that you test the on call systems
you want to test. Make sure that the on call person has everything. Doesn't mean
actually test the on call person to see like, hey, I just created an issue.
Did you actually catch it? Right. Not that way, but you want to make sure
that they are empowered before they get into a bad situation.
The other thing is specifying what the
team's goals are about. Resiliency is crucial.
Every team does not see it the same way.
It will change team to team, it will change application to application.
And those goals would be prioritized by that team. It is not something that you
can centrally prioritize. You can prioritize. Loss of revenue
should be minimized, but that is interpreted by different teams differently.
And I think this is something that we learned, and it was a pretty painful
learning for the simple reason that we didn't really understand
that teams had a certain plan as well for how resilient
they want to be. Instead, we were imposing a certain metric which for them was
meaningless. Right. So that's something I would suggest folks to have discussions
about. And obviously their deployment pipelines have to be verified.
But this, I would say, is something that is part of the exercise.
So are teams ready for exercises? So this is something that
no team is going to come in and say, if you just start your chaos
engineering group and you go to someone and you say, okay,
are you ready to do this? They are going to be very reluctant.
Maybe someone will say, yes, we will try it, but in reality,
none of them are right. So you have to make sure that you build the
training, you teach them that it's okay to sort of learn
but don't fail in such a way that you sort of cause that trauma
to the team saying, I'm never going to ever do this again, this was a
bad idea. So in order to teach the right way to do things,
you have to treat chaos engineering as it's supposed to be. It's an experimentation process.
It's a process where you establish a hypothesis and you
prove or disprove it. It's a scientific experiment. It's not a,
let's just hammer out some of these nails into this server
here, and let's see what happens. That's not what chaos engineering is. So I just
want to give you guys a sense of where we are now.
The report card reads really good. I would say that
all of these learnings that you saw, that has obviously given us a lot
of thought about how we want to progress in the future, but it has only
given us more sort of will to do more. The application
teams are eager to run these tests, and because of the maturity model,
they've understood that we do not have to just go in and be glamorous
overachievers overnight. It has to be a slow process.
Management confidence is good because management
invests in this.
Again, in order for management to invest in this,
they need to understand that chaos is not the goal. The goal
is resiliency, but the way to get to that is chaos. Right. And this
has been repeated by many folks, not just me. There's other speakers
also who will repeat this.
Increased resiliency in engineering usually tends
to basically, if your engineering team itself is resilient,
then essentially it opens the way to subsequent learning. So you start
doing more learnings. It encourages them to test things out even more.
And frankly speaking, everyone defines an end state. So application owners,
all of the application owners define, okay,
if something bad happens, this is what I want, right? That's all in their dreams.
There is no verification. But this now allows them to stand in meetings
where maybe there is multiple high level tech leads or senior management
and be able to say, okay, this is what my application does, and I'm confident
about it. And that's actually, I think, the best thing to come out of
it, because you want to empower your tech leads, your engineering managers and such,
so that they can stand their ground whenever they are talking to folks about
how to do this thing better or whatever. Right? Like, good design, good architecture,
all of that is improved by this. So this is where we are today.
And so all of the learnings that we have done has led us to this.
And going forward, what we want to do is build on these
models, right? The maturity models that we see today that I
just described, those were rough around
the edges, right? So there were things that we had not defined very well.
Like I said, resilience of the individual person as well.
Right. The engineer. So we are trying to work on those things and make sure
that we commit to this in a way that is sustainable
and that's actually important for this kind of exercise.
That's the entire story that I wanted to share with you guys today,
and that's all I had. Thank you so much. If you have questions, I'm ready
to hear it.