Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Piyush Verma. I am CTO
and co founder at Lastio IO. The talk is about
an SRE diary where as steps,
we are trained to be on pager all the time. I mean, I've held a
pager for almost a decade now, in fact, earlier than when it used to be
called an SRE as well. But this talk is slightly the
opposite end of it, where we talk about all the times when
the pager did not ring. And that's
where I say it went. Make a sound when it breaks. Why do
I say that? If you look at these photos, most of these
things, the upper half, right, they would make a sound when it breaks.
Starting from a sonic bubble of an aircraft to a balloon, to a
heart, and then the ram and the bios.
Not sure if many of us would associate with a ram breaking
and making a sound, because I haven't seen modern computers do
that. I've not even seen a BIOS screen in ages.
So that could also be a case were people don't associate with
it. But from an era were, which was
prior to this cloud computing, when these things would break, they would make a
sound. So it was very easy to diagnose that something has gone wrong.
Probably not the case anymore.
Most of the failures, the flavors,
come in form of software, human network, process and
culture. I'm going to talk about everything but a software failures.
A software failure is the easiest to identify. There will be a
500 error somewhere which will be caught using
some sort of regular expression, some sort of forwarding rule.
It will reach a page of duty, a victor ops or some other
tool of this manner, an incident management system, which has been set up.
And that would ring your phone, ring your pager, ring your email or
something like that, which works pretty well. All the other
failures, the flavors of it starting from human, a network,
a process or a culture, are the ones which make really
interesting rcas. A software
failure is very easy to identify. You find something in a log line.
A culture failure, on the other hand, is something that you identify pretty
late. And the good part about is, as you go from top to bottom,
you realize that the error that actually shows up, or a
failure that eventually shows up is very latent.
A culture failure is the slowest
to identify. A software failure is the easiest to identify.
A human failure, yet slower than a software failure, but way faster than a culture
failure. So my talk, I'm going to split this into a few
parts where I speak of outages, a few
of them, some of them may have heard this talk,
but this is slightly different variation of my earlier talk.
And these outages, they traverse from very
simple scenarios which otherwise,
as sres, we would bypass thinking, hey, I could
use x took for it, I could use y took for it, I could use
z tool for it. But they do not talk about the real
underlying cause of a failure. I want to dissect
these outages and go one level further to understand
why did it fail in first place. Because as steps,
we have three principles that we follow.
The very first thing is we rush towards a fire. We say,
we got to bring this down as we're going to mitigate
this fixes as fast as possible. That's the first one, obviously.
The second question, a very important one that we have to ask ourselves
is where else is this breaking cause?
Chances are, our software, our situation, our framework
is a byproduct of practices and cultures that we follow.
So if something is failing at some place,
there's a very high likelihood that it fails in another place as well.
That's the second most question we have to ask because the intent
is to prevent another fire from happening. And the
third ones, the real important one that we mostly fail to
answer is, how do I prevent this from happening
one more time? Unlike product,
reliability cannot be improved as one feature at
a time. It cannot be done as one single failure at a
time. Failure happens, I fix something, then another
failure happens, I fix something. Sadly, actionability doesn't
work that way because customers do not give us that many change.
The first outage that I want to speak about is
a customer service is reporting login to be down.
We check datadog, paper trail, neuralink cloud, watch.
Everything looks perfectly okay. Most of these
trying to hint towards the fact that breaking looks okay,
servers look okay, load looks okay, errors look okay. But we
can't figure out why login is down. Now here's
an interesting fact. Login, if it's
down, means that it is not accessible. So requests that
doesn't reach us are inside out monitoring
systems which sit in form of a setting of a sophisticated
Prometheus chain, a datadog chain, a new relic chain is not going to buzz
as well because something that doesn't hit you, it doesn't create an alarm on a
failures. You don't know what you don't know.
So what works, what gives? And meanwhile, on Twitter, there's a lot
of sound being made. I mean, surprisingly that I've really seen that
my automated system alerts are usually slower
than people manually identifying something is broken. So on
Twitter, there's a lot of sound happening. What was the real
cause? After a bit of debugging, we identified
that a DevOps person had manually altered
a security group, accidentally deleting the 443 as
well. Quite possible to happen because ever since the
COVID lockdown started, people started working from home. And when they start
working from home, you are always whitelisting some IP address or the other.
And those cloud security group operation tabs usually
allows you to make just one single press which you can accidentally
end up deleting a rule that you should not have, which results in a
failure. What was the real root cause here?
How do we prevent this from happening again? Well, the obvious second question that
we asked was, okay, where else is this happening? We may have deleted other
rules as well, but what's the real root
cause? If we have to dissect this, it's not
setting up another tool. It's certainly not setting up an
audit trail policy because, well,
that could be one of the changes as well, that you set up an audit
trail policy. But then somebody may miss to have a look at it as
well. I mean, the answer to solve a human problem cannot be another human
reviewing it every time. Because if one human has missed it, another human
is going to miss something as well. Then what is those real root cause?
The real root cause here is that we do
not have this culture, or we allow exceptions
of things to be edited manually. Now these cloud
states, if they were being maintained religiously using a simple
terraform script or a pilumi script, it is highly likely
that a change would have been recorded somewhere and a rollback was
also possible. But because that wasn't the case, and we had these exceptions
of being able to manually go in and change something, even if it was.
Hey, I just have one small change to make. Why do I really need to
go via the entire terraform route? Because it takes longer. Because every
time I have to run an apply operation, it takes around ten minutes to just
sync all the data providers and fetch the real state. And then
tells me that here's a diff. So it comes in the way of my
fast application of a change. Now the side
effect of that is that every once in a while were going to make a
mistake like this, which ends up resulting in a higher outage.
In those particular case, it wasn't just that the slash login is down. Imagine deleting
a 443 rule from an inbound of a security group of
a load balancer. It's not just login is down, everything was down.
Just that login was reported at
that point of time was one of the endpoints because one of the
customers had complained about it. So real impact was larger
CTO, a degree where we can't even tell how much the outage was,
unless now we start auditing the request logs from
those client side agents, et cetera, as well, which itself
at times are missing. So the real impact, we don't even know how
much business you lose. We don't know. We just don't know anything about
it. So the only way to overcome the situation is
to built a practice where we say that no matter what happens,
we are not going to allow ourselves to make these manual short circuits.
This still looks pretty simple and easy. I want to
cover another outage, which is outage number two.
This is when we were dealing with
data centers. There weren't really these cloud providers here.
Not that it changes the problem definition, but it's an important addition
to the scope of the problem. Around 25
hours, just before a new country launch that we're supposed to do, payer duty goes
off. We check in our servers and we see that
there are certain 500 which do report on
elasticsearch. Now, because we are in a tight data center
environment, default log
analysis isn't really a luxury cause you go over a firewall,
hop, et cetera. So what we do is one of us just decides to actually
start copying the logs. Comes in really handy a few minutes later in
the script, a few minutes later, what happens is that 500
just stop coming in. Pager duty is auto resolved.
Everything looks good. Though to our curious mind,
we are still wondering that why did something stop working? And how did that
thing auto resolve? So we are still debugging into it.
Five minutes later, pager duty goes off again and again.
The public API is unreachable. We do
get our pingdom alerts, everything that we had set
up alerts go in. Looks like something fishy because
this is clearly before a public launch that is happening. So we
start checking rundeck, because what is the most important question that
we ask when something fails? The very most important question that I have asked when
something stops working is that, hey, who change what?
Because a system in its stable stationary
state doesn't really break that often. It's only when
it is subject to a change is when it starts breaking.
The change could be for a
time of a change, could be something
was altered, a change could be something
was a new traffic source was added. But it's one of these change
that actually break a system. We also ask ourselves this
important question. Hey, is my firewall down? Why is firewall important
here? Well, because it's a data center deployment. So one of the changes could have
happened on a firewall as well. The inbound connection could have gone up,
but that wasn't down either. Okay, standard checklist,
check Grafana. Nothing wrong. Check stack driver because we were
still able to send some data there. Nothing wrong. Check all servers.
Nothing wrong. Check load. Nothing wrong. Check docker.
Has anything restarted? Nothing restarted. Check APM.
Everything looks okay now. While we
had copied the logs, we start realizing that some of the database
errors, there was only some database errors on some
of the requests, not everything. So which looked really suspicious if
we take a look at it. We had all the tools that we wanted.
We had elasticsearch available, we had Stackdriver available,
we had sentry available, we had Prometheus available,
we had steps available. One of the ways to solve a problem is that we
throw more bodies at a problem. We had really all the
sres that we wanted. We were team of ten people,
but we couldn't find the problem. 20 hours later,
after a lot of toil, we found that the mount
command hadn't run on one of the DB
shards. Now why did this happen?
There were machines, there were data center machines that we actually provisioned.
And one of the machines earlier night was coming in from a fix.
We had an issue where a mount command
used to run a temporary file system and it wasn't
really persisting it on a reboot. We applied a
fix across ten machines, but the machine that was broken
at that point of time did not have that ansible check on it.
That did not run ansible on it. Just before the country launch,
we decided okay, let's insert this machine and we will
have the machine available in case a load arrives as well.
So when the machine went in, it did not have that particular
fix, that just one fix that we had applied.
So data goes onto a shard as well.
The machine wasn't fixed properly, so it rebooted again.
And when it rebooted, that data got failed. It was just
a slice of data. So every time request would hit that
slice of data, they would result in an error which would result in
load raising of too many errors. Load balancer would cut
it off. Send a pager duty breaks the traffic.
Health comes in. Health obviously is not checking that data.
So the machine would come back in circulation and this cycle goes on and on.
Pretty interesting story, but what's the real root cause here?
Those real root cause, if you understand,
if you try to dissect this further in this particular case
was not that we were not running any automation, we had
state of the art ansible configuration.
What we could have avoided this was, if I ask myself,
could have I avoided this outage? Probably not.
What did that outage teach me so that it doesn't happen the
second time around? This is what is important. Most of the times you won't
be able to avoid it or avert that outage from the first time because they're
so unique in how they happen.
It's going to happen. But how can we prevent this from
happening again? When I say this, I don't mean this particular errors,
this class of errors from happening again. And the
only way to solve that is we realize a simple
script could have done the job only if we had a startup
script of an OS query which would just check the
configuration drift of the machine. Probably we could have saved
this. Now one of
the big questions that we ask while something is failing is
how was it working so far? This is
one of the most interesting question. Then the next question is what else is breaking
on the lines of these two outages? I want to cover another one, which is
an outage number three. This outage
number three is typically about a distributed
lock which was going wrong in production.
We used to have sort of a hosted and amounted
terraform which would allow multiple jobs to run at the same time.
And there was a lock in place.
But despite the clock, multiple jobs would come in
and the contention of a lock was really failing
almost to a degree that it looked like this. There was a lock, but was
pretty useless. So if I'm to
define the ideal behavior of a lock, we were using ETCD for
clustered clock maintenance, and the standard compare and swap
algorithm is being used. How does those compare and swap algorithm
works? If I just quickly walk through, right, I set in a
value which is one. Using the previous value
of one I set in a value of two. Everything works well.
If I now try to put in a value of three, it will say that
key already exists because it already exists. Or if I try to set
a three on a previous value of two, it would say the compare operation failed.
So it's a pretty standard API of compare and swap which works.
But what was really happening here?
We started with a default key which is stopped, and both
processes are trying to set a key of started with those
previous value of stopped. The behavior is that only one should
win that previous value is stopped. I set a new value of
started. Only one process should go through.
But what's really happening, what's happening is that
run a acquires a lock, and with a TTL,
TTL is very important. Cause every time you have a lock based system,
in absence of a time to live expiry,
what ends up happening is that the process which locked
or acquired the clock may die. So it's extremely important
that the process which has set the
lock also sets an expiry with it, so that
in case the process dies, the lock is freed up and is available for others
to use, which works well. Those only
difference being that when a tries to update those status,
we get keynote found, b tries to update those status
again, then we get keynote found, which is very weird,
because the process which had set the lock should be
able to unset it and free it, or either
case that both of them should be able to go ahead and find the contention.
But this is very unique here, because both of the processes say that I've not
found the key. Well,
keynote found, b says keynot found. So this
is where we all become full stack overflow developers.
We start hunting our errors, we start hunting
the diagnosis, and we land on a GitHub
page, which is a very interesting one, where it mentions that ETCD
has a problem, that ttls are expiring too soon, which is related
to an ETCD election. It looks like a fairly technical and
a sophisticated application, and we say, all right,
those makes sense, and this must be it.
So a GitHub ticket, open source
on a system and fairly tells us that hey,
eTCD is a problem. Our solution is,
well, we have to replace eTCD with a console
for reasons of a better API or a better it
has better state maintenance, et cetera. And we almost convince
ourselves that this is the right way to go. We make a change,
it's a week long sprint. We spend a lot of time at it and effort
at it and replacing it, and things go back to normal.
Is that the real reason though?
One of us decides that, look, if this was a
reason, how was it working earlier? That's the most important question,
right? That we ask, how else was it working earlier? We're not convinced that,
look, this could really be the way to go about it. So we dig it
deeper, and one of us really dug it deep
to actually find out that the only problem that
existed on those servers was the clock was adrift.
One of the machines and the other machine of the cluster were
running behind in time. Now why
this would fail is because the leader at that point
of time, which was actually setting a TTL because
a TTL was very short and it was shorter than the
drift. By the time those cluster would get that value,
the time had expired. So the subsequent
operations that went into the cluster would say that the key is
not found something as elementary as or
a basic checklist. This is something that we
all know, clocks are important, NTBD is extremely important.
But this is where we forget to have basic checklists in place.
And we almost ended up convincing ourselves
that hey, etcd was a problem. We should actually replace it with an
entirely new cluster solution which is based on a
new consensus algorithm. Worked well, but it just wasted
so many hours of the entire team. So simple
thing, which is as simple as,
I mean, it just baffles me today that a very elementary
thing, that is the first thing that we learn
when we start debugging servers is that a time drift could actually
cause a lot of issues. And this is the first time that we really experience
it. Those is one of those bugs that you hit every five years and
then you forget about it. But what was those real reason here?
The real reason was not automation. The real reason
was not any SRE tooling.
It wasn't obviously for sure running on Kubernetes as many
a times. I find that as a solution, it was
a simple checklist of our basic first principles could
have done the job, a very simple one which says that is
an NTP check installed or not,
and that could have saved this.
And that's the real root cause behind it. And this
is what I want to highlight here,
that most of the time our SRE journey is assumed that
another monitoring system, another charting system,
another alerting system, could have saved the problem, either of
these problems that we just described, right, and these
are the real business cause failures here, because each
one of this led CTO either a delay
in our project or it led to a downtime,
which really affected the business. And this isn't about slos.
This isn't about SLis, because that's what typically what
we find was the material when we talk about SRE on the Internet.
This is about getting the first principles right. The first
principles of, hey, were not going
to make any changes which are manually done to a server no matter what.
And if it is because of a reason that we are being too slow
in application, if terraform is too slow, we need to learn
how to make that faster. But we are not going to violate our own policy
of making manual changes. It's not about anomaly detection
or any fancy algorithms. There it is just the second
outage is probably a derivative of those fact that we did not have a simple
configuration manager validator. A simple basic tool
like OS query which runs at the start of a system checks
if my configuration is adrift. Could have been the answer.
The third one is the simplest one, but the toughest of all.
We spent our time checking timestamps
in raft and paxos, but we didn't even have to go that
far. The answer was our own server's timestamp
was off. Could have been saved if there was a simple check
which says hey, is NTP installed and active on a system which
could have been done by using a very simple elementary
bash script as well. The fact that we don't push ourselves to ask
these questions and we look for answers in these tools is
where most of the time were. We'll forget is that what
we see as errors are actually a byproduct
of things that we have failed to do or failed
is a wrong word or strong word or we have ignored to take care of
at the start of our scaffolding on how we
build this entire thing. Which takes me to the
outage number four, and this isn't a real outage.
This is the one that you are going to face tomorrow. What will be the
real root cause behind it? The real root cause behind that one
is that we didn't learn anything from the previous one. A very profound
line that Henry Ford said, the only real mistake is the one that we'll learn
nothing from and that would be the reason for our next outage.
Why is those important? And using all these principles is
what we cause at last nine. What we have built is a lastio nine
graph. Our theory and hours thesis says that hey,
all these systems are actually connected,
just like the World Wide web is.
If you try diagnosing a problem after a situation
has happened, and if there is no way of understanding these
relations, it's almost impossible.
CTo tell the impact of it. Example,
my s three bucket permissions are public right now
and this is something that I want to fix. Yes,
while I know the fix is very simple, but I do not know
what the cascading impact of that is going to be on the rest
of my system. This is extremely
hard to predict right now because
we don't look at it as a graph system, we do not look at it
as a connected set of things. And this
is exactly what we're trying to solve at last night. And that's what
we have built the knowledge graph for, which basically just
goes through the system and build these components and those relationships so that any cascading
impact is understood very clearly.
To give an example of how do we put it into use,
we basically load a graph and we try to find out hey,
what does the spread look like for my particular instance
right now? And if I run that query I realize that hey,
quickly it tells me that hey, the split is four four three.
Clearly there is an uneven split happening across
one of the availability ones which may result in a failure if it
was supposed to failed. Well, in this particular case because I wanted odd number
of machines so this is a perfect scenario. But was
this the case a few days back? Well, clearly not. What did
it look like? It was two those and three, which is an even number and
probably not the right way to actually split my application.
Answers like these are what the questions like
those, if I can answer them quickly,
is what allows us to make these systems reliable.
That's what we do. At last, lines a, we haven't yet
open source this thing but we are more than happy to,
if you want to give it a shout out, you want to try this
out, just drop us a line. And we are extremely, extremely happy to
set that up. And obviously there's a desire to actually put this out in
all open source domain once it matures enough.
That's all for me,