Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. I'm joined today by Erin McEwen and
charity Majors. Erin comes from Zendesk.
She is the director of engineering resilience there.
She has a lot of experience in crisis communications and leadership
disaster recovery. She spent some time at Google and Salesforce
in both the risk management and business continuity worlds.
Charity majors is the co founder and CTO of Honeycomb,
the observability platform to find critical issues in your infrastructure.
So today we're going to be talking about Jones of my favorite topics, and I
know one of their favorite topics as well. Incident management.
We all come from incident management from different
phases of the spectrum. Charity has a lot of insight on to before
the incident happens and what you're doing during the incident.
Erin comes from a world where she is really actively communicating
with customers and also communicating with the folks in her
organization to understand what is happening in the incident and understand what
is happening afterwards. And they're both leading organizations doing
this as well. And I'll quickly intro myself. My name
is Nora. I'm the founder and CEO of Jelly
IO. We are a post incident platform, but we kind
of COVID the whole stack as well, from responding to the incident to diagnosing
it to understanding it afterwards.
So thank you both for being here with me today. Thanks so
much for having me. This is going to be super fun. Yeah, same. Super happy
to be joining the conversation. Awesome. So I'm going to first ask Erin
a question. I want to understand a little bit more about your leadership
style when an incident happens at Zendesk today. Yeah,
sure. So I think what's interesting is I'll give a little context
of the parts of how we operate in an
incident. At Sundisk. I have a really amazing global incident management team and
we also have different roles of those that are constantly in
an incident. My incident manager, we have an engine lead on call.
We also have know, we have our advocacy
customer facing communication role as well as our Zenoc,
which is all about the initial triage and then all of our engineers
that get pulled in. Wait, did you say Zenoc? Yeah. So Zendesk
knock. I love that. So my first ever management
experience was at Linden Lab where I founded the Dnoc, the distributed knock.
So I was super tickled. I don't think I've ever heard of another company doing
they. I actually managed them for a little bit of time, but they've
moved over to a different leadership. But we stay very close.
We're very close with our Znoc buddies. But yeah, I think
what's interesting is during an incident, especially when I'm getting involved,
because a lot of our incidents, those are managed at the
incident manager level. We have our Internet leads on call
that come in. If I'm there, it's a little bit of a different situation
because that means that it's probably a pretty big one or something that has
high visibility. And so a lot of what I'm there to do is to make
sure that I'm supporting the team that's responding and keeping them
in scoping those things. Right. But I really serve as a
conduit to our leadership. It's really about how making sure
that booths on the ground are able to do exactly what they need to do.
The executives that need to be informed are informed and pulled in at
the right increments, in the right timing. So getting that alignment is really
important. And I think that having
that calm demeanor and making sure that you're managing up
in these situations so that you can protect the folks that really
need to be paying attention to what they're doing right in front
of them is really important during an incident. Totally.
The emotional tenor really gets set by whoever is in charge
of the incident. If you're just, like, slapping around and just, like,
reacting and everything, everybody in the company is going to freak out.
And if you can project calmness, but intensity.
Right. Like, you don't want people to just be like, oh, nothing's wrong. You want
them to be aware that something's happening, that it's not Normal,
but that it's under control.
I think part of the key thing is that people need to
feel like they can be confident that they will be pulled in if needed,
so that the entire company doesn't just have to hover and look,
am I needed there? Am I needed there? Do I need to respond here?
No. In order for everything to stop, everyone has to
have confidence they will be pulled in if they are needed or if they can
help. Yeah. And I think one of the things that I will say, too,
is that part of being able
to come in and have that calmness also is part of
having a really strong program around incident management and having good structure
and how your tooling operates, that your folks
are educated on their roles and responsibilities.
We also have crisis management that is the next level above that. So I
think that at Zendesk, I feel very
fortunate because everybody jumps into an incident,
they're ready to go. They are happy to help out. They want to know what
they can do. They're contributing in the space that they can.
And I think that's a very important piece
of it makes it a lot easier to be a leader in those situations
when you have folks like that that are jumping in and really wanting to just
collaborate and figure out what's going on.
Totally agree.
It sounds like the way you handle it is a lot more of an art
than it is a science, too. And it's just really keeping
a pulse on what's happening, making sure you're really present
in that moment. How do you prepare folks
for those situations as well before you're in them? Yeah,
I think it's a great question. There's a couple of different layers for that.
So we actually are right now
doing a lot of training. We've got folks that
have been around for a long time. We realize we have a lot of new
people. So my incident management teams have been doing some regional
training. They did a mayor recently, they're in the middle of doing a MIA,
and we're getting amazing feedback from that. I think that both
newer and older folks are. It's good reminders, but it's also,
at the same time, really great education. And I think we do
other things such as shadowing. We make sure that folks that are getting onto new
on calls, that they're doing
a couple of sessions where they're following along and shadowing during someone else's
on call so they can learn from that. Do you do primary,
secondary? Yeah. So it depends on the team.
So the structure can
vary a little bit depending on the size of the team, the criticality
of what services they are over, things like
that. So just a little bit of variance here and
there. But the one last thing that I would say is we do
a lot of exercising.
We have our chaos engineering team that does.
They have a whole suite of tests that we practice. It's really important
for us to operate in that exercise as it's an incident.
We have an incident manager that participates in every single one of those exercises.
We use as much as our tooling as we possibly can to make it realistic
and to get in and do all of those things. I think
that's the space that I think we get the most value.
And I will just say that I think it's something
that, trying to expand on that a little bit wider is something that we're looking
at with Zendesk just because we learned
so much from it. How time consuming is that? I think
that we have one primary
Zendesk is an AWS crew.
And so, for example,
we have AZ resilience testing that we do.
We do an AZ resilience failover exercise, I think it's on
a quarterly basis. Right. You know, that takes
a lot of prep work because usually what we're doing with that is depending
on what incidents have we had,
what areas do we need to press on? Even using AWS incidents,
you have AZ resiliency. Do you have region resiliency
for data? We have data, but we're not like,
got you. Yeah, we're not across
multiple regions per account. I'm curious because
it can be incredibly time consuming. And you're right, the payoff can be
enormous for companies that are at the stage where it's
worthwhile. But where would you say is the point at which it's worth
starting to invest in that? Because it is pretty time consuming,
and I think it depends on the commitments that
you've made to your customers. What thresholds would
you look for to say, okay, you should start really doing this for real as
a driver? Yeah, I think that's a really great question, and I think that
there's a couple of different inputs to that. First of all,
starting as small as possible, tabletop exercises are a great
way to go if you don't have a lot of time to do things.
If you're not able to prepare your pre prod environment,
to be able to actually put hands on keyboard and simulate things,
you can start at a much lower level and build on top of that.
I think people think right away they've got to go in and
do the big bang. Right. At what stage
should they start doing that?
I think it depends on the company, because I think that there's a different desire
for whatever. Absolutely right. What characteristics
or something would you look for in a company to be like,
okay, it's time for you to really start investing in these affirmative,
active sort of tests? Yeah. So I'm of the
persuasion that I think that companies right now are not starting
early enough with getting incident management practices
and teams and structure and testing in place.
I hate to be annoying about this, I super agree.
But specifically for regional failovers
or av failovers, that's a big gun in my opinion.
Where would you say that comes in? And it's an
expensive, super expensive. It's not a cost. You can't
just go, oh, everyone should do it as soon as they should, because that's
not actually true. Yeah, but it makes total sense
for Zendesk to do it if you think of the nature of what they're offering
and what their customers are going to say when they have incidents. If Zendesk
tooling goes down and one of their customers is in the middle of an
incident, they're not able to resolve their incident.
And so I feel know the urgency of doing what
Erin and team are doing becomes higher in that know, we're trying to
think about it on a weighted spectrum, but, yeah, you're right.
Like some organizations, it probably is not worth
some of the time. I think right now,
and there might be no good answer at just because you're so.
I think that, again, that's why it kind of, again,
like to. What Nora is saying is that we're a critical
service for our customers.
They depend on us to be able to deliver their customer
support to their customers. So if we're impacted and they're unable
to support their customers, that reflects on them. They don't know
necessarily that down the line. It's a Zendesk incident that's causing that customer pain.
Right. So we have an obligation and the pressure,
too, to actually minimize as much as possible of that impact.
The tolerance these days is a lot lower for that, right?
Yeah, it used to be like, oh,
AWS is down, everybody just go out for coffee. It's not like that.
I mean, I agree. I don't know if it definitely is not
a one shoe fit, all type of a timing, but I do think
that there are baby steps that people can take in understanding what
their business needs are. And again,
the first question that I would ask is, what's the tolerance from the business
for downtime or for impact? I think that that
question is one to put back to the business and to the leaders and say,
what is our tolerance for this? Are you okay? If we're down for this amount
of time, do we have financial commitments? Do we have just, like, word of
mouth commitments? I think that for me, the two things that I would say that
every company should do pretty much across the board, if you have customers,
you should do this. Number one, something is better than nothing.
If it's literally like your homepage is just returning a 500 or
a 502, that's unacceptable
for anyone these days, at least. If your
storage, if your back end writes are failing, cool. Try to
degrade gracefully enough so that you give your customers some information about what's
going on and you return something. Right. You should return
something and you should handle as much as possible without making
false promises about what you've handled. And number two,
I think to your point about tabletop exercises, I think that as soon
as you're mature enough to have a platform team, an Ops team,
whatever team you're mature enough to have a meeting
at least once a quarter where you look at, okay,
look at your risk path, right? Your critical path. Like, what are
the elements in there? How can they fail?
And what is the limited thing that you can always do if
whatever component goes down? What's your path? What's your story?
What do you try and return? Like at parse?
We had all of these replica sets.
The minimum thing that we could do if almost everything was down except
this one, AZ was. We could at least return
a page that said blah, blah, blah, or whatever. But you should think
through what is my end user's experience going to be like
and have an answer for it and try to make it as good as possible,
because often that takes very little work. You just have to think
about it. And I think you both have touched on the
real benefit behind this is coming up with what is a
good tabletop exercise to run. Right.
It makes sense for Erin and her team to run these region
exercises because that is issues they've experienced before.
But what you're gearing at charity is just like thinking about your user
lifecycle and thinking about what matters to your users,
like returning something or
thinking about what they're going to experience at that moment, or thinking about what the
time of day is for them, or thinking about when they're most using your
platform. Those are all things to think about. It probably doesn't make sense
for most orcs to just take a standard regional
failover exercise, but it doesn't make sense for them to
think of something. And I think a lot of the value in it is actually
taking the time to work together to think of that thing. What is
the best for us to do right now? And really digging
into what that means for everyone and why Facebook got
15 years into their server. And whatever about
Facebook, but they were 15 years old as a company
before they started actually investing in being able to shut down a
data center. And that was driven not by them wanting to be able to do
this, but by the fact that their original data center was shutting down.
And so they were like, well, we got to figure this out anyway. Let's make
it so that it's a repeatable process. And then kind of like what you were
saying, aaron, like once a quarter from there on out, they do these
highly prepared, shut down a region failover sort of practices just for
hygiene. But, yeah, I think it's easy to say, oh, you should
do this easier. But when you sit down and think about it, it's costly,
it's hard, it's going to take resources away. And so you should be
thinking pretty narrowly about what is the least best. What's the 20%
solution here that gives 80% of the benefit? Exactly.
And again, I do think it's important to know that what I'm
speaking about has been years of. I mean, I've been
at sundust for over seven years. We started little bits
and pieces of incident management when I came on board and
building out our resilience organizations, our reliability organization,
there's been a lot of time and effort. What does your resilience
organization look like, if you don't mind?
Yeah, so I actually have a very interesting team under
resilience. I've got our technology continuity organization,
which is all about our disaster recovery and continuity
of our technology. And we have
incident management, which I've mentioned already. And then I also
am responsible for business continuity and crisis management at Zendesk.
So not only are we focusing on the engineering pieces
of things, but we also are responsible for what happens with
business disruptions to the overall business operations.
And then there's one other part of my team. I've got our
resilience tooling and data, and they are responsible for.
The engineers are awesome. And they've built all of the
in house tooling that we have for incident management,
and they support the integrations and things like
that for the other tooling that we leverage outside of Zendesk.
And then we have our data team, which they're newer. We've been like,
data. Obviously, data is lifeblood these days,
and we have managed.
That team is about a year old. But really doing
the crunching of the data that we have related to incident management and
the impact that we have as a business and bringing those back to the business
so that we can make critical decisions.
So it's an interesting,
wide breadth of responsibility when it comes to disruption.
And the alignment there makes quite a bit of
sense for the Zendesk organization. Thank you.
Yeah, that's awesome. And I feel like what you're gearing at
is a lot of, I think, the real value of SRE, which is
understanding relationships and organizations like technology
relationships, human relationships, all kinds of things. And it becomes invaluable
after a while. Charity. I'm going to turn the question
over to you. Can you tell me a little bit about your leadership
style when an incident happens,
and this could be your philosophy towards how folks should
handle incidents in general? It can be anything.
Yeah. These days, I don't respond to incidents at
all. I think I am very high up if it
fails over the front end time engineer, and then the back end
engineer, and then the successive engineer, and then something,
something. But I maybe get a page failure to be once a year,
and it's usually an accident,
which I feel
like the people that we have are so amazing. Like fucking
Fred and our SRE team and Ian, and they're amazing.
And so I can't really take any credit for all of the amazing stuff
that they're doing these days. I feel like I can take some credit for the
fact that we didn't make a bunch of mistakes
for the first couple of years. And so, you know how there's a stage that
most startups go through where they're like, whoa, we're going through a lot
of growth, and suddenly everything's breaking. Let's bring on someone who knows something about operations.
We never went through that. In fact, we didn't hire our
first actual SRE for four years. Then he was
there for a year, and then he left, and then we didn't hire another for
another year. We really
have the philosophy of engineers write their own code, and they
own their code in production, and for a long time,
that was enough. Now these days it's not. And much
like Aaron and much like you, Nora, we have a service that people
rely on above and beyond their own monitoring,
right? And above and beyond their own systems, because they need to be able to
use it to tell what's wrong with their own systems,
and which I know we're going to talk about build versus buy a little later
on. And this is where I have to put my plug in for.
You should never, as an infrastructure company, put your
monitoring and your observability on the same fucking hardware,
the same colo. As your own production system, because then when one goes
down, you've got a million problems, right? They should be very much
segregated and separated, whether you're buying something or
building it yourself. But Tldr,
I think we did a lot of things early on
that were helpful. For a long time, we only
had one on call rotation, and all the engineers were in it.
One of my philosophies about on call is that the rotation should
be somewhere between about
seven people, right? Because you don't want it to be so short that
everyone's getting. You want them to have like a
month to two months between on call stints, right? You want to give them
plenty of time to recover, but you don't want to give them so long they
forget how the system works, or so long the system changes and when your
system is changing rapidly, like more than two months,
it's just too long.
Now we have multiple on call rotations, of course.
But I guess the main philosophical thing
that I will point out here is that the
way that I grew up dealing with
outages and monitoring and everything, they were very much intertwined.
Right? The way that you dealt with outages was the way that you debugged
your system and vice versa. And really,
when you were dealing with a lamp stack or a monolith, you could look at
the system, predict most of the ways it was going to fail, write monitoring checks
for them. Like, 70% of the work is done because all of the
logic was tied up inside the application, right. It would fail
in some weird ways. You tease them out over the next year or so,
and only like a couple of times a year, would you really just be completely
puzzled, like, what the hell is going on? It didn't happen that often,
but these days it happens all the time. It's almost like
every time you get alerted these days, it should be something genuinely new,
right? We've come a long ways when it comes to resilience and reliability.
And when things break, we typically fix them so they don't break anymore.
Right. It should be something new. And now you've got microservices
and multitenancy, and you're using third party services and platforms
and everything, and it's just wildly more complex. Which is why I
think you have to have engineers own their own code in production. But I also
feel like the processes and the tools and the
ways we think about debugging,
you need to kind of decouple them from the process of
the sites down. We have to get it back up, just the emergency sort of
stuff, because you can't look at it and predict 70% of the ways it's
going to fail. You can't predict almost any of the ways it's going to fail.
You can write some to Erin checks, but that's not
going to save you much because the system is going to fail in a different
way every time. And this is where dashboards
just become kind of useless, right? If you're used to just like,
eyeballing your dashboards and going, it's that component or it's that metric
or something that's not good enough anymore. And this is where
obviously, everyone's heard me ranting about reliable, about observability for a
long time now. But I think that the core components of observability you
can really boil down to. You need to support high cardinality,
high dimensionality, and explorability. And the shift
for engineers is going from a world where you had fixed dashboards
and you just eyeballed them and kind of pattern matched. Oh, that spike
matches that spike. So it's probably that that's
not enough anymore. You can't do that. And the shift to moving to a more
explorable, queryable world where you're like, okay, I'm going to start at the edge.
Break down by endpoint, break down by request
code, break down by whatever. Or if you're using
bubble up, you're just kind of like, okay, here's the thing I care about.
How is it different from everything else? But it's not like getting
over that hump of learning to be more fluid and using your
tools in a more explorable way. It's different, and it's a really
important mental shift that I think all engineers are having to make
right now. Yeah, absolutely. And you mentioned earlier you
didn't really have a philosophy. You haven't responded to an incident in
a while, but it's like, that's a philosophy. Yeah, that's true.
In your organization, you responded to your last incident
in the depths of the weeds, and I wonder what that point was
for your business. Right. Obviously, the business shifted in
its own way from a market perspective and things like that,
and it wasn't maybe the most
valuable use of time. And by
doing that, you're also giving everyone else a chance to do what they
do best at, which is what you mentioned. Right. Which is an amazing
philosophy on its own, is like giving people the time and space to
become experts and to build their expertise. Yeah, well, that was
pretty early on for us, and it wasn't an intentional thing so much
as it was. I was CEO for the first three and a half years.
I was constantly flying.
I couldn't do it, and the other engineers had
to pick it up. And before long, there was so much drift that I wasn't
the most useful person. Right. These are complex breathing systems
that are constantly changing, and it doesn't take very long
for you to not be up on them anymore. But you'd be
surprised how many leaders don't
necessarily. That's a mistake. That's a big mistake.
I love responding to incidents.
I love them. That's my happy place. I have ADHD,
which I just found out about, like, two years ago. So I never understood why
everyone else is freaking out about incidents. And that's when I get dead calm and
I can focus the best the more, the scarier it is, the better
I can focus. And I'm just like, we. So that's really hard for me
to want to give up because I suddenly don't have any times where I really
feel like I'm useful anymore. But you have
to. You have to, because just like you said, you have
to let your team become the experts. If you cling to that,
you cling to it too hard.
You're just preventing them from being the best team that
they could be. I will give you. And they won't feel like you trust
them. That's the thing. Like if I show up every time, they're just going to
be like, she doesn't trust us. I was
going to say for the job anymore.
It's like there are a number of different things. Sorry, I didn't mean to interrupt
you guys. I was just going to say that I have an old
leader that back in the day he incredible and very technical
and very smart, and I would have to have side conversations with him and
be like, people love having you in there, but you
also are preventing them from learning this. Let's wait.
I know we want to get this taken care of, but give them that little
bit of. Because it does have a different reaction
than people all of a sudden are like, oh boy, they're here. They're going to
tell us what to do. We'll just wait to be told. Instead of saying,
actually, I know this, and this is how I should. We fixed this
before. You could have a more junior engineer
in there that's never with this SVP before.
And they're like, whoa, I'm not speaking up right now. Yeah,
totally. It removes a sense of safety, it adds an edge of fear.
And the thing is that when you're dealing with leaders like this, and I've been
one before, you just have to look them in the eye and shake their shoulders
and say, it's not about you. It's not about you anymore.
You are not as clutch as you think you are. And it's not your
job to be as clutch as you want to be. You need to step aside
and let the next generation. They'll fumble it a little bit. It might take them
a little longer the first time, but they'll get there and they have to.
I was in an organization that was like, it was a public
facing company. It was used by so many different folks
in the world. And still, even after all that,
the CTO would jump into incidents and as
soon as they joined the channel, everybody talking,
well, no, it would go from one person is talking to several people
are talking, and the time between people hitting
enter increased.
This person didn't realize it, but it was almost like we had to
take them through it. It's just showing them how
the tenor of the incident changes. And it's not even just people
that high up. It's certain senior engineers, too. I think also
as leaders, we have to be
really intentional about
what we're rewarding and how things are happening,
because I also thought that organization as well, like very senior
engineers, even when they entered the channel, they would come in and just save the
day. There was one engineer that when they joined, everyone reacted with the
Batman emoji, and they were like the most principal engineer.
And I saw more junior engineers acting like this person,
and they were the information. They had their own private
dashboards because it was like what we were incentivizing
was this culture where you came in and saved the day as well.
And it was kind of fascinating. Yeah, I think,
Nora, super real. One of the things along
that line is that obviously
we have jelly that we've been using, and one of
the things that is really shining through there is where we've
got those folks that the incident is called. They're hopping in there. They're not on
call, and they're doing work. And there's
a lot of opportunity to learn about that because
there are situations where maybe we don't have the right page set up
for that particular type of event, or there's
many different elements that can contribute to that happening, but one
of them is. But when shit really hits the
fans, I bet you all have your
top four or five engineers that you're pinging on the side being
like, hey, are you watching this? And holding?
Because you know that they're going to be able to come in and really help.
But at the same time, there's just so much to learn about
how to manage the people element of.
Because again, if you keep having them come in and do that,
that's not going to help the organization overall.
Ultimately, in that case, you have to get to a point where you
say, thank you so much for being here. We do rely on you sometimes,
but wait for us to ask you. Wait for someone to escalate to you before
you jump in, because that gives the instant
commander, and that gives them a genuine chance
to try and solve it themselves without just constantly looking over their shoulder going,
when is so and so going to jump in and fix it for us,
right? And they know that they need to give it a shot, and that
person is still available. But like the training wheels,
you need to try and wobble around with the training wheels off a little bit
before you reach for them. Yeah.
And it's giving them the opportunity
to teach others their skills. And this is how
you become even more seniors, by doing stuff like this. Also, when you
do escalate to someone, ask them
to come in and try not to do hands on themselves, but just to answer
questions about where to look. Yeah.
One of the things that we did recently,
I guess it's not, man, my brain is, it's been a while,
but we got away from having an
engineer lead constantly in our
incident, our high severity incidents, and it worked pretty well for
a while, and we were seeing that things
were. But one of the things that we were noticing is
that there wasn't always someone there to kind of take the brunt
of a business decision being made. And what I mean by that
is when trade off conversations are happening,
having the right person to ask a couple of those pointed questions that has a
broader breadth to help really be able and
to raise their hand and say, yes, go ahead and do that.
I'm here. I've got you. Yes, go.
And really doing exactly what you're saying, which is, most of
the time, really listening and making sure that the team has what they
need, but then finding the opportune moments to make sure that
things are moving along, the questions are getting asked, and,
again, really taking on the responsibility,
again, that pressure off those engineers where they may
not be as comfortable to say, yes, we need to do this.
One of the things at Sundusk right now that we're very laser
focused on is reducing all of our mean times,
too. So we've got a bunch of different mean
times, too. And I'm saying that in the sense that
we've got a bunch of them that are, meantime to resolve, meantime to activate
all of those. And I think that there's these
very interesting pockets of responding
to the incident where, little by little, reducing that time,
especially around that type of a thing, where, again,
there's a debate going back and forth of whether or not to do something.
And it's, again, just especially
when it comes down to not the engineering decisions, but the
business decisions. Right. A lot of engineers will feel like, okay,
but now I'm at this risky precipice, and I'm not sure
how to make this decision about this risk to the business and how to analyze
that that's really important. Do you solve this with roles?
Yeah,
we now have our engineering lead on call rotation, which is
pretty much, this is a little inside baseball,
but like director plus, we follow the sun first
and secondary, and we
have a handful of senior managers that are in that, but those are senior
managers, that this is a growth opportunity for them to
be able to get to that.
It's a great opportunity to get some visibility and to show leadership skills outside
of what their area is.
And I think that what's interesting,
again, observing different incidents where we have different
leaders, they take different approaches. Right. Like,
you have some of the ones that are like,
they're there, they have their presence, they're listening, and then they're
ready to go with their three very specific questions as soon as there's
a break in the conversation where there's other
leaders that get a little bit more involved. And so
I think the most important piece of that role
is the relationship with the incident manager iims
should be making. They are in charge ultimately,
right? They are the ones that are making sure that
they're doing all the things. But I think that that
partnership between the end lead and the IM and
even having a little back channel conversation between those two on slack
to just say, hey, I'm hearing where the,
I'd rather make sure that they're pushing things along instead of
having me jump in because if I jump in, then it, it can shift
the, like, oh, whoa, why is all of a sudden asking this question?
So I think that there's a lot of opportunity to like,
and that works well in a situation where you don't have a lot of people
that are trying to exert a lot of,
it's not those power grab situations.
We don't really have that at Zendesk. I was mentioning
earlier, it's a really collaborative and all for one type
of the way that we operate.
But I think that there's unique elements like that
that, like, I wouldn't necessarily recommend having that role for
all programs. Right. Like, that might not be the right fit.
That might actually not be the way to become more effective.
But for us, that's helpful in the sense of let's make sure that there's someone
there to really own those critical decisions.
Yeah, makes sense. Erin, you touch on this
a little bit, but I mean, a lot of the folks that are going to
be listening to this have no idea the amazing learning
from incidents culture that you are building at Zendesk
and just how hard and intentional that
is with a really large organization with
a very pervasive tool in the industry.
And I'm curious,
you obviously can't distill it in like a five minute thing. But I'm
curious what advice you might have for other folks looking
to Erin out looking to create an intentional learning
culture there. And I think one of the things you suggested was that you put
it on the career ladder a little bit. Right. And so folks are being rewarded
for it in a way. And I think more of the tech industry needs to
do that and be that intentional about it. But is there any other advice you
might give to folks? Yeah, I think that that's a really good question.
And it's funny, like, yeah. Trying to squeeze it into a five minute,
keeping it short. I think
that it really starts from
the folks that are responsible for incident management. Right.
Wanting to make sure that they are understanding and
operating their process as efficiently as possible.
Right. So when we do that and we learn from,
we take the opportunity to really have a blameless culture.
And that is something that we really do stand by at Zendesk. We always have.
I feel like I keep kind of mentioning these rah, raw things about Zendesk,
but it is true. You should be proud.
We do practice these things. And I do think, though, that that
started from a very early foundation. I think that we have
maintained that culture, which helps quite a bit from the
learning. Right. Like the opportunity to learn. I will say
that what we ended up doing is,
I mentioned this earlier, we collect a lot of data, our incident
retrospectives, very intricate. At the
end of every incident, there's a report owner that's
assigned. If it's a large incident,
there may be more than one. And they
work on doing their full RCA, and there's so much
information. What's an RCA? Root cause analysis.
Yeah. And as part of that process,
it's really the work that they do to really understand what had happened,
providing the story, the narrative, the timeline, the impact remediations,
and being able to.
Then we have our meeting where we go through,
we talk about the incident, and then from
there we push out our public post mortem that
is shared out to customers. We also have an event analysis
that we do prepare within 72 hours that gets shared
with some of our customers upon request. Just because
this is what we know at this point, that also helps a lot with our
customer piece. That's not about learning, though. So I think
that what we've done is we have established a pretty strong
process that has worked for Zendesk
pretty much pretty well in the sense that we've been able to learn
quite a bit from our incidents and
our remediation item process, we have slas against our remediations.
And with those remediations, I think that I
was mentioning the chaos engineering that's a more recent.
And I'm saying these timelines, which me saying
recent could actually mean a little bit longer, but taking those and
saying, hey, we had this incident. Let's do this exercise and
validate that all of these remediations that you said are completed,
we are good. We've mitigated this risk.
And there's this whole piece of, I mentioned
about the data that we collect, right. And being
able to put together a narrative and a story that is
really able to help the business understand
what it is that we're dealing with, and how
are we actually taking that data and making changes? How do
we drive the business to actually recognize where
we need to be making improvements? And again,
we've been collecting this data for years now, so we've got a lot of historical
context, and it's very helpful. And I think what we've done now with
adding jelly to the mix with this is that it's a whole other layer for
us that we haven't been able to get that level of insight.
And we were ready for it in terms of
being able to do things like pulling our timeline
together from various slack channels and being able
to see the human element and involvement
and participation. I think those
are two really big spaces that we had some blind spots
to. Because again, when I talk about us trying to reduce
our times, mean times too,
there are things that are happening in other slack channels where there's time
spent on triage that is like, there's these
pieces that, sure, you can put into a big
write up of a report that, oh, we triage this in the such and
such slack channel with a link to the Slack channel, but that's not doing you
any good to actually, when you talk about blameless
culture and all this stuff, is that just within engineering or is that
company wide? Good question.
I think it's company wide. I think every organization
kind of has their own little. Because this is something
that I have definitely struggled with a little bit. And I don't mean to
throw other under the bus or anything, but in engineering,
I feel like we have for a couple of decades now,
been working on training. None of us
just shows up out of college or whatever,
just like, yay. I know how to admit my mistakes and not feel safe and
everything. This has been like a very conscious, very intentional, multi year,
multi decade effort on behalf of the
entire engineering culture to try and depersonalize
everything in code reviews to create blameless retros,
and we still struggle with it. And it's something where I feel like other
parts of the business, like sales and marketing, they're still fucking petrified.
They're just, like, so afraid to talk about mistakes
in public. So I keep trying to find out if
there are companies out there who are trying to expand this beyond engineering because it
feels so needed to me, and I'm not sure how to do it.
So I can talk to that in a little bit of a different way than
how I would say, blameless. So we do
a lot of work with our go to market folks in terms of enabling
them to be informed during an incident.
Immediately after an incident, they have process around where they can
understand what's happening. We obviously maintain
our space where we're actually active
responding to the incident, but there's other ways for them
to gain information so that they can communicate with their customers.
I think that Zendesk is also a pretty transparent
company when it comes to taking responsibility for where we've
had missteps.
But I will say when we have,
I will call them noisy times where
we are not aware that we may not have that
many high severity incidents. Right. But we could have an enterprise customer
that has a bug that hasn't been.
That's taking longer than they expect. They could be
having an integration issue that has nothing to do with Zendesk side, but it's
on the other side of it. There could be multitudes of
different things that have nothing to do with our incidents.
Right. And so that's where we run into
the frustration from, I think, our folks that are
trying to just support our customers, because I think that
that's a big piece of being able
to enable them to be able to have the right pieces
come together, to be able to work and interact
with their customers on those things. Totally.
And I think that that's what you're describing, too,
is like, creating this big, same team vibe. I've definitely seen engineering
cultures where they're like, no, if you refer to this doc
that we've pinned in our channel, it's actually not an incident. So you can't
talk to us right now. Ticket.
And it becomes incredibly frustrating for everybody. And you're
not actually like the engineers. Even though they may be creating
this internal blameless culture, they're not really having that
vibe with their other colleagues, too. I really like,
totally, one of the other things that we implemented
a while back was we have an engineering leadership escalation
process. So if
there's a customer that is frustrated and they have
whatever's going on. And our
customer success folk or person can go in
and put in a request, and basically, depending on
either executive sponsorship and or expertise in that particular area,
we then can line them up to be able to have a direct
conversation with one of our leaders within engineering.
This is something that originated from. I used
to have to manage crazy intake
and spreadsheets of frustration from certain
customers. Right.
And there were a couple of us that would take those calls
and just like, you would get on a call and you would listen and
you like, you know. Yeah, most of the time it was they just wanted to
be heard, and they just wanted to hear directly from somebody that was
closer to it. And I think that, again, I mentioned
that event analysis document that we put together. I think that also went
a long way with bridging that gap a little bit with like, hey, we don't
know everything yet, but this is what we do know,
and this is what we're doing about it right now.
I think that there's ways to kind of bring it
in so that the way in which the
rest of the organizations is trusting in the
engineering organization as a whole, there's definitely
ways to do that. But again, it's what
we were saying. I think it goes back quite a bit to the culture element.
I mean, that artifact that you're talking about creating is really
a shared vernacular. Right? It's like thinking about going to read
what you're giving to. If you're creating something that is just for engineering and is
just about the technical system issues,
they do impact customer success, but they're not written in
any sort of language that they can understand that doesn't give that
same vibe, and it makes them maybe not want to participate afterwards.
I think the tech industry is really cool, and the engineering
and organizations has a lot of influence. And so if you can enroll other
teams and other departments into your processes that are usually
quite good, they might be able to learn as well.
I was in an incident like several years ago on
Super Bowl Sunday, where Jones of the engineers that were on call
knew that we were running a Super bowl commercial.
And we went down during the Super bowl commercial,
and I came to the incident review the next day,
and it was only sres in the room. No one from marketing
was in the room, no one from PR was. And it was just
like, it was very much we should be prepared for every
situation vibe, but that's not
really. Maybe we can talk about that. I feel like a
lot of this is how we
coordinated here with other departments and exacerbating the
issue by not talking to them about it.
Totally. We only have about 15
minutes here, and I know you both have really great
recent incident stories that I kind of want to hear about, so I
want to shift gears a little bit.
Charity, I would love to hear about a
recent outage that you all had in as much detail as you're
willing to go into and what you've learned from it.
And what I will say, that as
much detail as I'm willing to go into is all of the detail. This is
one of the marvelous things about starting
a company. I was always so frustrated with the editing of,
like, I would be writing a postmortem for here's what happened. And they'd be like,
script scratching. No, no, you can't say MongodB, say our data
store, or no, you can't say aws, say our hosting
provider. I'm just like, what the fuck?
Or maybe, no, this reveals too much about the algorithm.
And I'm just like that
build verse by culture that we're still supposed to talk about later,
but if you're not allowed to name that, you're being
like, no one needs to know that we paid for this thing. It's like,
yeah, they should know that because that has nothing to do with your business.
Yeah. I feel like everybody's
always so worried about losing trust with their users, and I just feel like the
more transparent you are, the more trust you'll build, because these
problems are not easy. And when you go down a couple of
times, users might be frustrated. Then you explain what happened
and they're going to go, oh, respect. Okay, we'll let
you get to it. We'll talk later when things are up. In my experience,
more detail is always better anyway.
And so that was one of the first things that I did as
CEO at the time. I was just like, look,
whatever you want to put in the outage report, anything that's relevant,
just write it all, let people see it. Anyway,
yeah, we had this really interesting outage week
or two ago. Maybe it was just last week, and it's the
first time in, like, last time we really had an outage.
Might have been something this spring, but before that it was like a kafka thing,
like a year ago. So this was significant for us.
And I got a couple of alerts that honeycomb was down, and I was
like, wow, this is very unusual. And it
took us a couple of days to figure out what was going on, and basically
what was happening was we were
using up all the lambda capacity period repeatedly.
And the way our system work is we
have hot data which is stored on the
ssds, local ssds, and then cold data, which is stored in s
three. And stuff ages out to s three pretty
quickly. And this
outage had to do with slos, because every
time that someone sets an sl, they set an SLI,
right, which rolls up into an slO, and the SLI will be like,
tell me if we're getting
x number of 504 over this time or whatever.
So we have this offline job, right, which periodically kicks off and
just pulls to see if things are meeting the
SLI or not. And we realized
after a while that we were seeing timestamps way into the future.
And what happened was anytime a customer requests
an SLO that doesn't exist yet, we backfill everything
because they might be turning on a new SLI, right? They're writing something
new. So as soon as somebody requests it, we backfill. And this
was happening just over and over and over and over,
didn't have any valid results, so it wouldn't make a cache line. So every
single minute it would launch a backfill. And all the lambda jobs were just spun
up just trying to backfill these slos.
What was interesting about this was that we
often think about how users are going to abuse our
systems, and we try to make these guardrails and
these limits and everything. And in fact,
most of our interesting outages or edge cases
or whatever come from people who are using the system as
intended, but in a really creative, weird way,
right? This is not something that we ever would have checked for,
because it's completely valid. People will spin up SLis and need to
backfill them. People will set dates in the future. All of these
are super valid. I feel like when I was working on parts,
I learned the very hard lesson repeatedly about creating these bounds checks
to protect the platform from any given individual, right?
But it becomes so much more interesting and difficult of a problem if what they're
doing is valid, but in some way because of the size
of the customer or because of whatever,
it just is incredibly expensive.
We have a really big write up. Go ahead. I was going
to say, it's always very interesting to see how creative people can get with using
your systems. Like some things that you were often. I was
never like, wow,
often it's a mistake or something where it might be intentional
and completely legit, but they're doing a
very extreme version of it, right? Like Jones of the very first
things that we ran into with honeycomb, we do high cardinality, high dimensionality,
all this stuff. And we started
seeing customers run out of cardinality,
not in the values but in the keys because they accidentally swapped the key
value pairs. And so they were just like.
So the key would have like, what do people. Instead of
it being in the value. And that's where we realized that
our cap on number of unique key values
is somewhere around the number of Unix file system handles that we
can hold up. So now we limit it to like 1500
or something like that. Right. This is why I'm so
you can see my shirt test and prod or live a lie. This shit is
never going to come up in test environments. You're only
ever going to see it at the conjunction of your code, your infrastructure,
a point in time with users using it. Right. And this
is why it's so important to instrument your code. Keep a close eye on
it while it's new. Deploying is not the end of the story. It's the beginning
of baking your code. It's the beginning of testing it out and
seeing if it's resilient and responsible or not.
And you just have to put it out there, keep an eye on it
and watch it for a little bit.
Unfortunately, we had the kind of instrumentation and the kind of detail
where we were able to pretty quickly figure it out
once we went to look. But you can't predict
this stuff. You shouldn't even try. It's just impossible. You got
to test and prod. It's a waste of time to try to
predict it, too. Waste of time, right.
You'll find out. It will come up quickly.
Every limit in every system is there for a reason. There is a
story behind every single one. Exactly. And people
who spend all this time writing elaborate tests,
just trying to test every. It's a waste of time because they
spend a lot of time now. The time it takes them to run their tests
is forever. It takes them a long time to know.
Take that time and energy that you're investing into making all these wacky tests
and invest in instrumentation. That effort will
pay off. Yeah, totally.
That is an awesome one. Erin, I'm going to turn it over
to you. Can you tell us about a recent outage that you all had?
Yes, and unlike charity, I am not the CEO,
so I have to be a little bit more careful with
some of the details that I possibly share. But I
think that before we got on this
call, I was going back and kind of looking through recent incidents
that we've had. And I think that one of the things that I
just want to call out is that it's interesting to
see how over time, your type
of incidents and the nature of them can change.
And I don't have one of my obvious
ones that normally I'd be like, yes, I'm sure we have one of these recently.
And no, it's been pretty quiet in that space for a while.
Right.
I want to just talk a little bit.
This is a little bit different because I think that this is an interesting one
in the sense that it's more about the process of
how I'm going to talk about it than it is about the actual incident itself.
Of course I am because that's where I always come from.
But we had an incident that was called recently,
and the nature of it was, the question came up of,
is this a security incident or is this a service incident? And the team realized
like, oh, we can handle this within, like, we already tried escalating up through
security. They said that this was on the customer side,
so we're going to just technically
it's not a security incident. And so we managed it
because we could take those actions through the service side. We had the right engineers
and all of that. But I think that one of the things that
came back out of that was more of this understanding and discussion
around. There was more of a risk from the standpoint of
not, was that actually the right direction for
us to go in terms of handling that from a process perspective,
or would it have been better served, managed differently because
of the fact that the engineering work that was required
was not necessarily a high severity,
we downgraded the incident to a lower severity level because it was something
that, for them wasn't that, but reflecting back and looking
at it, it was like, there's a couple of gotchas
in here that we need to be more aware of that.
Are we actually following that process from the process
from the right standpoint, even though it doesn't fit in that perfect box of
what it is? Security incidents also
roll up through, like, we are responsible for service incidents
or security incidents as well, but we have our
triage and threat and triage team is
through our cybersecurity organization, so we
partner very closely with them. And so obviously
that's something that there's some really strong learnings from
that to kind of follow the tale of,
because that's also one where it was very specific for a
particular customer. Right. It wasn't a widespread thing. It was one particular
customer that was having this issue. So following
that trail and determining when it's
a situation of rightfully so, the individual is like,
we need to take care of this. So they raised an incident and
it's like, wow, is it truly? Is it not? But it needed to get taken
care of. But I think
those are the types of things that are always very interesting in terms of like,
it's totally off. It's not a common thing that would happen. And so you
got to go back and just kind of review and figure out how to those
incidents where it's not affecting everyone, those can be the
trickiest ones to figure out. Especially when your slo
or your SlI is like, yeah, we're at 99.9% reliability,
but it's because everybody whose first name starts with like,
slid thinks you're 100% down, or some of
those really weird little edge cases. This is where a lot of my anger at
dashboards comes from, because dashboards just cover over all those sins.
You're just like, everything's fine while people are just like, no,
it's not. Yeah, Aaron, you were talking about
a lot of the metrics you're measuring earlier, and recently I've
been coming know. I think metrics have value in their own and I have a
lot of gripes with some of them think, you know, sometimes we over
index on them. But I've been trying to come up with more creative metrics
lately, and one of them is how long someone spends waffling
trying to decide if something's an incident.
I love it.
Trying to figure it out or trying to figure out the severity and
the time it takes how many people they rope into a conversation. I was@one.org
that used to get paged all the time and
people just stopped wanting to bother each other because everyone was getting woken
up so much that one day people would open up incident channels
on their own and then sit there by themselves typing everything they were doing
for like 2 hours before they brought anyone else
in, which was fascinating. And so we started recording that
before paging another person in and thinking it was
a serious thing. So it just reminded me of that.
Cool. Well, we are wrapping up time now.
We probably have time for one more question.
And there's like three really
great ones. I'm trying to figure out which one I want you all to answer.
We're just going to have to do a second round at some point. Yeah,
like a follow up, a part two. We'll just end it on a cliffhanger.
Yeah, I guess
I will ask. I know,
cherry, you have thoughts on this? I want to hear you talk
about build versus buy with regard to tooling for any
phase of the incident lifecycle. You touched on it a little bit earlier, so I
feel like it's good for us to call back to that. Yeah,
so I preach with the religious zeal
of the converted. For the longest time, I was one of the neck
beards who just like, no, I will not outsource my core functions.
I am going to run my own mail because I want to be
able to grep through the mail schools. And it's so scary when I can't figure
out exactly where the mail went. Like, it goes up to Google and like,
what the fuck? All right, that was a long time ago, but it's still,
it was a big deal for me. Overcome that and go, okay, I can
outsource this to Google. Okay, I can do this, whatever. But even up to including
parse, I was like, no, metrics are a thing that use to give to someone
else. They're too critical, they're too core. I need to be able to get my
hands in there and figure it out. And I have come around full
circle on that and did so before starting honeycombio. Just because
the thing that I mentioned earlier about you never want your
telemetry to be on the same anything as your production systems.
Huge thing. But also it's
a big and really important part of the entire, call it
DevOps transformation or whatever, which is just that we
as engineers, the problem space is exploding.
Used to be you had the database to run. Now you've got fucking how many
data stores do you have? And you can't be expert in all
of them. You probably can't be expert in any of them and also do a
bunch of other stuff. Telemetry. For a long time,
it was just the case that you're going to get kind of shitty metrics and
dashboards, whether you ran them yourself or you paid someone else to do it.
That's not the case anymore. It's come a long way.
You could give somebody else money to do it way better than you could.
And that frees up engineering cycles for you to do
what you do best, right? What you do best is the reason that your
company exists, right? They're your crown jewels. It's what you spend your
life caring about. And you want to make that as an engineering leader,
you should make that list of things that you have to care about as small
as possible so you can do them as well as you can, right?
And everything that's not in that absolutely critical list,
get someone else to do it better than you can. It's so much cheaper
to pay someone else to run services than to have this enormous,
sprawling engineering team that's focused on 50 million different things.
That's not how you succeed your core business
or how you exactly take your headcount and spend them
on the things that you do and do them. Love it.
I love it. Aaron, anything to add there
before we wrap up here? Yeah, no, I mean, I think that it makes a
lot of sense. And again, I'm also responsible for our vendor
resilience. So dealing with understanding what our capabilities
are of the vendors and what we've agreed to, going back with
understanding. Now, I want to pick your brain about that, Erin.
There's a whole binder full of goodies that I have for it.
But, yeah, I think that there's a way to look at it from a perspective
of there's ways in which that we are getting
more sophisticated in how we are creating our systems and
our ecosystems in general overall. And the dependency there
is the web of things that everybody is connecting to and using
and leveraging. And I think that what is best,
again, I will go back to what I was bringing up earlier, which is starting
from a risk perspective of really actually understanding
what is your risk tolerance,
looking to use this tool. Okay, what are the
risks associated with doing that? And are you
willing to take those risks, such as dependency? Right. And again,
what are the things that you can do to ensure that you are mitigating that
risk as much as possible by creating workarounds and really thinking
about your technology continuity element of things and how
you're developing around that dependency. Right. When I was
at parse, we were like, okay, what we do is
we build the infrastructure so mobile developers can build their apps.
Right. So operations was absolutely core to us. We had to
run our own MongoDB systems because it was what we
did. Right. Yeah. But for most people, it's not what
you do. Go use RDS.
It's fine. You're still a good engineer if you go a good promise.
Yeah. And I think it's interesting, too,
because I think that we
have certain things that internally at
Zendesk that have been built, and there's been a lot of work that's Jones into
them, and they have a lot of people know what it is,
class fallacies. But also at the same time,
there's this tipping point where that becomes like, okay, we have this dependence,
we created this, we have to maintain it. We have to keep it alive.
At what point do you get to where that becomes like, okay, this is
a much larger task than it would be to outsource it. So I think that
there's different phases and also future proofing
because the people who built it and run it might not be there forever.
Especially if that thing is involved in a ton of incidents.
They are going to burn out and leave, and no one's going to know how
to run. True facts. True facts.
We went through a huge exercise a couple of
years ago where the whole entire thing around
ownership and self service and really
making sure that everything had a clear owner.
It seems like it's a simple thing to say that, but when you have
multiple different. It's something that just existed from the beginning.
And it's like, we don't own that. We just use it. We don't
own it. We just use it. Well, somebody's got to own it
because this thing keeps breaking. So can we please figure out
who's actually going to take on the responsibility of it? And I
think that's also. Yeah, you end up with
people that they leave, and then you're like, okay, does anybody know what's going
on here? Right? And it becomes more critical than
ever right now. I really want to do a part two with you all
in person. Wine, whiskey, part two. We can totally
do that. And we can record Jones as well. But thank you both
for joining me. Really, really appreciated the conversation.
And, yeah, look forward to talking more later.
Thanks, Darren. Yeah, thanks, Aaron.