Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer, a quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability
with chaos native. Create your free account
at chaos native. Litmus Cloud hello,
and welcome to our talk, the elephant in the blameless war room,
accountability. This talk started when Emily
and I were encountering executives in Fortune 500
companies who owned software reliability for these entire company
that didn't really believe in blameless culture. They would ask us
point blank, well, somebody so has to get fired,
right? And that actually was very poignant,
and it got me thinking, what about accountability?
And so that's something that Emily and I spent a lot of time thinking
and distilled the answers in this talk. We really wanted to reconcile the
idea of being totally blameless, but still holding personal accountability
when it was the best solution. And we're really excited to share what we found
today. First, a round of
intros. I'm Christina. I'm on the strategy
team at Blameless, strategizing for executive team cohesion and
also market positioning. I'm really passionate about making blameless
culture work, not only for engineers, but also for
business leaders. And I'm Emily. I'm a content writer at Blameless.
I'm originally an outsider to the world of SRE, but I've been really excited to
learn about the space and to start sharing my perspectives with the community.
So we started thinking about factors that have a huge impact
on business value. And one of the major ones agreed upon by every
study is developer velocity. Then we found that a major factor
in developer velocity is psychological safety. And what do you
think is a major factor in psychological safety? Blameless culture.
That would be correct. So it really is a big deal.
You can't underestimate the importance of being blameless.
Yeah. And especially when speaking with business leaders,
it's really important to speak in their language, and that is
the currency of communication. And so showing
that there is business value in having a blameless culture is tremendously
powerful. So picture
this. We have an engineer working one night on
the testing environment admin panel. However,
on this dark and stormy night, they slowly realize this
isn't the testing environment. This is the production environment.
Of course, these changes lead to a major incident.
And just at this time, the door opens and the
executive walks in and asks, what happened?
Who's responsible? So this is
a pretty chaotic situation. A lot of things have gone wrong,
and a lot of emotions are running high. Let's break down what happened. The shared
reality is pretty simple. Leadership walked in they asked,
what happened? Who's responsible? Probably their forehead was pretty
furrowed, their voice was raised. They're a little agitated,
speaking faster, and physically hovering around people's desks,
really trying to get to the bottom of this. Now, as members
of the engineering team, how would this shared reality be
interpreted? Well, it's very natural and
human to feel blame and frustrated or even afraid,
imagining all different scenarios about what are the possible repercussions
could be, and that could make it really difficult to focus on resolving
the issue. So even if that's what the engineering team's thinking,
let's look inside the head of the leadership. What might they be going
through? Exactly? And heres I wanted to say that
in psychology, there's this idea of a fundamental attribution
error, where when we feel hurt,
we assume that the other person has bad intentions. War is
a bad person, whereas when we hurt someone else,
as you can see, we would assume that that was an accident. I was
just having a bad day. And so it's very natural and human,
again to feel blamed and assume that that is the objective
reality, when really the only shared reality is that leadership
asked, who's responsible? And these lets. So let's
see what might actually be going on with the leadership
without assigning intent based on the
engineering team's feelings. So their
goals are probably very similar to the engineering teams. They're really
just focused on resolving the incident, preventing it from reoccurring and restoring.
True, with these stakeholders. Now, sometimes we
could have the same objectives and get there through different paths.
And so, given what we each know as
the engineering team versus leadership, we might think that
for the leader, holding someone responsible is the best path forward,
whereas for the engineer, there might be other paths. So really
it's easy to think, oh, leadership doesn't respect psychological
safety. They're wrong. This is not the way to resolve incidents. I hate
this toxic culture, but that doesn't actually solve the problem.
And what I found in my experience is that it
really helps when you try to step into the other person's shoes
and see how, under their set of assumptions, their conclusion
may be reasonable or logical. And once you uncover those
implicit assumptions, then you can directly speak to those as a
starting point of influencing and causing change.
So again, this you can easily
see as a starting source of conflict. What starts to
dissolve conflict is when you start to see commonalities
between the two parties. So here we can see that not only do they have
the blame goals, but they're really feeling the same way too. Both of them are
pretty tense. They're pretty stressed, and they're very afraid of
the conference this incident could have. They're both very motivated to
resolve the incident and restore the service to functionality. So let's
start breaking down. How can we cover this bridge? How can we bring these two
groups together? This will be the main backbone
for our talk. We're going to establish empathy for leadership,
understand their goals and perspective, and look at the assumptions
and perspective differences that might be driving their blameful
behavior. We'll address their concerns in three major areas. Looking at the incident,
the engineers involved, and the stakeholders trust, then we'll
cover how to be blamelessly accountable, how to incorporate accountability
into the best solution. So what are the assumptions that could
cause leadership wanting to hold someone accountable as a way of
resolving the incident to be correct? That's the question
that Emily and I asked ourselves. So they might
assume some things about the incident, like it just straight up should never have happened,
or that the best way to deter other people from making the same error
is to punish someone. And some of the assumptions
about the engineer could include a skilled engineer would
never make this mistake. If someone made a mistake like this, it must mean
that there's an issue with their competence or skill.
Removing the engineer will remove the problem.
And without punishment, these engineer won't fully understand the impact of their mistake.
They could have some assumptions about how the stakeholders are feeling too.
They might believe that the stakeholder wants to see someone singled
out and perhaps fired, that this is the most persuasive way to
convince them that these incident is resolved. They might
also think the stakeholders are expecting some sort of fairness, that because they've
experienced pain from the incident, they'd want to see pain experienced by the engineering
team as well. But we know this
isn't really how things will play out. Even though blame seems like
a good way to achieve your goals, given these assumptions,
we know that systemic changes are far more enduring and beneficial.
So how do we cover this gap of understanding? Absolutely.
And do remember that even at this point,
both the engineering team and the leadership have the same
shared goals. They both want to resolve the incident. It's just given
what we know about incidents, engineer, stakeholders, we might have
different ways of getting to that outcome.
So first, let's understand leadership's perspective on the
incident. If they assume that it should never
have happened, punishment will deter others from making the same error.
Then how can we skillfully address it so
that we challenges their mind without triggering their defense mechanisms?
And so Emily and I came up with a way to essentially uncover
these perspective differences. What better way to have
open minded conversations than to ask questions?
And keep in mind that these are not questions that we are expecting
you to ask during the incident. Everyone is
still stressed and tensed and focused on the resolution during the incident.
We highly suggest that you ask about these probing questions
when both the leader and you are in a calm state of mind where you
can meaningfully engage in a conversation and where you're both open
to changing your mind. So these are some of the possible questions.
One thing to ask is, is 100% reliability even possible?
Is it worth the cost if it is? And what kind
of tradeoff are you willing to make between trying to prevent incidents and
preparing yourself to react to them, given that you only have
so much engineering capacity? Another question to ask is, are there other ways of
making people more blameful than using punishment?
So let's see how we could address leadership's concerns about the incident.
So one thing to emphasize when you're having these conversations is that systemic changes
are more enduring and beneficial. That if you actually change
the system at heart of the incident, then you'll do way
more to prevent the incident from recurring than just swapping out people.
And it helps to give a specific example of how in the past
you've had a system change that was more effective than letting go
of someone. Based on Aristotle's model of persuasion,
ethos, pathos, logos, it's actually very important to
strategically sequence your appeals based on how likely
the person you're talking to will agree with you.
So, for example, if you anticipate that the leader is less likely
to agree with you, it's more important to establish your credibility first,
then use logic, then use emotional appeal at the end. So don't start
with, oh, the team feels really guilty. But start
with, I've been doing this for many years, and I've
seen this example where the systemic change was a lot more effective
and enduring than actually letting go of someone and
then say, from an emotional standpoint, what helping the leader
understand how the engineer might be feeling.
Another thing to really make clear is that there's no way that
complex systems won't ever fail. There's no way to prevent incidents
100% of the time, as much as you might try. Yeah. So Dave
Renson, the former head of customer reliability engineering at
Google, would often in his talk say to air is human,
the sun is not 100% reliable because, well, it's unreliable
at the very end, and then our heres are not 100%
reliable, and complex system failures are inevitably going to
fail. So helping the leader understand that software doesn't
just work the way it probably did in the 90s
is going to be very helpful.
Another thing to really get across, and this is more kind of on the emotional
side, maybe, as you were talking about, that engineers are not at their best when
they're stressed. If they're in a fight or flight mode where they think, like,
every mistake they make could lead to the end of their job,
they're not going to be able to focus at all on actually solving the problem.
Absolutely. While zero one and two can wait until after
the incident is resolved, you can see that zero three is actually
the key to helping the leader understand
the situation, to give enough room for the engineers
to focus on resolving the issue in the moment. So instead of letting
leadership hover and ask what's going on, what's going to happen, who's responsible?
You can say, hey, we're all here trying to resolve the problem as
soon as possible, and engineers don't solve problems well in
fight or flight mode. So it might be helpful if you could take some time
and give the engineer team and give the engineering
team some space to resolve the issue first, and then we'll come back
to looking at what actually happened. Now let's take
a look at how the leader might be feeling about the engineers involved in the
incident. They might have some assumptions about the
engineer that we covered before, like a skilled engineer
would never make a mistake like this, or that if you just get
rid of this one bad apple, then everything should be fine. Or that
the engineers don't really understand these severity of an incident without punishment.
So let's try to look at how we can bridge the gap in
our perspectives that would lead them to have these assumptions.
So ask them, do you think there are deeper causes of incidents beyond individuals?
I love these open ended questions because it really gives
everyone in the conversation a chance to do creative problem solvings together
instead of fixating on things that have gone wrong. This question embeds
some forward momentum towards solving the problem together.
Another question is, do engineers understand the business impact of incidents?
It could be that they don't, and there needs to be more of an open
dialogue between leadership and engineering teams so that they can understand
how their development choices will translate into money gained or lost
by the company. More than likely, though, they do understand that
the incident was severe, and they're probably already feeling plenty guilty.
Yeah. So let's talk about addressing these leadership concerns
about the engineer. Anyone in this position could have made that mistake.
And so from Emily and I's conversation with the
emergency physician at Stanford Emergency Medicine
LA has said that there are more missed heart attacks in
the US than what we would expect. Because doctors,
even though they've gone through so many years of very tough training,
will have interruptions. Sometimes their brain is holding a
different set of contexts. Maybe these engineer was helping another engineer
with a deploy in production and therefore coming back immediately
under a time rush. Perhaps they didn't realize that this
is actually not the testing environment, but these production environment.
And another thing, again on more the emotional side of appeals,
really try to create empathy between the leader and the engineer that
was involved. Really emphasize that nobody wanted this outcome.
It's very easy for leadership people to feel isolated in
their roles, that only they can grasp the magnitude of
all of these incidents. But the
engineer is obviously suffering very much as well.
And closing the gap on that empathy will allow them to understand a lot
better. Absolutely. And it's also about building common ground,
because for leaders who typically are people that take extremely
high ownership of company performance, they will feel like,
oh, I'm the only one that heres because they're not sure.
They're also in a mode of stress when they're trying to assign
blame, if that is the case. And so when that is happening,
it's difficult to have the mental space to recognize that,
no, we actually all feel bad. And it is possible that
the engineer doesn't fully understand the business impact and that is these
leadership's responsibility to actually help the engineering understand.
And what I found is that it actually builds tremendous trust.
When you can facilitate a conversation where the engineer
does acknowledge the understanding of the
business impact, it doesn't mean you're taking
full responsibility of something that would have happened and could
have happened with other people, but it means that you understand the pain
that leadership is experiencing too.
Another thing from a business perspective to emphasize is that it's way
more costly to hire someone new than to train the existing team.
Even if there are gaps in knowledge. It's way easier to bring someone up
to speed than to hire someone brand new and teach them all
of the intricacies of your system. As you can see, the first and
third point heres are logical appeals and then these second one
is an emotional appeal. So if the leader
likely understands you, these you can start with can emotional appeal. But if
you foresee them disagreeing with you first, start with one and three first.
Finally, let's take a look at their perspective on stakeholder trust.
Again, they might have some assumptions about what the stakeholders might want to see
that they might want to see someone get fired as a way to
be convinced that the incident is restored or to maintain some
fairness among different teams and stakeholders.
So again, here are some questions to ask to try to figure out where your
perspectives are different on these situations. Like just ask open
endedly, are there other ways to rebuild trust with stakeholders besides retribution?
Another thing to ask is how will they respond to retribution versus
being informed on your long term plans to resolve this incident
and other incidents like it. As you can see, the first question heres also
allows everyone to come together and brainstorm different options
rather than to just say it's wrong to punish these engineer because
that actually cuts the momentum of the conversation and
stops it at that point. Whereas you can direct that energy towards
a new option and see, okay, what are some other ways that we could build
trust?
So now that you've opened up this dialogue, how can you present your
own perspective? One thing to really
try to convince leadership of is that your action plan will inspire
confidence that once it's explained to stakeholders, they'll see
how it leads to in a more enduring solution. The other point to
mention is to really show that you hear and acknowledge the
pain of your customers and also all stakeholders impacted.
So who else could be impacted by customers? Well,
sometimes customer success team is often the team
that is responsible for retention metrics. And so if customers are impacted
by incidents, they may be more at risk of churn,
which makes it harder for the customer success team to do their job.
And so extending,
so extending empathy and understanding not only to
customers but to internal stakeholders as well,
is very important.
So let's return back to this incident where the leader has come
in, asked what happened and who's responsible in
the moment. How is the best way to respond to this? We think
some of the elements it should have is to be very direct and succinct.
No beating around the bush. Yeah, because that could actually make you seem
suspicious and trying to hide something if you're beating around the bush.
Another thing to really focus on is building common ground, looking for
things that you can both agree on, things that you're both feeling and goals
that you share. You also really want to create psychological
safety. And if you see any rush to point fingers and blame,
really try to alleviate that with some of the questions we mentioned before.
Like I said, having shared goals is extremely important. So explicitly
articulate them, make sure you're all on the same page and facing the same direction,
and then give visibility into what these next steps will be.
Now that you have set up the goals that you share. How are you going
to achieve them without using blame? So let's go back to that moment where leadership
walks in through the door. So what happened
here? Who is responsible for this mess?
It really could have been anyone. We're all focused on resolving
the incident as quickly as possible. So why don't we give these team
some time and space to focus on the resolution first?
And I understand the impact this has on customers and we're committed to
restoring stakeholder trust and we'll take full ownership of
working towards preventing incidents like this in the future through
in depth contributing, factor analysis and follow up actions over
the next two weeks after this incident is resolved.
That seems fair. I look forward to seeing what you find.
Wow. That was actually a very scary and stressful
experience for me, even hearing Emily say that because
I could feel blamed. I felt like I needed to hold
someone accountable in that moment. But I wanted to actually ask,
how did you feel as you were asking those questions? I tried to really
embody the feeling that this was a big deal and that I had the entire
company perhaps riding on resolving this quickly.
I really wanted to get across how passionate I felt about
this going wrong and convey the importance to everyone
else. So if that came across as scary, we can see now where
these gaps start to pop up. Yeah. Wow. That's powerful.
See, even when I was scared, I actually lost these ability
in that moment to understand you. Heres just really
prioritizing this issue. It came off definitely feeling like
you're trying to hold a specific person responsible. So yeah,
that was very powerful. Thanks. So immediate response is actually not enough.
Let's look at what the follow up investigation could look like.
Rather than saying this engineer screwed up,
let's dig a little bit deeper and do the hard work and see
what are other contributing factors. So as an example, we can
return to our story from the start and ask a few questions
about how this may have happened. Like why do the admin
control panels for production and testing look
really, really similar? Yeah. And should production have a
big flashing banner? Production? This is production.
This is production. Yeah. And maybe just a single person acting
alone shouldn't be able to make these changes. Maybe there should be like an oversight
someone has to review before they go through. And should we
maybe be selective about the engineers who can make changes on the production
admin panel? So just by digging into it,
we can come up with all sorts of enduring systemic changes that can prevent
this specific incident and all sorts of other incidents like it going forward.
It's so much better than just getting rid of one engineer.
Yeah. So again, you should have this follow up
conversation a little while after the incident is resolved
and you've uncovered these perspective differences. You've uncovered
what assumptions they might have had, and then you can start really meaningfully
implementing these changes. So now that you've done the investigation,
you've come up with really great systematic changes that will
help prevent issues like this in the future. Are you done?
Well, as you can imagine, no. There's also follow up
planning for reliability overall. So looking at these
incidents, how does that inform the three pillars of planning
people? Process tooling and process includes prioritization
as well. So for people, how do incidents inform headcount
planning? Do we need more people and process? How can we
update runbooks or production readiness checklists?
Is there things that we can do to consistently improve our
performance? Resolving incidents in the future? And also consider investment
in goals that will really up level the effectiveness of the
engineering team. So just from one incident, we can dive into
major priorities for the entire organization? Absolutely.
Incidents can uncover issues that maybe the
company isn't even ready to hear about yet. Because after asking
enough whys of how something happened and not just going
down one path of these tree, but exploring multiple options,
it often ends up revealing something about
leadership and also about how the team is structured.
So there's really interesting insights when we dig deeply into
incidents. So let's go back to
the beginning of the talk where executives asked, someone still
has to get fired, right? Well, sometimes,
yes. But when is it fair to hold someone accountable in the
traditional sense where you are actually letting someone go?
Well, Emily and I came up with a number of prerequisite questions
to ask as a starting point. So heres
expectations for this person's job clear? Were they realistic?
Were they well documented? Did they know what they were supposed to be doing?
And were the mistakes of the incident a result of their lack of
skill, good intentions, or honest effort?
Have you been sharing feedback about their gaps in performance
on a real, consistent basis, making sure they know that they're
not up to par? And also, have you accounted for all
other contributing factors? When it is in the context of an incident,
holding someone accountable shouldn't be the easy way out. It shouldn't be something that you
leap to as the simplest solution, but instead something
that you resort to after accounting for every other circumstance that could have
led to the mistake. And as you can see, this is a very distinct
and separate process from incident resolution. This is performance
management. And just because there are incidents which are normal
and natural and happens with every company and every system. It doesn't
mean that that can be a substitute for proper performance management.
So let's talk about being blamelessly accountable, having your cake and eating it
too, being both blameless in culture, but holding people accountable when necessary.
Well, at Twitter, we learned that accountability faces forward.
So it means that the team that is accountable will
take full ownership of improving reliability from the incident
point moving forward. It also means that
you're separating these reliability outcomes from performance management. Like we
were talking about before, performance management
should never be a substitute for resolving the incident in the best possible way.
And likewise, it shouldn't come at the cost of in depth contributing factor
analysis. You shouldn't give up on trying to find other causes
of incidents just because you've decided to hold someone accountable.
Absolutely. So really, there's no trade off between
blameless culture and accountability. It's not that if you are blameless,
you sacrifice accountability. You could very much have both a
blameless culture and also people feeling a tremendous sense of
ownership about improving the system together as a whole.
Leadership is critical in fostering a psychologically safe
culture, and it takes incredible empathy, stress tolerance
and critical thinking to get blamelessness and accountability
working together in harmony. But it is possible, as we've seen it
done in this example. So the example we shared is actually based
on a true story and in real life, these engineer,
of course, felt bad, but was not punished in any way as a result
of the incident. And the team worked together to implement these systematic
changes to make the distinction between testing environment and production environment
more clear. A perfect example of blamelessness
and accountability working in harmoniously together. So, as we worked on this
talk, Christina and I found a wealth of valuable resources.
If this subject interests you, we encourage you to check them out. We learned a
lot about empathy, conflict resolution. We looked at the reliability
journey of other companies and how they reached this point of maturity.
And we learned a lot about just what it means to be blameless.
So what do we do about the elephant in the blameless war room? We shouldn't
hide it. Let's ride it. Yeah. Thanks for coming to our talk.
Thank you.