Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome everybody to the importance of communication SIe.
We will cover today psychology safety in the context
of SIE. We have some case studies ready for you.
We plan to talk around 25 minutes, something like this.
And we will also show a little bit of tool we used in our case
study. Be and yeah, let's start right away.
So my name is Flaunt Hubner, I'm tech strategy from
Accenture and I'm working out of New York. Alex.
Thanks Lauren everybody, name is Alexander Schopper, also based in New York.
I'm a principal director with our talent and communication practice and
lead our psychological safety capability here at Accenture. Over to
you, Marco. Thanks, Alex.
Hello everyone, my name is Marco Torre, I'm a senior manager
at Accenture, also a site reliability engineering
subject matter expert. So happy to talk to you about this
topic. Perfect. So then let's start
right away. Yeah, sorry, Florian, I was just going
to kick it off and just say with respects to open communication
and SRE, what do you mean by that?
Why is it so important? So,
yeah, I mean, when we take one step back, what we
see that the situation in it, in corporate it and
so on for all of us is getting more and more complex.
It's getting more complex because of two run models.
A lot of companies, a lot of our clients, they are still working one
side in waterfall model and one side with
agile teams, build run teams, really progressive,
doing agile, doing DevOps, doing SiE and so on. So you
have just two run models in parallel working most companies. And this is
leading to more and more complexity on the other side.
Shifting responsibilities, shifting responsibilities means here that
in the past everything was maybe on prem, you had some vendors,
some solutions outside of your company. But this
is changing now. Over time, you consume much more
the cloud, the cloud provider, getting more important. Now you have
relationships inside of your company. You must manage
so that the model there is changing and then also
into production, we have much more stress, so we need more
robustness in production because more and more changes sre coming
into production because of the change in the frequency with
DevOps, with agile, we sre changing how we are working, we are changing
the way we are working, and this leads to more frequent changes in
production, smaller batch sizes and so on.
So this makes a lot of sense. Florian,
when I think about this though, and I think about open communication,
what I'm anchored on is the people aspect.
And what you're taking us through right now seems a lot like
process and technology. And I know there's a communication aspect to it,
but what about the lens on people. Yeah,
valid point. I mean, in all of these areas, the people are
key. And what we see here normally is in this complex
situation, communication is really the most important thing because
you have this changing technologies. So changing
technologies is ongoing learning. This means we have to work better together,
we have to learn faster, and we are learning from each other.
Learning from each other means always communication, talking to
each other, understanding what each person is doing.
Collaboration, key topic is always at the beginning,
we have the silos dev and quality,
security operations and so on, silent based working.
If we keep on going this, the collaboration is key,
but we have to bridge this wall, we have to tear down this walls,
and we have to work together and collaborate also. Their communication
is key. Incident reviews,
that's the blaming culture sometimes in some companies.
So what we want to see there more and more is this. Working together,
blameless post mortem culture,
speaking up, working together in a team, when you have an incident and
things like this. And then here, the last item.
Different perspective in a team, we have different perspectives.
Yeah. So this makes perfect sense to me.
Florian, with my site reliability engineering background and
working with a lot of project teams, it's interesting.
All of these challenges that you're pointing out relates to
a topic that we've been focused on a lot around psychological
safety. Are you familiar with that term? A little bit,
but I think it would be good if you can describe it just for our
audience. Yeah, no, perfect. So on the next slide,
I mean, I'll give you verbatim, the definition, right. And this definition comes from
Amy Edmondson, who's one of the lead researchers in this.
But essentially, psychological safety is the belief that no
one believe that one thinks they will be punished or
humiliated for speaking up with ideas,
questions, concerns or mistakes.
And when you put this in the context of site reliability engineering,
think about the soft skills that a site reliability
engineer needs, right. They're in a position where they have
got to bring up or voice concerns over the reliability
of systems. How resilient are our systems
problem solving? How is it that they're
applying this? Right. This might mean engaging with the business,
engaging with architectures, and bringing up some really uncomfortable
conversations. Right. So I'm in
no shape or form an expert in the theory
behind psychological safety, but if you move on to the next slide,
I'd like to talk about what are the differences that we see
across four dimensions. Right. When we start shifting our attentions
from the traditional approach on how is it that we approach
our confidence and our trust with our teams, but how
we shift that when we're thinking about this in the context of site reliability
engineering. So let's think about it from a blame perspective.
Mistakes that can be made, feedback that we're sharing back with
our teams, and then, of course, the elephant in the room.
Right. Management, our leadership. Right. So a
lot of times we're heads down, focused on the technical work that we have
to accomplish. But like I said before,
the soft skills that an SRE need, it's very important that we
have an environment where we feel safe
that we can talk about these items, right. And so I'll quickly touch on some
key points within these four dimensions. But number
one, blame. Do you feel like you're in a team when
there's constant finger pointing going on?
So there's an issue in production, your first instinct
is, it's not my fault. My code passed
all tests in the lower environments. I'm sure that's networking.
I'm pretty sure that they missed the configuration. We're always telling the
infrastructure teams, you got to get on githubs, right? You got to version control your
changes. But even still, if there was a mistake,
right. How are we making our team members feel
when they actually do make a mistake? Right. Do we talk about it?
Do we voice it over? Is it okay? Or are
we penalizing folks for that that goes on
that next dimension of feedback, are we sharing back that feedback?
Do we have the trust and confidence to accept feedback from our
team members? Right. And most importantly,
do we have the confidence and trust with our teams
if we're in a leadership position or if we're in a single
contributor position, do we trust in our colleagues,
our coworkers, or our team members to actually deliver what they have to and
why this is important in the context of site reliable engineering.
Well, think about it, right? We're not going to get into the concepts
of, say, for example, service level objectives and setting up your service level
indicators, but these are essentially tools and practices
that allow us to take risk in production environments.
Right. And do we feel that we have a culturally safe
environment to take risk? And that's what we wanted to focus
on as part of why psychological safety is so important.
And I want to point out one topic here also on this slide.
What we sre in the traditional approach is that we have maybe the structure
like biz desk queue and port, and maybe security also in
the middle. That's easy to blame someone else because
this person is maybe not sitting in my team. We also can see
when we break down the silos in an organization that when I'm
end to end responsible for something, there's not so much blame anymore.
So this also leads into a better culture, into a culture
like that people are responsible end to end for something and that
there's not so much finger pointing anymore. So the organization,
the form, how we structure teams is also leading a little bit in
the direction of how we treat people and how do we
talk about blame and do we point fingers or something like this?
Definitely makes sense. Florian.
So on this slide, I mean, when we started our journey, when we
started to talk about psychology, safety and trust, we also
thought, okay, is this a huge topic? Is this a topic which
can really improve our team?
We looked into the research, and what we found there is
that the numbers are clearly indicating that focusing
on psychology, safety, focusing on improving trust,
is a key topic for your team, is a key topic
to improve your innovation, for example, your productivity
and your skill application. Marco, maybe you would
like to give us some more details to the slide and
explain a little bit how you see this.
Yeah. One thing I would like our audience
to anchor in on is that first metric that we call out, right?
So a 67% improvement in
a higher probability that employees will apply a new learned skill.
So essentially, in order for us to
introduce any sort of type of change, and that could be a technology change,
that could be a process change, just introducing the role
of an SRE or organization that's going to introduce new
silos, new trainings. But as most of
us know, working in organizations for many years,
organizations tend to, they don't like to change.
So definitely investing in communication is definitely going
to pay off. As we see, having that ability
to quickly pivot an organization and the skill sets of a whole
team or a whole group or a whole technology group to start
enabling the culture of, yeah, sure, I'll pick up and learn a
new skill if that's where the organization
technically is headed. Right. So it's definitely something
that we want to call out and focus on.
And there are other metrics, benefits that we can see,
76% more engagement, eleven times
more innovation, and 50%
more productivity. And so you might be asking,
well, how is it that we approach this? Right.
So the way we approach this is
essentially by focusing on, first, what are those
challenges? And I touched a little bit on some of those. So number
one, culture is intangible, right.
It's very hard to change culture.
The way we approach this is with a data driven approach that
we'll give you guys an overview on. Also on
that same theme, of organizations are just resistance
to change. Right. It's very important that we get leadership buy in.
Right. They need to be fully on board as well.
And then just overall, you guys are starting to understand what does it
mean to be in a low psychologically safe environment.
Right. It starts with our practitioners.
Right. They're the only way we're going to be able to spread
that bottoms up approach. And on the right hand side, I mean,
you'll see a couple of the activities that we run through.
We won't spend too much time on that because we'll run through those activities as
we do by looking at the insight scan tool
that we use to be able to measure the effectiveness
of culture change. So with that said,
we're going to hand it off to Alex, who is
going to give us the case study overview. And how is
that we actually applied this. Cool. Thank you, Marco. So that's exactly what
I'm to going do the next few minutes is to talk a little bit about
actually bringing this to life. So how did we experience this with a
representative team that represents, I think, some of the experiences that you've probably had
in the last few years and something we come across very often.
And then through that, I want to share, as Marco said, a tool that we
use quite frequently, but also a framework to really understand
how do you make psychological safety actionable. It's kind of a big
term, and I think, as Marco Florence pointed out,
it means many different things to different people. So I want to share a framework
of kind of how you can bring those to your teams and make it quite
actionable. So in this particular case, I want to talk about,
we work with a product engineering team that need
to respond to really a dramatic increase
in growing sales and really get more client centric
during the pandemic? I think you might be able to relate to this.
We saw incredible productivity and efficiency gains.
Online sales went up as a result of that. And so that put a lot
of pressure on engineering and SRE teams.
And so the challenge really for this team was
how do we respond to the demand, the increased demand in business by
introducing new features, keeping up with the number of site visitors
and traffic, all without compromising quality and
security. And the questions I want you to think about over the next
few minutes are how did this team overcome a culture of blame?
How was it able to learn from mistakes? How did we eventually
evolve the ways of working as we saw the increasing demand?
And ultimately, is this a model that you can scale across the
organization and taking some of the agile best practices
and bringing those back into SRE.
So let's go to the next slide quickly because
I want to highlight here is essentially the approach we took,
and you'll see that we have different key moments
in this journey that we did with the team. And it all started with
the first point. Here you'll see called insight scan and debrief.
Another way to look at that is a retrospective or blameless post mortem.
So what we did over twelve weeks was run
three or four retrospectives or
postmortems over the course of three months that were all data driven
using the insight scan tool, and used online collaboration tools like Miro
and Mirror to facilitate anonymous conversations. I'll talk a little bit more about
that in a few minutes. And now you might be wondering,
like why? Seems kind of random. But what we did
too is look at what does it take for behaviors
and new ways of working for actually to
stick with the team. So the research coe points that teams and
leaders start to internalize new behaviors on their own
once they've repeatedly done them over the course of three months.
So every two weeks, through smaller actions and
reinforcement. And so make a long story short,
we follow the cycle of running these workshops or using this tool
we're going to show, and then off of the back of that, really getting to
action very quickly to understand what is it that the leader needs to change,
what is it that the team members can change and addressing those with leadership trainings,
coachings and workshops. And if we go to the next slide,
you can see this is the framework that I mentioned earlier that
can help us sort of think about bringing psychological safety to teams.
And you might recognize some of these terms here around flow state.
But before we go there, I quickly wanted to highlight that the research is
clear in pointing out that it's not sufficient
just to focus on psychological safety. And that's been our point
of view and what our work has shown. But to really understand what
it takes to be an effective team in an SRE context and what's expected
of the behaviors that Marco and Florian pointed out earlier,
you also need to look at intrinsic motivation, or as you can see here on
the x axis, team intrinsic motivation, these motivational drivers.
So that talks a lot about purpose and collaboration. And when we
see teams that have both high psychological safety and
high motivational drivers, they actually proactively assume
accountability. By comparison, teams that are in the
fear zone, they actually are more held accountable by leaders.
It's a big difference. So it's, do I actually feel comfortable? It's a step up?
Do I have a psychological safety? Am I motivated to own a mistake, to flag
a risk, to bring up a concern in a post mortem? It's very
different than somebody asking a team member, what concerns did you see?
Or please tell me who made that mistake? That's typically what we see
in the fear zone. But the flow zone is really characterized
by teams that want to learn from their mistakes, and that's what we call positive
error orientation and have a lot higher levels of proactive
communication. You don't see those behaviors in both the apathy
and the fear zone. So that's actually what's holding
some of these teams back, is that lack of psychological safety. And,
Alex, I know you're touching a lot on the point of fear
and making mistakes, but we also touched on
the topic of being more innovative. Right. So you can
also think about, hey, it could be a team that they're pretty
good, not making that many mistakes,
but they're also not raising their hair and saying,
hey, we want to try new technologies or we want to test this out.
We want to take a risk. Right. That's also as.
The way I understand is those teams are in that flow zone,
they're more prone to take risk. Right. That's definitely the
type of quality and culture that we need
if we want to definitely enable a SRE organization.
Right? That's correct. Yeah. It's a really good point. It's really around
calculated risks or acceptable risk, which is if we're trying to
solve for complex problems and we default to the
ways of working and the inputs and the ways we're familiar with, we're not going
to be able to do that. So as requirements change,
you want both the team proactively to share new ideas, but you also want the
leader in the silos state to be able to frame that correctly and say,
here's what we're trying to solve for. We don't know the answer yet,
and here's the acceptable bandwidth of mistakes and errors we can make.
And here's where it's not acceptable. So a clear delineation of where mistakes
are. Okay. And where not. And I like the term calculated
risk that you use because essentially, if you're familiar
with the concept or practice of service level objectives.
Right. We're taking calculated risk,
monitoring our systems in production. Right.
By being able to enable new features, and we're using the error budget to
ensure that we're within those guardrails.
Yeah, sorry. This topic is exciting.
Right. But, yeah, I mean, what I also wanted to call
out. What it sounds like, Alex, is that you essentially have
a quantitative approach to measure culture,
which is something that is not really known for.
Right. They always say it's hard to measure culture, track culture. So that's
something actually you can show us. Yeah, that's a great point.
Let me jump into the tool that we use, because I think when you look
at comfort and apathy and fear, even that means a lot of different things to
different people. Right. And I think you've all, I mean, everybody listening probably
has been part of retros or post mortems, where you
talk about mistakes or you talk about vulnerability,
or are you actually able to share what you did wrong? Even those things are
very subjective, or the leader says we need to become more
effective. Well, what does that even mean? So I think what we tried to do
is, and I'll share my screen here,
really take a data driven approach to blameless post
mortems and retrospectives. So what that
allows us to do, and I'll go through this in a second here, is really
to anchor a conversation on an objective baseline.
So we're coming into this, and this is a survey tool called the Insight scan.
It's 15 questions that generates real
time data on team level psychological safety and motivation.
And we look at factors like courage or vulnerability,
accountability. So a lot of these things that are front and center to
a blameless SRE culture. And what we
want to do is for the team, during the retrospective
or the blameless, to actually measure this in real time.
And this is the case study of a team we work with. You can see
here, that was in the flow zone and was doing quite well.
So in this particular case, I can show you this dot right here.
We started working with the team and exhibited both high psychological safety
and motivation. And then something interesting happened. You'll see here,
highlighted by this dot, that there was a security incident,
there was a data leak of private information. And ultimately
what happened, you can see here in that drop of psychological safety
that shows how the team responded. So what we observed
from that drop in psychological safety was personal breaking.
There was some undermining going on. There was finger pointing specifically
to two people who were supposedly responsible for the incident.
And that's how the conversation took
place. What we did was combined some
of the best practices of a retrospective to really focus on what are
the ways of working at this team with the best practices
of a blameless post mortem. So we addressed questions
like, what were the failures that led to this in terms of
our ways of working, how much do we want to own this decision
as a team and what will remain on us in the future if we don't
solve this issue? So anchoring it to the future success
of the team and saying, what's the cost of not talking about the root cause
here, not from who's to blame, but from what ways
of working and what processes got us here in the first place?
And then, really importantly, what can each individual do
better on that team to improve the situation? And what
you can see here, actually marked by the left,
by these little dots where my cursor is. Courage was incredibly
low, vulnerability and inclusion and collaboration.
And so we talked about each of these dimensions, of what each
individual and what the leader can do on that
team to address some of those challenges. And so what that led was
a change in the ways of working, but from a progress side,
also improving feedback loops, being more thoughtful
and reflective, and actually visualizing some of the risks more clearly
when they created tickets. So we had some very tactical takeaways and
then some more of the soft skills. We talked about how
the leader needs to frame for uncertainty and actually lead a team
like this. And so you can see here, the team, because of that,
was able to jump back into the flow zone and actually sustain that
for quite a while. But the key point here that is,
you can see by these micro movements and then even by this last drop,
which had to do with an external development, there was a change in strategy which
caused uncertainty, that psychological safety
is not something that stays once you've achieved it. It's not that once we
feel psychologically safe, it's always going to remain that way or remain that way for
a longer period of time, it can actually change with
the word of a leader, with an incident that was flagged or not.
And as you can see here, by this last drop
from when the team was in the flow zone back down to the apathy zone,
that reflected a change of strategy. There was nothing particularly wrong
that happened. There was no incident, there was no security breach. It was a
change of strategy and a shifting expectation of the team. That caused a lot
of uncertainty. And you can see the impact that this had on this
team, and this is really a proxy for team effectiveness. If a
team drops like that, that's going to have an impact on the velocity and output
of a team. And because we did a similar style retrospective
that I just outlined, the team was able to bounce back pretty
quickly, redefine its ways of working, had to change its north star and
jump back into the flow zone. So we do this whole process over
three to four months with this team. Ok, Alex, I think this
is really. Yeah, four months.
And then what happens when you're not there anymore?
How is the team behaving and acting when
this is the over? The psychology scan was done.
Yeah, that's a good point. I think first part is what
we saw is teams start to have this vocabulary around
vulnerability inclusion. So they start using these terms to
actually make their conversations a little bit more specific. And then
I'm going to show you in a second an example of kind
of a best practice retrospective,
really. And I think that's the key piece, is that teams need
to do this over a recurring period of time. And I think what we saw
with this particular team is not just a post mortem when the
incident occurs, but actually continual retrospective. So the
team builds the muscle memory of being able to have these conversations.
That's a really excellent question because a lot of teams don't feel comfortable
having these types of conversation or being vulnerable.
And then when we do ask them to do that, we do kind of on
an ad hoc basis with post mortems.
But it actually, it's a muscle memory that a team needs to build up over
time. So after a while you are really changing the behavior, you really
change the language of the team and you change how they act.
So they are bouncing back much faster into psychology shape on
the flow area when you stop doing the scans.
Yeah, exactly. The ability to sort of get to the root cause of
what's going on is a lot higher.
So I talked a little bit about these retrospectives and blame this post mortem.
So here's an example of and some best practices of how we actually brought
the best of both worlds together. And I hope that
you can think about this too in your respective teams,
in your organizations, as some guiding points
that you can bring back for retrospectives and postmortems.
So the first one is using real time data, I think, you know, through the
tool. Hopefully it became clear that it's not so much a
conversation of did Florian or did Marco say something,
but what is our team level perception? It's not about individual perceptions
necessarily, but it is really how do we function as a team?
And that that data has to be anonymous. That's very important, but it
needs to be data driven conversation because it takes away
that initial personal element and anchors it on an objective
of truth. The other piece here is that you
want in the beginning, as I said, it's really hard for teams to open up
and actually develop psychological safety sometimes even
incredibly hard or difficult if that low psychological safety is coming
from the leader themselves. And so you want a neutral
third party to be able to facilitate the retros or post mortems and
give everybody a fair chance to speak without criticism
and without the leader jumping in to defend certain points.
And the second piece you see here
around these sticky notes, and this is, I think one of the great benefits of
working remotely or virtually through a mural or mirror is you can
actually give people the ability to comment and share their perspectives
anonymously and confidentially. So often
retrospectives are held where we expect people to come off mute and share their perspective.
That's a big jump if you don't feel psychologically safe. So a
way to get there is enable anonymous mode
and then pinpoint and have conversations around specific data points of
maybe, why is courage low? Why aren't we collaborating? It's not so
much about who said what in this context. It's about understanding.
Do we have a common perception? And what are some of these pain points that
we're seeing across the team? We go to the next slide quickly.
I can make this very short because you can see here that I think the
best practice is you don't just come out of this with a root cause analysis,
but you really want to leave with specific improvement
ideas, not just as a team. That's what you see here with number three,
but then number four, who's going to own this? And I think that's often
the big gap in post mortems, is you agree, you understand you
have a way forward, but there's not a clear set of expectations
and accountability structure around, well, who's actually going to do this? And I think the
critical piece is it's not just leaders or the scrum
master or other people who need to own this going forward. It's very much around
what can every single individual do to contribute to this?
And I think the key piece is small sres micro learning. So that could be
a specific activity to say, I will raise one concern
in the next post mortem, because maybe previously you haven't spoken
up at all, but just really raising one concern and starting there and seeing what
the feedback is, and maybe the next time you feel more comfortable to raise two
or three things that you've seen, but starting small and giving
everybody the chance to weigh in is a critical piece here.
And so what we see, you kind of like,
okay, why does this approach matter? And I think we talked about some of the
immediate benefits as it relates to incident
review. But if you take a step back and you sort of look at
why do teams benefit from this long term approach of working
over twelve weeks and internalizing new behaviors? This is
the data we saw across all the projects that we've worked on.
So when teams internalize these best practices, take a data
driven approach. Reflect not just in post mortems, but regularly
in retrospectives to build that ability. And the leader takes an
active participation and not the role of the justifier.
You see that psychological safety increases by an
average 30%. Motivation goes up by 57%.
About two out of three teams were able to move into the flow zone.
And just to double click on that, if you look at what specific behaviors improved
as teams moved into the flow zone, it's the ones on the right. So purpose,
relatedness, accountability and courage. And then coming
back to our early point, if we want to really have true learning culture
where we take the best practices of incidents and learning
from mistakes or sharing new ideas,
you need the behaviors of courage and accountability specifically, and you can
see some of the improvements that have been made here. So there are real tangible
team benefits, real benefits to team performance
through this approach. Exactly. So appreciate
all the insight, Alex. And anchoring this back to just
the theme around site reliability engineering, I get a lot of questions from
the project teams that I work with,
reliability leaders, and they ask, how is it that we can
enable SRe across our organization? Because it's not so much just the
role, but it's the practice, right. We want to enable some of these qualities and
characteristics, and even if we're hiring externally, right,
SRE, the role is so niche that it's hard to fill.
So when we start thinking about how can we enable some of the folks that
we have already, it's not intuitive to
think, right? We're always thinking technical, we're always thinking skills. But how
about some of those culture changes that we can change, right.
And like I said before, this will enable us
to have those conversations with our product owners, with our business counterparts about,
hey, we want to define some service level objectives within some of our critical business
applications or on that theme of
post mortems, right. Being able to be more open
will allow us to maybe find some of these systemic
issues that are around that perhaps somebody is
just too scared to bring up because they're like, well, I know that's going to
open up a can of worms and I
don't want to get involved in that project, right. But that's
essentially the difference between someone or
a team who's in the apathy or fear zone versus being in the flow zone.
Right. And I'm sure there's also some organizational
impacts that probably Florian, you probably know about. Would love
to touch on. Yeah, for sure. So one key topic
I want to touch on is really this, we talked now about the
team level, but also this is enabling full enterprise transformation.
We sre that many organization or many teams we
are talking to, they are a little bit in this. Yes, but situation.
Yes, we understand we have to change. Yes, we understand we have to pick up
a new skill, for example, but we don't have time for that.
But we don't have the tool for that and so on and so on.
So a lot of teams, they're finding all the time excuses why they cannot
do something. And this can be many times because of they
don't keep psychology safe enough to go into a risk situation.
So the learning here is also for us,
a transformation is slow or transformation may stop totally
when people don't feel safe. Psychology safe because they
just want to keep on going. Whatever they are doing, they're not changing their
behavior. So this is important for an enterprise, this is important for a
full communication. If you would like to change, if you would like to
transform, and we all have to change and transform all the time.
So psychology safe is a key term here,
especially in SiE because SRE is something new.
It's a new term, it's a new behavior, it's a new skill
and topic for most of the companies.
Okay. I hope we could explain
a little bit of psychology safety, why that's so
important and so on. If you have more questions, feel free
to reach out to us. You can find us on LinkedIn.
Drop us a message. We are always happy to talk about this topic or also
other topics in SIE, DevOps and so on.
Please feel free to reach out to us.
Thank you. Thank you.