Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, this is Saurabh Bangad and I'm a technical account manager
at Google Cloud. Today's topic for conf
SRE 2023 track is divided by responsibilities
united by the DevOps and SRE cultures.
Let's dive into it. So from an agenda perspective,
I'll be discussing about how enterprises end up in a situation,
what new problems are created, how to solve them, and let's
get that started. So enterprises
actually evolve with respect to all the industry
wide best practices, such as ISO 27,001, which lists
out lot of best
practices and recommendations which enterprises can adopt.
And it all starts with good intentions, such as
I'm just calling out a couple of examples, but there are a lot of
guidelines. So one of them is the primary one I'm focusing
on is separation of duties, which essentially means that
any business critical decision shouldn't be completely
done by one individual or one party.
It has to be divided into multiple separated
out duties, so that if in case of
any security or privacy related incident,
essentially you are dividing the risk into more
than one party. This is just one of the examples
where I'm focused on there are other practices such as least
privileged security policies. So essentially
give people only what they need, don't give more than that.
It's a great security policy, but it also
gets enterprises in a situation where they actually
make one party more powerful than the other,
and it actually reduces slightly the
collaboration. So this talk is all about how can
we take these best practices, but still not give up on collaboration
with some of the SRE and DevOps best practices.
So let's dive right into it. So if you look at the
previously highlighted best practice that we should be
causing separation of duties. So one
common way of doing it is allow developers to remain
focused on developing new functionalities and
features which will actually get the organizations
in a better place from business standpoint,
which would essentially mean developing features which
competitors have not developed yet, and keep making this
progress at a higher velocity. While the other
set of people should be responsible, the operators should be responsible
for maintenance and reliability and
servicing of the given service.
If you look at this, these two goals
are almost in the opposite directions. I mean,
one set of people, developers will be focused on
agility, they will be rewarded for the pace
with which they bring innovation, they bring new kind of cultures.
And the other set of people are responsibilities for
more and more stability, fewer number of
incidents and potentially no outages
or minimal outages. And these two
are not 100% aligned. One would
try to have more and more velocity, the other one would prefer
to have very slowness,
which brings the reduces risk
for the overall environment. So essentially these two
incentives are not aligned. And this is
a very common thing in enterprises to divide
duties into two or more set of
people at scale. Actually it looks something like this because
larger the organization, the responsibilities
start getting into. How do you do SrE prod versus
prod versus any other environment. So the people would differ
even in those perspective. And there is not
just one application that are we focused on,
there will be more than one application and these applications
interact with one another. So the responsibilities will be divided again.
So you can see where I'm going with this.
We have people with different goals and they
all don't necessarily collaborative with one another.
They haven't been given a mandate or a culture where they would
necessarily collaborate. And to top on
that, we also have organizations which are fairly
common in enterprises and large banks,
where we constantly try to bring
new practices, such as a security team
will be there, which looks after across the organizations
all the security policies you may have auditing team that's
again independently sitting, but has a role to play. So any
party that has some kind of role to play,
we sre talking about the whole separation of
duties to get scaled and along with that necessarily
people getting different goals.
So how do we actually address this? So I'll be coming
back to the previous slide again, but let's see what are the common ways
these kind of models are addressed in
a better way. So one approach is DevOps style. So if
you look at holistically, we start from a concept, then it is relevant
to a business. We do development of
that concept and we have operations where
it is going to keep the lights on and market
actually drives the whole growth in a flywheel model.
So in a DevOps style manner, actually DevOps
tries to bring those developers and operations together
in a more like a you build, you run fashion that
gives them some kind of ownership, which is great.
Also there is agile which actually tries to bridge the gap between
product development and product management. So business side
will bring the product management perspective and
development will actually be driving the product development.
On the other hand, SRe actually tries to do a slightly more
detailed approach, more or less like end to end
with SLO and mutual agreements that
are either formal or informal. Organizations tend
to get in a better place with SRE style.
Now again circling back to the slide where we were
focused on at bigger picture how it looks like,
and also reorganizations can cause these
to scale even further.
What does this mean? This actually means
the gray portion that's sitting in between,
almost like a wall between the two or more parties.
It can be a gap if it's not documented or if it's not
properly acknowledged by each party that this is the responsibility
of party x or party y.
So each gray area actually represents risk,
which are undocumented, not owned by any
party, and they can trigger at any point and cause lot of
issues to the organization. And to avoid this,
essentially we will be looking into some collaborative practices.
Also, in an enterprise
where communication gaps actually make the automation
go away, you can imagine two
or more parties working on a single piece where a workflow can
have human intervention much more than intended,
and it basically brings more and more risk.
And statistics suggests that
change management is mainly responsible
for outages. So from a cultural standpoint,
like if 70% of the outages are caused by
changes in a life system, this is another pointer that
if changes are not coordinated well,
it is almost certain that it will end up in an outage.
And this happens at scale even
more. So let's look into what are the other things we
can do. So step zero, always to bring
a culture where failures are
not looked at as a bad thing. Firstly, we need to have
blameless culture, blameless philosophy where postmortems
or retrospective analysis
are done to fix processes and systems,
not people. If people are blamed for issues
that emerge from systems and gaps and processes,
it won't be a psychologically safe environment. So the
first thing first is you need to have a blameless culture
where people sre highly curious
and they sre rewarded for learning, and also they're not
celebrated for heroism, for recovery from an outage.
Short term it works well, but medium and long term
heroism for celebration of heroism
doesn't go anywhere. The organization continues to have more
and more failures, and you just don't evolve in terms
of moving from a bad system to a good.
So let's dive into the main part,
the solutioning. So firstly,
culture. Bringing a culture requires few things.
This is a text heavy slide, so you could definitely pause and go through
each point. I'm going to do elaborate each point
as much as possible. So firstly, sense of ownership and accountability
is required from each party. So they
need to see this as a joint problem, rather than my
problem, your problem, or not my problem. It has
to be seen as a joint problem because even if they
have different sub goals or different goals,
if individually they succeed, but as an organization,
if they fail, that will be a bigger failure. So sense of
ownership is the most important one to get there.
We need to have some core metrics which are agreed by
each party. They all have to agree on some
common signals where they collaborative and try
to improve from, let's say 10% to 20%
to 30%. It's considered as progress.
That would be a good signal and they all have to
jointly influence those signals.
Communication and collaboration is something that I'll be describing more on the next
slide. They also have to agree on change management practices.
For example, team one has upstream
and downstream dependencies. Potentially they
should be communicating and have coordination on how
they manage change. Also, change management practices
in pre prod versus prod would differ and
even the type of change is it p one,
p two, what kind of impact one would expect out
of that? These are basic parameters that agreeing
for change management practices would mean.
Also, continuous improvement is something as
all the parties, they have to work together, they should be agreeing
on continuous improvements. For example,
automating lot of things which don't require
human decision making or reduction in time by
eliminating some unnecessary processes. And people should
be rewarded for this. This is the best way to get from
a good to better environment. Lastly,
this is probably the most important for enterprises where
reorgs SRE common. After every reorg, you should be going
over all the decisions or all
the working model so that you may end
up in a situation where you would have all
the accountable parties agreeing again,
even under the new leadership or even under the new goals.
Let's dive into communication and collaboration. So firstly,
you need to have bi directional modes such as Google
Chat faces or slack channels where these teams
exist, and they kind of have threaded discussions
on topics. So why I suggest bi directional?
I mean, most often if you're using just the
ticketing system or just emails,
it may not function as a collaborative environment where
the flow of information is usually one way and it's
a formal way of documenting everything. While they
all have their advantages from a collaboration perspective,
they also have disadvantages. So they should be used for tracking
purposes. But for collaboration purposes, bi directional communication
channels do the best thing. Also probably
maybe having cadences where you
bring initiatives for next quarter,
plan them well and execute it together.
Secondly, you should be having joint artifacts,
so where documentation pages are not silos
or islands where people
develop their own things. As opposed to that, it should be a collaborative,
jointly owned documentation pages. This also
applies to roadmap items where all
the teams plan together a lot of items which would actually address
future issues as well.
Every environment is always dynamic and you have changes of
some nature all the time. So roadmap items are definitely
one of the important ones. Lastly, the tooling.
If you have common tooling, you would actually be
encouraging collaboration to improve the tool as well.
If you have better tooling, you end up in a much efficient environment
which is operationally smoother.
Changes as previously suggested, changes should be coordinated.
So perhaps in your regular sync
ups, regular cadences, you should be talking about what
kind of changes is
one team going to deploy and what they expect in terms of impact
on those other teams, either downstream or upstream,
and how it may affect their day to
day operations. So these things
need to be defined as common set of standards.
And lastly, going into best practices,
informal learning. Informal collaboration happens
when you sre meeting agenda free.
So this would mean one of the teams sharing
their lessons about most recent deployment and
how it actually brought new kind of
changes in the environment, or tgifs,
which are fairly common. As one can see in Google's
culture, somewhat similar exists in many other corporate environments.
So you could encourage those.
Any kind of knowledge sessions or shadowing opportunities would
allow you to actually push one team to step into
other team's shoes and learn
from what kind of challenges they may have. And this would allow
them to bond well together and understand each other's constraint
and make progress together.
Finally, my last take on this is because
enterprises have very different dynamics.
It can actually lead into complexities
which were not predicted. So for example,
if you have meetings that SRE scheduled on a monthly basis,
but if they don't result in outcomes,
they actually become toil and they don't necessarily
bring any improvements. Unnecessary approvals
can come in when you have too many teams,
and in the name of collaboration, if they start adding
approvals which work for the time being,
but after some reorgs or after some deprecation
of some tooling, they actually become unnecessary approvals.
So it becomes important to assess your situation whether
some approvals sre unnecessary unaligned
maintenance windows. Actually, this one means,
let's say, if you have three teams which
have no change management protocol in
place, they may actually end up in one team doing
maintenance on a Monday, one team doing on a Wednesday,
and one team doing on a Friday.
If they have no relationship or dependency between
one another, it's absolutely okay. But if
they are upstream downstream of one another,
they may actually cause a much wider impact for your
end users, where you may end up in three
maintenance happening in the same week,
which are not even on happening on the same day. You have basically downtime on
Monday, you may have downtime on Wednesday, you may have downtime on Friday,
which may work for those specific teams, but it will definitely
be a bad idea for your end users and your organization's
reputation when we are encouraging
collaboration and better communication. It may also
encourage an environment where you are encouraged to build
tooling. This tooling may be absolutely fancy,
but it may be unnecessary and unusable from a business standpoint.
So you need to assess whether you are doing the right thing
from time to time.
As we saw in earlier slides, there can be gray areas which pose
risk, so you may have an environment
where there are no incentives for reducing these gray areas.
People may not actually do anything about it and they just continue
to operate even if it's an inefficient environment.
And lastly,
one should be having all the aligned metrics as we spoke about
golden signals and agreeing for common set of
metrics. So in many environments it
actually happens that for example, if we go back to previous
slide where we had set of developers, set of operators,
they had contradicting metrics which actually made the environment
not so great for for the organization.
So that's it. That's all I had on my agenda.
Thank you so much for joining me.