Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, hello. My name is Jorge Castro. I work as
a transformation leader. I have some experience in DevOps
SRE quality and Agile world, working at
large enterprises and also with
different kind of programs, it testing and so forth.
So I'm very happy to be part of the event. I hope you
can enjoy my talk. Okay, the name
of my talk is building more reliable products through SRE
community of practices. Connecting people makes better continuous delivery.
So basically during my talk I'm going to share with you real experiences.
Helping and working with my clients in the challenge to build
more liable products and services.
The SRE community of practices help us
to reach that goal. First of all,
my introduction as I mentioned before, my name is Jorge Castro.
I work as an agility, agility DevOps software
engineering test, data transformation leads coach,
agile coach as well and program manager. And also I
am very lucky because I had the chance to
be a speaker and keynote speaker in some events about agile
testing and so forth. So you can see here
my contact information and also my LinkedIn account. So please
add me to your LinkedIn. It would be awesome if we can
keep in touch after this session and we can build community
and share experiences. Sharing caring is
my mindset so I believe in that.
And actually that is the reason that why I'm here, sharing knowledge.
Okay. Basically in this talk we are going to share our
experiences facing the business challenge to build more reliable products
to meet customer needs at enterprise level. As you know
when you work with teams or team of team levels
in different companies, right. The size is quite important if you
are going to talk about large enterprises.
So my experience is basically helping these large enterprises
to build more reliable products,
designing and building community of practices,
in this case, SRE community of practices.
So the idea with this is that when you
have this challenge to build more reliable products,
you are going to have several obstacles like
lack of team collaboration and so forth. So something
key here is that a community help us to
foster team collaboration, sharing knowledge and fix
the skill gaps and also promote the
hands of work while we promote
that walk the talk culture, right. In different situations.
So yeah, that's part of the story. We are going to share
with you our real experiences from natural inches, dealing with these
bottlenecks with software development teams.
Let's take care of the basics.
I assume that most of you know what is SRe
and so forth. But anyway, I think that we need to
start with the basics. Okay. What is site
reliability engineering? It's a framework, right?
It's a framework to handle
these operation structure, to manage
the reliability of the products in production.
SRE is what happens when you ask a software engineer to
design and operation functions.
SRE focuses on running systems in production and basically
this work is made by development teams.
So another approach or another concept that
is quite important in SRE, it's about service
level objectives that slos,
which are basically these agreements about the
expected reliability and availability of products,
services in production. It's a kind of agreement
between development teams and operations and also
our customers. It's quite important for SRE
purposes. Another key point about SRE
is incident response processes.
Because of we are talking about production, as you know,
in production we have a lot of situation, right? Especially bugs or
incidents, you know. So SRE is also
about these old processes, about catching,
finding, sorting out and
improve the root cause of issues in production.
So yeah, it's a very important topic
in SRE, this incident response processes.
So this is a kind of summary about what is
SRE, about this framework.
So I hope that basically this can help us to
align our main knowledge about SRE.
Okay. About the principles, which are
quite important. You know, principles are important in
any kind of framework or methodology or
mindset. Number one, SRE needs
slos with consequences. Consequences, yeah, that's quite important.
As I mentioned before, SLOS service
level objectives, which are agreements about the
expected availability and reliability of other products.
And those agreements are between
or among, sorry, development customers and operations.
So yes, if you don't achieve
some specific SLO, there are some consequences about
the service, about the quality of your product, about the trust with your
clients, about how trust you are
in terms of your product, the quality of your products.
Number two, SRE must have time to make tomorrow
better. I think that is quite important because as any other kind
of framework maybe has some similar
root cause or roots or
origin in link mindset,
or maybe a framework that has some common
things with Agile DevOps and so forth.
SRe also continuous
improvement, you know, be ready for that and
analyze metrics, analyze processes, analyze what
is happening in the end to end of software development, to make better
products in the future, in the near future. So yeah, that is a very
important principle which is aligned to gai saying
mindset or this continuous improvement mindset,
right. To improve the operation of your
products in production, the quality, the availability and the reliability.
Of course, SRE teams have the
ability to regulate their workloads. That is quite
important, right. And actually I
like to relate this number
three topic principle with the cognitive
load approach, right. SRE teams,
as any kind of teams, should be
able to regulate and manage,
in a good way, the workload to avoid
the famous cognitive load. Right. And to avoid.
To overload the work and
produce a negative impact in the ways of working. And that
quality and availability, of course, that is
quite important. And number four, failure is an option.
Sorry. Failure is an opportunity to improve.
That is quite important. And actually this is
part of a, I'm not going to say a new
mindset, at least a good mindset that
we need to sell and we need to foster in
our development teams, right. That failure is
not something bad, it's something that we
need to use and we need to get, if we
want to be masters of something, if we want to get the
best quality of something, if we want to do better
and better in each iteration. Most probably failure is
the paths. So, yeah, so it's quite important that
we need to foster this kind of mindset in our teams.
And actually this number four principle,
I relate this topic to the psychological safety
approach, right. Because it's about feeling
well about fail, but having
the idea that
you need to take the best from that failure to improve the future,
right, to improve your product. So, yeah, that's quite important.
Okay, so now after this alignment about
basic topics in SRE concepts,
we can talk about a real story,
right? A real experience. So real life, real life
business, right? Dealing with customers, with developers,
tester production, with issues and so forth.
So, yeah, once upon a time we were
working at a large enterprise,
an IT large enterprise,
something that happened in that company. Is this, right?
We had a global and diverse teams involved in continuous
delivery. Yes. We had people
from Latin America, from Europe, from different countries as well.
So more than maybe 600 people
working with different products, moving code to production,
maintaining coding, doing quality and so forth.
And of course,
people, software is about people, right?
I think that is key to understand if you are in this business.
So in our teams, we had different people with
different cultures, time zones,
skills with working and people
from different roles as well. So that is something quite
important if you want to develop whatever
practice or whatever enterprise capability in your
company. My first advice is
to understand the reality of your team,
how global your team are, and also
how diverse your team are in terms of technology, locations,
time sum, skills, ways of working and so forth.
So that was part of our situation
with this company. In that company, we had some problems
with our mobile products, with our web
applications and actually
with our in house applications. Those were
the problems after goproduction.
We faced that, to be honest with you.
And also we face another problems like we
had not reliable product services. You know, we had problems
in production, so our products were not reliable.
Low product. Sorry. Low product availability as
well. We had some times that
our products were not able to.
Lack of enterprise capabilities. Yes. We had only a few
people with strong capabilities in as
example, SRE, right. We didn't have a pool of
engineers with SRE capabilities,
some of them. So yeah, it was a problem.
Low organizational resilience. I think that is quite important,
right. Because if you don't have that, most probably
when you, when you face a kind of change in
your architecture, infrastructure, platforms and so forth,
you are going to suffer a lot of pains during
the change, of course. And finally lack of collaboration and
sharing. Right. We had some people that they
knew the business officer e. They were very technical as
I mentioned. Right. We had people from different countries,
different. But we noticed that we
suffered a lack of collaboration and sharing.
People were working together. It looked
like we work to
different companies. So we didn't share
goals. Okay.
That was the context we needed
to change. Right. And as you know, change his heart, as Nancy Heart
said, but not changing is worse.
I totally agree with her. So we decided to change,
of course, because of the situation before that
I explained before we decided to apply community
of practices cops. I like this
concept, this meaning about from
Etienne Wenger and Beverly Wenger.
Regarding to them, cop is group of groups of
people who share a concern or a passion
for something they do unlearn how to do it better
as they interact regularly. I think I like this idea because
at the end that is the approach that we
wanted to sell to our client, right. To our company.
Community is about people. So people taking care of a problem,
right. A business problem, a real business problem. And communities,
people learning together to improve something. Right.
Have fun and interact. Right. In a
positive, in a proactive way.
Okay. About this first approach, okay.
We said, okay. We have the, these commutative practices, we have
these challenges about SRE and our reliable products,
lack of collaboration, lack of low
availability and so forth. We had our first
thought about it. Number one, our community should
help us to building ways of working SRE waste
working, foster experimentation. Yes,
because we noted that most of the problem that we
had is because didn't want to try
SRE practices or DevOps practices or
new tooling and so forth. So yes, we had to foster
experimentation also, which also critical topic
is about collaboration, right. As you know,
the most important asset in any kind of
company, more than software, is people and their
skills and their knowledge. And if you want to develop
this kind of capabilities through your entire organization.
Collaboration should be part of your DNA as a company.
So that was part of our thoughts that we were looking
for this community and finally build outcome center
planning. Very sure about designing
a community, not only for, you know,
bringing people together and share stuff.
We wanted to impact business, right?
Make people design, run the
community, look for the results to get
impact in our outcomes. So that was part of the approach.
We said that we decided to create our SRE
site reliability engineering community of practices.
Our second thought was about that learning experience,
right? The experiential learning and
learning by doing or walk the talk learning,
right. The idea was to, we need to learn
new stuff. We need to prepare people to learn more stuff,
build new capabilities, SRE capabilities in our community,
through our community. And for that purpose, we follow
this approach, this learning
by doing approach. First of all, concrete experience.
So basically, in our community, we shared real
experiences, real situations, working with
Sov, problems with our clients,
we have a reflective observation on the experience.
So basically, we analyze the good things, the bad things,
the context of the experience, the metrics
involved, the people involved, and all the situation,
because we consider that we
need to get that experience from this kind of shared
knowledge from our community. And then we
went to that abstract conceptualization which
was concluding and learning from experience.
So basically, it was okay about this situation,
new situation, new skills, new practices.
I analyze the context, experience and the metrics
and so forth. So I conclude with some ideas about
what are the best movements to implement this approach in
my teams, maybe run some workshops,
promote some gamification approaches,
um, move some mentoring and coaching,
and finally, learning by doing right, active experimentation
in this point, something that which is key is about psychological
safety, or basically press
in your experiment,
don't feel, you know, panic about failures,
and do the experiment right and do
it right. That is the most important part, of course.
Do experiments in, you know,
maybe small contexts, and then if that works,
you can escalate the solution. Of course,
that was our approach for learning my doing. Okay,
about the team, right. I think it's a quite traditional
team. You know, we have our community with
different engineers from different countries,
business units and so forth. And we have a core team
inside the community. You know, the core team was in charge to
design, to facilitate and organize at
least the first sessions and the
first steps of the community, because our purpose was
to rotate this core organization team. So anyone
in the community could have the chance to organize some
sessions of the cop. We have the leader,
which is basically the guy,
the person in church, to lead
all this approach, deal with the upper managers, with the stakeholders,
with the, with the other communities, to design
and to foster the best practices inside the community
and drive the community in terms of value,
impact and the best for their practice
and its development, you know. So yeah,
the lead is a very important role. And as part
of this approach, we had the backlog of the community, you know,
with all the challenges that
we wanted to develop, sort out with our community,
you know, lack of some skills, certifications,
some business implementations, some SRE customer challenges
and so forth. That was part of our cop
backlog. You know, the gaps about our current capabilities
in terms of SRE. As part of that, we also have
okrs. Our community of practices, our SRE
community practices. We have some okrs.
Okay. Something that was quite important was learn
from the past, especially from the failures. And as you may know,
this approach to creating sorry community was not the first
approach to create a community inside the company. So that is why
learning from the past was quite important in our experience.
So about this topic, please be sure that you
understand and share this voice with your team members, with your stakeholders
and so forth. CoP is an investment. So it's
an investment. Investment of time is investment
of talent and so forth. So you need to, you need to handle
this approach in that way. It's an investment.
And then SrE cop aligned to
business strategy. That is quite important.
You need to understand what are the business
contexts, the business challenges. So with
your SRE community of practices,
move your okrs inside your community,
produce impact to these business goals.
Right. The business challenges are
going to be more products, velocity,
quality, reliability, win more clients
and so forth. And I'm pretty sure that SRE Cop can
help you with that. I'm pretty sure about it.
Okay. Some examples about OKR, about okrs that
we designed in our community. Number one, improve your reliability
and availability. Okay. That was one objective.
And as an example, key results achieve an x
percent reduction in a number of incidents impacting production
services. Another example, number two,
improved team collaboration. Key results,
launch x cross functional workshops or
hackathons with global groups from different teams.
Right. And number three, increase SRE enterprise capabilities.
It results increased participation in SRE
related training courses or certification by x percent
within the community. Okay, those are examples that
we use in our community. You can add more, you can
choose a different ones, but basically please remember that depending
on your business challenges, depending on the business
strategy that you are aligned to, you need to define your okrs.
Okay? Now this is a very, very important tool
that you can use to design your community. This is the minimal
viable community, the MVC. And as you can see, as you can
see here is a canvas that helps
you to design your first approach of
SRE community. Actually, you can use this for any kind of
community, but in this case we use that for community.
So now we are going to check
topic by topic and we are going to
share our experience about that. Okay? Number one is
the purpose. In our case, our purpose was bringing together
experts and enthusiasts, sorry, enthusiasts,
to share knowledge, skills and experiences related to
improving the reliability and performance of digital services
and build doers culture. That is quite important,
right? Because more than bringing people
to work together to share knowledge, to help each other,
also we want to make builders, right.
We want to build doers, doers, that at
the end, they are the ones to create impact through experiments,
through trying new stuff and to deal with
real, real problems in production or in business.
So that was our purpose for our community.
Number two, the audience. Well, basically the audience of the community where
our SRE engineers, developers, devopsrs,
operation engineers and so forth, right? All the people involved in end
to end software development, production development,
they were our public, our team members in the
community. Number three, both values we
promote the values of sharing knowledge, experimentation, collaboration and outcome
base, which was quite important for the success,
for the future success of our community. Number four,
the goal, right. Well, the okrs that
I showed before, they are examples of the goal.
Please be sure to align to the transformation
and business goals. That is quite important that you align your community
goals to that transformation approach that you are doing in your
company and your business goals. About.
Number five is quite important is that expectation, you know,
and basically it's about the community member
experience, right. You know, we had in the market developer
experience, sorry about that.
That was Alexa. So we have in the market,
sorry, we have in the market customer experience,
developer experience. And basically this topic is about community
member experience, which is a function of reality and expectation.
And that is quite important because we said before that
the community, the SRE community is an investment.
We said before that you need to align your SRE
community okrs or goals to your business strategy and
you need to have outcome based approach inside
your community and about all activities that
you are going to do, training sessions, workshop and so forth.
So that is quite important as well. Your team member,
the people that is going to be part of the community are your
clients. So you need to take care of your clients and you need to take
care about why they thought about the community and
what they are expecting from the community. It's a key
topic, right. So for that approach, for example, at the beginning of the
community we ran these kind of feedback loops and
we got this feedback from our SRE engineers
with our former team members. Very interesting, right.
As you can see here, basically people is saying that
they don't want from the community more PPT or
more talks. They want
real experiences, hands on approaches and
also they wanted to know more real
failures are real victories or success stories in
SRE projects. That was quite important for us,
especially for design work, our community.
Okay, number six, the rules. Basically, as I mentioned, we had
the cop lead the core team, you know,
number seven, the rules basically is about the schedule, the participation,
core team agreements and so forth. You know all, you know,
it's about, it's about all the topics, you know, all the
topics that you need to set up with your teams in terms of the
function, the operative function of your community.
Number eight, goals, how to prioritize backlog okrs updates,
learning initiatives, decision making and so forth. That is quite important,
right about, sorry about number nine,
communication basically are the channels to communicate inside your
community. Slack teams, internal social networking,
etcetera. Okay, a very important topic
about these are the metrics, some metrics recommendation.
Well, basically three, we recommend that you use the
metrics of shares and collaboration. Basically how your
teams collaboration is collaborate with your royal teams.
An indicator of that, the number of experiments, cassian experiments
for example. And finally the outcomes, they are quite important.
The quality, the speed, the savings, the reliability that
you are reaching because of your community and
its operation. How do we make cop
last longer and more engaged? Yeah, I think that is a
good topic because we noticed that in the previous
cop approaches. The cop at the
beginning was strong, but after some iterations it disappeared.
So we wanted to change that. And basically for that approach
to make larger, long, larger communities,
we apply this, the minimum enjoyable game. So we
applied gamification, right? We combined some gamification approaches with
lean setup approaches to design the most
valuable and simple games inside the
community to foster collaboration,
learning and so forth. So make our
team members enjoy the experience.
For that approach, I recommend you to use this framework optalysis.
It's a game design and human design framework. Very useful.
And also as part of that we create this game right inside
the community. We create that reliability leak game
which is basically a combination of game design, Optalis framework
and human design approaches. And also link strap
in. This game was quite simple, right. We have the
people with strong skills in SRE who were
the Batmans inside this game and each Batman
has the psyche, right? The SrE psyche and
those Sre psyche were the juniors or the developers that need to
develop SRE capabilities and so forth. So is Batman worked together
with the psyches and the Batman do whatever
they need, whatever she needs to do to create
more heroes, you know, to develop the sidekick
and move their, practice their skills to another level. It was very
funny, you know, we had a lot of backmans, we had a lot of sidekicks,
robins. It was very funny to work with that.
Finally what we achieved a lot of things, I think increased number of experiments.
That was quite important, you know, do more experiments in a company,
improve services availability. Of course we improved
that metric. That actually was a pain, was a real
pain in our business. We improve our turnover, right?
Because with this kind of approaches, gamification community
people feel different, right? This kind of learning,
they feel motivated to share with their mates
and have fun. Through navigation, it helped us to improve the turnover
rate and finally the developer experience.
Yes. When we ran some feedback loops about the
NP's of the sessions of the community,
we got very good results about the experience of our developers.
So finally some learned lessons. Rotate the cop core
team. That is quite important. Please try to more people can have
the responsibility to is to facilitate different
sessions. That is quite important. You are what your community is.
Yeah, that's true. So if your community foster
team collaboration, experimentation, outcome base, the people
inside the community is going to get that. So please be sure about
it. Your business grows as your communities and
people grow. Yeah, that's quite important. If you
can impact your business, I'm pretty sure that your community is going
to grow, not only people and maybe also in
budget and more resources. So yeah, and finally cop
improve developer experience. Yeah, that is quite important. So if you are,
if you're facing some leavings of
developers or some bad numbers in terms of developer
experience, I recommend you to use cops and also gamification for that approach.
SRE cop can help you to build, enable and develop SRE
that replace capabilities as part of your business goals while
building social and technical learning spaces where people
benefit and have fun also, right,
people and business oriented collaboration inspires
people to become doers. And those doers
they make possible to build reliable products.
So finally some books that I recommend, those are really nice books that
I can recommend you. You can search for them on Internet.
So enjoy, enjoy them. Finally,
please remember, don't forget we have dreams. So help and share more.
Sharing is caring and maybe also have fun, continuous fun.
So that's it. So I really appreciate your time. I hope you
enjoyed talk. Thank you very much for your time and please
reach me out after the session and add me to your LinkedIn
accounts.