Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hey everybody,
things for coming to my talk about organizational chaos and how
recipes for service ownership can overcome those challenges,
I'm really excited to share. So let's get started. So who am I?
I'm Joey Parsons. I'm a longtime veteran in the infrastructure
and reliability space. I've led engineering and operation teams
for the last 20 teams at a handful of companies. You've probably
heard of running anything from platform and backend engineering
teams to old school operation teams. Fun fact,
I've actually carried a literal pager for on call for work and
more recently running site reliability. At Airbnb, I've seen how folks
organize both their people and their infrastructure for success.
These days I'm building a company called FX where we've built a platform
to help engineering teams better operate and use microservices,
including helping them overcome the human and cognitive
load challenges that can occur, including service ownership.
What I'm going to talk about when it comes to the topic today comes from
my years of experience as well has from our customers who come to us
to solve some of these challenges. So what is service ownership?
It's a pretty broad topic, to be honest, and there's plenty of opinions in the
space that are all generally correct, and I'm pretty glad that it's actually become
a topic that people talk about. Some folks like to distill it down into
simply, you code it, you own it. But it goes far beyond that.
Ownership is more of a mindset of responsibility that gives you freedom
as well as autonomy. It's not all just on call and deploys.
So how do we define it? Service ownership means that teams are fully
autonomous and responsible for the delivery of their service
component or piece of functionality. It goes both ways.
In order to have true ownership and have the autonomy over your work,
you take on responsibility. And thankfully, the responsibility parts
of ownership come with the freedom and flexibility and autonomy that
we know every engineer craves when it comes to their work. And this includes whats
is definitely not limited to incident response. So this is
what most folks think of when they think about service ownership, but it's really just
a part of it. But ultimately, it's a pretty big part of the responsibility
side of things. Service uptime matters, so the on
call part of ownership is naturally critical. But it's not just on
call. It's managing observability, monitoring, and ensuring that
everyone has all the information they need to take action
when something dire happens. Performance and efficiency is pretty self explanatory,
but as a service owner, you're responsible for making sure that
a service runs fast or slow, I guess, as it needs to be,
tuning for performance and efficiency of resources falls
upon the service owners building features and tracking bugs,
the most obvious of all these. But it's critical to call out who do those
Jira tickets get assigned to. Capacity planning costs may matter.
May not matter in some companies, but in others it's absolutely critical.
Keeping track of your resources and using them appropriately is key in
any low margin business. In some orgs, it's simply just a measure
of dollars, but in others, this is something that can show up and funnel up
to A-P-L report. Last but not least, failure injection
and chaos. Who's running game days for a service? Who's ensuring that
it's fault tolerant to the rigors of your platform,
compute and user behavior? More often, it's a critical
part of an engineer's workload. So speaking of chaos,
what happens when service ownership doesn't go
right? Organizational chaos. So this is comp 42 chaos
engineering. Before you can even think about running failure injection or setting
up game days, you need to know who to involve in a world of
microservices. Imagine trying to run a game day where you're going to
take part of the info down. Let's say it's a really critical database that a
lot of services need to engage with through its presentation service.
And this may have repercussions for a lot of services. If you're unable
to figure out who to coordinate with for those other services for that game day,
you end up getting stuck. Frankly, beyond that, without a sense of
service ownership, your engineers are struggling day to day on
a day to day basis to probably get anything done in their own worlds,
much less cross team or infrastructure wide chaos inevitably
ensues and engineers are left distraught that the people side of things simply
isn't set up properly and look to management for answers. So what
are all the things that can cause organizational chaos without service
ownership? First and foremost, unowned services, this is probably
the worst thing that can happen. Poor ownership is replaced by nonexisting ownership,
and it leaves all the responsibilities of a service owner to nobody,
and no one knows who to talk to. Secondly, no easy way to contact a
service owner. Knowing who an owner is is one thing, but how do I
get in contact with them? What's their slack channel? What's their email address?
How do I page them if there's an emergency? If it's not easy, it's not
getting done without engineers being, frankly, absolutely livid.
Poor documentation or runbooks part of ownership is making sure
that it's up to date and solid so that not only can your team follow,
but others can if they ever need to operate bus factors of one this
usually comes hand in hand with the hero complex. I'm sure we've
all had it before, but there's usually that one engineer who knows
everything and can operate every system or service. Yet when they leave the
team, switch roles, leave the company, or frankly
something terrible happens to them, the rest of the is left holding the bag.
Their knowledge needs to be dispersed to a team or team to prevent their
own burnout, but also to share the load and responsibility.
Make this a priority to end if you can broken
or missing on call rotations I bet we all have hilarious
on call rotation stories. I remember getting paid for a startup I
hasnt worked for in over two teams because something esoteric broke. Yeah,
without service ownership maintained, this sort of thing that's funny
is bound to happen. There's plenty more where all that came from and I'd love
to hear your stories, but I want to run through some tenants or even recipes
for how to ensure solid service ownership at your company.
First thing. No unowned services ever.
None. Everything needs to have an owner. How many of you have felt that
story where there's this critical service that everybody knows exists,
but folks have played hot potato with it for years to where when there's a
security issue, nobody quite knows who to reach out to. That's an
unknown service and probably the worst thing you can have for your culture.
It's important to defining can owner for each service and keep it up to date
and maintained in a place that's easy for folks to consume and digest.
Which brings me to my next point, which may be a little controversial.
One service for one team. A service cannot
be owned by two teams. When two teams own it, nobody does.
You need the single team to contact. That's well defined.
Imagine being in a situation where something dire is happening, security issue,
incident response, something like that, and you're unsure of which team to reach
out to. There's a decision point that shouldn't exist. Sure, you could ping
both teams. Then you're likely to get someone involved who doesn't need to be
one of the tenants of service. Ownership is being autonomous, and you can't be
if you need to coordinate elsewhere. One team to one service ratio
never toss over the fence, even within a team. So the embed model
for sres that's gaining popularity is
becoming quite prevalent, especially if you're able to hire these sorts of highly trained
folks. However, if you're just replicating the pre DevOps model
of separating dev from ops, where they're the only ones who are dealing with
the hard parts of ownership, you're doing it wrong. Every engineer of a
team is part of the ownership, not just the sres or the ops. Every engineer
needs to be on the on call rotation, and the sres should be up
leveling everybody's ability to do it better. You should be tracking
production operational readiness constantly and always.
Constantly and always. Part of true service ownership is having quality
documentation, whether that be runbooks, API specs,
general documentations, observability dashboards, and more.
Having this information readily available makes ownership easier,
not just for you, but for the other folks who may need to operate your
system in a pinch. And don't just do it as part of a preflight
checklist before a service is launched for the first time. Do it constantly
and make sure everything is kept up to date. And when it comes
to tracking readiness, measure what matters most to you every
organization is different, and you may care deeply about having
in depth runbooks, while a company under regulatory compliance
may care about something else more like having a security scan run every day.
Build a mechanism to measure what matters most to your organization,
not necessarily just taking the cookie cutter approach from
other companies you may have heard from, but contact information
should be easy and simple to find, and if it's not,
it needs to be incredibly intuitive. For example, let's say you have
a team in your engineering called user growth. They're responsible for
building services and functionality to help grow your user base.
Simple enough, but what if their Slack channel is something called
hashtag scale and their email address is just growth app?
No one's going to know how to get in contact with them, keep things simple
and make it consistent, so else your engineers are just left floundering.
Speaking of keeping things simple, on call rotations should be intimately
tied to your user directory system, whether that's Google, LDAP, active directory,
whatever you use. Otherwise, you end up with too many inconsistencies
between two groups, and even funny things where people have left the team or
a company and still are on call for your service. Let's talk about
Reorgs, the absolutely dreaded reorg. Nobody loves reorgs,
and they're usually done quite in haste, but often, especially with microservices,
the people get reorgs, but the microservices don't. A goal for
every reorg should be to reassign, assign, or reassign every
service or piece of software in existence so that the negative ramifications
of service ownership are mitigated from the start and you don't end up
with that one person owning a service, or even worse,
unowned services where people just shrug and say hey, I guess
I used to own that, but I switched teams so I don't anymore.
Good luck. You never hasnt that to happen. So make sure you reorg
all of your services too. Anytime you do an organizational reorg,
ownership is a challenge it's definitely not easy to get right.
There's probably a jillian other recipes I could have mentioned today that would
help all of you, but hopefully this is a set of rules that you can
live by and it can be a start for you. One last thing to mention
at the end here is that the folks on your team are the most precious
part of your organization and company. Always think about how to do service
ownership and what you can do or tools you can implement to help
them do it better. You want them building products and software to
help your end use. Not thinking about tracking down a person,
a service, a dashboard, an owner and leaving the day frustrated?
Think about the owners. Don't forget them. Thanks again for listening.
Feel free to hit me up at at Joey Parsons on Twitter to chat.
Have a great day.