Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi, my name
is Bart Angalau, and welcome to my presentation at
42 Chaos Engineering. Let's talk about site reliability
engineers. So as I mentioned,
I'm lead sre@ball.com. Which is a large
online retailing platform in the Netherlands and Belgium,
the largest one. And we've been experimenting with
SRE for about the last two years
now, I guess. So in this presentation, I kind of
want to go into our learning so far, how we've approached it,
etc. A little bit about who we are as
a company. So as I mentioned, we're the largest online retailer
in the Netherlands and Belgium were part of the Aholta has a
group of companies, and we're a retailing platform. So we
sell things ourselves online. But other retailers
can also sell their items via our website. And that
means that we sell anything that pretty much anything non food,
so that can be pets and children's stuff or furniture.
And we ship that all from both our
partners warehouses, but also from our own warehouses. And we
have a couple hundred thousands of pitched that we ship each
day with like 10 million active users.
We have a pretty involved and very interesting microservice
landscape, and we work with about 600,
700 engineers to innovate there.
And weve existed and been developing the back
end since like 1999 or something. And we've
had quite a long journey and we've been doing DevOps for quite a
while already. So, yeah, how did we get to sre@bottle.com?
So let's quickly go through the journey that we
had as a company towards SRE, and then we can go into how that
has shaped our vision on it and shaped what we're doing
now and what we've learned from all of that. So Bolocom's
journey to SRe kind of started in 2009
when we had our agile transformation and we started adopting
things from scrum and the like.
But as we developed and wanted to innovate more and more,
originally we had our operations externally managed by a different
partner. And you can imagine that that doesn't really benefit the
release cycle. So in 2013, a big project
was to bring these operations in house, and we
established like a top of the line operations team that managed our
mostly, I think we had four or five huge monoliths that we ran at
this time, and they ran that in our own data center.
And then in 2014, I joined bordeaux.com.
It's not really applicable to the SRE journey, but it's just fun to
put in there. And then around 2016, we had our DevOps
transformation so we had a big project that we called man
on the moon. And we had like a catchphrase that was, you build it,
you run it, you love it. So this was where we really wanted to
push the operations as far to the innovation team
as possible. And that was quite successful,
actually, at the time. So we
provided quite a lot of tooling so that all the CI CD could
work all the way towards production,
theoretically. I'm not sure if anyone at that time was actually doing continuous deployment,
but definitely awesome continuous integration. But also around that
time, Google published the first satch liability engineering
books. And it was this time that we first
hosted our internal conference,
which was like the Spaces Summit. And it was at that first internal
conference that David Ferguson from Google also
dropped by. He's like, this has a reefing that we've just published. You should totally
be doing that. And quite a lot of people@ball.com
agreed with him. And when in 2018,
we were also moving towards the cloud, so big
operations team for our own data center, we acknowledged that,
yeah, maybe this cloud thing is actually kind of useful,
so let's do this. And with that migration,
we also started to move towards a product driven organization.
So what we mean here is that instead of organizing our teams
and our microservices in a way that it's organized around
the technical elements, we really organized teams
around the value chains that span all the different technical elements
so that we can hopefully autonomously
innovate within those products and have a less
complex web of communication between different products.
So now you just need to talk between different products instead of talking
between all thousands of microservices. So that's the context
of how we came to SRE. And as I was saying,
from 2016 on, there were quite a few individuals in the company who were already
quite excited about this concept. So in early 2019,
we got all those together, and we started with a pilot
on how can SRE work well for us now,
we had several specific learnings from this pilot that I'm
going to go back into later. But more concretely,
the pilot was got so successful, parts of it were, parts of it
were not so much. So based on learnings of that in 2020, we started with
a dedicated, full time SRE team, and this is also
a team I'm a member of. And it's with this step
that we've really started to get the ball rolling to
get this mindset shift that SRE can really give
an organization. And I think which is also really one of the
major added benefits of SRE over DevOps. Because where
DevOps is kind of like a strategy
for putting parts of your innovation cycle close together
in a single role. SRE really provides a set of practices
and structure for thinking about this combination of the
two practices of development and operations that really allows you
to integrate it for maximum user happiness,
I think because over
the last year and as part of the SRE team. Right? Yeah. So I
don't think in this conference I really need to go into what the
concept of SRE is. But just to really highlight
it quickly, I think the most important part is that you acknowledge
that reliability in any system is your number one feature,
but that we also live in a Fuka world like
volatile, uncertain, complex, ambiguous.
And that these two aspects mean that in
order to maximize your user happiness, you cannot just
say, we have to reach this reliability and that's final.
But you can also got say we can innovate
whatever we want because all innovation will bring risk, and when
you are not there, it doesn't matter how much you've innovated. So there's
a fine balance between these two aspects that
needs to be managed. DevOps, of course,
theoretically sort of addresses that. But in my
discussions with other companies and in research
for how can we best solve this problem for Bolcom, we found
that there's like two different ways to approach SRE.
So you can approach it from the development angle and you
can approach it from the operations angle. And what we
often find is that developers managers
often say they want DevOps,
but then when they say that often what they actually want is
just better CI CD, it's like a smoother journey from
code written to code at the user.
And what they often don't really want is like things like having
to manage disaster recovery and doing post mortems,
instant handling, fire extinguishing stuff like that.
So that's then the call from the dev side and then from opsite,
people are saying, no, we need to do SRE. And what they actually want
often is like reproducible infrastructure, less toil infrastructure
as coats, stuff like that. And they don't really want just like deaf
by ammo and lose their jobs because I
had like a management column at some point here, but it was just
all less cost, less cost, less cost.
And you often see this mismatch like, oh yeah, we now do DevOps,
so we don't need ops anymore. But that's not the way it works because operations
is a deep specialization with relevant insight
into a proper functioning microservice landscape. And also
the way you do these two things together is complex and SRE
has a specific opinion about that. And if you're
just going to be applying software engineering practices to your
operation department, then you're robbing your developers of
the opportunity to learn from production. And production
is your most direct link to your users'dreams.
So in your innovation cycle, you really need to include both
your operations and your development. And as such, I think Google's
mantra that SRE implements DevOps is
vital. So SRe really is a way to get DevOps
right. So that's also how we sre it@ball.com.
And that means that for our SRE team,
our dedicated full time SRE team, weve really set the
mission that it has as a purpose to support our innovation
team. So it helps develop the tooling and the practices
to enable all other teams to find this
balance between reliability and innovation
in the right way. And that means that the team itself, in essence, does not
have run responsibility. This is something that
is also quite often challenged from goal because
they literally told us like, yeah, the consultancy model, we've seen that fill
many times, but for our skill and the way we are
working, it really seems the best fit because bringing
production, operation and development closer
has been of such benefit. So this is
the way we're approaching it and let's see what we've learned
so far. So if we go back to this slide, and I
already announced it quickly when we talked about this
pilot, that there were some learnings. So what did this
pilot entail? So there were five
software and operations engineers and they were given 40%
of their time to go forth an SRE because
we thought SRE is awesome and it will make everything better.
And so you just do this with like two days a week, that should
be enough, right? So you can imagine with such a clear direction
and purpose, this team really enabled things,
or not. So,
yeah, it didn't achieve as much as we'd hoped,
but there was actually a different pilot running
parallel to this which achieved a lot more,
but also didn't really reach the impact that was
desired. So after this didn't go so successfully,
there was really like this idea of, yeah,
so we tried it and it's not going to work for us.
But a critical part of getting SRE right is learnings
from failure. So that's also what we did here.
We looked at what we did, analyzed it and
there were like a couple concrete learnings from this.
And for us, the most relevant one was that SRE
as a part time role next to full time software engineer
doesn't really work so well because the pressures of
backlog innovation always outcompeted our
SRE work. And also while we
were doing this pilot, there were quite a few, especially managers, who were
saying like, oh, lead SRE are a new way to do operations. Then you're
really missing again that connection between innovation
and operation, but you're just approaching it from the other side.
But at the same time, in our investigations discussions,
we did SRe that there were particular problems all around the organization
that we could solve with SRE, but that we just didn't have the capacity
for to solve them properly right now. So this specialty
that SRE is and the solutions
it offers, it was really something we did need. So based
on that, we went for the full time central
dedicated SRE team and we wrote a big plan and we sent it
to it management and it was like, oh yeah, this is actually a pretty
kind of good idea that you got here. So then we had
the central SRE team and for 2020,
the main purpose of the central SRE team was to increase
knowledge about good DevOps practices
SRE practices, increase observability of product
value chains and fix a problem that we hadn't solved yet of
how we do outside of office hour support for cloud
applications. So from the knowledge site,
we built some trainings and we did
presentations. And it was through these
different presentations and engagements with different teams
that we kind of had some our first setbacks,
maybe failures, in that we and the team, we all knew
this subject matter intimately. So we were really familiar with
the concepts and the benefits it can bring.
So we were like, look, this is how it goes, let's go do it.
Yeah. And then we gave a presentation like
that and people were, yeah, let's go do it. Okay. And then
a month later we gave another presentation. Yeah, so this is how it works
and let's go do it. Yeah. But even people who were in
those both presentations and were excited by both presentations,
by the second presentation, there wasn't real progress because
in the meantime they had been doing completely other work in
a completely different context. So having this large
2500 person organization and getting them
to completely rethink the way they think about reliability
right now, because again and again the uptime versus
availability discussion came up and that's just something that's
not trivial to grasp. So we really had to
start from scratch there every time we had a new engagement with a new team
and then just repeat and repeat and repeat. And we can't expect
other people who are got fully focused on reliability to
have this same adoption and immersement
of the concept beneath it as we who are full time
working on it do. So this was like a
big learning for us that we can't run ahead of the pack and
then expect the pack to catch up. We really have to
work together with them and then take them along for the ride.
And then if we work with someone else, start again at the beginning and
take them along for a ride. And then slowly you're building up a
broader knowledge base about these subjects and it'll enter
common knowledge. It's a long process and it's not comparable
to how quickly a person can learn. So that's what
we learned in that knowledge sharing aspect. And then for the
value chain observability, first we wanted to go
like, yeah, were going to take our most critical value chains and
add sre to that. But then we were thinking it's like, well,
maybe wouldn't it make sense to start with a smaller value chain
so that we can have a quick example of how this
works and then we can roll it out easier to other value chain. So we
chose the content product catalog value
chain. And this is like a selling page on our websop shop. And content
just manages all product text assets and product image assets.
So that's images like these, the title,
the category, but also the main image here.
And so we thought like, okay, this is a fairly isolated system.
We can make a quick iteration here, and then we'll have a big success
and share it with everyone. So we had this look into the
whole system, and we looked for
what were the most important parts of that system, and how
can we get complete coverage. And it turns out we got 45
slis. Well, that's maybe a bit much. So we
abstracted it a bit and we made a down select
of five slis that we wanted to monitor
to observe the value chain reliability on.
So we were really focusing on slis that gave us
the high over value chain availability. But if you look
at this slide, you can see all kinds of different
systems in different places. And there were
actually many teams, there were like six different teams involved
in innovating and working on these places.
And by only having enabling slis that go over the
whole stack, there was no clear owner of the single Slis.
So if you're in a situation like us where
there's no clear owner of the high over product slis,
what you need is that you also have lower level slis and
then a clear relation between the two so that you can engage both the
teams on their system level, but also engage the
product managers on the product level. And then with that clear
relation between the two, you can still get the big benefits
of a slice while also having the engagement of
the people who are really involved on the different abstraction level. So that's
something we kind of did wrong here and learned from.
And partly also due to this mismatch
in abstraction level, we never really got to the part
where the teams were actually using these Slis to
actively make that balance between reliability and innovation.
We weve really got the aero budget policy properly set up and
properly burned through our error budgets and then properly used the error budget
policy to recover from that state.
And we really think that this is very much due to the
fact that we did not have that engagement on the low level, but also we
didn't have the right engagement on the higher level. So not so much
buy in from upper management, essentially. So what
weve really learned here is that while the SLIs can
give you great benefits and Slos et cetera,
it's really important to engage with the whole vertical stack of the people
that are involved in the value chain because you need buy in
for the aero budget policy to roll it out over all the different
teams. You need ownership on a product level from product
management and tech leads, but you also need ownership on the
teams to really engage with the SLIs and use that
to basically act before there's a problem.
So if we sum up all those learnings,
we are not Google, our skill is not their skill. So we
truly think it's okay to not be Google and
do it in a way that uses their experience and the
capacity that they offer, but users it in a way that works best
for your own company. Then there is that SRE is a full
time long term investment. It really is
a different way of thinking about both reliability and
innovation. And as such, it touches all kinds of aspects
of your innovation way of working. So it really
takes time to get that mindset shift. So yeah,
that doesn't come quickly and you really need to engage with all
the levels of the organization to get that properly rolled.
But, and especially since it's such a long
term process, it's really important to celebrate your successes,
otherwise you're not going to have the stamina to
ride it out to actual success. Right. So in
a nutshell, that was what we learned at Bulletcom
over the last one and a half years. I noticed that I forgot
to put attributions for the graphics I used
in this recording. I'll add them to the slide deck, and I'm
sure the slides will be available.
So thanks for joining my presentation.
Have lots of fun with 42 chaos engineering and
I'm looking forward to any kind of questions.