Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, I'm Daniel, VP of engineering at Checkly.
In my talk today, going from quarterly releases to weekly in a
small startup company, I would like to take
you back through my experiences of being in
situations where my DevOps maturity wasn't always the way I wanted it to
be and give you a blueprint with concrete steps
how you can come out of those situations as a winner and take your
team or organization to high DevOps maturity.
I'm happy to say that at checkly so where I work today,
we are already there. We are a Dora elite performer. We deploy multiple
times a day without any incidents normally,
and we have a very fast lead time. So the
talk is mostly about my past experiences, really where I
saw the immense value of synthetic monitoring to get you out of firefighting
into high DevOps maturity as a driver. And also
to be honest, this was part of why I joined checkly because I
immediately saw the value proposition back then.
But still today I use the dorometrics as an engineering leader
to make engineering decisions so we can stay sharp
as an engineering organization before
we start. I think still, and I think I'm
going to emphasize this more and more as
a technical leader. I think the door metrics are very relevant as a
tool and you should think about them now and then.
I was working in a startup, for example, where I decided to take my team
from manual, error prone monthly or quarterly releases
to stable weekly deploys in weeks rather than months.
And this is of course important because DevOps work always
comes with opportunity cost. So you will have to convince some
business thinking manager that this is where you should invest time,
where you're not fixing bugs and you're not shipping features, but you're
making your organization quicker and faster. So let
me show you how I did it and increased that velocity by 70%
along the way in a roughly 40 people engineering team startup
Dora metrics recap so for those of you who have never
heard of the term or maybe need a reminder, what are the
dorometrics? So the first one is lead time for changes. So this means
the amount of time it takes for code to get from idea to
production, which is very important. Then it is deployment
frequency. So how often do you deploys? And good organizations
deploy multiple times per day time
to restore service. You have an incident, how long does
it take you to recover from a failure? And the last one
change failure rate, the percentage of deployments causing a failure in production.
Why I like these metrics and why I think you should use them is
because they're not easy to game.
Normally, if you only pick one metric, it's easy to
game it and have a useless measurement. So oftentimes
I hear people only talk about deploy frequency and that alone is not
good enough because just deploying stuff quickly doesn't really
tell you anything about how good your team is. It could be that you're just
deploying version updates or dependable merges frequently,
but it doesn't say anything about lead time. So how long does it take
me from Jira ticket to production, from code commit to being
live? Does every deploy cause an incident
or introduce new bugs? That's also not great. You might be
shipping very quickly, but then you're introducing bugs the same
speed, which of course is nothing that you would like. So I think the
metrics encompass DevOps maturity quite
well, and they can be absolutely used as a tool to assess yourself
and know where to go.
By the way, I made a bold claim saying that
I was able to increase velocity by 70%. How do
I know? Well, in one startup that I worked at,
a program manager was so kind to actually measure this using Jira
and GitHub metrics, and you can see that
I am being absolutely truthful here. This is what happened. Just by
implementing the following blueprint, we were able to speed up engineering
velocity significantly. And I
think you'll be able to depending on where you are right now.
My question of course, you might have been in different
companies, and I wonder, have you ever been in those retrospectives?
Right? So where you had these blockers or
impediments visible on your board,
the system is too complex and brittle. We don't know where to make
changes. Manual QA is slowing
us down. Deploys take days.
We need a slack channel to coordinate them. We have little to no
test automation or the DevOps team is constantly busy
with their work and the dependency on them makes us slow.
And then you might see action items that might solve
one problem. But it's not quite a systemic approach that will get you
out of the mess, right? So we can rewrite the service to
make it less complex in maybe a new programming language, and it
will be faster with all the things we have learned.
It might be a nicer architecture. We might reduce
the deploy frequency, right? Like maybe let's deploy quarterly and not
monthly, because deploying is expensive. It takes three days and people
need to coordinate on it. Let's introduce a meeting
to organize cross team dependencies,
because we have a lot of them and they need to be managed. And I
guarantee you none of these measures will get you out of your DevOps hell.
But don't despair. Use DevOps
to guide your actions. There are a couple of things you
can do and a couple of actions you can take that will improve things.
Of course, and make no mistake, this will be a tough journey and
many people will have to be convinced on the way.
I think this is something that is dear to my heart and so I
took an extra slide to talk just about this. A few
words on culture and change management in general.
Over my years I had to implement
change and modify processes and the way that
people worked quite often.
And I can tell you from introducing agile
to a team who has never worked with anything like it,
to introducing DevOps maturity to a startup that
has high time pressure to deliver things and
no time for DevOps. It can be challenging,
but the following patterns always work.
So if you want to introduce a new process
or a new way of working, first of all, you have to understand
why things are the way they are. So you should never
judge anyone based on how things are right. People came to decisions
and conclusions and they made sense at the time. That doesn't mean that they're wrong.
You know, a way to improve stuff, but you should have the patience
to listen and to understand why people are where they are and you
need to take them seriously, even if you think they
are doing some things very wrong. And then
the next step is you have to explain the purpose and get
buy in from people. They need to hear it often. Why are we
doing this? Why do we suddenly need to write tests?
Why do we suddenly have a CI? Why does it need to be automated?
Why do we need to deploys so often? I mean, it's not even done.
No changes were made. Why do we need to deploy? All of these
questions might come up and you need to be ready,
able and patient enough to discuss them and remind people often enough that
this is what you need. And if you do this in the end,
some people might leave during the process. They might not like the new changes that
you're introducing. But eventually you will win over most
people's hearts by being successful.
So, so much for the general introduction.
Now let's go with the concrete steps. I have
organized those by difficulty level. So we are going from
easy to medium to hard concrete steps to
take tech easy first.
And of course this depends on where you are now, but this is
a blueprint, right? So not everything might apply to your concrete situation.
First, convince developers to build CI pipelines
on GitHub. Actions are similar. Maybe there's already a company Jenkins that
you can use. Maybe you have a GitHub account that's not being used properly.
There are usually multiple options. Or you can grab a free GitHub actions
account if you don't have any better solution.
But this is important. You need CI.
The software needs to be built, tested and
sometimes just running it on the CI, just starting the application
is already better than nothing to make sure that it works.
And you will see that automation of course is key for cycle time. And cycle
time is important because it impacts lead time. So one of
our big dorometrics that we want to improve. And then
of course this is a general advice. Simplify, simplify,
simplify and again, simplify.
Carelessly introduced complexity is the root of all evil.
I am very convinced of that. Reduce the amount of branches that people
use, delete unused code, remove unused dependencies.
You can already start with that and you will make everything way less complex and
easier to understand.
Concrete steps to take tech easy. Number two, use synthetic
monitoring. It is perfect for such a scenario to
get you out of firefighting. Why is that?
Because what it essentially is, it is taking
end to end tests. And maybe you already even have them,
right? So an end to end test that will use critical
flows in your web applications as your users would do.
It will log into the application, as you can see on the screenshot,
log into the application and do something that is relevant. Create a
document, buy something from the store,
read important data, open a list of users,
whatever it is that is important to your application. End to
end tests can do that, and they can do it as
synthetic users. So the moment you break it,
you will know my application is broken. And now the cool thing is
you can set this up in days. Building end to end tests
and running them on a schedule is something that doesn't take long.
It gives you amazing coverage over the whole application,
and it gives you back the focus that you need to
boost our metrics and get to high DevOps maturity.
In the example screenshot, you can see something was apparently deployed
that broke the login. You immediately know you have a strong signal,
you know, oh no, the login is broken. We need to roll back.
And so synthetic monitoring for me was the vital foundation
of making the DevOps transformation work.
Concrete steps to take tech medium. So now that
we have the synthetic monitoring in place to really
have us safe when we make changes, because we will immediately
know when we break core flows through the application,
it's time to also startup automating deployment. If it's not
already there, because manually coordinated
deploys, it's just not a good process. It needs to be automated,
it needs to be repeatable so you
can do it more and more often and therefore tune the second door
metric which is deploy frequency. It is very important to being able to deploys
because it also means you can roll back when something questionable or problematic
happens. And of course introduce zero downtime
deployable services if you don't already have them everywhere. So remove
state from rest APIs, remove state from services
if you can, and make them zero downtime
deployable if needed. For the business, I can definitely
see scenarios where some servers don't have to be zero downtime
deployable. Maybe it's fine for them to be offline for 2 seconds, or maybe
to be down on weekends. This depends on the business.
But for those services, like an API, we need to run all
the time. Make sure you make them zero downtime deployable if they're not
already, so you can deploy them often. And of
course simplify, simplify, simplify. Carelessly introduced
complexity is the root of all evil. Remove redundant dependencies
now maybe you have two libraries doing the same thing, or maybe the
standard library of whatever ecosystem you're using already has that feature now,
so you can remove the dependency and thereby reduce complexity.
Update versions delete unused configurations delete unused
feature flags. It will make it so much easier to build deployment if you
know what configuration options actually do and which ones are used,
and then tech hard simplify
again. Simplify. Merge code bases why
is this code split up into 15 repositories? So you
suddenly need to do 15 prs, 15 merges, 15 CI
runs, update dependencies somewhere. It's just
needlessly complex. You can probably merge it into one or
maybe a few code bases. Are there any
services that might be redundant or not even used? Remove them
right? Be bold. You can merge them into one and maybe be
a little more monolithic if you don't really need all of those microservices
because they increase load and system complexity.
Remove any DevOps over engineering.
Maybe you don't need an EBS blob storage mounted
into a service, maybe s three is good enough. Or maybe you can use a
hosted database instead of running your own.
And then of course it's time to simplify the code. You can do
that now because you have synthetic monitoring in place and it will make
sure that you cannot break anything. And if you do, you'll immediately
know. And then concrete
steps to take people. And this is hard.
So how do you get bite in
from people. I think this is one of the hardest things you can do
because like I said, people in the past have made
decisions in this company, in this team, and they made sense to them,
right? So nobody does random actions or takes random actions.
They are where they are because of a series of experiences
has led them there. And you need to first understand this and
then you need to come up with a strategy to get buy in from them
and explain why are we changing everything now
suddenly. So what you want to do is for example,
share past experiences of your own. So maybe you
have been in company where DevOps maturity was way higher than it is now and
you can tell from experience what you are doing.
It makes sense, but we can be faster and better and
you need to explain it. Build confidence through
renowned material like the Google DevOps blog and the accelerate book.
And there's many more. So many companies are doing this.
Elite Dora performers are usually outperforming
the competition. And then of course, like I said,
and I can't put enough emphasis on this, constantly explain the changes,
constantly explain the purpose and the vision to everyone as often as they need
to hear it and show and lead by example.
So maybe you need actually to step in and write the first deploys
script yourself. Maybe you need to write the first CI pipeline
script yourself. Maybe you even need to write the first test yourself.
It doesn't matter, do it and show it to others.
Lead by example. And then once they learn
what the vision is, you give them back the control.
Concrete steps to take process hard,
move to trunk based development. Have prs
and code review, merge to main and keep it stable
at all times. Rollback fast if needed. Fix bugs offline.
This is a very significant cultural change if you don't already have that.
And I felt it was very challenging coming
into an organization that for example uses Gitflow or some other model.
Suddenly they don't have five different release branches or a
development branch where they can safely merge everything and then they later
integrate a huge chunk of code changes into the main branch
and they pray to God that it will still work.
Trunk based development, for those of you who don't know, is the idea that
you always have a stable main branch and this is the one that you also
deploy to production automatically.
And the idea now is of course, and you can see how it
relates to the door metrics. When you make a change on this kind
of a branch, you want to make sure the change is actually small.
So first of all, someone needs to review it. Try sending
a 500 file change review request to an engineer.
They might not be very happy about this because it will take them a day
to review it. Send them a small one. Here's a file that
I need to change. Here's five lines of code. Can you
please review it? Easy to understand. And they
can review it quickly, merge it to main, it can be deployed.
And now the next good thing is, and this immediately impacts the next orometric
meantime to restore or restoring from failure
or change fail rate. If you deploy a small change
and something goes wrong, you'll be able to
very quickly understand what caused the issue because it was a
very small change. So there are not many things that can actually have caused it.
Whereas if you have a huge one spanning multiple modules
and projects, right. So the diff gets larger.
And especially as teams scale up and engineers scale up,
the diff gets larger and larger and larger the longer you wait with deploying
something and then it becomes infinitely
more difficult to actually debug issues
because if you have the diff created by three teams over one
month deployed to production and it breaks, good luck
figuring out what the hell went wrong.
And then concrete steps to take
process hard. All the changes that you have done now,
so you might have not executed all of them, but you have the synthetic
monitoring in place, which is important. It has your back, so, you know
when something breaks. And now it's time to finally increased
our deployment velocity or deploys frequency.
Nice. The question is how much should we deploy
and how frequent? Well, that depends on where you are in,
right. Quite frankly. So depending on your current DevOps maturity,
where you are coming from,
what the state of the engineering team in the company is like,
I think if you're coming from a
very low DevOps maturity space, or depending on business needs,
weekly deploys are actually perfectly fine.
They're not elite, Dora. Right. So they're not 30 times a
day or more, but they are better
than monthly or quarterly and they reduce
the diff significantly and they have a stable deploy process.
And of course we have to face some realities. Like I said it initially,
investing into DevOps is also opportunity cost.
So there's going to be some manager or some exec on top of
you wondering, hey, Daniel, what the hell are you doing?
Like, you've built stuff for two weeks, yet I have not
seen any new features or any new bugs. Right. And now
we need to actually show them something. And if we can now,
and this is what I was able to do, prove to
them that suddenly lead time is reduced significantly through
deployment frequency, through automation and through all these things that
we have just looked at, suddenly delivering
a bug fix takes days rather than months or
weeks or a quarter. And of
course this is not easy, right? So here it's on the hard slide for a
reason. So the first weekly deploy
that we did took us a day because we uncovered many
issues that we hadn't seen before, right? Like we thought everything was automated,
but it wasn't. So still manual steps were involved. We had to
do some stuff, coordinate. It wasn't great, but of course we learned
a lot. There was a retrospective, we implemented a series of changes,
and the week after we reduced it to a couple of hours and eventually
even less than that.
And now if you're still
not satisfied with your DevOps change, right, you could go to difficulty
mode. Emotional damage by the way, I love Stephen Hay. He's an awesome
youtuber, responsible for this funny meme and many cool videos.
I think emotional damage is appropriate because the
next step that you could take is organizational change. And this is
very hard, right? So all the changes before you can do
them on a team level or maybe on multiple teams.
This, however is an organizational change. So you will need to get buy in
from senior management, the exec
team, someone pretty high up, and this can be
a tough one to sell and convince. So you need a lot of patience
and a lot of good, well argumented,
structured documents and reasons
to convince them. But however,
what helps is if you've already been able to show the
first success, right? So hey, we went from quarterly to weekly. Look at that.
We can be even faster. Or sometimes
you even have someone who has an open ear for you, right?
So they also see the issues and they're looking for a solution. And now you
can come in and suggest organizational change and
things that I am very convinced that you need for
being a Dora elite performer are first of all,
cross functional teams and a you built it, you run it mentality.
You need to have stream aligned cross functional teams
who are able to deliver everything that is required
themselves. They need to have front end engineers, back end engineers, they need to
have the operational knowledge to deploy their own service and to monitor it.
This is very important. Only then can they be fast.
There are no dependencies, there's no dependency meeting. You don't
need agile coaches to synchronize 15 teams.
No, they can deliver what they need themselves. You don't
want to have a standalone DevOps team
that has the job to build. DevOps never talks
to software engineers and just does whatever. Nobody really knows.
They sit there in a corner build terraform stuff all day and nobody gets
it. That's a big antipattern, right? So if you have a DevOps team then
it should be temporary. And the idea is that they
help other teams to increase DevOps maturity and
at some point they dissolve again. It's not a permanent solution.
Test automation and exploratory testing is important.
You need test automation at some point every
team has to do it and your QA
team, right? So it must not be a manual gate for
releases because this will reduce your door performance significantly.
If you have to wait for someone to manually test
something and then approve a deploy, what you want is automation.
And as soon as the automation is satisfied and happy you roll it out
because you have the monitoring in place through your synthetics.
If something goes wrong, you can roll back in record time because deployment
now is automated. So there's also no need anymore for
manual QA as a release gate. What they can do instead is exploratory
testing because of course test automation. You don't
want to test everything. You don't want to have a synthetic monitoring set up
that has 5000 tests and tests every quick path free application.
No, it's supposed to only test core flows. And if there
is a typo in FAQ somewhere, this is where, for example,
manual QA can come in handy. Or if there is a broken chart
that is displayed somewhere, this is where they can be useful.
So thanks for listening. If you
are interested to nerd out about synthetic
monitoring or all things DevOps, come seek me out in the czech slack community.
It's linked in the slides or you can also ping
me on LinkedIn. And yeah, I'm happy to answer all your questions
afterwards. I hope that you could learn something.
Have a great day everyone.