Conf42 DevOps 2024 - Online

Going from Quarterly to Weekly Releases in a Small Startup Company

Video size:

Abstract

As a technical leader, I think the DORA metrics are very relevant as a tool to guide my actions. I decided to take my team from manual, error-prone, quarterly releases to stable weekly deploys in weeks rather than months. Let me show you how I did it and increased dev velocity by 70% along the way

Summary

  • Daniel: Going from quarterly releases to weekly in a small startup company. Daniel: I use the dorometrics as an engineering leader to make engineering decisions. He says metrics encompass DevOps maturity quite well and can be used as a tool to assess yourself and know where to go.
  • Concrete steps to take tech easy include convincing developers to build CI pipelines on GitHub. Use synthetic monitoring to get you out of firefighting. Simplify, simplify, simplify and again, simplify. Carelessly introduced complexity is the root of all evil.
  • And this is hard. How do you get bite in from people. Share past experiences of your own. Build confidence through renowned material like the Google DevOps blog. Lead by example. And then once they learn what the vision is, you give them back the control.
  • Concrete steps to take process hard, move to trunk based development. Have prs and code review, merge to main and keep it stable at all times. Rollback fast if needed. Fix bugs offline.
  • The question is how much should we deploy and how frequent? That depends on where you are in, right. If you're coming from a very low DevOps maturity space, weekly deploys are actually perfectly fine. Investing into DevOps is also opportunity cost.
  • The next step that you could take is organizational change. You need to have stream aligned cross functional teams who can deliver themselves. Test automation and exploratory testing is important. If you are interested to nerd out about synthetic monitoring or all things DevOps, come seek me out in the czech slack community.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, I'm Daniel, VP of engineering at Checkly. In my talk today, going from quarterly releases to weekly in a small startup company, I would like to take you back through my experiences of being in situations where my DevOps maturity wasn't always the way I wanted it to be and give you a blueprint with concrete steps how you can come out of those situations as a winner and take your team or organization to high DevOps maturity. I'm happy to say that at checkly so where I work today, we are already there. We are a Dora elite performer. We deploy multiple times a day without any incidents normally, and we have a very fast lead time. So the talk is mostly about my past experiences, really where I saw the immense value of synthetic monitoring to get you out of firefighting into high DevOps maturity as a driver. And also to be honest, this was part of why I joined checkly because I immediately saw the value proposition back then. But still today I use the dorometrics as an engineering leader to make engineering decisions so we can stay sharp as an engineering organization before we start. I think still, and I think I'm going to emphasize this more and more as a technical leader. I think the door metrics are very relevant as a tool and you should think about them now and then. I was working in a startup, for example, where I decided to take my team from manual, error prone monthly or quarterly releases to stable weekly deploys in weeks rather than months. And this is of course important because DevOps work always comes with opportunity cost. So you will have to convince some business thinking manager that this is where you should invest time, where you're not fixing bugs and you're not shipping features, but you're making your organization quicker and faster. So let me show you how I did it and increased that velocity by 70% along the way in a roughly 40 people engineering team startup Dora metrics recap so for those of you who have never heard of the term or maybe need a reminder, what are the dorometrics? So the first one is lead time for changes. So this means the amount of time it takes for code to get from idea to production, which is very important. Then it is deployment frequency. So how often do you deploys? And good organizations deploy multiple times per day time to restore service. You have an incident, how long does it take you to recover from a failure? And the last one change failure rate, the percentage of deployments causing a failure in production. Why I like these metrics and why I think you should use them is because they're not easy to game. Normally, if you only pick one metric, it's easy to game it and have a useless measurement. So oftentimes I hear people only talk about deploy frequency and that alone is not good enough because just deploying stuff quickly doesn't really tell you anything about how good your team is. It could be that you're just deploying version updates or dependable merges frequently, but it doesn't say anything about lead time. So how long does it take me from Jira ticket to production, from code commit to being live? Does every deploy cause an incident or introduce new bugs? That's also not great. You might be shipping very quickly, but then you're introducing bugs the same speed, which of course is nothing that you would like. So I think the metrics encompass DevOps maturity quite well, and they can be absolutely used as a tool to assess yourself and know where to go. By the way, I made a bold claim saying that I was able to increase velocity by 70%. How do I know? Well, in one startup that I worked at, a program manager was so kind to actually measure this using Jira and GitHub metrics, and you can see that I am being absolutely truthful here. This is what happened. Just by implementing the following blueprint, we were able to speed up engineering velocity significantly. And I think you'll be able to depending on where you are right now. My question of course, you might have been in different companies, and I wonder, have you ever been in those retrospectives? Right? So where you had these blockers or impediments visible on your board, the system is too complex and brittle. We don't know where to make changes. Manual QA is slowing us down. Deploys take days. We need a slack channel to coordinate them. We have little to no test automation or the DevOps team is constantly busy with their work and the dependency on them makes us slow. And then you might see action items that might solve one problem. But it's not quite a systemic approach that will get you out of the mess, right? So we can rewrite the service to make it less complex in maybe a new programming language, and it will be faster with all the things we have learned. It might be a nicer architecture. We might reduce the deploy frequency, right? Like maybe let's deploy quarterly and not monthly, because deploying is expensive. It takes three days and people need to coordinate on it. Let's introduce a meeting to organize cross team dependencies, because we have a lot of them and they need to be managed. And I guarantee you none of these measures will get you out of your DevOps hell. But don't despair. Use DevOps to guide your actions. There are a couple of things you can do and a couple of actions you can take that will improve things. Of course, and make no mistake, this will be a tough journey and many people will have to be convinced on the way. I think this is something that is dear to my heart and so I took an extra slide to talk just about this. A few words on culture and change management in general. Over my years I had to implement change and modify processes and the way that people worked quite often. And I can tell you from introducing agile to a team who has never worked with anything like it, to introducing DevOps maturity to a startup that has high time pressure to deliver things and no time for DevOps. It can be challenging, but the following patterns always work. So if you want to introduce a new process or a new way of working, first of all, you have to understand why things are the way they are. So you should never judge anyone based on how things are right. People came to decisions and conclusions and they made sense at the time. That doesn't mean that they're wrong. You know, a way to improve stuff, but you should have the patience to listen and to understand why people are where they are and you need to take them seriously, even if you think they are doing some things very wrong. And then the next step is you have to explain the purpose and get buy in from people. They need to hear it often. Why are we doing this? Why do we suddenly need to write tests? Why do we suddenly have a CI? Why does it need to be automated? Why do we need to deploys so often? I mean, it's not even done. No changes were made. Why do we need to deploy? All of these questions might come up and you need to be ready, able and patient enough to discuss them and remind people often enough that this is what you need. And if you do this in the end, some people might leave during the process. They might not like the new changes that you're introducing. But eventually you will win over most people's hearts by being successful. So, so much for the general introduction. Now let's go with the concrete steps. I have organized those by difficulty level. So we are going from easy to medium to hard concrete steps to take tech easy first. And of course this depends on where you are now, but this is a blueprint, right? So not everything might apply to your concrete situation. First, convince developers to build CI pipelines on GitHub. Actions are similar. Maybe there's already a company Jenkins that you can use. Maybe you have a GitHub account that's not being used properly. There are usually multiple options. Or you can grab a free GitHub actions account if you don't have any better solution. But this is important. You need CI. The software needs to be built, tested and sometimes just running it on the CI, just starting the application is already better than nothing to make sure that it works. And you will see that automation of course is key for cycle time. And cycle time is important because it impacts lead time. So one of our big dorometrics that we want to improve. And then of course this is a general advice. Simplify, simplify, simplify and again, simplify. Carelessly introduced complexity is the root of all evil. I am very convinced of that. Reduce the amount of branches that people use, delete unused code, remove unused dependencies. You can already start with that and you will make everything way less complex and easier to understand. Concrete steps to take tech easy. Number two, use synthetic monitoring. It is perfect for such a scenario to get you out of firefighting. Why is that? Because what it essentially is, it is taking end to end tests. And maybe you already even have them, right? So an end to end test that will use critical flows in your web applications as your users would do. It will log into the application, as you can see on the screenshot, log into the application and do something that is relevant. Create a document, buy something from the store, read important data, open a list of users, whatever it is that is important to your application. End to end tests can do that, and they can do it as synthetic users. So the moment you break it, you will know my application is broken. And now the cool thing is you can set this up in days. Building end to end tests and running them on a schedule is something that doesn't take long. It gives you amazing coverage over the whole application, and it gives you back the focus that you need to boost our metrics and get to high DevOps maturity. In the example screenshot, you can see something was apparently deployed that broke the login. You immediately know you have a strong signal, you know, oh no, the login is broken. We need to roll back. And so synthetic monitoring for me was the vital foundation of making the DevOps transformation work. Concrete steps to take tech medium. So now that we have the synthetic monitoring in place to really have us safe when we make changes, because we will immediately know when we break core flows through the application, it's time to also startup automating deployment. If it's not already there, because manually coordinated deploys, it's just not a good process. It needs to be automated, it needs to be repeatable so you can do it more and more often and therefore tune the second door metric which is deploy frequency. It is very important to being able to deploys because it also means you can roll back when something questionable or problematic happens. And of course introduce zero downtime deployable services if you don't already have them everywhere. So remove state from rest APIs, remove state from services if you can, and make them zero downtime deployable if needed. For the business, I can definitely see scenarios where some servers don't have to be zero downtime deployable. Maybe it's fine for them to be offline for 2 seconds, or maybe to be down on weekends. This depends on the business. But for those services, like an API, we need to run all the time. Make sure you make them zero downtime deployable if they're not already, so you can deploy them often. And of course simplify, simplify, simplify. Carelessly introduced complexity is the root of all evil. Remove redundant dependencies now maybe you have two libraries doing the same thing, or maybe the standard library of whatever ecosystem you're using already has that feature now, so you can remove the dependency and thereby reduce complexity. Update versions delete unused configurations delete unused feature flags. It will make it so much easier to build deployment if you know what configuration options actually do and which ones are used, and then tech hard simplify again. Simplify. Merge code bases why is this code split up into 15 repositories? So you suddenly need to do 15 prs, 15 merges, 15 CI runs, update dependencies somewhere. It's just needlessly complex. You can probably merge it into one or maybe a few code bases. Are there any services that might be redundant or not even used? Remove them right? Be bold. You can merge them into one and maybe be a little more monolithic if you don't really need all of those microservices because they increase load and system complexity. Remove any DevOps over engineering. Maybe you don't need an EBS blob storage mounted into a service, maybe s three is good enough. Or maybe you can use a hosted database instead of running your own. And then of course it's time to simplify the code. You can do that now because you have synthetic monitoring in place and it will make sure that you cannot break anything. And if you do, you'll immediately know. And then concrete steps to take people. And this is hard. So how do you get bite in from people. I think this is one of the hardest things you can do because like I said, people in the past have made decisions in this company, in this team, and they made sense to them, right? So nobody does random actions or takes random actions. They are where they are because of a series of experiences has led them there. And you need to first understand this and then you need to come up with a strategy to get buy in from them and explain why are we changing everything now suddenly. So what you want to do is for example, share past experiences of your own. So maybe you have been in company where DevOps maturity was way higher than it is now and you can tell from experience what you are doing. It makes sense, but we can be faster and better and you need to explain it. Build confidence through renowned material like the Google DevOps blog and the accelerate book. And there's many more. So many companies are doing this. Elite Dora performers are usually outperforming the competition. And then of course, like I said, and I can't put enough emphasis on this, constantly explain the changes, constantly explain the purpose and the vision to everyone as often as they need to hear it and show and lead by example. So maybe you need actually to step in and write the first deploys script yourself. Maybe you need to write the first CI pipeline script yourself. Maybe you even need to write the first test yourself. It doesn't matter, do it and show it to others. Lead by example. And then once they learn what the vision is, you give them back the control. Concrete steps to take process hard, move to trunk based development. Have prs and code review, merge to main and keep it stable at all times. Rollback fast if needed. Fix bugs offline. This is a very significant cultural change if you don't already have that. And I felt it was very challenging coming into an organization that for example uses Gitflow or some other model. Suddenly they don't have five different release branches or a development branch where they can safely merge everything and then they later integrate a huge chunk of code changes into the main branch and they pray to God that it will still work. Trunk based development, for those of you who don't know, is the idea that you always have a stable main branch and this is the one that you also deploy to production automatically. And the idea now is of course, and you can see how it relates to the door metrics. When you make a change on this kind of a branch, you want to make sure the change is actually small. So first of all, someone needs to review it. Try sending a 500 file change review request to an engineer. They might not be very happy about this because it will take them a day to review it. Send them a small one. Here's a file that I need to change. Here's five lines of code. Can you please review it? Easy to understand. And they can review it quickly, merge it to main, it can be deployed. And now the next good thing is, and this immediately impacts the next orometric meantime to restore or restoring from failure or change fail rate. If you deploy a small change and something goes wrong, you'll be able to very quickly understand what caused the issue because it was a very small change. So there are not many things that can actually have caused it. Whereas if you have a huge one spanning multiple modules and projects, right. So the diff gets larger. And especially as teams scale up and engineers scale up, the diff gets larger and larger and larger the longer you wait with deploying something and then it becomes infinitely more difficult to actually debug issues because if you have the diff created by three teams over one month deployed to production and it breaks, good luck figuring out what the hell went wrong. And then concrete steps to take process hard. All the changes that you have done now, so you might have not executed all of them, but you have the synthetic monitoring in place, which is important. It has your back, so, you know when something breaks. And now it's time to finally increased our deployment velocity or deploys frequency. Nice. The question is how much should we deploy and how frequent? Well, that depends on where you are in, right. Quite frankly. So depending on your current DevOps maturity, where you are coming from, what the state of the engineering team in the company is like, I think if you're coming from a very low DevOps maturity space, or depending on business needs, weekly deploys are actually perfectly fine. They're not elite, Dora. Right. So they're not 30 times a day or more, but they are better than monthly or quarterly and they reduce the diff significantly and they have a stable deploy process. And of course we have to face some realities. Like I said it initially, investing into DevOps is also opportunity cost. So there's going to be some manager or some exec on top of you wondering, hey, Daniel, what the hell are you doing? Like, you've built stuff for two weeks, yet I have not seen any new features or any new bugs. Right. And now we need to actually show them something. And if we can now, and this is what I was able to do, prove to them that suddenly lead time is reduced significantly through deployment frequency, through automation and through all these things that we have just looked at, suddenly delivering a bug fix takes days rather than months or weeks or a quarter. And of course this is not easy, right? So here it's on the hard slide for a reason. So the first weekly deploy that we did took us a day because we uncovered many issues that we hadn't seen before, right? Like we thought everything was automated, but it wasn't. So still manual steps were involved. We had to do some stuff, coordinate. It wasn't great, but of course we learned a lot. There was a retrospective, we implemented a series of changes, and the week after we reduced it to a couple of hours and eventually even less than that. And now if you're still not satisfied with your DevOps change, right, you could go to difficulty mode. Emotional damage by the way, I love Stephen Hay. He's an awesome youtuber, responsible for this funny meme and many cool videos. I think emotional damage is appropriate because the next step that you could take is organizational change. And this is very hard, right? So all the changes before you can do them on a team level or maybe on multiple teams. This, however is an organizational change. So you will need to get buy in from senior management, the exec team, someone pretty high up, and this can be a tough one to sell and convince. So you need a lot of patience and a lot of good, well argumented, structured documents and reasons to convince them. But however, what helps is if you've already been able to show the first success, right? So hey, we went from quarterly to weekly. Look at that. We can be even faster. Or sometimes you even have someone who has an open ear for you, right? So they also see the issues and they're looking for a solution. And now you can come in and suggest organizational change and things that I am very convinced that you need for being a Dora elite performer are first of all, cross functional teams and a you built it, you run it mentality. You need to have stream aligned cross functional teams who are able to deliver everything that is required themselves. They need to have front end engineers, back end engineers, they need to have the operational knowledge to deploy their own service and to monitor it. This is very important. Only then can they be fast. There are no dependencies, there's no dependency meeting. You don't need agile coaches to synchronize 15 teams. No, they can deliver what they need themselves. You don't want to have a standalone DevOps team that has the job to build. DevOps never talks to software engineers and just does whatever. Nobody really knows. They sit there in a corner build terraform stuff all day and nobody gets it. That's a big antipattern, right? So if you have a DevOps team then it should be temporary. And the idea is that they help other teams to increase DevOps maturity and at some point they dissolve again. It's not a permanent solution. Test automation and exploratory testing is important. You need test automation at some point every team has to do it and your QA team, right? So it must not be a manual gate for releases because this will reduce your door performance significantly. If you have to wait for someone to manually test something and then approve a deploy, what you want is automation. And as soon as the automation is satisfied and happy you roll it out because you have the monitoring in place through your synthetics. If something goes wrong, you can roll back in record time because deployment now is automated. So there's also no need anymore for manual QA as a release gate. What they can do instead is exploratory testing because of course test automation. You don't want to test everything. You don't want to have a synthetic monitoring set up that has 5000 tests and tests every quick path free application. No, it's supposed to only test core flows. And if there is a typo in FAQ somewhere, this is where, for example, manual QA can come in handy. Or if there is a broken chart that is displayed somewhere, this is where they can be useful. So thanks for listening. If you are interested to nerd out about synthetic monitoring or all things DevOps, come seek me out in the czech slack community. It's linked in the slides or you can also ping me on LinkedIn. And yeah, I'm happy to answer all your questions afterwards. I hope that you could learn something. Have a great day everyone.
...

Daniel Paulus

Director of Engineering @ Checkly

Daniel Paulus's LinkedIn account Daniel Paulus's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)