Conf42 DevOps 2023 - Online

Baking in Reliability

Video size:

Abstract

If you were tasked with implementing “reliability engineering” how would you approach? The last few years saw a log of interest in Site Reliability Engineering (SRE). Ever since Google publish the first SRE book (https://sre.google/books/) there was an “explosion” in the interest on the subject. Many saw it as an approach to implement DevOps. Others said it was a DevOps evolution. Regardless, SRE is here to stay. Moreover, reliability is here to stay. But if you were tasked with implementing “reliability engineering” how would you approach? What would would be the steps you would take into achieving that audacious goal, both cultural and technical? That’s precisely what we’ll explore in this talk. We’ll talk about practical advice on how to get started but also algo to get the ball rolling. We’ll discuss culture and tech and how they have to work together if you wish to prioritize reliability engineering.

Summary

  • My talk is at Conf 42, DevOps 2023 about baking in reliability. We're basically going to cover how you can start and keep the ball running. At the end we're going to do a quick recap of what we learned today.
  • In 2016, Google came out with this book about site reliability engineering. It described the was that Google did operations. Many organizations still struggle to implement what DevOps is. It's sometimes hard to make that transition from the way that Google does things to our reality.
  • The first thing that I do is ask this set of questions. I want to understand people's expectations and what they think that reliability means to them. By having that sense, we can approach the way that we implement reliability engineering in different ways.
  • Don't be afraid to ask questions, no matter how stupid they might seem. The second tip is to expose yourself. You want to gain knowledge by the experience of others, but you also want to build empathy. These tips will help you progress a lot when you're starting your practice.
  • Another important aspect when starting a reliability engineering practice is to make yourself available. Also important is to communicate extensively. Everyone should contribute to the reliability efforts in the thing in the company. Finding your niche is especially important when you're starting out.
  • It's very helpful to have a reliability framework. It will create a shared language to talk about reliability between teams. As with any practice, having executive support can be very beneficial. Very important for reliability engineering is to invest in observability.
  • We talked about why there's a sudden interest in reliability engineering and SRE in particular. And then we went through some useful tips, things that I usually do on a daily basis to help me do this reliability work regularly. These tips could be a presentation on its own.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hello everyone. Welcome to my talk in Conf 42, DevOps 2023 about baking in reliability. Let me start by thanking the organization for accepting my talk and thanking all of you for attending this talk. So let's get to it. So what sre we going to cover today? So we're basically going to cover how you can start and keep the ball running about bake in reliability within your organization. So we'll start by giving a little bit of context. I'll describe the first thing I do when I try to approach this kind of thing and then we'll go through some very useful tips that we can do to get thats ball running. And then at the end we're going to do a quick recap of what we learned today. So how did this all started, all this commotion about reliability engineering and SRE? And it's mostly due to this book in particular. So in 2016, Google came out with this book about site reliability engineering and it described the was that Google did operations. This is what was very interesting because we have been dealing with the DevOps movement for a few years now. And although the DevOps movement started to gain form around 2007, 2008 and the term was coined around 2009, many organizations still struggle to implement what DevOps is. Because DevOps is a culture, it's not a set of practical things. From its start, it can be quite hard for some companies to actually understand what they need to do or what they could do to implement DevOps. So it wasn't any surprise when Google came out with the site reliability engineering book, which describes both an engineering practice and a position that companies started to look this as a new approach to do things. Also on top of that, it's a book that describes how Google does things, right? So Google is one of the most successful companies in the entire world, and it's very appealing for companies to just look at what these type of companies do and start to implement the same things. The problem is that this book covers a lot of stuff. So it can be quite hard to really understand what's going on and it can be quite hard to adapt that reality to our own reality. And that's the other problem, is that Google has a set of challenges that most companies won't have and the way that they have to deal with that might be different. So it's sometimes hard to make that transition from the way that Google or any other big company does things to our reality. So we need to have a set of things that we can do to actually approach that in that way. So how do I first start when people want to talk about reliability or implement reliability. So the first thing that I do is ask this set of questions. So I approach people and ask what does reliability mean to you and what does reliability mean to the organization? So what am I trying to achieving with these couple of questions? First of all, I want to understand people's expectations and what they think that reliability means to them. And simultaneously I want to try to find out if there is a common language to talk about reliability among the organization and if there is a mismatch between what people think reliability is and thats the organization wants reliability to be. By having that sense, we can approach the way that we implement reliability engineering in different ways. Of course you need to ask these questions to many, many people so that you can try and figure out if, even if different clusters of people within your organization have a different sense of what that means. And then that can help you approach the way that you're going implement a reliability culture. So let's get to some useful tips on how to approach this baking in reliability thing. So the format that I'm going to try and do these tips is that I'm going to try and to describe why this tip is important, how we can implementing it and then give in the thats section actually useful tips and practical things thats you can do on a daily basis. So the first one might not come as a surprise and it will be will be common across trying to implement whatever practice that you want. And is the concept just asking questions? So what were we trying to do with this asking questions tip? So we want to understand the pain points that the company or the engineers or whoever works in a company have. We also want to understand the current perceptions that people have about reliability or even just why things are the way they are. And also at the same time you want to understand where you can contribute, where you can put some effort to actually improve people's lives. So how are we going to do this? We want to increase our knowledge based on the experience of the people that are already working in the company. So very important, what can we do? We mustn't be afraid to ask questions, no matter how stupid they might seem or no matter the forum. Even you could be in a meeting or you can be in an incident. Don't be afraid to ask questions. Something that I like to do when I join an organization or I join a new part of the organization is to use the career cold Start algorithm, which was popularized by Google's, by Meta's CTO, and essentially a set of three questions that you ask in 30 minutes meetings. So you book 30 minutes meetings with a few people and then you ask three questions. First of all, usually you ask, what should I need to know to do my job? And this could be for you, in particular a team that you're building, or a whole cluster, or even a practice that you want to implement. Then you can ask what they think will be the difficulties that you will encounter doing that. And you list them out. And the last question is, who should I talk to next with those three questions? When you start talking with many people, you will start to encounter common things. So you will start to understand what are the pain points, what people think reliability is, what are the difficulties that you're going to encounter. And you will start algo building a graph of people and start understanding who does what within the company and who you should talk with. The second tip is to expose yourself. So the idea here is thats you will try to the why for this part is fairly similar to the ask questions. You will also want to understand the pain points and encounter the current perceptions and were you want to contribute. Again, you want to gain your knowledge by the experience of others, but you also want to build empathy. You want to put yourself on the shoes of other people and really feel what they feel and have to deal with what they deal. So what can you do? You can participate in many events. It could be meetings, it could be incident bridges, incident reports. Whatever it is makes sense. With time you will start to filter those out. Some of them you can expose yourself and understand. Okay, I might not be the best use of my time to be here, but in the beginning, just expose yourself, be there and try to understand how things are. One very important thing is that you should listen a lot more than you speak and of course ask questions. You should be there in observing capacity and trying to understand things. Just listen. Listen what people say, observe what they are doing. When you don't understand something, just ask questions and take a lot of notes. If you're anything like me, you will forget most of the stuff. So I just take as many notes as I can and then I can try to summarize. As you might have figured out by now, exposing yourself and asking questions go hand in hand. So you can expose yourself and ask a lot of questions. And the other way around is also true, although that being said, I wanted to highlight this separate because you could ask questions without deliberately exposing yourself and you can deliberately expose yourself and not ask questions. So these two actually go very well hand in hand. So it's two things that will actually will help you progress a lot, especially in the beginning when you're starting your practice or you're starting in a new company, because you will gain knowledge from the ones that sre already in the question in the company next. Also very important is thats you want to educate yourself about what this reliability thing might mean. So you want to understand the fundamental concepts around reliability engineering and you also want to understand thats other companies are doing right. You want to understand how company X does this, or maybe Google does this, or maybe metaverse or even companies that are within your region. So how are you going to do this? So you're going to build your own foundational knowledge. You will have to read books, watch videos, do courses, attend events. So here are a few tips in the thats section. So the first one is the site reliability engineering book, the one that I described in the beginning. Although there is a lot of information there, it's like the Bible for site reliability engineering and there's a lot of information there thats could be useful for you and for anyone who's implementing reliability engineering. Second book, very interesting book is the implementing service level objectives by Alex Algo and talks about slos and how you can implement a reliability culture and have a reliability framework to actually measure and assess your reliability. The third link is a course on Coursera. It's about reliability engineering. It's taught by a few engineers from Google and it's more practical. You'll have videos and you have assignments that you can do and start building that reliability engineering foundation. And of course was important as the other ones it's just going to events, it could be conferences, could be meetups. It's very interesting, but you can interact directly with people, see what they're doing, ask follow up questions, and more importantly you can see what worked for them and thats didn't. So it's very useful to just go to meetups, find one in your geographical area, go there and talk with people. Something that you can do in the beginning is immediately start to attack pinpoints thats the company has. So why would you do that? So you want to alleviate issues that are hurting engineers and the business as a whole. And you also want to create space for project work. So you want to alleviate some pain points, reduce some toil so that you can actually focus on long term sustaining projects. And of course you want to make yourself valuable. So from the get go you want to increase your value to the company. So it's good that you help attack painting points. So again, so how do you do this? You will address your organization pains. So what can you do in practice for this? You have to make a deliberate effort to identify recurring issues. So it might mean that you can go to events or meetings just to understand what's really hurting people's lives in a daily basis. And after you identify those, you will have to schedule work to address them. One caveat here, if you're not careful, there's a serious risk of that being the only thing that you do. And people get used to this new team or this new part of the organization, just years with this kind of stuff. And if you want to create the space for project work, this needs to be time bounded, for example, or very contained. So you address some pain points, but you need to create the space for project work. Another important aspect when starting a reliability engineering practice is to make yourself available. So you want to build relationships with other teams, you want to reduce friction with communication, and you want to make people comfortable addressing you. For just discussing reliability in practice or because they have some kind of trouble, they need to be at ease to just coming to you and ask questions. So you want to create easy to use communication channels. How can you do that? You can create some communication channel, like a slack channel, for example, where people just can ask questions and someone from the team answer them. You can also create office hours. It will be, for example, a schedule hour every day, every week where someone from the team is just there and people can just pop up. It could be physically or it can be a Zoom call. People can just pop up and ask questions. Another thing is to have, for example, a mailing list where people can just send questions or want to discuss something and people from the team just answer. Overall, what you want to create is an open door policy where people feel at ease just coming to you and asking questions or discussing a theme around this part of operations and reliability engineering. Also important is to communicate extensively. Why would you want to do that? So everyone should contribute to the reliability efforts in the thing in the company. It doesn't make sense for one specific team on its own to do reliability work. So it should be a collaborative effort. And for something to be addressed, people need to be aware of it. You need to keep the topic of reliability fresh. You don't want people to think that reliability is just something that one team does or that all the work has been done and everything is fine. So you want to make sure that reliability is present in people's minds. So how can you do that? You can do talks both internally or external. You can create a newsletter where you periodically send information about reliability or efforts thats are being done within the company. You can create documentation that people can refer to when they are doing their work. You can create blog posts, you can create articles that you share within the company or outside with things about reliability in operational work. And of course you can send periodical surveys. For example, if you have, let's say you could have a reliability maturity model and you periodically can send surveys asking for your colleagues how they think their operational work fits within that model and that will keep the reliability efforts in people's minds. Something very important is to make reliability work visible. So reliability, as we said in the previous slide, shouldn't be something obscure or something that just the team does. If people know what's going on, what people are working on, thats the efforts are going to be, they will be more comfortable about it because it's not something obscure that just someone is doing and that will then somehow will impact their work and it will also open the door for other people to collaborate. So you want to turn reliability work as a first class citizen. So the first thing is to make it visible. People should be able to quickly find out what work is being done in round operations or reliability work and understand what is being done. So you can track it any way you track any other type of work. For example, if you use tools like Jira or Trello, you could use the same thing. You could create tickets, you could have epics, so you could use a similar set of tools that people already understand and sre used to using. And you could use similar working models. For example, if the company is using an agile development process, be it scrum, combine, or whatever it is, you could use a similar type of working model so that people can quickly understand what's being done, when it's going to be delivered, et cetera. Finding your niche is especially important when you're starting out in a company that already has some size. So many companies have different teams already addressing operational work and it doesn't make sense to have competing efforts because that will be just a waste of resources. And of course, competing efforts create bad incentives and promote a bad culture. So we would want to find reliability areas not being actively addressed and tackle those. So make sure that your work doesn't overlap too much with other teams. You want to identify and stop competing efforts as soon as possible and you want to create a collaborative culture. So maybe you have a cloud engineering team or a DevOps team that is focusing on a particular area. It doesn't really make sense to go there and try to change all their work. It makes a lot of sense to focus on something else. For example, in observability, you don't have a team focusing that in particular. And then when some work will overlap with those teams, you will want to collaborate with them and not replace the work that they're doing. Very, very important for any effort and reliability engineering in particular is to promote independence. So teams should be as autonomous as possible. Your team or your practice shouldn't be the bottleneck you need to allow people to learn and build things on their own. You would want to build tools that allow people to progress on their own and make and gain traction. You can do that by providing documentation, both written and videos. You can build tools, you can build platform, you can build Clis, you can build bots, et cetera. And of course you can do training. It could be follow training, for example certification, or you can do your own internal workshops. The idea here is that teams can independently the traction they have, the necessary tools, they have the necessary documentation to progress on their own, and they don't have to wait for you to do some kind of work. Talking about more specific about reliability itself, it's very helpful to have a reliability framework. Why is that? By having a reliability framework, you want to have a way to define reliability. Also, you want to have a way to measure and assess reliability. It will create a shared language to talk about reliability between teams and it will facilitate prioritization so you can create your own or use a reliability framework. One very popular at the moment is the use of slos to gain more knowledge about that. You can read about slos in the SRE book, you can read about in the SRE workbook, which is the second book that Google released with more practical implementations. And of course you can read the implementing service level objectives by Alex Hildalgo, which was extensive, talks extensively about reliability. The idea here is that different teams talk about reliability using the same language. And this will help avoid conflict because, for example, you won't have a development team saying that this service needs to be reliable for x, y or z and for example, operations teams talking about reliability in a different way. And that will also of course facilitate prioritization. Because if you have a way to measure NSS reliability, you can invest in what makes sense. So if you're within your reliability bounds, probably you can release more features for your system. But if you're below reliability, the reliability that you have defined, maybe you need to invest in more operational work. As with any practice, having executive support can be very beneficial. So executives need to understand why reliability efforts are important and why they affect the business along the way. Some changes could be hard and could require push from the top, and it's also helpful to align business with engineering. So you will do that by being engaged periodically with executives. So you need to interact with executives regularly. Very important is that you need to back your claims with data. For example, dormetrics are a good way to translate engineering metrics to things that business can understand for being beneficial in terms of operational work, and you want to connect reliability engineering efforts directly to business outcomes. That way, executives will understand the impact that these engineering practices will have on the business as a whole. And last but not least, very important for reliability engineering is to invest in observability. Why is that? You want to understand how users are interacting with your services. You want to understand how happy users sre with your own services. For that, you need to understand how your systems are behaving. You want to improve your meantime to detection, you want to improve your meantime to repair. And of course you want to reduce the change failure rates. So you can do that by investing in observability. So you'll need to equip your system with metrics, logs, traces, stack traces, continuous profiling, etc. There's a good book about observability engineering released by the team at Honeycomb, and it talks about what reliability is and how you can implementing reliability within your systems. And you can also use open source tools and standards like open telemetry and even leverage automatic instrumentation to get something out of the ground very quickly and making your lives a lot easier. So before we go, let's quick recap what we already talked about. So first of all, we talked about context. So we talked about why there's a sudden interest in reliability engineering and SRE in particular. So that's when Google released their own book. A lot of people saw that as a new approach to doing operations and they try to incorporate SRE within your organizations. But because SRE can be so broad and so diverse and some of the things that might work for Google might work within your organization, it can be quite hard to make that translation between what works at Google and what will work within your company. So the first thing I do, I usually ask two individuals what they think reliability means or reliability means to them and what it means within your organization. And that will help you set you the stage for you to understand was the company or have the people have an idea they have a framework in place or they don't, and will help you understand if there's a mismatch between what they think reliability is and what the company thinks that reliability is. And then we went through some useful tips, things that I usually do on a daily basis to help me do this reliability work regularly. Keep in mind that most of these tips are not done once and then forgotten. These are things that you can help you start but keep the ball running. And most of these things can be done in parallel, so you can, for example, attack pain points and make yourself available. At the same time, it makes sense for you to do these things in parallel. And this is all from my part. I hope this talk was informative for you. Each of these topics on their own thats that I mentioned. These tips could be a presentation on its own, so feel free to reach out to me during. You could meet me at events or you could send me a message on social and we can keep discussing these topics. So thank you very much and have a great conference.
...

Ricardo Castro

Principal SRE @ FanDuel / Blip.pt

Ricardo Castro's LinkedIn account Ricardo Castro's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)