Conf42 Site Reliability Engineering 2023 - Online

Overcoming SRE Anti-Pattern Roadblocks: Rebranding the Operations Team

Video size:

Abstract

This talk will detail the anti-pattern where traditional operations are rebranded as “SRE” but everything else stays nearly the same. The same tools, processes, and interactions persist. SRE is seen as a mandate without any real change. In practice, the only change is the team’s renaming.

Summary

  • Ricardo Castro: Today we're going to talk about SRE antipattern and one in specific, which is the rebranding the operations team. We'll briefly touch on what SRE is, where it came to be, and what it looks like today. And then we'll talk about the ant pattern itself and how to fix or avoid it.
  • The advent of the Internet and modern computing platforms needed a new way and different engineering approach on how to run these systems. In 2003, facing massive growth, Google needed a unique way to manage their operations. Today, site reliability engineering aims to develop robust applications in the most simple way.
  • The QWERTY keyboard was developed to slow down a bit writers, so that these needles didn't hit each other. But because we got so used, that pattern transformed into an antipattern that we have today. Studies have shown that for speed, this is not the preferred layout.
  • When an operations team is rebranded to SRE, but little else changes. How can we fix this? First and foremost, we have to educate ourselves. Always a good idea is to invest in observability.
  • So a brief summary of what we discussed today. We saw the why of SRE, saw its origins and what it looks like today. Antipattern are a problem for our industry. And at the end, we just saw that SRE can have monitor patterns. I hope this was informative for you guys.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Good morning and welcome to 42 site Reliability Engineering 2023. My name is Ricardo Castro, and today we're going to talk about SRE antipattern and one in specific, which is, which is one that I call rebranding the operations team. So what do we have on the menu? So we're going to briefly touch on what SRE is, where it came to be, and what it looks like today. We're going to briefly describe what are antipattern and why the why of the SRE anti patterns. And then we're going to touch on the SRE ant pattern that we're going to discuss today, which is the rebranding the operations team. We're going to talk about the antipattern itself and how to fix or avoid it. And before I leave you, I'll talk to you about some more SRE antipattern that I usually do talks about. So why SRE? So over the years, computer systems have evolved in many different ways. They increased in scale and complexity, gave birth to different ways on how to develop and manage such systems in productions. And the advent of the Internet and modern computing platforms needed a new way and different engineering approach on how to run these systems. So the 80s were seminal in shaping the Internet as we know it today. In 1983 marks the official birth of the Internet. Previously, computer networks didn't have a standard way to communicate. TCPIP provided that standard and allowed computers in different networks to talk to each other. Then in 1989 at CERN, Tim Burton Lee defined what became known as the World Wide Web. His goal was to allow the sharing of information using the Internet. A year later, he wrote the first client server, paving the way for the specification of UrIs, HTTP and HTML. Then in the 90s, building on previous successes, the late 90s saw the emergence of the Internet as we know it today, and it marks the beginning of the first big Internet companies. Some of these large Internet companies are the ones that we know today, like Amazon or Google, and appeared as Internet companies during this period. Soon enough, they started to stretch the boundaries of what was possible. Until then, they faced new challenges and started producing innovative solutions to these new set of challenges. The continued growth of the Internet quickly exposed other organizations to similar challenges. The emergence of cloud computing, public, private or hybrid, as the standard way to deploy and run services made it accessible for everyone to manage their workloads and achieve massive scale. The traditional ways to manage these systems felled short when dealing with this new reality. And in 2009, the term DevOps was coined with the proposition of hiring Ops who think like devs or devs who think like Ops. SRE softwares and systems architecture got more and more complex. Scale increased considerably. Traditional ways of running software started to fail again, Ops versus devs and nonscalable engineering solutions. So in 2003, facing massive growth, Google needed a unique way to manage their operations. Having just joined the company, Benjamin Trainer Sloth, currently VP of Engineering, was tasked with running a production team to face these new challenges. Trainer Sloth defines his team's approach as what happens when you ask a software engineer to design an operations team? So this team evolved and matured into what today is known as site reliability engineering. So what are some of the principles of site reliability engineering? So, embracing risk. SREs assume that operations have risks, developing applications have risks, and putting them to production has risks. They define slos. SLOS is a framework that allows us to understand if our users are happy or not, and allows us to define what reliability means, measure that and assess it. Another principle is the principle of eliminating toil. Toil is the concept of work that needs to be done, but doesn't offer an endearing value. So SREs take as one of their tasks to actually try to eliminate as much toil as possible. Of course, monitoring. Monitoring is essential for us to understand what's happening with our systems, be it to evolve the systems or to actually fix some issues. Automation is also very linked with limited install, but it's the process that we developed software that takes care of repetitive tasks. Of course, SREs are also very intertwined with release engineering because introducing changes to a system is seen as a risk. So SREs are very intertwined with release engineering engineering in the process of renaming changes into production. And of course, SRE strive for simplicity. Their goal is to provide tooling and to provide ways to develop robust applications in the most simple way. So where does it stand SRe today? So there's a big hype from you a few years ago about SRE, and the main culprit is the site reliability engineering book. And that was released in 2016. So this book was released by Google and it gave everyone a peek into how Google managed their operations on a daily basis. Of course, Google released two more books about this. Building secure system. Building reliable systems also released a workbook. So it gave the whole world an idea world, an idea of how Google developed and managed their applications in production, and gave them a blueprint to do that kind of work. The problem is that these books has so much information that makes it very hard for a lot of organizations to onboard all of these practices or even some of these practices. And sometimes it makes it even harder to actually see how the Google reality translates into our own reality. Because of this, anti patterns might arise. So what is an antipattern? So an antipattern is just like a pattern, except that instead of a solution, it gives something that looks superficially like a good solution, but it isn't one. I actually like the definition of antipattern by Maggie Fowler, which says an antipattern is a solution that initially looks like an attractive road lined with flowers, but further on leads into a maze filled with monsters. So is this idea that we adopt something better? Because it looks like it could be a good solution to our problem, but down the line or down the road, it will actually be more prejudicial than beneficial. So I'm pretty sure you are familiar with this type of keyboard is the QWERTY keyboard. But I don't know if you know, but why do we have this keyboard today? And it's because of this, because of the typewriter. So we might think, okay, we got so used to this layout format for keyboards in typewriters. So it makes sense that we use it for our computers keyboards. But why did we settle on this keyboard for typewriters? Mainly because of two things. So when we pushed a button on a typewriter, a needle just prints a letter into a sheet. So that means that if we type too fast, those needles could hit each other. So this layout actually was intended to slow typers down so that these needles don't actually hit each other. So this was a pattern back in the day. Right. And we translated that because we got good using this keyboard. But actually, studies have shown that for speed, this is not the preferred layout. Actually, as I just said, this keyboard was actually developed to slow down a bit writers, so that these needles didn't hit each other. But because we got so used, that pattern that we used back in the day actually transformed into an antipattern that we have today. So translating this into the antipattern that we have here today, it's when we talk about what I call the rebranding the operations team, and it's when an operations team is rebranded to SRE, but little else changes. I can generalize this to rebranding x to Y, and there are many examples across the industry. We can think about agile or CI CD, and it's this idea that we give a new name or a similar name to something new, but nothing else really change. Maybe you're using a slightly different tool, maybe you're using a slightly different approach, but on a day to day basis things actually stay pretty similar. Unfortunately, this is the reality for many organizations. A lot of organizations do this kind of approach as well. They essentially give a team a new name, they adopt a new tool, maybe they start using terraform or this, they start using a configuration management, or even they start using kubernetes or some kind of tool, but on a day to day basis the work renaming the same. And it's not what SRE is all about. So how can we fix this? First and foremost, we have to educate ourselves. So we want to understand what this SRE thing is. And of course here are some good books that we can start learning about. And we also want to start to understand what is the perception for others of what SRE is, what worked for them and what didn't work. Again, Google is an amazing technology company, but most of the things will not be applicable to our own reality. So we need to educate ourselves and start bringing those discussions into our organizations and seeing what is happening on other organizations and adapting to our own reality. Another way to fix or avoid this is just start. Why do we want SRE? Why should we adopt SRE after we have more or less understanding, reasonable understanding of what SRE is? Is it fixing our problem? What is the problem that we are trying to fix by starting with why? Instead of just going with a what, adopting a new tool, just starting a new methodology, just because someone else is doing it, just trying to understand why we are doing this. What is the problem that we're trying to fix or what is the thing that we are trying to improve that will make it a lot easier for us to actually onboard SRE and actually starting to pick the things within SRE that make sense at each point in time. Of course, some general guidance, the Pareto principle that says that 80% of the results come from 20% of the effort. This means that we can actually look at the why that we are trying to fix and put all of our efforts there. So we're probably able to identify a few things within our organization and then map out SRE principles or approaches that we can put some effort that will bring us 80% of the results. Also, a good idea to actually prioritize within all of the SRE things that we can approach and onboard in organizations is just to use the Eisenhower metrics. So the Eisenhower metrics categorizes tasks by urgency and importance. So urgent and important things are things that we just need to do right. Maybe it's some kind of security fix that we need to do in production. So we need to get that done. If it's something that is urgent but it's not important, we probably can delegate it. If it's not important, not urgent, probably we don't need it. Where we can actually gain some time is then when things are not urgent, but they are important because of two things. First, not urgent, but important things eventually will transit will make its way into urgent and important. That means that it's the part of not urgent and important that we can plan out and that we can gain some time. We can plan the work around those things that can give us some leeway in the future. And if you have no idea about how to fix or avoid it, always a good idea is to invest in observability. So observability is critical for sres. Sres need to understand the states that systems have put themselves into. And of course it's good to have a reliability framework, something like slos. And observability will be essential for you to actually define, measure and assess what reliability means. And of course that data that obviously provides will then allow you to actually dig into issues that your systems have put themselves into and will give you the confidence that you can evolve your systems with a lower risk of renaming it. And this is just one of the antipattern. There are more. So the one that we just talked about is the rebranding SRE team. But there are others that organizations struggle with. For example, the lack of observability. So they don't have enough observability to actually understand what their systems got themselves into, not assessing user satisfaction. So it's just trying to make systems better, but not being really SRE if we are making our users happy. Poor incidents management, common end pattern as well. We just fix this fire and go on to the next. And we actually don't learn with issues. Another common entity pattern is the SRe hero. It's the person or the team that thinks and that assume that they fix all the issues and they'll be the heroes again. Another tenet of SRE is automation. So if you have poor automation, you probably got yourself into an antipattern. So a brief summary of what we discussed today. So we saw the why of SRE, saw its origins and what it looks like today. We saw what antipattern are and why SRE. Antipattern are a problem for our industry. And then we discussed the rebranding the operations team, we discussed the antipattern itself and we saw how to fix and avoid it. And at the end, we just saw that SRE, just like any other practice, can have monitor patterns. And we saw a brief summary of what those could be. And this is all for my part. I hope this was informative for you guys. I hope you have a great conference, and don't hesitate to connect with me on my social. And let's keep this conversation going. Thank you very much and have a great day.
...

Ricardo Castro

Founder & CTO @ FanDuel/Blip.pt

Ricardo Castro's LinkedIn account Ricardo Castro's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)