Conf42 Site Reliability Engineering 2023 - Online

Divided by responsibilities united by the DevOps and SRE cultures!

Video size:

Abstract

Short talk refresher for those who work within Enterprises and large organizations. The end goal of the talk will be to mitigate the organizational issues with the SRE and the DevOps cultural practices to foster a collaborative workplace. A collaborative environment = highly efficient systems!

Summary

  • Saurabh Bangad: Today's topic for conf SRE 2023 is divided by responsibilities united by the DevOps and SRE cultures. One of them is the primary one I'm focusing on is separation of duties. Talk is all about how can we take these best practices, but still not give up on collaboration.
  • One common way of doing it is allow developers to remain focused on developing new functionalities and features. The other set of people are responsibilities for more and more stability, fewer number of incidents and potentially no outages. How do we address this? One approach is DevOps style.
  • We need to have blameless culture, blameless philosophy where postmortems or retrospective analysis are done to fix processes and systems, not people. If people are blamed for issues that emerge from systems, it won't be a psychologically safe environment. The organization continues to have more and more failures.
  • Bringing a culture requires few things. Sense of ownership and accountability is required from each party. They also have to agree on change management practices. Lastly, this is probably the most important for enterprises where reorgs SRE common.
  • Let's dive into communication and collaboration. You need to have bi directional modes such as Google Chat faces or slack channels. Unnecessary approvals can come in when you have too many teams. Informal collaboration happens when you sre meeting agenda free.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, this is Saurabh Bangad and I'm a technical account manager at Google Cloud. Today's topic for conf SRE 2023 track is divided by responsibilities united by the DevOps and SRE cultures. Let's dive into it. So from an agenda perspective, I'll be discussing about how enterprises end up in a situation, what new problems are created, how to solve them, and let's get that started. So enterprises actually evolve with respect to all the industry wide best practices, such as ISO 27,001, which lists out lot of best practices and recommendations which enterprises can adopt. And it all starts with good intentions, such as I'm just calling out a couple of examples, but there are a lot of guidelines. So one of them is the primary one I'm focusing on is separation of duties, which essentially means that any business critical decision shouldn't be completely done by one individual or one party. It has to be divided into multiple separated out duties, so that if in case of any security or privacy related incident, essentially you are dividing the risk into more than one party. This is just one of the examples where I'm focused on there are other practices such as least privileged security policies. So essentially give people only what they need, don't give more than that. It's a great security policy, but it also gets enterprises in a situation where they actually make one party more powerful than the other, and it actually reduces slightly the collaboration. So this talk is all about how can we take these best practices, but still not give up on collaboration with some of the SRE and DevOps best practices. So let's dive right into it. So if you look at the previously highlighted best practice that we should be causing separation of duties. So one common way of doing it is allow developers to remain focused on developing new functionalities and features which will actually get the organizations in a better place from business standpoint, which would essentially mean developing features which competitors have not developed yet, and keep making this progress at a higher velocity. While the other set of people should be responsible, the operators should be responsible for maintenance and reliability and servicing of the given service. If you look at this, these two goals are almost in the opposite directions. I mean, one set of people, developers will be focused on agility, they will be rewarded for the pace with which they bring innovation, they bring new kind of cultures. And the other set of people are responsibilities for more and more stability, fewer number of incidents and potentially no outages or minimal outages. And these two are not 100% aligned. One would try to have more and more velocity, the other one would prefer to have very slowness, which brings the reduces risk for the overall environment. So essentially these two incentives are not aligned. And this is a very common thing in enterprises to divide duties into two or more set of people at scale. Actually it looks something like this because larger the organization, the responsibilities start getting into. How do you do SrE prod versus prod versus any other environment. So the people would differ even in those perspective. And there is not just one application that are we focused on, there will be more than one application and these applications interact with one another. So the responsibilities will be divided again. So you can see where I'm going with this. We have people with different goals and they all don't necessarily collaborative with one another. They haven't been given a mandate or a culture where they would necessarily collaborate. And to top on that, we also have organizations which are fairly common in enterprises and large banks, where we constantly try to bring new practices, such as a security team will be there, which looks after across the organizations all the security policies you may have auditing team that's again independently sitting, but has a role to play. So any party that has some kind of role to play, we sre talking about the whole separation of duties to get scaled and along with that necessarily people getting different goals. So how do we actually address this? So I'll be coming back to the previous slide again, but let's see what are the common ways these kind of models are addressed in a better way. So one approach is DevOps style. So if you look at holistically, we start from a concept, then it is relevant to a business. We do development of that concept and we have operations where it is going to keep the lights on and market actually drives the whole growth in a flywheel model. So in a DevOps style manner, actually DevOps tries to bring those developers and operations together in a more like a you build, you run fashion that gives them some kind of ownership, which is great. Also there is agile which actually tries to bridge the gap between product development and product management. So business side will bring the product management perspective and development will actually be driving the product development. On the other hand, SRe actually tries to do a slightly more detailed approach, more or less like end to end with SLO and mutual agreements that are either formal or informal. Organizations tend to get in a better place with SRE style. Now again circling back to the slide where we were focused on at bigger picture how it looks like, and also reorganizations can cause these to scale even further. What does this mean? This actually means the gray portion that's sitting in between, almost like a wall between the two or more parties. It can be a gap if it's not documented or if it's not properly acknowledged by each party that this is the responsibility of party x or party y. So each gray area actually represents risk, which are undocumented, not owned by any party, and they can trigger at any point and cause lot of issues to the organization. And to avoid this, essentially we will be looking into some collaborative practices. Also, in an enterprise where communication gaps actually make the automation go away, you can imagine two or more parties working on a single piece where a workflow can have human intervention much more than intended, and it basically brings more and more risk. And statistics suggests that change management is mainly responsible for outages. So from a cultural standpoint, like if 70% of the outages are caused by changes in a life system, this is another pointer that if changes are not coordinated well, it is almost certain that it will end up in an outage. And this happens at scale even more. So let's look into what are the other things we can do. So step zero, always to bring a culture where failures are not looked at as a bad thing. Firstly, we need to have blameless culture, blameless philosophy where postmortems or retrospective analysis are done to fix processes and systems, not people. If people are blamed for issues that emerge from systems and gaps and processes, it won't be a psychologically safe environment. So the first thing first is you need to have a blameless culture where people sre highly curious and they sre rewarded for learning, and also they're not celebrated for heroism, for recovery from an outage. Short term it works well, but medium and long term heroism for celebration of heroism doesn't go anywhere. The organization continues to have more and more failures, and you just don't evolve in terms of moving from a bad system to a good. So let's dive into the main part, the solutioning. So firstly, culture. Bringing a culture requires few things. This is a text heavy slide, so you could definitely pause and go through each point. I'm going to do elaborate each point as much as possible. So firstly, sense of ownership and accountability is required from each party. So they need to see this as a joint problem, rather than my problem, your problem, or not my problem. It has to be seen as a joint problem because even if they have different sub goals or different goals, if individually they succeed, but as an organization, if they fail, that will be a bigger failure. So sense of ownership is the most important one to get there. We need to have some core metrics which are agreed by each party. They all have to agree on some common signals where they collaborative and try to improve from, let's say 10% to 20% to 30%. It's considered as progress. That would be a good signal and they all have to jointly influence those signals. Communication and collaboration is something that I'll be describing more on the next slide. They also have to agree on change management practices. For example, team one has upstream and downstream dependencies. Potentially they should be communicating and have coordination on how they manage change. Also, change management practices in pre prod versus prod would differ and even the type of change is it p one, p two, what kind of impact one would expect out of that? These are basic parameters that agreeing for change management practices would mean. Also, continuous improvement is something as all the parties, they have to work together, they should be agreeing on continuous improvements. For example, automating lot of things which don't require human decision making or reduction in time by eliminating some unnecessary processes. And people should be rewarded for this. This is the best way to get from a good to better environment. Lastly, this is probably the most important for enterprises where reorgs SRE common. After every reorg, you should be going over all the decisions or all the working model so that you may end up in a situation where you would have all the accountable parties agreeing again, even under the new leadership or even under the new goals. Let's dive into communication and collaboration. So firstly, you need to have bi directional modes such as Google Chat faces or slack channels where these teams exist, and they kind of have threaded discussions on topics. So why I suggest bi directional? I mean, most often if you're using just the ticketing system or just emails, it may not function as a collaborative environment where the flow of information is usually one way and it's a formal way of documenting everything. While they all have their advantages from a collaboration perspective, they also have disadvantages. So they should be used for tracking purposes. But for collaboration purposes, bi directional communication channels do the best thing. Also probably maybe having cadences where you bring initiatives for next quarter, plan them well and execute it together. Secondly, you should be having joint artifacts, so where documentation pages are not silos or islands where people develop their own things. As opposed to that, it should be a collaborative, jointly owned documentation pages. This also applies to roadmap items where all the teams plan together a lot of items which would actually address future issues as well. Every environment is always dynamic and you have changes of some nature all the time. So roadmap items are definitely one of the important ones. Lastly, the tooling. If you have common tooling, you would actually be encouraging collaboration to improve the tool as well. If you have better tooling, you end up in a much efficient environment which is operationally smoother. Changes as previously suggested, changes should be coordinated. So perhaps in your regular sync ups, regular cadences, you should be talking about what kind of changes is one team going to deploy and what they expect in terms of impact on those other teams, either downstream or upstream, and how it may affect their day to day operations. So these things need to be defined as common set of standards. And lastly, going into best practices, informal learning. Informal collaboration happens when you sre meeting agenda free. So this would mean one of the teams sharing their lessons about most recent deployment and how it actually brought new kind of changes in the environment, or tgifs, which are fairly common. As one can see in Google's culture, somewhat similar exists in many other corporate environments. So you could encourage those. Any kind of knowledge sessions or shadowing opportunities would allow you to actually push one team to step into other team's shoes and learn from what kind of challenges they may have. And this would allow them to bond well together and understand each other's constraint and make progress together. Finally, my last take on this is because enterprises have very different dynamics. It can actually lead into complexities which were not predicted. So for example, if you have meetings that SRE scheduled on a monthly basis, but if they don't result in outcomes, they actually become toil and they don't necessarily bring any improvements. Unnecessary approvals can come in when you have too many teams, and in the name of collaboration, if they start adding approvals which work for the time being, but after some reorgs or after some deprecation of some tooling, they actually become unnecessary approvals. So it becomes important to assess your situation whether some approvals sre unnecessary unaligned maintenance windows. Actually, this one means, let's say, if you have three teams which have no change management protocol in place, they may actually end up in one team doing maintenance on a Monday, one team doing on a Wednesday, and one team doing on a Friday. If they have no relationship or dependency between one another, it's absolutely okay. But if they are upstream downstream of one another, they may actually cause a much wider impact for your end users, where you may end up in three maintenance happening in the same week, which are not even on happening on the same day. You have basically downtime on Monday, you may have downtime on Wednesday, you may have downtime on Friday, which may work for those specific teams, but it will definitely be a bad idea for your end users and your organization's reputation when we are encouraging collaboration and better communication. It may also encourage an environment where you are encouraged to build tooling. This tooling may be absolutely fancy, but it may be unnecessary and unusable from a business standpoint. So you need to assess whether you are doing the right thing from time to time. As we saw in earlier slides, there can be gray areas which pose risk, so you may have an environment where there are no incentives for reducing these gray areas. People may not actually do anything about it and they just continue to operate even if it's an inefficient environment. And lastly, one should be having all the aligned metrics as we spoke about golden signals and agreeing for common set of metrics. So in many environments it actually happens that for example, if we go back to previous slide where we had set of developers, set of operators, they had contradicting metrics which actually made the environment not so great for for the organization. So that's it. That's all I had on my agenda. Thank you so much for joining me.
...

Saurabh Bangad

Technical Account Manager @ Google

Saurabh Bangad's LinkedIn account Saurabh Bangad's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways