Get Ready to Recover with Reliability Management

Video size:

Abstract

The best way to minimize the impact of incidents is to prepare to respond ahead of time. But it is difficult and expensive to prepare for every possible symptom or cause. This talk will cover how to test the reliability of your system, evaluate your incident readiness, and prioritize future preparation.

Summary

This session is titled get ready to recover with reliability management. Jeff Nickoloff has worked with mission critical business critical systems for 20 years. It really comes down to preparation. Preparation is important, but it's important not to minimize the level of effort.
The other side of this is getting your team on autopilot. Bringing a high degree of consistency into your incidents response. A fairly deterministic and consistent set of remediation options. The question is what to spend on each side of these things.
The relationship between incident management and incident response and reliability, reliability of your systems and running a reliability program. These two things are definitely separate efforts, but there's an inherent relationship between the two.
There's a tool called failure mode and effect analysis, effects analysis. This is a pretty robust framework. It's rare that I've seen it implemented in SaaS and software space in a deep way. It helps you identify all sorts of risks in your system.
I want to talk for a moment about failure modes and your detective controls. This goes directly to your incident preparedness. If you can't detect whether or not a failure mode has happened, you're going to have poor response time. It's important to regularly create failure conditions in whatever environment.
The real big question is, well, where's our biggest risk? And when I say risk here is like probability of failure, but also multiplied by the severity of the failure. Consistency in measurement, consistency in those metrics that you're using to drive, prioritize decisions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

You. Hi. This session is titled get ready to recover with reliability management. I'm Jeff Nickoloff. I'm currently with working with Gremlin, but I have 20 years in industry. I've worked with some of the largest organizations on the planet, and I've been working with mission critical business critical systems for a very, very long time. Some of those companies include at Amazon, Venmo, and PayPal, and several others. In my time working on those critical systems, I've been involved with hundreds of incidents, both as a person on call, as well as someone who's responsible for communicating across an organization, running a man, running an incidents, communicating with customers, digesting and translating between engineering information technician information and customer impact. And there's a long way of saying I've seen a lot of different kinds of incidents, but I think the type of incidents that bring us together really drive us all to be interested in a session like this, in a conference like this, are really those with the most pain, and I don't necessarily mean those with the greatest customer impact, although those are very important. I'm talking about situations that I like to call scrambling in the dark. Now, this might be literally in the dark, where it's either the middle of the night, you're tired, but it doesn't have to be. It might also mean situations where you don't know, your customer doesn't know, your technicians don't know, no one's sure you're having communication problems, miscommunication. You have to do just in time research. Scrambling in the dark might look like your technicians are trading dashboards like trading cards. There's an increased desperation to the situation. And those are the moments where the customer impact is one thing, but it's really those moments that create a bit of a crisis of conscious, or rather a crisis of confidence for yourself, for your team in itself, for your business partners, in your engineering organization. People begin to ask the hard question of, can I rely on our systems? And many, many uncomfortable questions fall out of these times where we're scrambling in the dark. And so this is a problem we need to solve as much as, and with the same urgency as we typically talk, but things like time to recovery or time to response those sort of like a little bit more concrete metrics. But in my experience, more often than not, these have similar solutions. It really comes down to preparation. You need to get ready for the incident. Before the incident, you're going to have incidents that is, that is given. And especially for systems that undergo regular change, high velocity change, either in the system itself, the system design, or in the business context in which you're operating the system, either side of the coin can change a system in ways that are difficult to anticipate. But I don't want to be another one of those people that kind of stands up and says, well, you need to be better prepared, or to just be better prepared. Preparation is important, but it's important not to minimize the level of effort and the expense that goes into preparation. Preparing for incidents. It's nontrivial. As soon as you dive into it, you have to ask yourself, what does it mean to be prepared? What level of readiness do we need and what do we need to prepare for? And the level of effort increases with the complexity of your system. The number of individual components of the system, the number of people on your team, the number of teams you have, the different ways that your system might interact with other systems or your partner systems, your upstream dependencies, or with your customers. How many different ways does your system interact with your customers? How many different ways can it fail? And that's a hard story to tell when you're asking for funding to invest in programs so that you can be better prepared, so that your teams can feel better prepared, so that you can be more confident before, during and after an incidents. Because if you don't have that, this is table stakes for getting to those places where you can begin to talk about time to recovery. It all starts with the people and understanding the space. And it's critical not to minimize the level of effort it requires to be better prepared. So when we talk about preparation, your investment really falls into two categories. You have your detective controls, identifying when your system is failing, potentially identifying what parts of your system are failing, and potentially identifying likely resolution. That all falls into a world that we typically think of as automation. Automation. These look like programs or systems that provide continuous monitoring and automated recovery. Things like resilience platforms like kubernetes. This is specifically what that is intended to provide. Automated release control, automated rollback control, automated recovery. If a pottery or process fails, it will automatically bring reconcile that desired state back to having it run. Those types of investment in automation can be very powerful, but can also end up being very tool or stack dependent. And this makes it a little bit more fragile, makes it a little bit more engineering effort to pursue robust solutions over the lifetime of your team and your product and your company. They need a little bit of love, and that's okay. These are very powerful tools, but they are expensive to maintain, to implement, maintain and continuously improve. They're not bad. They're critical. Right. The other side of this is getting your team on autopilot. And what I mean by that is having a consistent, bringing a high degree of consistency into your incidents response, the skill set and context that your technicians bring, a consistent and logical way of following through and problem solving the incidents. A fairly deterministic and consistent set of remediation options. How do we recover? Getting that on autopilot? And when you're not on autopilot, what that looks like is incident responders and technicians, specifically where you have a high degree of variation in their readiness to handle the incident. Some people understand some systems more than others. Some people are familiar with recent changes to a system more than others will be. Sometimes people have different problem solving responses, different problem solving workflows. In other times, different people will be familiar with different sources. For truth, this might look like dashboards. It might look like awareness of specific, some alarms or maybe non alarming monitors that have also been set up, that are designed to, or that are in place to try to help triage and identify a path to recovery. But those things are not. There is a missed opportunity if your team can't use them with consistency. So when I say getting on autopilot, I mean bringing a high degree of readiness to the people on your team, making sure that they're aware, making sure that they understand the systems and how those systems fail, making sure that they understand the tooling and everything else that are available to them, and making sure that they get reps, making sure that they get practice, making sure that they've seen the various kinds of failure before they show up in an incident. This is an investment, and a regular investment into the human side of things. But either way, regardless of which of you're going to end up investing in both of these things, the question is what to spend on each side of these things, and then also identifying what things to prepare for, which kinds of incidents to prepare for. Most systems have. They can become quite complicated quite quickly with the number of dependencies, the number of ways things can break, and under which kinds of conditions different parts of the system may break or may need different kinds of love in order to recover. In some cases, you might be asking yourself, like, what can we change about the system before an incidents? To either reduce the probability of an incident or to speed recovery. But again, coming back to the complexity and the level of effort that goes into just preparing, it's going to be very difficult to prepare for everything. That's a very long tail that we'll all have to be chasing. So the question really comes down to not just do we invest in automation versus autopilot, but which types of incidents, which types of failures should we invest in preparing for? And that's a nontrivial question. You can answer it trivially. Some people may have seen that some people might be more familiar with different types of failures than others, and so they'll probably lean towards that, naively lean towards the things that they are familiar with, the failing or the last thing that they ended up being paged for. There's typically a strong bias for that. But if you're standing in a position where you have the opportunity to choose, there's a better way. I want to talk about the relationship between incident management and incident response and reliability, reliability of your systems and running a reliability program. These two things are definitely separate efforts, but there's an inherent relationship between the two. Your reliability program. We put these things in place so that we can proactively, not retrospectively, looking at what has been breaking, but so we can proactively identify and regularly assert what incidents we are at risk for, the probability of those risks, the severity, if these things break. And we use that information to inform what incidents we should prepare for, and we use the information that comes out of managing incidents, how prepared are we to handle these incidents? How long does it take to recover? What's been the financial impact of the last three times or however many times this type of failure has happened? We use that as an input back into the reliability program so that we can prioritize what to change and how to measure. So, reliability program. This is a very high level, abstract idea, but in general, what this looks like is having a strong idea of being able to enumerate the components in your system, being aware of the ways that they might fail, being aware of your dependencies, being aware of the value, usually, like by rate, of how valuable certain systems are, and then regularly measuring and determining what types of operational conditions those components can survive and specifically where they tip. And then obviously there's a whole thing around identifying, funding and staffing for engineering improvements so that you can hit reliability goals. But I really want to talk, but to zoom in on the mechanism, the high level core mechanisms of a solid reliability program, there's a tool called failure mode and effect analysis, effects analysis. This is a pretty robust framework. It's rare that I've seen it implemented in SaaS and software space in a deep way, but it's a really important system, even if you're taking only high level inspiration from it. A failure mode and effects analysis is a robust and opinionated framework for cataloging the components in your system, the failure modes for each of those components, the probability of those failures, and that term starts to get a little bit fuzzy and really dependent on your business, the severity impact of that type of failure. For many groups, this might look like financial impact, this might look like downstream. If this fails, you can begin to talk about cascading failures, although failure mode and effects analysis is really not so much concerned about that. It's usually first order impacts. But if you can get to money, it helps you craft a better story later. And these analyses also typically discuss and presents an opportunity for you to determine whether or not you can detect the type of failure. But the big idea is you get this information, you build out a big table, it might be a spreadsheet, whatever it is. And this helps you identify all sorts of risks in your system to really identify where the risks are. I want to talk for a moment about failure modes and your detective controls, because this goes directly to your incident preparedness, as you're enumerating, as your reliability management program is enumerating the types of failures for each component and whether or not they can survive them. Another big question for each of those types of failure modes is, can you detect it before your customers do, or how quickly can you detect it? And that's really because if you can't, these are clearly going to be gaps in your preparedness. If you can't detect whether or not a failure mode has happened, you're going to have poor response time. If you can't detect whether or not this failure has happened, your incident responders, when they do respond, are going to have a more difficult time identifying the nature of the failure. And so it's naive. I could stand up here and say, make sure that you've got detective controls for everything. But this is one of those cases where you want to look at that breakdown to say, does this type of failure mode warrant investment into detective controls? And it's important to be able to test your detective controls to regularly, I don't mean at one point in time, but to regularly create failure conditions in whatever environment to verify that your detecting controls operate the way that they're intended. The next step, and it's not really a step, but the other part that I'd already discussed a little bit is, so when we're bringing it back to how do we prioritize what to invest into, the real big question is, well, where's our biggest risk? And I mean not just in like. And when I say risk here is like probability of failure, but also multiplied by the severity of the failure. If you have something that is very expensive, if the failure occurs, but is extremely unlikely, then this might be a lower priority than a type of failure. This might be a lower priority to prepare for than a type of failure that happens three or four times a day and is likely to continue that that has a more mild cost associated with. But you have to do that reflection activity. You have to actually ask yourself, how likely is something to happen? And that typically requires some type of experimentation. You should test it. Can this happen? Under which conditions can it happen? And when it does dive into the business. Look at your volumes, look at your, if it's a revenue type business, how much revenue is associated with these interactions? If this type of failure might result in some breach of contract, it's important to understand the penalty for those types of violations and bring that in. Let the business inform your engineering decisions. And so there's a lot of different ways to do this. It's easy to say probability and severity and talk about risk. I've seen it done a lot of different ways. And one of the concerns there is having inconsistency in your organization. If you have ten different groups in your organization and each of the ten groups are doing it slightly differently, it becomes very difficult to prioritize for your organization because you're often comparing apples to oranges. So regardless of what happens, regardless of how you move forward, consistency in measurement, consistency in those metrics that you're using to drive, prioritize decisions that dictate how you're going to spend your money in improving your preparation. Consistency is key. And this is one of the problems that we're solving at Gremlin that I'm so passionate about. Our new product reliability management. Product scoring is really central to it. And at minimum, this is something that we've learned from the vast experience in building reliability programs with companies. More often than not, from my experience and other conversations I've had with people at gremlin, more often than not, these are the dimensions that our customers find great success with. And like I said, the specific scoring mechanism you use is less important than having consistency in scoring. So what we've done is we've gone ahead and built a consistent scoring mechanism on their behalf. And so this is just an example. We do regular testing for redundancy, scalability and surviving dependency issues. We combine those into an easy to understand score and our customers, and we help present this to customers in a way that they can understand reliability issues between different services. Now, if you were to dive in, you can see specific conditions, and you can use those types of failures to inform the types of incidents that you should be prepared for. But the real power here is being able to know and identify what things can we survive? What things can we not survive for? Those things that we can't survive, what's our impact? Right. So however you end up implementing it, this is one of those great. What I believe to be a fantastic example of what you should end up with at the end of the know. That's why I'm so excited about what we're building here at Gremlin. At Gremlin, this has been our focus since the beginning, but we're really making explicit now that our mission is to help teams standardize and automate the reliability one service at a time and to help them understand at the service level in a consistent, repeat able way what they can tolerate and what they can't, so that they know this is that you understand how to prioritize your improvements, either in your product engineering or in your incidents response preparedness. Your incident preparedness. Thank you. That's all I have today, but if you have any other questions, I would love to see them in the chat. Thank you for everything.

See all 17 talks at this event!

Conf42 Incident Management 2022 - Online

September 29 2022

Get Ready to Recover with Reliability Management

Video size:

Abstract

Summary

Transcript

Jeff Nickoloff

Principal Engineer, Office of the CTO @ Gremlin

Join the community!

Featured event

2025

2024

Info

Conf42 Incident Management 2022 - Online

September 29 2022

Get Ready to Recover with Reliability Management

Video size:

Abstract

Summary

Transcript

Jeff Nickoloff

Principal Engineer, Office of the CTO @ Gremlin

Join the community!