Incident Management 2022 Panel Discussion

Video size:

Abstract

Nora Jones, Erin McKeown & Charity Majors will discuss the Incident Management practices in their respective organizations: Jeli.io, Zendesk and Honeycomb.io

Summary

Erin McEwen is the director of engineering resilience at Zendesk. Charity majors is the co founder and CTO of Honeycomb. The observability platform to find critical issues in your infrastructure.
Today we're going to be talking about incident management. Charity has a lot of insight on to before the incident happens and what you're doing during the incident. Nora is the founder and CEO of Jelly IO. We kind of COVID the whole stack from responding to the incident to diagnosing it to understanding it afterwards.
At Sundisk, I have a really amazing global incident management team. At Zendesk, I really serve as a conduit to our leadership. Having that calm demeanor and making sure that you're managing up in these situations is really important.
Where would you say is the point at which it's worth starting to invest in that? It is pretty time consuming, and I think it depends on the commitments that you've made to your customers. What thresholds would you look for to say, okay, you should start really doing this for real?
What's the tolerance from the business for downtime or for impact? Something is better than nothing. Have a meeting at least once a quarter where you look at your risk path. Think about what is my end user's experience going to be like.
Aaron: These days, I don't respond to incidents at all. We really have the philosophy of engineers write their own code in production. You should never put your monitoring and observability on the same hardware as your own production system. Every time you get alerted these days, it should be something genuinely new.
In your organization, you responded to your last incident in the depths of the weeds. By doing that, you're also giving everyone else a chance to do what they do best. Give people the time and space to become experts and to build their expertise. Step aside and let the next generation take over.
There's just so much to learn about how to manage the people element of an incident. Wait for someone to escalate to you before you jump in. It's giving them the opportunity to teach others their skills. One of the things at Sundusk right now is reducing all of our mean times.
Erin, a lot of the folks that are listening have no idea the amazing learning from incidents culture that you are building at Zendesk. I'm curious what advice you might have for other folks looking to create an intentional learning culture.
In engineering, I feel like we have for a couple of decades now, been working on training. But other parts of the business, like sales and marketing, are still petrified to talk about mistakes in public. I keep trying to find out if there are companies out there who are trying to expand this beyond engineering.
Charity: As much detail as I'm willing to go into is all of the detail. I feel like the more transparent you are, the more trust you'll build. In my experience, more detail is always better anyway.
We had this really interesting outage week or two ago. It took us a couple of days to figure out what was happening. Most of our interesting outages come from people who are using the system as intended, but in a really creative, weird way. Keep a close eye on your code while it's new.
Erin, can you tell us about a recent outage that you all had? Yes, and unlike charity, I am not the CEO, so I have to be a little bit more careful with some of the details that I possibly share. It's interesting to see how over time, your type of incidents and the nature of them can change.
Is this a security incident or is this a service incident? We downgraded the incident to a lower severity level. Security incidents also roll up through, like, service incidents or security incidents. Those can be the trickiest ones to figure out.
Aaron: I've been trying to come up with more creative metrics lately. One of them is how long someone spends waffling trying to decide if something's an incident. I think metrics have value in their own.
Cool. Well, we are wrapping up time now. We probably have time for one more question. And there's like three really great ones. We're just going to have to do a second round at some point. We'll just end it on a cliffhanger.
Cherry: I want to hear you talk about build versus buy with regard to tooling for any phase of the incident lifecycle. Telemetry is a big and really important part of the entire DevOps transformation. It's so much cheaper to pay someone else to run services than to have an enormous, sprawling engineering team.
Darren: There's ways in which that we are getting more sophisticated in how we are creating our systems. And the dependency there is the web of things that everybody is connecting to and using and leveraging. Darren: What is best is starting from a risk perspective of really understanding what is your risk tolerance.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. I'm joined today by Erin McEwen and charity Majors. Erin comes from Zendesk. She is the director of engineering resilience there. She has a lot of experience in crisis communications and leadership disaster recovery. She spent some time at Google and Salesforce in both the risk management and business continuity worlds. Charity majors is the co founder and CTO of Honeycomb, the observability platform to find critical issues in your infrastructure. So today we're going to be talking about Jones of my favorite topics, and I know one of their favorite topics as well. Incident management. We all come from incident management from different phases of the spectrum. Charity has a lot of insight on to before the incident happens and what you're doing during the incident. Erin comes from a world where she is really actively communicating with customers and also communicating with the folks in her organization to understand what is happening in the incident and understand what is happening afterwards. And they're both leading organizations doing this as well. And I'll quickly intro myself. My name is Nora. I'm the founder and CEO of Jelly IO. We are a post incident platform, but we kind of COVID the whole stack as well, from responding to the incident to diagnosing it to understanding it afterwards. So thank you both for being here with me today. Thanks so much for having me. This is going to be super fun. Yeah, same. Super happy to be joining the conversation. Awesome. So I'm going to first ask Erin a question. I want to understand a little bit more about your leadership style when an incident happens at Zendesk today. Yeah, sure. So I think what's interesting is I'll give a little context of the parts of how we operate in an incident. At Sundisk. I have a really amazing global incident management team and we also have different roles of those that are constantly in an incident. My incident manager, we have an engine lead on call. We also have know, we have our advocacy customer facing communication role as well as our Zenoc, which is all about the initial triage and then all of our engineers that get pulled in. Wait, did you say Zenoc? Yeah. So Zendesk knock. I love that. So my first ever management experience was at Linden Lab where I founded the Dnoc, the distributed knock. So I was super tickled. I don't think I've ever heard of another company doing they. I actually managed them for a little bit of time, but they've moved over to a different leadership. But we stay very close. We're very close with our Znoc buddies. But yeah, I think what's interesting is during an incident, especially when I'm getting involved, because a lot of our incidents, those are managed at the incident manager level. We have our Internet leads on call that come in. If I'm there, it's a little bit of a different situation because that means that it's probably a pretty big one or something that has high visibility. And so a lot of what I'm there to do is to make sure that I'm supporting the team that's responding and keeping them in scoping those things. Right. But I really serve as a conduit to our leadership. It's really about how making sure that booths on the ground are able to do exactly what they need to do. The executives that need to be informed are informed and pulled in at the right increments, in the right timing. So getting that alignment is really important. And I think that having that calm demeanor and making sure that you're managing up in these situations so that you can protect the folks that really need to be paying attention to what they're doing right in front of them is really important during an incident. Totally. The emotional tenor really gets set by whoever is in charge of the incident. If you're just, like, slapping around and just, like, reacting and everything, everybody in the company is going to freak out. And if you can project calmness, but intensity. Right. Like, you don't want people to just be like, oh, nothing's wrong. You want them to be aware that something's happening, that it's not Normal, but that it's under control. I think part of the key thing is that people need to feel like they can be confident that they will be pulled in if needed, so that the entire company doesn't just have to hover and look, am I needed there? Am I needed there? Do I need to respond here? No. In order for everything to stop, everyone has to have confidence they will be pulled in if they are needed or if they can help. Yeah. And I think one of the things that I will say, too, is that part of being able to come in and have that calmness also is part of having a really strong program around incident management and having good structure and how your tooling operates, that your folks are educated on their roles and responsibilities. We also have crisis management that is the next level above that. So I think that at Zendesk, I feel very fortunate because everybody jumps into an incident, they're ready to go. They are happy to help out. They want to know what they can do. They're contributing in the space that they can. And I think that's a very important piece of it makes it a lot easier to be a leader in those situations when you have folks like that that are jumping in and really wanting to just collaborate and figure out what's going on. Totally agree. It sounds like the way you handle it is a lot more of an art than it is a science, too. And it's just really keeping a pulse on what's happening, making sure you're really present in that moment. How do you prepare folks for those situations as well before you're in them? Yeah, I think it's a great question. There's a couple of different layers for that. So we actually are right now doing a lot of training. We've got folks that have been around for a long time. We realize we have a lot of new people. So my incident management teams have been doing some regional training. They did a mayor recently, they're in the middle of doing a MIA, and we're getting amazing feedback from that. I think that both newer and older folks are. It's good reminders, but it's also, at the same time, really great education. And I think we do other things such as shadowing. We make sure that folks that are getting onto new on calls, that they're doing a couple of sessions where they're following along and shadowing during someone else's on call so they can learn from that. Do you do primary, secondary? Yeah. So it depends on the team. So the structure can vary a little bit depending on the size of the team, the criticality of what services they are over, things like that. So just a little bit of variance here and there. But the one last thing that I would say is we do a lot of exercising. We have our chaos engineering team that does. They have a whole suite of tests that we practice. It's really important for us to operate in that exercise as it's an incident. We have an incident manager that participates in every single one of those exercises. We use as much as our tooling as we possibly can to make it realistic and to get in and do all of those things. I think that's the space that I think we get the most value. And I will just say that I think it's something that, trying to expand on that a little bit wider is something that we're looking at with Zendesk just because we learned so much from it. How time consuming is that? I think that we have one primary Zendesk is an AWS crew. And so, for example, we have AZ resilience testing that we do. We do an AZ resilience failover exercise, I think it's on a quarterly basis. Right. You know, that takes a lot of prep work because usually what we're doing with that is depending on what incidents have we had, what areas do we need to press on? Even using AWS incidents, you have AZ resiliency. Do you have region resiliency for data? We have data, but we're not like, got you. Yeah, we're not across multiple regions per account. I'm curious because it can be incredibly time consuming. And you're right, the payoff can be enormous for companies that are at the stage where it's worthwhile. But where would you say is the point at which it's worth starting to invest in that? Because it is pretty time consuming, and I think it depends on the commitments that you've made to your customers. What thresholds would you look for to say, okay, you should start really doing this for real as a driver? Yeah, I think that's a really great question, and I think that there's a couple of different inputs to that. First of all, starting as small as possible, tabletop exercises are a great way to go if you don't have a lot of time to do things. If you're not able to prepare your pre prod environment, to be able to actually put hands on keyboard and simulate things, you can start at a much lower level and build on top of that. I think people think right away they've got to go in and do the big bang. Right. At what stage should they start doing that? I think it depends on the company, because I think that there's a different desire for whatever. Absolutely right. What characteristics or something would you look for in a company to be like, okay, it's time for you to really start investing in these affirmative, active sort of tests? Yeah. So I'm of the persuasion that I think that companies right now are not starting early enough with getting incident management practices and teams and structure and testing in place. I hate to be annoying about this, I super agree. But specifically for regional failovers or av failovers, that's a big gun in my opinion. Where would you say that comes in? And it's an expensive, super expensive. It's not a cost. You can't just go, oh, everyone should do it as soon as they should, because that's not actually true. Yeah, but it makes total sense for Zendesk to do it if you think of the nature of what they're offering and what their customers are going to say when they have incidents. If Zendesk tooling goes down and one of their customers is in the middle of an incident, they're not able to resolve their incident. And so I feel know the urgency of doing what Erin and team are doing becomes higher in that know, we're trying to think about it on a weighted spectrum, but, yeah, you're right. Like some organizations, it probably is not worth some of the time. I think right now, and there might be no good answer at just because you're so. I think that, again, that's why it kind of, again, like to. What Nora is saying is that we're a critical service for our customers. They depend on us to be able to deliver their customer support to their customers. So if we're impacted and they're unable to support their customers, that reflects on them. They don't know necessarily that down the line. It's a Zendesk incident that's causing that customer pain. Right. So we have an obligation and the pressure, too, to actually minimize as much as possible of that impact. The tolerance these days is a lot lower for that, right? Yeah, it used to be like, oh, AWS is down, everybody just go out for coffee. It's not like that. I mean, I agree. I don't know if it definitely is not a one shoe fit, all type of a timing, but I do think that there are baby steps that people can take in understanding what their business needs are. And again, the first question that I would ask is, what's the tolerance from the business for downtime or for impact? I think that that question is one to put back to the business and to the leaders and say, what is our tolerance for this? Are you okay? If we're down for this amount of time, do we have financial commitments? Do we have just, like, word of mouth commitments? I think that for me, the two things that I would say that every company should do pretty much across the board, if you have customers, you should do this. Number one, something is better than nothing. If it's literally like your homepage is just returning a 500 or a 502, that's unacceptable for anyone these days, at least. If your storage, if your back end writes are failing, cool. Try to degrade gracefully enough so that you give your customers some information about what's going on and you return something. Right. You should return something and you should handle as much as possible without making false promises about what you've handled. And number two, I think to your point about tabletop exercises, I think that as soon as you're mature enough to have a platform team, an Ops team, whatever team you're mature enough to have a meeting at least once a quarter where you look at, okay, look at your risk path, right? Your critical path. Like, what are the elements in there? How can they fail? And what is the limited thing that you can always do if whatever component goes down? What's your path? What's your story? What do you try and return? Like at parse? We had all of these replica sets. The minimum thing that we could do if almost everything was down except this one, AZ was. We could at least return a page that said blah, blah, blah, or whatever. But you should think through what is my end user's experience going to be like and have an answer for it and try to make it as good as possible, because often that takes very little work. You just have to think about it. And I think you both have touched on the real benefit behind this is coming up with what is a good tabletop exercise to run. Right. It makes sense for Erin and her team to run these region exercises because that is issues they've experienced before. But what you're gearing at charity is just like thinking about your user lifecycle and thinking about what matters to your users, like returning something or thinking about what they're going to experience at that moment, or thinking about what the time of day is for them, or thinking about when they're most using your platform. Those are all things to think about. It probably doesn't make sense for most orcs to just take a standard regional failover exercise, but it doesn't make sense for them to think of something. And I think a lot of the value in it is actually taking the time to work together to think of that thing. What is the best for us to do right now? And really digging into what that means for everyone and why Facebook got 15 years into their server. And whatever about Facebook, but they were 15 years old as a company before they started actually investing in being able to shut down a data center. And that was driven not by them wanting to be able to do this, but by the fact that their original data center was shutting down. And so they were like, well, we got to figure this out anyway. Let's make it so that it's a repeatable process. And then kind of like what you were saying, aaron, like once a quarter from there on out, they do these highly prepared, shut down a region failover sort of practices just for hygiene. But, yeah, I think it's easy to say, oh, you should do this easier. But when you sit down and think about it, it's costly, it's hard, it's going to take resources away. And so you should be thinking pretty narrowly about what is the least best. What's the 20% solution here that gives 80% of the benefit? Exactly. And again, I do think it's important to know that what I'm speaking about has been years of. I mean, I've been at sundust for over seven years. We started little bits and pieces of incident management when I came on board and building out our resilience organizations, our reliability organization, there's been a lot of time and effort. What does your resilience organization look like, if you don't mind? Yeah, so I actually have a very interesting team under resilience. I've got our technology continuity organization, which is all about our disaster recovery and continuity of our technology. And we have incident management, which I've mentioned already. And then I also am responsible for business continuity and crisis management at Zendesk. So not only are we focusing on the engineering pieces of things, but we also are responsible for what happens with business disruptions to the overall business operations. And then there's one other part of my team. I've got our resilience tooling and data, and they are responsible for. The engineers are awesome. And they've built all of the in house tooling that we have for incident management, and they support the integrations and things like that for the other tooling that we leverage outside of Zendesk. And then we have our data team, which they're newer. We've been like, data. Obviously, data is lifeblood these days, and we have managed. That team is about a year old. But really doing the crunching of the data that we have related to incident management and the impact that we have as a business and bringing those back to the business so that we can make critical decisions. So it's an interesting, wide breadth of responsibility when it comes to disruption. And the alignment there makes quite a bit of sense for the Zendesk organization. Thank you. Yeah, that's awesome. And I feel like what you're gearing at is a lot of, I think, the real value of SRE, which is understanding relationships and organizations like technology relationships, human relationships, all kinds of things. And it becomes invaluable after a while. Charity. I'm going to turn the question over to you. Can you tell me a little bit about your leadership style when an incident happens, and this could be your philosophy towards how folks should handle incidents in general? It can be anything. Yeah. These days, I don't respond to incidents at all. I think I am very high up if it fails over the front end time engineer, and then the back end engineer, and then the successive engineer, and then something, something. But I maybe get a page failure to be once a year, and it's usually an accident, which I feel like the people that we have are so amazing. Like fucking Fred and our SRE team and Ian, and they're amazing. And so I can't really take any credit for all of the amazing stuff that they're doing these days. I feel like I can take some credit for the fact that we didn't make a bunch of mistakes for the first couple of years. And so, you know how there's a stage that most startups go through where they're like, whoa, we're going through a lot of growth, and suddenly everything's breaking. Let's bring on someone who knows something about operations. We never went through that. In fact, we didn't hire our first actual SRE for four years. Then he was there for a year, and then he left, and then we didn't hire another for another year. We really have the philosophy of engineers write their own code, and they own their code in production, and for a long time, that was enough. Now these days it's not. And much like Aaron and much like you, Nora, we have a service that people rely on above and beyond their own monitoring, right? And above and beyond their own systems, because they need to be able to use it to tell what's wrong with their own systems, and which I know we're going to talk about build versus buy a little later on. And this is where I have to put my plug in for. You should never, as an infrastructure company, put your monitoring and your observability on the same fucking hardware, the same colo. As your own production system, because then when one goes down, you've got a million problems, right? They should be very much segregated and separated, whether you're buying something or building it yourself. But Tldr, I think we did a lot of things early on that were helpful. For a long time, we only had one on call rotation, and all the engineers were in it. One of my philosophies about on call is that the rotation should be somewhere between about seven people, right? Because you don't want it to be so short that everyone's getting. You want them to have like a month to two months between on call stints, right? You want to give them plenty of time to recover, but you don't want to give them so long they forget how the system works, or so long the system changes and when your system is changing rapidly, like more than two months, it's just too long. Now we have multiple on call rotations, of course. But I guess the main philosophical thing that I will point out here is that the way that I grew up dealing with outages and monitoring and everything, they were very much intertwined. Right? The way that you dealt with outages was the way that you debugged your system and vice versa. And really, when you were dealing with a lamp stack or a monolith, you could look at the system, predict most of the ways it was going to fail, write monitoring checks for them. Like, 70% of the work is done because all of the logic was tied up inside the application, right. It would fail in some weird ways. You tease them out over the next year or so, and only like a couple of times a year, would you really just be completely puzzled, like, what the hell is going on? It didn't happen that often, but these days it happens all the time. It's almost like every time you get alerted these days, it should be something genuinely new, right? We've come a long ways when it comes to resilience and reliability. And when things break, we typically fix them so they don't break anymore. Right. It should be something new. And now you've got microservices and multitenancy, and you're using third party services and platforms and everything, and it's just wildly more complex. Which is why I think you have to have engineers own their own code in production. But I also feel like the processes and the tools and the ways we think about debugging, you need to kind of decouple them from the process of the sites down. We have to get it back up, just the emergency sort of stuff, because you can't look at it and predict 70% of the ways it's going to fail. You can't predict almost any of the ways it's going to fail. You can write some to Erin checks, but that's not going to save you much because the system is going to fail in a different way every time. And this is where dashboards just become kind of useless, right? If you're used to just like, eyeballing your dashboards and going, it's that component or it's that metric or something that's not good enough anymore. And this is where obviously, everyone's heard me ranting about reliable, about observability for a long time now. But I think that the core components of observability you can really boil down to. You need to support high cardinality, high dimensionality, and explorability. And the shift for engineers is going from a world where you had fixed dashboards and you just eyeballed them and kind of pattern matched. Oh, that spike matches that spike. So it's probably that that's not enough anymore. You can't do that. And the shift to moving to a more explorable, queryable world where you're like, okay, I'm going to start at the edge. Break down by endpoint, break down by request code, break down by whatever. Or if you're using bubble up, you're just kind of like, okay, here's the thing I care about. How is it different from everything else? But it's not like getting over that hump of learning to be more fluid and using your tools in a more explorable way. It's different, and it's a really important mental shift that I think all engineers are having to make right now. Yeah, absolutely. And you mentioned earlier you didn't really have a philosophy. You haven't responded to an incident in a while, but it's like, that's a philosophy. Yeah, that's true. In your organization, you responded to your last incident in the depths of the weeds, and I wonder what that point was for your business. Right. Obviously, the business shifted in its own way from a market perspective and things like that, and it wasn't maybe the most valuable use of time. And by doing that, you're also giving everyone else a chance to do what they do best at, which is what you mentioned. Right. Which is an amazing philosophy on its own, is like giving people the time and space to become experts and to build their expertise. Yeah, well, that was pretty early on for us, and it wasn't an intentional thing so much as it was. I was CEO for the first three and a half years. I was constantly flying. I couldn't do it, and the other engineers had to pick it up. And before long, there was so much drift that I wasn't the most useful person. Right. These are complex breathing systems that are constantly changing, and it doesn't take very long for you to not be up on them anymore. But you'd be surprised how many leaders don't necessarily. That's a mistake. That's a big mistake. I love responding to incidents. I love them. That's my happy place. I have ADHD, which I just found out about, like, two years ago. So I never understood why everyone else is freaking out about incidents. And that's when I get dead calm and I can focus the best the more, the scarier it is, the better I can focus. And I'm just like, we. So that's really hard for me to want to give up because I suddenly don't have any times where I really feel like I'm useful anymore. But you have to. You have to, because just like you said, you have to let your team become the experts. If you cling to that, you cling to it too hard. You're just preventing them from being the best team that they could be. I will give you. And they won't feel like you trust them. That's the thing. Like if I show up every time, they're just going to be like, she doesn't trust us. I was going to say for the job anymore. It's like there are a number of different things. Sorry, I didn't mean to interrupt you guys. I was just going to say that I have an old leader that back in the day he incredible and very technical and very smart, and I would have to have side conversations with him and be like, people love having you in there, but you also are preventing them from learning this. Let's wait. I know we want to get this taken care of, but give them that little bit of. Because it does have a different reaction than people all of a sudden are like, oh boy, they're here. They're going to tell us what to do. We'll just wait to be told. Instead of saying, actually, I know this, and this is how I should. We fixed this before. You could have a more junior engineer in there that's never with this SVP before. And they're like, whoa, I'm not speaking up right now. Yeah, totally. It removes a sense of safety, it adds an edge of fear. And the thing is that when you're dealing with leaders like this, and I've been one before, you just have to look them in the eye and shake their shoulders and say, it's not about you. It's not about you anymore. You are not as clutch as you think you are. And it's not your job to be as clutch as you want to be. You need to step aside and let the next generation. They'll fumble it a little bit. It might take them a little longer the first time, but they'll get there and they have to. I was in an organization that was like, it was a public facing company. It was used by so many different folks in the world. And still, even after all that, the CTO would jump into incidents and as soon as they joined the channel, everybody talking, well, no, it would go from one person is talking to several people are talking, and the time between people hitting enter increased. This person didn't realize it, but it was almost like we had to take them through it. It's just showing them how the tenor of the incident changes. And it's not even just people that high up. It's certain senior engineers, too. I think also as leaders, we have to be really intentional about what we're rewarding and how things are happening, because I also thought that organization as well, like very senior engineers, even when they entered the channel, they would come in and just save the day. There was one engineer that when they joined, everyone reacted with the Batman emoji, and they were like the most principal engineer. And I saw more junior engineers acting like this person, and they were the information. They had their own private dashboards because it was like what we were incentivizing was this culture where you came in and saved the day as well. And it was kind of fascinating. Yeah, I think, Nora, super real. One of the things along that line is that obviously we have jelly that we've been using, and one of the things that is really shining through there is where we've got those folks that the incident is called. They're hopping in there. They're not on call, and they're doing work. And there's a lot of opportunity to learn about that because there are situations where maybe we don't have the right page set up for that particular type of event, or there's many different elements that can contribute to that happening, but one of them is. But when shit really hits the fans, I bet you all have your top four or five engineers that you're pinging on the side being like, hey, are you watching this? And holding? Because you know that they're going to be able to come in and really help. But at the same time, there's just so much to learn about how to manage the people element of. Because again, if you keep having them come in and do that, that's not going to help the organization overall. Ultimately, in that case, you have to get to a point where you say, thank you so much for being here. We do rely on you sometimes, but wait for us to ask you. Wait for someone to escalate to you before you jump in, because that gives the instant commander, and that gives them a genuine chance to try and solve it themselves without just constantly looking over their shoulder going, when is so and so going to jump in and fix it for us, right? And they know that they need to give it a shot, and that person is still available. But like the training wheels, you need to try and wobble around with the training wheels off a little bit before you reach for them. Yeah. And it's giving them the opportunity to teach others their skills. And this is how you become even more seniors, by doing stuff like this. Also, when you do escalate to someone, ask them to come in and try not to do hands on themselves, but just to answer questions about where to look. Yeah. One of the things that we did recently, I guess it's not, man, my brain is, it's been a while, but we got away from having an engineer lead constantly in our incident, our high severity incidents, and it worked pretty well for a while, and we were seeing that things were. But one of the things that we were noticing is that there wasn't always someone there to kind of take the brunt of a business decision being made. And what I mean by that is when trade off conversations are happening, having the right person to ask a couple of those pointed questions that has a broader breadth to help really be able and to raise their hand and say, yes, go ahead and do that. I'm here. I've got you. Yes, go. And really doing exactly what you're saying, which is, most of the time, really listening and making sure that the team has what they need, but then finding the opportune moments to make sure that things are moving along, the questions are getting asked, and, again, really taking on the responsibility, again, that pressure off those engineers where they may not be as comfortable to say, yes, we need to do this. One of the things at Sundusk right now that we're very laser focused on is reducing all of our mean times, too. So we've got a bunch of different mean times, too. And I'm saying that in the sense that we've got a bunch of them that are, meantime to resolve, meantime to activate all of those. And I think that there's these very interesting pockets of responding to the incident where, little by little, reducing that time, especially around that type of a thing, where, again, there's a debate going back and forth of whether or not to do something. And it's, again, just especially when it comes down to not the engineering decisions, but the business decisions. Right. A lot of engineers will feel like, okay, but now I'm at this risky precipice, and I'm not sure how to make this decision about this risk to the business and how to analyze that that's really important. Do you solve this with roles? Yeah, we now have our engineering lead on call rotation, which is pretty much, this is a little inside baseball, but like director plus, we follow the sun first and secondary, and we have a handful of senior managers that are in that, but those are senior managers, that this is a growth opportunity for them to be able to get to that. It's a great opportunity to get some visibility and to show leadership skills outside of what their area is. And I think that what's interesting, again, observing different incidents where we have different leaders, they take different approaches. Right. Like, you have some of the ones that are like, they're there, they have their presence, they're listening, and then they're ready to go with their three very specific questions as soon as there's a break in the conversation where there's other leaders that get a little bit more involved. And so I think the most important piece of that role is the relationship with the incident manager iims should be making. They are in charge ultimately, right? They are the ones that are making sure that they're doing all the things. But I think that that partnership between the end lead and the IM and even having a little back channel conversation between those two on slack to just say, hey, I'm hearing where the, I'd rather make sure that they're pushing things along instead of having me jump in because if I jump in, then it, it can shift the, like, oh, whoa, why is all of a sudden asking this question? So I think that there's a lot of opportunity to like, and that works well in a situation where you don't have a lot of people that are trying to exert a lot of, it's not those power grab situations. We don't really have that at Zendesk. I was mentioning earlier, it's a really collaborative and all for one type of the way that we operate. But I think that there's unique elements like that that, like, I wouldn't necessarily recommend having that role for all programs. Right. Like, that might not be the right fit. That might actually not be the way to become more effective. But for us, that's helpful in the sense of let's make sure that there's someone there to really own those critical decisions. Yeah, makes sense. Erin, you touch on this a little bit, but I mean, a lot of the folks that are going to be listening to this have no idea the amazing learning from incidents culture that you are building at Zendesk and just how hard and intentional that is with a really large organization with a very pervasive tool in the industry. And I'm curious, you obviously can't distill it in like a five minute thing. But I'm curious what advice you might have for other folks looking to Erin out looking to create an intentional learning culture there. And I think one of the things you suggested was that you put it on the career ladder a little bit. Right. And so folks are being rewarded for it in a way. And I think more of the tech industry needs to do that and be that intentional about it. But is there any other advice you might give to folks? Yeah, I think that that's a really good question. And it's funny, like, yeah. Trying to squeeze it into a five minute, keeping it short. I think that it really starts from the folks that are responsible for incident management. Right. Wanting to make sure that they are understanding and operating their process as efficiently as possible. Right. So when we do that and we learn from, we take the opportunity to really have a blameless culture. And that is something that we really do stand by at Zendesk. We always have. I feel like I keep kind of mentioning these rah, raw things about Zendesk, but it is true. You should be proud. We do practice these things. And I do think, though, that that started from a very early foundation. I think that we have maintained that culture, which helps quite a bit from the learning. Right. Like the opportunity to learn. I will say that what we ended up doing is, I mentioned this earlier, we collect a lot of data, our incident retrospectives, very intricate. At the end of every incident, there's a report owner that's assigned. If it's a large incident, there may be more than one. And they work on doing their full RCA, and there's so much information. What's an RCA? Root cause analysis. Yeah. And as part of that process, it's really the work that they do to really understand what had happened, providing the story, the narrative, the timeline, the impact remediations, and being able to. Then we have our meeting where we go through, we talk about the incident, and then from there we push out our public post mortem that is shared out to customers. We also have an event analysis that we do prepare within 72 hours that gets shared with some of our customers upon request. Just because this is what we know at this point, that also helps a lot with our customer piece. That's not about learning, though. So I think that what we've done is we have established a pretty strong process that has worked for Zendesk pretty much pretty well in the sense that we've been able to learn quite a bit from our incidents and our remediation item process, we have slas against our remediations. And with those remediations, I think that I was mentioning the chaos engineering that's a more recent. And I'm saying these timelines, which me saying recent could actually mean a little bit longer, but taking those and saying, hey, we had this incident. Let's do this exercise and validate that all of these remediations that you said are completed, we are good. We've mitigated this risk. And there's this whole piece of, I mentioned about the data that we collect, right. And being able to put together a narrative and a story that is really able to help the business understand what it is that we're dealing with, and how are we actually taking that data and making changes? How do we drive the business to actually recognize where we need to be making improvements? And again, we've been collecting this data for years now, so we've got a lot of historical context, and it's very helpful. And I think what we've done now with adding jelly to the mix with this is that it's a whole other layer for us that we haven't been able to get that level of insight. And we were ready for it in terms of being able to do things like pulling our timeline together from various slack channels and being able to see the human element and involvement and participation. I think those are two really big spaces that we had some blind spots to. Because again, when I talk about us trying to reduce our times, mean times too, there are things that are happening in other slack channels where there's time spent on triage that is like, there's these pieces that, sure, you can put into a big write up of a report that, oh, we triage this in the such and such slack channel with a link to the Slack channel, but that's not doing you any good to actually, when you talk about blameless culture and all this stuff, is that just within engineering or is that company wide? Good question. I think it's company wide. I think every organization kind of has their own little. Because this is something that I have definitely struggled with a little bit. And I don't mean to throw other under the bus or anything, but in engineering, I feel like we have for a couple of decades now, been working on training. None of us just shows up out of college or whatever, just like, yay. I know how to admit my mistakes and not feel safe and everything. This has been like a very conscious, very intentional, multi year, multi decade effort on behalf of the entire engineering culture to try and depersonalize everything in code reviews to create blameless retros, and we still struggle with it. And it's something where I feel like other parts of the business, like sales and marketing, they're still fucking petrified. They're just, like, so afraid to talk about mistakes in public. So I keep trying to find out if there are companies out there who are trying to expand this beyond engineering because it feels so needed to me, and I'm not sure how to do it. So I can talk to that in a little bit of a different way than how I would say, blameless. So we do a lot of work with our go to market folks in terms of enabling them to be informed during an incident. Immediately after an incident, they have process around where they can understand what's happening. We obviously maintain our space where we're actually active responding to the incident, but there's other ways for them to gain information so that they can communicate with their customers. I think that Zendesk is also a pretty transparent company when it comes to taking responsibility for where we've had missteps. But I will say when we have, I will call them noisy times where we are not aware that we may not have that many high severity incidents. Right. But we could have an enterprise customer that has a bug that hasn't been. That's taking longer than they expect. They could be having an integration issue that has nothing to do with Zendesk side, but it's on the other side of it. There could be multitudes of different things that have nothing to do with our incidents. Right. And so that's where we run into the frustration from, I think, our folks that are trying to just support our customers, because I think that that's a big piece of being able to enable them to be able to have the right pieces come together, to be able to work and interact with their customers on those things. Totally. And I think that that's what you're describing, too, is like, creating this big, same team vibe. I've definitely seen engineering cultures where they're like, no, if you refer to this doc that we've pinned in our channel, it's actually not an incident. So you can't talk to us right now. Ticket. And it becomes incredibly frustrating for everybody. And you're not actually like the engineers. Even though they may be creating this internal blameless culture, they're not really having that vibe with their other colleagues, too. I really like, totally, one of the other things that we implemented a while back was we have an engineering leadership escalation process. So if there's a customer that is frustrated and they have whatever's going on. And our customer success folk or person can go in and put in a request, and basically, depending on either executive sponsorship and or expertise in that particular area, we then can line them up to be able to have a direct conversation with one of our leaders within engineering. This is something that originated from. I used to have to manage crazy intake and spreadsheets of frustration from certain customers. Right. And there were a couple of us that would take those calls and just like, you would get on a call and you would listen and you like, you know. Yeah, most of the time it was they just wanted to be heard, and they just wanted to hear directly from somebody that was closer to it. And I think that, again, I mentioned that event analysis document that we put together. I think that also went a long way with bridging that gap a little bit with like, hey, we don't know everything yet, but this is what we do know, and this is what we're doing about it right now. I think that there's ways to kind of bring it in so that the way in which the rest of the organizations is trusting in the engineering organization as a whole, there's definitely ways to do that. But again, it's what we were saying. I think it goes back quite a bit to the culture element. I mean, that artifact that you're talking about creating is really a shared vernacular. Right? It's like thinking about going to read what you're giving to. If you're creating something that is just for engineering and is just about the technical system issues, they do impact customer success, but they're not written in any sort of language that they can understand that doesn't give that same vibe, and it makes them maybe not want to participate afterwards. I think the tech industry is really cool, and the engineering and organizations has a lot of influence. And so if you can enroll other teams and other departments into your processes that are usually quite good, they might be able to learn as well. I was in an incident like several years ago on Super Bowl Sunday, where Jones of the engineers that were on call knew that we were running a Super bowl commercial. And we went down during the Super bowl commercial, and I came to the incident review the next day, and it was only sres in the room. No one from marketing was in the room, no one from PR was. And it was just like, it was very much we should be prepared for every situation vibe, but that's not really. Maybe we can talk about that. I feel like a lot of this is how we coordinated here with other departments and exacerbating the issue by not talking to them about it. Totally. We only have about 15 minutes here, and I know you both have really great recent incident stories that I kind of want to hear about, so I want to shift gears a little bit. Charity, I would love to hear about a recent outage that you all had in as much detail as you're willing to go into and what you've learned from it. And what I will say, that as much detail as I'm willing to go into is all of the detail. This is one of the marvelous things about starting a company. I was always so frustrated with the editing of, like, I would be writing a postmortem for here's what happened. And they'd be like, script scratching. No, no, you can't say MongodB, say our data store, or no, you can't say aws, say our hosting provider. I'm just like, what the fuck? Or maybe, no, this reveals too much about the algorithm. And I'm just like that build verse by culture that we're still supposed to talk about later, but if you're not allowed to name that, you're being like, no one needs to know that we paid for this thing. It's like, yeah, they should know that because that has nothing to do with your business. Yeah. I feel like everybody's always so worried about losing trust with their users, and I just feel like the more transparent you are, the more trust you'll build, because these problems are not easy. And when you go down a couple of times, users might be frustrated. Then you explain what happened and they're going to go, oh, respect. Okay, we'll let you get to it. We'll talk later when things are up. In my experience, more detail is always better anyway. And so that was one of the first things that I did as CEO at the time. I was just like, look, whatever you want to put in the outage report, anything that's relevant, just write it all, let people see it. Anyway, yeah, we had this really interesting outage week or two ago. Maybe it was just last week, and it's the first time in, like, last time we really had an outage. Might have been something this spring, but before that it was like a kafka thing, like a year ago. So this was significant for us. And I got a couple of alerts that honeycomb was down, and I was like, wow, this is very unusual. And it took us a couple of days to figure out what was going on, and basically what was happening was we were using up all the lambda capacity period repeatedly. And the way our system work is we have hot data which is stored on the ssds, local ssds, and then cold data, which is stored in s three. And stuff ages out to s three pretty quickly. And this outage had to do with slos, because every time that someone sets an sl, they set an SLI, right, which rolls up into an slO, and the SLI will be like, tell me if we're getting x number of 504 over this time or whatever. So we have this offline job, right, which periodically kicks off and just pulls to see if things are meeting the SLI or not. And we realized after a while that we were seeing timestamps way into the future. And what happened was anytime a customer requests an SLO that doesn't exist yet, we backfill everything because they might be turning on a new SLI, right? They're writing something new. So as soon as somebody requests it, we backfill. And this was happening just over and over and over and over, didn't have any valid results, so it wouldn't make a cache line. So every single minute it would launch a backfill. And all the lambda jobs were just spun up just trying to backfill these slos. What was interesting about this was that we often think about how users are going to abuse our systems, and we try to make these guardrails and these limits and everything. And in fact, most of our interesting outages or edge cases or whatever come from people who are using the system as intended, but in a really creative, weird way, right? This is not something that we ever would have checked for, because it's completely valid. People will spin up SLis and need to backfill them. People will set dates in the future. All of these are super valid. I feel like when I was working on parts, I learned the very hard lesson repeatedly about creating these bounds checks to protect the platform from any given individual, right? But it becomes so much more interesting and difficult of a problem if what they're doing is valid, but in some way because of the size of the customer or because of whatever, it just is incredibly expensive. We have a really big write up. Go ahead. I was going to say, it's always very interesting to see how creative people can get with using your systems. Like some things that you were often. I was never like, wow, often it's a mistake or something where it might be intentional and completely legit, but they're doing a very extreme version of it, right? Like Jones of the very first things that we ran into with honeycomb, we do high cardinality, high dimensionality, all this stuff. And we started seeing customers run out of cardinality, not in the values but in the keys because they accidentally swapped the key value pairs. And so they were just like. So the key would have like, what do people. Instead of it being in the value. And that's where we realized that our cap on number of unique key values is somewhere around the number of Unix file system handles that we can hold up. So now we limit it to like 1500 or something like that. Right. This is why I'm so you can see my shirt test and prod or live a lie. This shit is never going to come up in test environments. You're only ever going to see it at the conjunction of your code, your infrastructure, a point in time with users using it. Right. And this is why it's so important to instrument your code. Keep a close eye on it while it's new. Deploying is not the end of the story. It's the beginning of baking your code. It's the beginning of testing it out and seeing if it's resilient and responsible or not. And you just have to put it out there, keep an eye on it and watch it for a little bit. Unfortunately, we had the kind of instrumentation and the kind of detail where we were able to pretty quickly figure it out once we went to look. But you can't predict this stuff. You shouldn't even try. It's just impossible. You got to test and prod. It's a waste of time to try to predict it, too. Waste of time, right. You'll find out. It will come up quickly. Every limit in every system is there for a reason. There is a story behind every single one. Exactly. And people who spend all this time writing elaborate tests, just trying to test every. It's a waste of time because they spend a lot of time now. The time it takes them to run their tests is forever. It takes them a long time to know. Take that time and energy that you're investing into making all these wacky tests and invest in instrumentation. That effort will pay off. Yeah, totally. That is an awesome one. Erin, I'm going to turn it over to you. Can you tell us about a recent outage that you all had? Yes, and unlike charity, I am not the CEO, so I have to be a little bit more careful with some of the details that I possibly share. But I think that before we got on this call, I was going back and kind of looking through recent incidents that we've had. And I think that one of the things that I just want to call out is that it's interesting to see how over time, your type of incidents and the nature of them can change. And I don't have one of my obvious ones that normally I'd be like, yes, I'm sure we have one of these recently. And no, it's been pretty quiet in that space for a while. Right. I want to just talk a little bit. This is a little bit different because I think that this is an interesting one in the sense that it's more about the process of how I'm going to talk about it than it is about the actual incident itself. Of course I am because that's where I always come from. But we had an incident that was called recently, and the nature of it was, the question came up of, is this a security incident or is this a service incident? And the team realized like, oh, we can handle this within, like, we already tried escalating up through security. They said that this was on the customer side, so we're going to just technically it's not a security incident. And so we managed it because we could take those actions through the service side. We had the right engineers and all of that. But I think that one of the things that came back out of that was more of this understanding and discussion around. There was more of a risk from the standpoint of not, was that actually the right direction for us to go in terms of handling that from a process perspective, or would it have been better served, managed differently because of the fact that the engineering work that was required was not necessarily a high severity, we downgraded the incident to a lower severity level because it was something that, for them wasn't that, but reflecting back and looking at it, it was like, there's a couple of gotchas in here that we need to be more aware of that. Are we actually following that process from the process from the right standpoint, even though it doesn't fit in that perfect box of what it is? Security incidents also roll up through, like, we are responsible for service incidents or security incidents as well, but we have our triage and threat and triage team is through our cybersecurity organization, so we partner very closely with them. And so obviously that's something that there's some really strong learnings from that to kind of follow the tale of, because that's also one where it was very specific for a particular customer. Right. It wasn't a widespread thing. It was one particular customer that was having this issue. So following that trail and determining when it's a situation of rightfully so, the individual is like, we need to take care of this. So they raised an incident and it's like, wow, is it truly? Is it not? But it needed to get taken care of. But I think those are the types of things that are always very interesting in terms of like, it's totally off. It's not a common thing that would happen. And so you got to go back and just kind of review and figure out how to those incidents where it's not affecting everyone, those can be the trickiest ones to figure out. Especially when your slo or your SlI is like, yeah, we're at 99.9% reliability, but it's because everybody whose first name starts with like, slid thinks you're 100% down, or some of those really weird little edge cases. This is where a lot of my anger at dashboards comes from, because dashboards just cover over all those sins. You're just like, everything's fine while people are just like, no, it's not. Yeah, Aaron, you were talking about a lot of the metrics you're measuring earlier, and recently I've been coming know. I think metrics have value in their own and I have a lot of gripes with some of them think, you know, sometimes we over index on them. But I've been trying to come up with more creative metrics lately, and one of them is how long someone spends waffling trying to decide if something's an incident. I love it. Trying to figure it out or trying to figure out the severity and the time it takes how many people they rope into a conversation. I was@one.org that used to get paged all the time and people just stopped wanting to bother each other because everyone was getting woken up so much that one day people would open up incident channels on their own and then sit there by themselves typing everything they were doing for like 2 hours before they brought anyone else in, which was fascinating. And so we started recording that before paging another person in and thinking it was a serious thing. So it just reminded me of that. Cool. Well, we are wrapping up time now. We probably have time for one more question. And there's like three really great ones. I'm trying to figure out which one I want you all to answer. We're just going to have to do a second round at some point. Yeah, like a follow up, a part two. We'll just end it on a cliffhanger. Yeah, I guess I will ask. I know, cherry, you have thoughts on this? I want to hear you talk about build versus buy with regard to tooling for any phase of the incident lifecycle. You touched on it a little bit earlier, so I feel like it's good for us to call back to that. Yeah, so I preach with the religious zeal of the converted. For the longest time, I was one of the neck beards who just like, no, I will not outsource my core functions. I am going to run my own mail because I want to be able to grep through the mail schools. And it's so scary when I can't figure out exactly where the mail went. Like, it goes up to Google and like, what the fuck? All right, that was a long time ago, but it's still, it was a big deal for me. Overcome that and go, okay, I can outsource this to Google. Okay, I can do this, whatever. But even up to including parse, I was like, no, metrics are a thing that use to give to someone else. They're too critical, they're too core. I need to be able to get my hands in there and figure it out. And I have come around full circle on that and did so before starting honeycombio. Just because the thing that I mentioned earlier about you never want your telemetry to be on the same anything as your production systems. Huge thing. But also it's a big and really important part of the entire, call it DevOps transformation or whatever, which is just that we as engineers, the problem space is exploding. Used to be you had the database to run. Now you've got fucking how many data stores do you have? And you can't be expert in all of them. You probably can't be expert in any of them and also do a bunch of other stuff. Telemetry. For a long time, it was just the case that you're going to get kind of shitty metrics and dashboards, whether you ran them yourself or you paid someone else to do it. That's not the case anymore. It's come a long way. You could give somebody else money to do it way better than you could. And that frees up engineering cycles for you to do what you do best, right? What you do best is the reason that your company exists, right? They're your crown jewels. It's what you spend your life caring about. And you want to make that as an engineering leader, you should make that list of things that you have to care about as small as possible so you can do them as well as you can, right? And everything that's not in that absolutely critical list, get someone else to do it better than you can. It's so much cheaper to pay someone else to run services than to have this enormous, sprawling engineering team that's focused on 50 million different things. That's not how you succeed your core business or how you exactly take your headcount and spend them on the things that you do and do them. Love it. I love it. Aaron, anything to add there before we wrap up here? Yeah, no, I mean, I think that it makes a lot of sense. And again, I'm also responsible for our vendor resilience. So dealing with understanding what our capabilities are of the vendors and what we've agreed to, going back with understanding. Now, I want to pick your brain about that, Erin. There's a whole binder full of goodies that I have for it. But, yeah, I think that there's a way to look at it from a perspective of there's ways in which that we are getting more sophisticated in how we are creating our systems and our ecosystems in general overall. And the dependency there is the web of things that everybody is connecting to and using and leveraging. And I think that what is best, again, I will go back to what I was bringing up earlier, which is starting from a risk perspective of really actually understanding what is your risk tolerance, looking to use this tool. Okay, what are the risks associated with doing that? And are you willing to take those risks, such as dependency? Right. And again, what are the things that you can do to ensure that you are mitigating that risk as much as possible by creating workarounds and really thinking about your technology continuity element of things and how you're developing around that dependency. Right. When I was at parse, we were like, okay, what we do is we build the infrastructure so mobile developers can build their apps. Right. So operations was absolutely core to us. We had to run our own MongoDB systems because it was what we did. Right. Yeah. But for most people, it's not what you do. Go use RDS. It's fine. You're still a good engineer if you go a good promise. Yeah. And I think it's interesting, too, because I think that we have certain things that internally at Zendesk that have been built, and there's been a lot of work that's Jones into them, and they have a lot of people know what it is, class fallacies. But also at the same time, there's this tipping point where that becomes like, okay, we have this dependence, we created this, we have to maintain it. We have to keep it alive. At what point do you get to where that becomes like, okay, this is a much larger task than it would be to outsource it. So I think that there's different phases and also future proofing because the people who built it and run it might not be there forever. Especially if that thing is involved in a ton of incidents. They are going to burn out and leave, and no one's going to know how to run. True facts. True facts. We went through a huge exercise a couple of years ago where the whole entire thing around ownership and self service and really making sure that everything had a clear owner. It seems like it's a simple thing to say that, but when you have multiple different. It's something that just existed from the beginning. And it's like, we don't own that. We just use it. We don't own it. We just use it. Well, somebody's got to own it because this thing keeps breaking. So can we please figure out who's actually going to take on the responsibility of it? And I think that's also. Yeah, you end up with people that they leave, and then you're like, okay, does anybody know what's going on here? Right? And it becomes more critical than ever right now. I really want to do a part two with you all in person. Wine, whiskey, part two. We can totally do that. And we can record Jones as well. But thank you both for joining me. Really, really appreciated the conversation. And, yeah, look forward to talking more later. Thanks, Darren. Yeah, thanks, Aaron.

See all 17 talks at this event!

Conf42 Incident Management 2022 - Online

September 29 2022

Incident Management 2022 Panel Discussion

Video size:

Abstract

Summary

Transcript

Join the community!

Featured event

2025

2024

Info

Conf42 Incident Management 2022 - Online

September 29 2022

Incident Management 2022 Panel Discussion

Video size:

Abstract

Summary

Transcript

Join the community!