Running an effective incident management process

Video size:

Abstract

When working on technical systems, it’s inevitable that something will at some point break. Therefore, it is extremely important to be prepared for how to handle such a situation, and ensure you and your team are doing what you can to minimize the downtime for your users and/or customers.

Outages and similar drops in availability can occur in any system, whether it’s user facing, employee facing, revenue generating, or recommendation generating.

By minimizing the time taken to respond to and resolve such an incident, you can make sure to minimize the impact to your topline, customers, and users

This is a tech talk intended to share some ideas with developers, SREs, and managers, on how to run a healthy incident management process within your organization. When things go wrong, it’s extremely important to remain calm, reduce and resolve the impact, retroactively identify the top learnings, and follow up to make your systems more resilient to such outages in the future.

As the engineering manager for the Ads Serving Platform team at Pinterest, which owns multiple business critical services, and a member of the incident manager oncall for Pinterest, I have experienced several high-severity incidents, and have learned a lot from this process. I hope my learnings can be of use to others in similar positions to run an effective incident response process.

By the end of this talk, the audience should be able to answer the following questions:

What differentiates an incident from a bug?
How can we empower our team(s) to reduce adverse impact to our customers/users/business?
How can we build a healthy culture around learning from our mistakes and growing together?
How to measure and track improvements to our incident management process?

Summary

Nashant is the engineering manager for the ad services platform team at Pinterest. Through his involvement with our various incident manager on call rotations, he learned a lot about incident management. Nashant hopes these learnings can be of use to others who are in similar positions and help you run an effective incident response process of your own.
An incident is an event that is not part of the regular operations of a service that are causing an interruption in availability or a reduction in its quality. This is also what differentiates an incident from a bug, which typically will also have an SLA for response.
There's typically five main phases. First of all, detection. After that, triage the situation, identify which component is responsible. Finally, remediation. What are the follow up action items that we need to make sure our systems are more resilient to such situations in the future?
We should create incidents when we need immediate support. Who should be declaring the incident? Anybody who notices the impact should be empowered to declare an incident. Make it easy to refer to and understand the incident criteria so they can do so.
The scope of how many people are expected to respond may vary greatly based on the impact. Every team, component, or service within your company's architecture needs some customized criteria for incidents. You'll come up with different thresholds to define what an incident is.
When declaring an incident, the first thing to do is open up your communication channels. Next, we need to define and assign some critical roles. The incident runner is responsible for helping drive this incident to a resolution. The other role is the incident manager, responsible for coordinating the incident response.
The first priority of the incident response process here is always to stop the bleeding first. The next step is to resolve the root cause, understand what caused the system to get into a state of an incident. Finally, ensure that all your systems are restored to regular operations.
Postmortem involves describing the impact, what happened and the root cause. It also provides a detailed timeline of events, including when the incident began. Make sure you're never identifying a person or team as a root cause in a postmortem.
It's really important to destigmatize incidents within your team and within your organization. We should empower our employees to speak up when they see some impact. Employees who spend these long hours could receive rewards. False positives are significantly better than false negatives.
The guiding principle for an incident manager to follow is to minimize risk and drive resolution. A key part of this is being confident. It's also important to remember that it's okay to ask for help.
So to wrap up, we answered four questions here today. We established a clear definition for incidents and how they're different from bug. We discussed ways to measure and improve on our incident handling. And lastly, ways to reward and motivate our teams to take this work seriously.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

You. Hi everyone, my name is Nashant and I'm really excited to be here today at the Con 42 incident management event to talk to you about that very topic, incident management. So after this session, you should have a good idea of how to run can effective incident management process within your team or company, how to minimize the impact to your users or customers, and finally, how to continue to learn and improve from past experience. So before we get started, here's a little about myself. I'm the engineering manager for the ad services platform team at Pinterest and our team owns multiple critical systems that power a multibillion dollar ad business in that role as well. Ads. Through my involvement with our various incident manager on call rotations, I've experienced several high severity incidents and learned a lot from that process. So I'm hoping these learnings can be of use to others who are in similar positions and help you run an effective incident response process of your own. So to start off, let's define what an incident is. There is a few different situations that we may classify as an incident. For example, if a systems is not behaving as expected, let's say a particular system has higher latency than normal. Perhaps when users aren't able to access our service, they can't log in, they can't get to the front page, or when employees aren't able to do their work, if employees can't submit pull requests, can't comment on other people's code, so on and so forth. So we can say an incident is an event that is not part of the regular operations of a service that are causing an interruption in availability or a reduction in its quality. Taken, you might say, well, the same also applies to most bugs, right? They're also a manner in which a service may experience interruptions or quality degradations. And you'd be completely right. That is also what a bug usually is. So then we ask ourselves, what is an incident really? How do we differentiate one from a bug? And especially since we're saying we need a process for handling incidents, we also therefore need an actionable definition. So here's how I define an incident. I say an incident is when we need one or more people to stop everything else that they're doing Andor come help us fix the bleeding immediately. This is also what differentiates an incident from a bug, which typically will also have an SLA for response and may be treated urgently as well. But for instance, a bug may only require a response during business hours, and typically only from the team that owns the component that is facing the bug. So now that we've defined what an incident is and talked about how it's different from a bug, let's talk about what incident management looks like. There's typically five main phases. First of all, detection. The detection might happen through automated alerts from our observability or monitoring systems, or through manual reports. After that, triage the situation, identify which component is responsible, and identify how severe the impact is. Next, we go to mitigate things means we have to stop the bleeding as soon as possible, minimizing the impact to our users, to our employees, so on and so forth. After that, we can talk about prevention. How do we figure out what the root cause was? Andor actually solve it so that we're confident that this issue is not going to reoccur. Andor then finally, remediation. What are the follow up action items that we need to make sure that our systems are more resilient to such situations in the future? So the first thing to discuss is the process of actually declaring an incident. So given our definition above, we expect someone to drop everything and respond, which means we need a clear set of criteria for defining an incident. So let's talk about when. And to talk about this, I'm going to borrow a story from the famous Toyota manufacturing line. Toyota had something called an Andor cord, which was a rope that any employee on their manufacturing line could pull whenever they found a problem with production. Pulling this cord would immediately halt production. The team leader would come over and ask why the cord was pulled, and then the teams would work together to solve the problem before resuming production. So in software engineering, we typically don't need to pause all of production. There may be some really severe cases where we do. However, typically we just ask one team or a handful of team members to respond to the incident, pause others work, respond to the incident, andor help production resume as normal. So based on this, we can say that we should create an incidents when we need immediate support. And what that means will be different between different components of an organization, different teams, so on and so forth. For instance, an ads team may declare their incidents criteria based on revenue impact. Another team may use login failures, app crashes, still user content, high latency, so on and so forth, content safety, et cetera. So the next questions is, who should be declaring the incident? And the answer here is very easy. Anybody who notices the impact should be empowered to declare an incident. And it's extremely, extremely important to empower everyone and anyone in your team or organization to declare an incident and make it easy to refer to and understand the incident criteria so they can do so. The last question is, how do we declare an incident? Going back to the Toyota story, all they had to do was pull this cord to declare that something was wrong. And we need to make it just as easy so different teams stop different processes. We may have a Google form that you fill out and immediately file an incident. You may file a certain type of Jira ticket under a project which declares an incident, so on, so forth. Whatever the preferred method is, it just needs to be extremely easy. And when we file that ticket or fill out that form, basically when a new incident is declared, we need to taken sure that someone is notified. Someone knows that something is wrong. This may be the system owner. Perhaps in the form you can select what system is impacted. Could be the person on call for that system, the incident manager, at the team or at the company. Doesn't matter. Someone needs to be notified. So now that we know what an incident is and how to declare one, how do we tell them apart which incidents are more severe than others? So we know that all incidents are not the same. Even though we're saying that incidents require one or more people to drops everything and respond, the scope of how many people are expected to respond may vary greatly based on the impact. For instance, let's say you're a company or team that has an Android app, and you have 1% of Android users reporting that the app occasionally crashes every couple of days, versus we say that 20% of your Android users cannot log into the app at all. Those are two vastly different scenarios, and the number of people or teams expected to respond may be vastly different as well. Which means, as we said before, every team, component, or service within your company's architecture needs some customized criteria for incidents. The general framework we can adopt is how many people need to respond or need to know, since we're kind of using this framework to determine what an incident is anyway. So typically at most places, we think about incident severity levels as starting at Seph three or set four for the lowest severity incidents, and then going all the way up to sev zero, which is typically complete outage all hands on deck. We need to fix the situation immediately. So one way to think about this is how many people are expected to respond? Or how many people, or how many levels up do we need to notify about the impact? So again, going back to some of these examples, we said, let's say if there's only an elevation in latency, there's a regression in latency. Your home page is loading slower than it usually does, that's probably a low severity incidents. Let's say it's a step three or step four. Typically only the team that owns that component, their manager and a few others might need to know respond to the situation. However, if user login is down and no one can get into your site or your app, typically company execs would like to know. And that sort of gets us into the sev zero or sev one criteria. So these are two frameworks to think about how to define your severity levels. Based on that, then you will define certain thresholds for different parts of your stack, your architecture, different components. You'll come up with different thresholds to define what an incident is. So if there's $1 in revenue loss, that is one level of severity. If there's $10 in revenue loss, that's a higher level. So on and so forth. So now that we understand when to declare can incident, what do we actually do? Once that happens, the first thing to do is some administrative work to take care of. Open up your communication channels. This may include starting a chat room or a slack channel, setting up a video conference if you're in the office taken, marking out a physical room that you're going to use to respond to this incident, do all your communication, so on and so forth. Optionally, you may also choose to create an investigation doc to keep notes. Things may involve what steps were taken, who all got involved, what the impact is, et cetera, et cetera. This can sometimes be helpful if things incidents response requires many parallel tracks of investigation. If we're expecting more and more people to jump in over time, this document can become the source of truth for everyone to quickly ramp up and figure out what the situation currently is. Next, we need to define and assign some critical roles. So the first one we have here is the incident runner and this person is responsible for the outcome of the incident response. That means they're responsible for helping drive this incident to a resolution. This may involve communication with the various folks involved in the investigation, escalating to other teams or other folks on the team, and actually supporting the investigation and the debugging process. The other role is the incident manager role and this person is responsible for coordinating the incident response. So what that means is they're here to help the team understand the impact, determine the severity, identify can appropriate incident runner. This may be the subject matter expert, the system owner, the on caller, so on. They're also here to support the incident runner, handle broader communications. For example, status updates to a broader audience, cross team escalations, so on and so forth. The incident manager is also overseeing the process, asking the right questions and helping the team prioritize between next steps. They're also here to keep the team calm and like I said, prioritize those next steps. And then finally, it's really important that this incident manager is someone who's a confident decision maker with good problem solving skills. So let's talk about how do we do this impact assessment? How do we figure out what the severity is? Because this is extremely important to understand. If we need to escalate to other teams, what channels of communication do we need to engage, if this is something that is user or potentially partner facing? If you have other businesses, your customers, for instance, you may need to send out external communications as well to inform a fear services availability, outage or degradation in service quality. Finally, assessing the impact is also really important to determine what your resolution process is. For instance, let's say you have an incident late at night, it's two in the morning, or an incidents that's middle of the day when everyone is around, your whole team and your company are around to help with the response process. Based on the severity of the incident, we may really choose different processes to resolve this incident. So at some point, if it's middle of the day, we might be okay with a hot fix or even rolling forward late at night. We may prefer to be extra cautious and only roll back. So we're not introducing any new changes. We're not introducing any new potential sources of instability to make the situation worse. Once we've done this, once we've figured out the impact and determine the severity, the incidents runner and incident manager can start to take different responsibilities. The incident runner can start to play a more active role with actively debugging escalating to relevant people or teams for support. While the incident manager takes care of internal comms to teams, to the overall company, to execs if necessary, and external comps. Like we said, if partners are impacted, the first priority of the incident response process here is always to stop the bleeding first. The root cause can wait if it's not immediately clear, oftentimes it is, but if it's not, we need to first treat the symptoms. So this may involve rolling back suspicious changes. If some services were deployed, if some experiments were rolled out, if some config flags were changed during this time, even if we don't feel, based on our knowledge of the systems, that these changes should cause this incident, if the timeline aligns, it's still worth rolling them back just to rule them out. The other option here is to just put up a quick patch to alleviate the symptom if we don't know the true root cause, or if it'll take longer to fix the underlying root cause. One example of this is if you have a data dependency and the system you're calling is suddenly returning corrupt data or empty data, and your service is crashing, we could potentially just put up a quick patch to add a fallback path for that missing data. Therefore treating the system, stopping the bleeding, and minimizing the impact to our downstream users. Once we've done this, once we've stopped the bleeding, cured the symptoms, we can update our incidents status to mitigated, informing everyone that the impact is now taken care of. We can now focus on the next steps. The next step, naturally, is to resolve the root cause, understand what caused the system to get into a state of an incident, and ensure that we're completing any action items needed to resolve this, or guarantee that the issue is not going to reoccur in the near future. What that near future horizon looks like is sort of dependent on your component or team. But for instance, you may say that as long as we have the confidence that this incident or this issue is not going to reoccur for a month, we're okay with saying that the incidents is resolved. The second part of resolution is ensuring that all your systems are restored to regular operations. So, for example, if you chose to pause your continuous deploys, if you rolled back any live experiments, if you made any config changes, Andor undid them, making sure all those are back to normal are a key part of ensuring that you can say that the incident is truly resolved because your systems are now back to regular operations. And then lastly, once we've gone through this process of resolution, sending out a final update to internal and external parties to inform them, perhaps you might want to inform them what the root cause was, but most importantly, communicate that it has been resolved, our systems are back to normal, and we are confident that our systems are stable for a certain amount of time. So at this point, we are confident that our incinerative resolved, there's no more impact. We can now start to think about the postmortem process. So what do you do afterwards? First off, why do we need a postmortem process? Our goal is to continue to make our systems more resilient. Like we said earlier, we want to learn and improve from our past experiences. So we need to first define and follow a rigorous postmortem process. This may involve creating a postmortem document template so that we know that everyone is following a fixed template, including the same level and detail of information that the broader team needs to identify gaps and critical remediation items. The second step for a postmortem process is to typically have an in person review once the postmortem document is complete. Getting the team together to ask questions and discuss areas of improvements is a critical part of this process as well. The attendees for things postmortem review are typically the people who responded to the incident, the incident runner, the incident manager, and any other key stakeholders such as the system owners, the on callers, sres who have knowledge about generally how to improve stability overall. All these folks should be required, or at least encouraged to attend this review. And then finally, it's really important to make sure that we have an SLA to complete this postpartum process and any remediation ideas that we identified as part of it. This helps us as a team and as can organization ensure that we're holding a high bar for quality and stability, and that none of these tasks are slipping through the cracks. So let's talk a little bit about what this postmortem document might look like. I find that these are sort of five things that are really important to cover. You may choose to add more. These are the five things that I strongly recommend. First of all, talk about the impact, what happened? For example, if we were suddenly charging advertisers too much for their ad insertions ad impressions, we may need to process refunds, so on and so forth. This also involves describing the root cause that caused this impact. Next, providing a detailed timeline of events, including when the incident began. So when the symptoms first started, what time did we detect the incidents? When was the incident mitigated to? When do we fix the systems? And then when did we actually resolve the root cause? Based on how long it took to detect, mitigate and resolve our incident, we may be able to identify action items to reduce that time. For instance, if it took really long to detect an incidents, a very clear action item here is that we need better alerting. So then the last part andor follows is to write down our remediation ideas to improve these metrics above and identify owners for these remediation items. So one thing I want to call out is that it's really important to keep our postmortem process blameless. Focus on the what Andor why not the who? Humans make mistakes. It's natural, it's expected. We learn from them, systems need to handle them. For instance, if we're seeing our system crash due to a null pointer exception, the questions we might ask here are, why did that happen? Why was the change not detected by unit teams by integration tests through an incremental rollout process causing failures in canary first on and so forth? Or if someone accidentally causes a corruption or deletion of a prod database. Table why is such access available in the first place? Why were they able to make such a change? Asking the right questions are really going to help us get to the point, get to the root cause, the underlying root cause, and identify what we need to do to fix it going forward. A really helpful technique, also coming from Toyota, is this idea of the five whys, so we can ask the question why five times, give or take, to determine the root cause of a problem. There's two important guidelines to follow here. Make sure you're never identifying a person or team as a root cause. Secondly, five is a guidance. It may be more or less. So let's take an example. The what here is that users can log into their accounts. So we ask the first why. Andor we decide that the API is rejecting the login request. Okay, that's clearly a problem, but not really our root cause. Second, why tells us that the API can talk to our authentication service. Next, we say that it can't do this because it doesn't have the right SSL cert. Next, why did this happen? Because someone copied the wrong config to prod. So now we're getting somewhere. But based on our guideline, we said we never want to identify a person or a team as the root cause. So we're going to ask ourselves why one more time. Here we come to the answer that there's no validation for the API config. So clearly, as we went through this process, we dove deeper and deeper, peeled back the layers, and identified that our root cause here is that there's no validation. So a very clear fix, very clear remediation item, is to ensure that we have such validation in place so that no one accidentally or maliciously can cause it to change in the future. All right, now I'm going to dive into some lessons that I've learned over time. First off, it's really, really important to destigmatize incidents within your team and within your organization. Going back to Toyota's and on card story, someone asked them about this process where any employee had the right to pull the cord, and Toyota took issue with that wording. They said that employees just didn't just have the right to pull that cord. They were obligated to pull the cord if they saw anything wrong. It was part of their responsibilities as an employee. So it's really important to reward our incident runners and our incident reporters identifying incidents as a learning opportunity. So celebrating our incident responders Andor reporters will help continue to encourage this good behavior. We should empower our employees to speak up when they see some impact. It's really important as team managers and as leaders to publicly praise those employees who identified the impact, declared the incidents, as well as the ones who actually spent time resolving them. Oftentimes incidents happen late at night. People stay up all night resolving these incidents. It's really important to recognize that hard work. It's also really important to reward them. Employees who spend these long hours could receive rewards. These may be in the forms of shoutouts, in terms of improved visibility, perhaps in terms of dollar bonuses, so on and so forth, and also encourage them to take time off to recover when they're up late at night. My opinion is that false positives here are significantly better than false negatives. Obviously, there's a trade off. You don't want to get to a point where everyone's declaring an incidents at the top of the drop of hat, which is why we talked about earlier. We need a really clear set of incident criteria. However, it is always better to file an incident prematurely, get immediate support, resolve the issue, and then later decide that the impact actually didn't end up being significant enough to warrant a full incident andor a full postmortem process. At that point, we can just lower the priority, lower the severity, and call it a bug or something like that. The alternative, though, is that employees are not sure whether the impact is big enough yet to declare an incident hold off, and by the time they decide that it is, the impact might be really big and balloon sort of out of control and require a lot more people to jump into support and require just a lot longer till we actually resolve the issue, leading to a much higher impact than we necessarily would have had a person that had just filed an incident earlier. Okay, the next part is specifically for the incident manager. The guiding principle for an incident manager to follow is to minimize risk and drive resolution. A key part of this is being confident. So let's take a few examples. Let's say we have an event where an ad system is misbehaving and causing our ad load, which we define as the percentage of ads users see compared to non ads posts, to be much higher than normal. So users are used to seeing only two out of ten posts being ads. Suddenly that number goes up to six out of ten. This could also be causing a drop in ad relevance, users suddenly seeing ads that are no longer relevant to them. In this case, is it better to turn off ads entirely and taken that revenue hit, or show irrelevant ads for a period and potentially impact our user experience, maybe even causing them to not come back to our site? There's no easy answer here and the team is going to struggle with this question. So this is one example where it's really important for the incident manager to step in and help the team confidently make a decision. This doesn't mean you're solely responsible for that decision or making that decision. It means you have to be the one to find the right information, whether that is predocumented, or identify the right person to make that calm, and just guide the team through that whole process. Another example, like we talked about before, there's many ways we may choose to resolve an incident. We may need to pick between these three options. Do we roll back a change? Do we fix forward, which would include all changes since the last time we deployed, or do we hot fix while cherry picking a patch on top of our latest deploy? Again, going back to our guiding principle, we said we need to minimize risks. So factors like time of day, availability of other teams, et cetera, play a factor and the incident manager can help make this decision. Another example is if the incident runner or any of the incident responders are sort of panicked, they're not able to make progress. The incidents manager can step in and help them collect themselves, calm down Andor refocus. Alternatively, if we feel like the incident runner hasn't quite had the experiences yet, they need to effectively run this incident, identify the next best person to do so. Andor encourage this person to stick around and learn from the experience so that they can be an effective incident runner the next time. One last example is for these late night and after hours incidents, the incident manager can help make a call on this question, like should we pause our investigation till business hours when other teams are available, or should we go ahead and wake other people up right now? This involves understanding the impact, the severity, Andor just being confident in making this decision, ensuring that the team also, in turn feels confident with the decision that they've made. So how do we do this? As an incident manager, it's really, really important to remember that it's okay to ask for help. Most of the times, you will likely not be the subject matter expert in the room. You might not even have ever worked with the system that is encountering this incidents. So rely on your incidents runner or the subject matter expert. These may sometimes be the same person, sometimes they may not, but rely on them to help make this call, or rely on them to at least give you the information that you need to make difficult calls. The other thing is sometimes when you're the incident manager, there may be several incidents ongoing at the same time. Sloop in other folks to help loop in other folks on the rotation. It's completely okay to ask for help because the alternative is that you're just juggling too many balls. You don't know what's happening in each incident response process and things sort of slip through. And we may be in a situation where we're taking much longer to resolve the incident than is good or than could have been avoided. So in order to ask for help, we need to know who to ask for help. So make a list of these key contacts who can help make these hard decisions. If we decide that we want to turn off all ads for our site, typically you might not be the one who is responsible or has the power to make that decision. So loop in the folks who can. It's really important to document those so that it's easy to pull them in when you need. The second one is who to reach out to for external support. In today's world, a lot of services are running on hosted infrastructure from partners in Amazon, Google, et cetera, or for instance, relying on third party services like pagerduty, so on and so forth, for learning, for other communications, so on, Andor, so forth. If we rely on some of these services for critical parts of our own applications, who do we reach out to for external support? You don't want to be in a situation where one of these third party services is down and you're scrambling at that point to figure out who to reach out to. So it's really important to make this list ahead of time and continue to update it as and when you find gaps. The other lesson I learned is that it's really important to measure. Our goal as part of an incident management process is ultimately to reduce downtime for our services. However, how do we do this if we're not measuring this in the first place? So what are some things that we can measure? There's three metrics that I found to be very useful. First off, MTTR or mean time to recovery. How long did it take us from when the incident began till we resolved the incident? That's a very easy one to measure. Typically, you will document this as part of your post mortem document and you can continue to monitor this over time. If there are certain systems that are seeing longer time to recovery than others, you can focus some efforts to improve the stability of those systems. Second one is downtime between failures. Again, if a particular system is failing very frequently, we know that as an organization or as a team, we should focus our efforts there. And then lastly, we talked about this in one of our examples, but meantime to detection. If we notice time after time that we're not aware that there's an incident till much later, that means that there's a gap in our alerting systems, our monitoring systems. So if this metric is high, then we can clearly identify one easy action item is to improve our alerts. Only once we start measuring these can we identify these action items, iterate, improve them and measure again. This will also really help motivate the team because we can see these metrics improve over time, giving them the faith that this incident management process things incident response process is leading to some improvements. And then finally, it's really helpful to use error budgets when you need to trade off innovation work versus KTLo or keep the lights on work. So error budgets can be thought about. Ads the maximum amount of time that a system can fail without violating its SLA. So as an example here, if we say that our SLA is three nines, 99.9%, our error budget works out to a little under 9 hours. If you're monitoring this on a team level or a service level, we can very easily say once the system is out of its error budget. So once it's breaking SLA, we clearly need to deprioritize some of our product innovation work and focus more on system stability so that we can continue to offer our services with high availability and a high quality to our users and customers. So to wrap up, we answered four questions here today. Firstly, we established a clear definition for incidents and how they're different from bug. We went over how to declare an incident and defined a process to respond to an incidents, to minimizing the impact to our users, communicate with our stakeholders Andor get the support that we need. Then we talked about a postmortem process that we can use to reflect on what went wrong and extract learnings to continue improving our systems in the future. And then lastly, we discussed ways to measure and improve on our incident handling, as well as ways to reward and motivate our teams to take this work seriously and continue to keep our system stable Andor. That's all. Thank you so much for having me here today. I hope you found the session helpful and if you have any questions or if you'd like to chat, you can find me at my email address. I wish you all the very best with your own incident management and programs, and I hope you enjoy the rest of the conference.

Slides

Download slides (PDF)

See all 17 talks at this event!

Conf42 Incident Management 2022 - Online

September 29 2022

Running an effective incident management process

Video size:

Abstract

Summary

Transcript

Slides

Nishant Roy

Engineering Manager @ Pinterest

Join the community!

Featured event

2025

2024

Info

Conf42 Incident Management 2022 - Online

September 29 2022

Running an effective incident management process

Video size:

Abstract

Summary

Transcript

Slides

Nishant Roy

Engineering Manager @ Pinterest

Join the community!