Conf42 Incident Management 2024 - Online

- premiere 5PM GMT

Empowering SRE teams and incident management with AI

Abstract

In today’s fast-changing tech world, GenAI helps improve incident management by making it easier to detect, respond to, and resolve problems quickly. This session focuses on how GenAI empowers SRE teams to identify issues, automate tasks, and reduce downtime, leading to faster incident resolution.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, and thank you for joining today. I'm Spyros Konoumakis, Director of Product Operations at Mattermost, and I'm leading the Infrastructure and Customer Reliability Engineering teams. As you can understand, there's a lot of incident management here. And we're going to explore today how AI can help the SRE teams to manage incidents better, right? And a bit more efficient. And as we know, every minute of downtime matters, right? As it can lead to SLA breaches. Which means service gets to our customers. So AI potentially can help us take some control of the cows and improve the way we respond to incidents. If you ever been on call, it's tough, right? A call in the middle of the night disrupts your sleep, messes with your focus and adds a lot of stress. There is you are under pressure to solve multiple problems at once. There is a lot of multitasking at this time. So these incidents don't just stress us out, right? There's other things which are happening. They also slow down the innovation because there's a constant firefighting done building. So let's look how the severity incidents and why are such high pressure situations. As we discussed. Everything happens at once. There's a lot of multitasking. So you need to find the root cause, update internal teams, provide customer updates, and all of this while you need to check dashboards, logs, metrics, support issues which are coming from the support team, customer success are complaining. So it can be really overwhelming. But what if AI, could help us to manage this complexity a bit better and empower us to make a few things faster? Before diving into how AI can help, let's understand what AI can really do because it's important to understand how exactly it's going to empower the teams. As you can see also in this slide, AI can work across different types of data from text, image, audio, video. And by connecting all of these, it gives us a better, bigger picture. So in practice, AI doesn't replace our decision making. Still, we're going to decide what it's going to say in the end, what will be the next action. But instead, it can empower us with faster, better decisions. More informed insights based on the data we're going to provide. So let's see how this works in a real incident. Imagine this. It's 2 a. m. You have the perfect REM sleep and you get paid with a high CPU alert for X component You know You wake up by the scary alarming sound from PagerDuty or Opgenie or whatever tool you are using And I believe everyone here actually has gone through this Wondering it, it's, Oh my God, is this real? What's going to break? And, you're sleepy, you're going to your desk and sit down, turn on the laptop and just start checking what's going on. And if you are lucky, you have already a pretty fine process, and this is actually your incident response process. And based on this, you need to start doing some actions. In our case, we use modern most playbooks feature to make sure everything is organized right from the start. We run the playbook and this sets up an incident tunnel algorithm so everyone can communicate in one centralized place, right? So communication is key. In this kind of situation, it's very crucial 'cause things slip through the cracks, right? And to miscommunication, delays, fixes. So with everything organized with a checklist and a well defined process, you can move on to the next step of the playbook, right? So you have the incident sponsor. Number one priority as you can understand is to identify the severity so we can communicate internally to teams like support or customer success and eventually to our customers the impact of the incident, what has been, what is broken. So AI can process multiple data and this is where actually AI and Mattermost Copilot can help us as part of our incident management process. In this example, you can see that the compile analyzed different kinds of data we provided, like graphs, data from our observability tool, specifically the dashboards, screenshots, and classified this incident as a severity 2, due to the severe performance issues with the database. And it also provided detailed summary identifying the nature of the problem. this triage process allows us to communicate clearly with our internal team. And that's why I didn't find the severity. We can use the same context right now. We are ready to communicate clearly what's happening with our internal teams. Instead of manually crafting these updates, AI can generate them quickly. So as we use modern most playbook features, right? And we have the generated summary, we can just with a few clicks, set a status update internally in a channel and we ensure everyone is aligned from engineer to customer success. So with communication taken care of, it's time to dive deeper into the investigation. And it's just as important to keep our customers in the loop, right? So with AI, we can quickly generate customer facing updates. And as you can see also in the video, you can even use our templates we want to have with our customers and AI can utilize them to generate specific, clear status updates based on the template we want to have. In the past, writing both in external updates would take a lot of time. leaving room for errors, and now AI handles this. So giving us time to focus More on solving the issue versus just managing the problem and with communication taking care of it's time to dive deeper into the investigation and everything said, everyone is informed. Everyone knows exactly what's happening. And right now we need to gather more information, right? So we're starting more in our observability tools, right? Logs. We get more information by our support team, more information by customer success team. Something doesn't work. And AI, in providing all this information, you can see here we provide logs and some information by the support and customer success. can do a contextual analysis. So with extra input, AI right now can pull more information, combines with the previous context we already gathered, and help us clarify more what's happening. by bringing logs, tickets, and input from the team members. AI with this actually, can empower us to highlight patterns or issues we might miss. During the initial response, right? Because we are on rush. We were under stress, we were sleepy, right? We just woke up from our REM sleep. this makes the investigation process a bit faster. And a bit more efficient, right? Because AI is constantly adding more pieces to the puzzle. And helping us see the full picture. now that, let's assume that the issue has been resolved. What comes next? Right now, we need to generate, a post mortem. We need to have details about what happened. What was the evidence? What was the events, right? What were the next action items based on the whole thread we have? in this case, we just have a simple case where we're just going through. But imagine multiple people talking, multiple people were discussing, so you can summarize all these things in one generated timeline with events, evidence, and how the issue has been resolved. And this was a super manual process in the beginning as a starting point. And you have right now just a quick summary, which we can just start editing and unlock with a collaboration with a team. And this saves hours of manual work and ensures that we learn from the incident and prevent it from happening again. Of course, AI isn't perfect, right? As we discussed, collaboration between the different kinds of people in human context and input is important here. Otherwise, the post mortem is just. A summary which has been generated by AI, but in practice, there are a lot of much more things which happened. It's just a good starting point to start using it and collaborating as a team. So AI isn't perfect. And it's clear, we know that, right? So AI can make mistakes. So this is not an artificial SRE or someone who is going to do all the work for you, right? What AI can give to us and empower us actually more is automate some repetitive tasks during stressful incidents. We discussed being on call is very stressful because of all the things, there's multitasking, multiple updates, multiple things to, to be done. And you will be lucky if you have a model where you have different kinds of roles, like an incident commander, someone who has a communication. most of times there's no such a thing. So AI can give us a very good solid base and a starting point to automate some of the repetitive tasks, which can drag us out from the resolving the incident and responding to the incident versus just managing the incident. So this, it's up to us to engineers to refine each suggestions and make the final decisions, right? AI can give us key insights for us. So you focus on fixing the problems. And over time, AI improves, like we see all these breakthroughs and innovation which is happening in the space. So becoming a more valuable tool for speeding up incident management, AI can help us. And let's see that key takeaways, from the today's talk. AI helps us understand the problems faster by giving you key insights right away. So you can get, you can combine different kinds of data and get a summary of it in the end and understand what's going on and put more context. So AI can reduce some kind of the multitasking which is happening because you needed before to write all the internal updates by yourself and you were lucky if you had a template. All the external updates by yourself and you were lucky you had another template there and you need another tool there, right? There was a lot of multitasking, which has been significantly decreased right now with the use of AI. And you just spent less time managing, like sharing updates, discussing with teams, sharing with the customer updates to the customer, analyzing some of the data or put more input to the context. So you spent less time managing the incident and more fixing it in the end. And you have also the post mortems, which can be helpful in the end for you and your team to go to the next step. actions and improve your reliability. Thanks, everyone. I hope this session gave you enough insights into how AI can support SRE and empower them, during incident management. And if you are interested in diving deeper, about the details and how AI can be applied to your organization about incident management, feel free to reach out. You can DM me, exchange contact info, or tag me on social media. I'd love to keep this conversation going, sharing insights. And hear about your experiences, too.
...

Spiros Economakis

Director of Product Operations @ Mattermost

Spiros Economakis's LinkedIn account Spiros Economakis's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways