Conf42 Incident Management 2024 - Online

- premiere 5PM GMT

Network Incidents: What to Know Before the Chaos Strikes!

Abstract

I’d like to share some insights on how to better prepare for and manage incidents arising from network infrastructure issues. Network-related disruptions often trigger a cascade of service outages, making them challenging to troubleshoot and resolve. These situations can be confusing and time-consuming, especially when you lose access to critical monitoring and diagnostic tools, or even to all services simultaneously. While there’s no one-size-fits-all solution, there are several key strategies that can help minimize the impact of such incidents, both in preparation and during the event itself.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to Con 42. I am Maxim Kupriyanov. My talk today is network incidents or what to know before the chaos strikes. Few words about myself. I work as a Site Reliability Engineer, Production Engineer, before that System Administrator, For a lot of years, and I've seen a lot, I'd say, I wish I see them less, but, yes, that's true, a lot of different incidents, and, those of them who were connected to the, or root cause by the network services were the most devastating ones. let's dive deep what is actually a network incident and, like, why it's so devastating. First of all, a network incident is just an expected event that disrupts or degrades critical network services, impacting their availability, integrity, or performance. Sounds? Very similar to any other incident. But, point is, critical network services. Let's take a look what they are. And these are DNS, DHCP, Routing, VPN. These are infrastructure services. They are security services including firewall intrusion detection prevention, security certificate centers, physical access control. These are directory services. yes, so you can't authenticate or authorize your users, the cloud and container orchestration, or just web services which deliver your traffic, CDN, load balancers, or web service architecture. That's a very sensitive problem. point sensitive layer of software for any organization. And, why this network incidents dealing with this critical services are so disruptive. First of all, they frequent, I'd say almost every time they lead to cascading failures. for example, if a DNS. server failed, very likely that a lot of services that are still dependent on DNS will start to malfunction. By the way, please never depend on DNS. If you can avoid it by any cost. But still, a lot of services still do that. then you lose your monitoring and diagnostic tools. Not them, but access to them. Usually it's very fast. Frequent story, then downtime affects multiple stakeholders. Yes People will come to you or your network engineer or site reliability engineer with a question. What's going on? We can't work. Nothing is working do something with that Difficulty in triage modern networks are very complicated, especially big ones. It takes a lot of time to find the root cause to triage it and to fix it afterwards, time consuming troubleshooting. Yes. Time is just money. In that case, if your business is not working, you're losing money, increased security risk. that's, can happen for two reasons. First of all, you can like. Firewall or whatever can be malfunctioning and another story is you can make a temporary exemption for fixing the issue And it could lead to the attackers to come in and do something very violent Okay, of course reduced business operations. That's basically It's all it, what it mean, let's, take a look at a couple of examples of the recently happened, network incidents, with the big tech companies happen like in 2020, 2021, there is no 2023 and 24, because I was not able to find them. there were no. incidents, but still some of them probably happen. Okay. let's start from just briefly cloudflare outage. It's, happened in July, 2022. 20 minutes. Basically, it was a misconfiguration in router, which, lead to the, router malfunctioning and overloading one of the sites. then, Google cloud outage, authentication issue, Google accounts authentication failure was in that time. So services like Gmail, YouTube, and Google drive were down for, about an hour. Then, Microsoft Azure and teams outage. Actually, teams were not working for three, three hours. It happens again because the Active Directory failure. So authentication, Fastly CDN outage, bugging triggered by customer configuration. You see a one hour Facebook outage, six hours that happened because the software was not correctly handled in several conditions happened in the network. Just. Error in developing. Amazon web service, AWS outage, 6 hours and 30 minutes. And again, there's actually a software issue, I'd say. what we, can take from this? In the IT world, network incidents are as certain as sunrise, so if you never had them, you probably will have them at some point of time. The anatomy of a network incident. Let's dive inside of how things usually start and develop. first of all, initial situation, you have a very healthy green system. The network is fine, systems are fine, everything is working perfectly, but then, hardware failure. For instance, it could be a power supply of your router, switch, or line card, whatever. It could be more tricky stuff, I've seen from my experience like a malfunctioning central processing unit which start to mess up data. Things could go immediately wrong or they can take some time to develop, but still. software issue, two situations could happen here. First, just the software bug. Sometimes it happens by the way, quite frequently. another story is unsupported scenario. And that's is very frequent nowadays because, unfortunately developers are not, gods. They don't know everything. the systems, Sometimes can predict that they can be put in such a situation that they just can't handle and that's also software issue human error Usually, it's a misconfiguration story untested configuration or just wrong configuration Mistype, whatever it can Security breaches. It's usually some kind of external factor, but someone trying to a break your perimeter or someone trying to like make a denial of service for your, for service or just external factors. This is usually the most simple to trash things. It could be, the flat, or it could be, kind of fiber cut, or Again, from my experience, I've seen quite crazy things, with the animal, somehow got into the transformer, caused the short circuit, which caused the fire and it's caused the power outage. It can happen as well. Unfortunately, we can do anything with that, just external factors. and my favorite market related events. I was working for several years for the marketplace and usually Black Fridays were very hot for us. And when a Black Friday starts, a lot of people come to your services. A lot of people start to make transactions. And it, it depends a lot on how professional you are in capacity planning before that. So you are very close to the overload at this point. And this can happen. Sometimes it happens at, random, unpredicted moments. It could be. Market is chaotic sometimes. first of all, after the initial trigger, again, it could develop for a week, but usually it starts right after that. What do you see? You see the service outage. That's obvious. You, for instance, a file server can stop working, database, communication system, anything, customer facing application, websites. It can start to fail. other terms, all the monitoring and other tools, send out flood of alerts, multiple systems and services detect failures, the team could be overwhelmed with, with these alerts, or it could be even worse. When you don't start, you don't receive any alerts at all because like system is unreachable completely or just malfunctioning, user complaints. Of course, employees and customers, the usual very social media starts to, to make some, posts, like any other media can start to join this party, whatever. So yeah, this happens. What's next? unfortunately, network incidents, they, very rarely remain isolated. First of all, interdependent systems start to fail. For instance, authentication service goes down. All the services dependent on the user authentication authorization can fail as well afterwards. And, applications can start, to be slow. for listening. or just failing or in the responsive and, the worst thing about that, which can happen and that's frequently happen, they start to retry, if one operation fail, application start to retry the operation and the application before that application also start to retry the operation. And they create such a. big wave of requests that they usually kill their underlying services completely. Traffic bottlenecks, that's a network thing. So if you, if the network incident, if it didn't, it happened because of a road misconfiguration or maybe a fiber cut or whatever, your links could be overloaded and it could lead to, to all the service using the slings. failing resource exhaustion. That's what I was talking about the many retries. Each device has its own hardware limits, and if you start to send to a server, a million requests per minute instead of a hundred, as you did previously, yeah, probably it will, not handle it, it will, it will fail. Then, you'll lose visibility and control, and that's a very difficult thing because first of all, your inaccessible monitoring tools are become, your monitoring tools, Solib story becomes inaccessible, it's a very frequent thing can happen, and, you just don't know what happened in your network with your services. remote access issue, you can just lose access to your network, completely, because of authentication, because of physical failure, anything, it just doesn't exist anymore. or a communication breakdown, for example, email or messaging applications can be down, it's difficult to coordinate, to communicate. Teams just don't know what's going on like implode like individual people's people can can't understand what's going on what to do and Incident may expose the network to additional threats. Why it can happen not only because of the network incident happen, but because of the you know How you try to fix that sometimes for that, some holes in firewalls are made or just firewall is disabled or some, simplified, applications are landed, which can handle the situation, but it can help, but sometimes, attackers just Wait for exactly this to happen to go on. So be aware, usually it's a very bad practice to decrease, security level during the incidents. And of course, team face multiple obstacles in terms of like stress and pressure. this network incidents, they are usually very, Now, people get very nervous because nothing is working and you face a stress and pressure from, your stakeholders, and from your team members, whatever, you have prioritization dilemmas to solve. So what start to do first, fixing this or that you'd never don't know you need someone to consult to, or. It's difficult. Communication with stakeholders, they are being very tough because you are fixing, the service, you don't have time to talk to someone, to anyone else, and stakeholders, they are, they just, they want the service to get, to be working back. They, try to, communicate with you, but you're not answering. All this stuff is happens. what are the key, take outs from, this anatomy story? Network incidents have multiple triggers, so anything can start an incident. Sometimes you can never thought about it, like I said, an animal transformer. How could that be? Did it get there? I don't know. interconnectedness, magnitudes, impacts. Network services are very critical because a lot of other services depend on them and this is a cascading event. It's a chain event. things can go pretty bad. loss of visibility complicates response. Yes, of course. If you don't see what's happening and you, very frequently you can't even, tell, if the network. is faulty or something else. Preparation. Now, let's talk about, how can you be prepared for such a network incident event? Key steps you have to take. First, you need a kind of a network redundancy. And by that, I not only mean like just having multiple network path, or, setting up multiple routers or whatever. geographical redundancy as well. because, for the external factory events, it could, happen that, this particular area, the you need, to be present somewhere else to continue the operations. you have to do this. You have to, and this have to exist before, the incident strikes. Out of band management network is a crucial story. if the network is the issue, it means that you are losing access to your services at all. You get access to your devices at all. And the only way you can, you can do something with your network services, devices, whatever, you can, right to a data center with the console cable, and fix something from there. So if you think before, and you, can build out of band management network, things can go pretty well. You just connect, from the. I'd say opposite side, you make the changes, you fix the issue. Data backs up, backups, of course. Sometimes network incidents can lead to a data fail, data loss, unfortunately. And data backups is crucial for this. Especially, just to remind, data backups should not be held in the same location as the system services are working. Failover systems, that's pretty obvious story if you know that any server can fail, You need something to overtake it. very important step to take is to conduct regular system audits and health check. And by this, I don't just mean, having a PDF with all green marks checks that everything's fine. It should be a continuous thing. and we're talking about network assessments and hardware maintenance, capacity planning. It should be, You should work on this, all the time, continuously and subset of monitoring tools and dashboards. It's definitely must be accessible out of band. Just imagine your network is not working services, not working. how do you understand what's going on actually inside of it? You need something to monitor this, some access tools. You can, rent some services. elsewhere, export some, part of data there. usually it's not a kind of secret thing of data. You just some, general health metrics. You need to have them somewhere. another very important thing is to develop a comprehensive documentation and playbooks. Usually, all the company I was working with, It's usually a painful thing to have a comprehensive documentation because things are, systems are, they are being, developed, continuously. They are changing continuously, and the best documentation you can have is in your code. But, For these network related incidents, you need to have at least playbooks, and these playbooks and network documentation, they should be available remotely, out of band, external backup, whatever, you need to have it remotely, you won't have an access to them when systems will fail. So just have a copy somewhere else and regularly check that you, can access this copy. Because a crazy thing can happen, people can just, forget their, alternative password, forget where they store it. Okay, and, finally, drills. You need to like to fire drills in your network. it's very important. they should be done on the regular basis, drills and simulations. So you need to be able to take down your systems, components of it to see how things goes. For instance, what will happen if the DNS server will go down? Just turn it off in some location. See what will happen. If you have some, health budget on your systems, ensure that your team is prepared and, evaluate and update the response plans. Yes, this seems like slightly, a lot of work, but if you want to save some, time, money, nerves, whatever, just mental health during the network incidents, if you will do this, you will be prepared for this network incidents. let's recap. Preparations is an ongoing process. Always have a backup plan, fire regular drills. And like with all of that, the benefits outweigh costs. Yes, you'll spend a lot on this, a lot of time, a lot of money, but you better do this than not. So let me give you, some tips on how to, better behave during the incident, to, to fix it sooner, to, get less penalties from that, and, first recommendation is stay calm. It's probably the most important one. Stay calm. if you get in panic, into panic, then probably you won't be as effective. the, the network incident will take much longer time. It will be harder to fix, whatever. Stay calm. the second step, so you should take the gather initial information. There could be logs, there could be other from your systems. And for that to happen, you need to out of band access or backup. sometimes you're there still accessible and available. It's a good thing. It's just depending on how lucky you are, also user reports. So sometimes they find the problem before any other systems will come into play. And you should determine the scope of the impact, which systems are. affected and, like how many of them, then, establish incident command. This is again, a very important step. you have to take it. you need to assign someone as an incident manager. It doesn't mean that the key is the manager, the key, the captain, whatever, but he is the point of content contact for everyone else. And while Some of your team will, be very deep diving inside the debugging, triaging, fixing things. This captain will maintain the current state of the incident and will communicate with another teams or communicate with the stakeholders, and synchronize the team activity. So it's very important and it's better be, prepared beforehand. and that, I'll talk about training previously, and drills, so you should find the way how to, who will be assigned to this, incident manager role. then secure communication channel. Your team need to have an effective way to communicate during incidents. It could happen that, for instance, if you use, company wide messenger or whatever, it could be broken. Because of the network incident, especially it's true for big companies which have their, own communication solutions. You should have alternative one and it, everybody should know how to use it and how to get inside of it. A lot of incidents happened when like people were just not aware that they have a kind of communication IRC channel or alternative messenger or how to reach other people. Sometimes they just were not able to even find the phone of their colleague to contact with. So please do this beforehand. And, inform stakeholders. here like honesty is the key. you better Use a very simple, no technical terms, in a very simple words, you should describe to them, what's going on and what's are your, thoughts about it very briefly, but they should know, if they won't get the information. things can get very, worse for you, after the incident. So you better be, communicative with them. then you have to prioritize what to fix first. It's also very important. Identify a critical service, identify which service, you can sacrifice, maybe not completely, but just to put it in the second stage or whatever, and determine the root cause. yeah. Again, everything is important here, so determine the root cause as well, and develop an action plan, and follow it. These three rules to what to do with this sounds very simple, but usually it's very difficult to do, but still, this is important. Develop action plan and follow it all the time. So if, different people from a team start to do different things, not according to plan, things could get even worse. Communication, everyone, who are, working, not working affected by the incident internally, maybe sometimes externally. And external clients, whatever should know what is going on. They don't need technical details. Usually what they want to know is like the, when you will fix it and, what are the, could we run something, even if these things are failing, or could we do something with that? Something like that. So please establish a communication protocol and do regular updates with the current status. That's incident manager is for. Sometimes there is an alternative role as a journalist, but it's for very big companies, for very big incidents. Usually incident manager is pretty fine with that. inform all affected parties, and, maintain open lines within the team. First of all, you need to share the findings somehow, internal chat, something. Second, don't be afraid to ask for help. Don't sit inside of the problem alone. Just share it with people and let them help you. Sometimes you never know, where it can get help from, but there are a lot of small, smart people around and they have their own experience and they can get, the idea, or sometimes they can like, just the Hence to help whatever, and stay positive. Yeah. that's, like it's hard, I know. but if you're not calm, if you're not positive, it just will take, you will be less effective and it will take longer time, to fix the issue. You have to talk about common pitfalls which can happen during the incident. Number one is not rushing without a plan. It will just add more chaos to your incident situation. The second one is poor communication. Either within your team or with the stakeholders. Poor communication within the team will add more stress, will make you less effective. Poor communication with the stakeholders, not giving them information about the current state, will probably make them disappointed in you as a service provider or employee or whatever. Number three is ignoring protocols and procedures. Please don't do that. Several protocols and procedures were, introduced, not just for fun, but for security or legal reasons and, like avoiding them could make things even worse. the fourth one, overlooking the root cause, if you missed with that, you can heal some kind of symptoms and the situation can reappear very soon. look deeper and the final, but also very important thing is neglecting team well being. And what I mean by that is that if the incident is quite long, like it could last hours or sometimes days, the team could be overstressed, and, like prone to mistakes. so if you. can introduce kind of rotation within the team or something like this, it could help think about that. and finally, let's recap. so first of all, during the incident, stay calm and positive. It's the most important rule, as I said, please do that. second, effective communication is critical within the team with the stakeholders, both very important with the clients, of course, teamwork matters. Don't be afraid to ask for help. Always share all your findings, just work as a team, work together. You will solve this story, this incident faster and most efficient. Prioritize actions. you should. Take the approach the problem step by step. Take the most critical, step, make it, then next one, and so on. And avoid the pitfalls we described in the previous slide. When the incident is over, you still need several things to be done. First of all, analyze what went wrong, and by that I mean that you need to have a post incident review procedure and root cause analysis. You need to clearly understand why the incident happened, what was done well during it, and what could be improved. These three things. Then create a feedback loop, and by that You should update all the runbooks of the systems that were involved in the incident. They could be outdated, they could be wrong, they could be just not present. Because if you don't do this, you are waiting for the history to repeat itself. And finally, Every network incident is, troublemaking, at least. It could be very expensive, finally. But it's an invaluable source of the learning experience. Please use it to strengthen your system, your infrastructure. to train your team, finally. And, stay calm, be positive about that. It's already happened to you, you already planned it. pass through the incident. like I hope network incidents will never happen to your company, but if they will, you are now ready for them. And thank you for your attention. It was Con 42. If you have any questions, you are welcome.
...

Maksim Kupriianov

Production Engineer @ Meta

Maksim Kupriianov's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways