Conf42 Prompt Engineering 2024 - Online

- premiere 5PM GMT

Predictive Network Maintenance: How AI Forecasts System Failures

Video size:

Abstract

Imagine a world where network outages are nearly obsolete, and system failures are predicted before they happen. In my talk, Predictive Network Maintenance: How AI Forecasts System Failures, I’ll show how AI predicts issues and suggests preventive actions, saving time and costs.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to my talk on predictive network maintenance. Now imagine a world where network issues almost never happen, where we can actually prevent network failures before they even have a chance to disrupt operations. This is the promise of predictive network maintenance. It's a revolutionary way of using AI to identify and prevent potential problems in our networks, allowing for smoother operations and fewer interruptions. I am Akshat Kapoor and I will walk you Through how AI driven predictive maintenance works, why it's important and how it's being applied across different industries to create more reliable networks. Since the inception of internet, network reliability has always been a concern. Various protocols and technologies were developed to counter this issue. Most of them have been based on reactive approaches. which require a failure to happen first before mitigating it. But as networks grew in complexity, the focus shifted from merely reacting to failures to proactively predicting them and avoid potential disruptions to the network. Predictive network maintenance is the next evolution in this journey, aiming to read out or repair systems in anticipation of a failure, a major shift from traditional reactive approaches. Now, what is a network failure? A network failure is a complete or partial failure of hardware or software components that comprise a network, which causes service disruption for the end users. Now, there are various causes for it. There has been a study and based on that, they found out the most prominent cause of network failures were link failures, which could be a fiber cut or a configuration error on a link, and that comprised of about 32 percent of the failures reported in the network. This is followed by operating system failures or software bugs. Or due to system updates Which cause these glitches and that comprise of about 18 percent of them This is followed by human errors, which are also about 18 percent of them caused by humans due to misconfiguration of the network followed by hardware failures due to components reaching end of life memory failures card failures and such or the overuse of those components And even cyber attacks in the modern world The cost of this network failures is very high. Average cost of network downtime as per Gartner is about 5, 600 per minute. That's a lot of money. Globally for telecom operators, this annual cost can exceed 60 billion a year. This is not a small cost to pay. So what are the problems with the current network maintenance? It has been a topic since early days of internet. And the paradigm has not changed much. First, a failure or a path disruption must be detected, followed by traffic rerouting to an alternate path, which is either pre programmed or computed on the fly. So it's largely reactive, right? Companies only address problems once they have identified the problem and once it has already happened. But this approach has serious drawbacks. It means unexpected downtimes and very high operational costs to pay. Now, I would like to cite an example of the Rogers outage that happened in Canada in 2022. That outage was a day long outage, which, caused, which left a quarter of the population, about 12 million people without internet access, severely impacting their daily lives. Even emergency and payment services were cut. Rogers required excessive time to identify the root cause of the failure, which turned out that because of a software update, their routers had gone down. This is a perfect example of how costly reactive maintenance can be, both financially and in terms of public safety. And since it's not very efficient, as we can see, it is expensive. Smaller networks who have wholesale agreements with our service providers like Rogers. They have service level agreements with these service providers in which the service providers are supposed to provide them network uptime guarantees, right? Otherwise they have to pay, to these smaller networks a fee. So the more time it takes for them to bring up the network, the more it costs them. Another important point is something called gray failures. The gray failures are when an a path is up technically alive, but is underperforming or cannot support the, the quality of, experience requirements of the path. So these are called gray failures. So a reactive, maintenance approach will still see the path as active and try to route or use this path, but. It will cause degraded performance for users. Now, predictive networks can anticipate and mitigate these issues. This makes a proactive approach like predictive maintenance essential for preventing these types of disruptions. So why predictive network maintenance? I think by now we understand that reactive approaches are not working, but there is something more. Networks are growing more complex with the influx of IoT devices that are getting onboarded every day. You have multi cloud environments. The operational technology Networks are merging with it networks. For example your factory automation networks or your train networks They are merging with it networks. So the networks are more and more heterogeneous and more and more complex Existing solutions are neither able to scale not able to cope up with the complexity and heterogeneity of these modern networks. And that's where the predictive networks come in. They offer a better fault detection accuracy, and some studies have shown that their, their ability to detect faults is up to 95%, which is quite high. Secondly, networks are highly diverse and dynamic. In a multi cloud highly virtualized world, Where the network keeps changing and applications constantly move, it has never been so important to equip the network with an ability to learn. This is what is missing in the reactive networks. They have zero ability to learn from the faults or errors that have happened in a network. Now, AI plays a crucial role here, enabling us to predict potential failures and addressing them in advance. This proactive model leads to a more reliable network and greater efficiency. Predictive networks can even identify grave failures, issues that degrade network performance without causing complete outages, helping us to keep both connectivity and quality of service high. Now let's move towards how to build a predictive network. I think we all know here that the first thing to build a model is to define the problem. And maybe we already know the problem that we want to identify and detect the faults, or we want to, proactively find where there could be a next fault happening in the network. But we still want to define what kind of fault we want to identify. Is it a gray fault or is it a black fault, which is a total outage of a link? Do we want to identify hardware failures, software failures? What is our forecasting horizon? Are we looking to do short term forecasting faults happening within a month or we have a longer or mid term forecasting horizons? All this comes under defining the problem, which is the most important thing before we start building our model. Now, the next step is collecting the data for this model. Now, there are two types of data and data primarily, which we use in the network. One is the historical records of faults, and that has happened in the network and the associated network state at that time, what were the flows, what was the status of the network components at that time, all of that, there are historical records available, there are public data sets also available. that log this network information, but those are mostly outdated and may not be suitable for training the model. Now, the next step would be the real time data collection through continuous monitoring of networks. So real time data collected and integrated from continuous monitoring comes from different network components. It could be the routers, the switches, the servers, the applications, the type of data that's available in networks is huge. We have packet logs, we have network alarms, we have flow telemetry data, packet traces. device configuration. What is the configuration of all those devices? What is the topology? What kind of applications are running over the network? This is, there are a lot and lots of data available. So when we move to the next step, We have to look at preparing this data, which is something called feature engineering or features extraction. We need to extract the data from the network based on our KPI. What we are trying to solve. Are we, what problem are we trying to solve? And this goes back to point number one. defining the problem. So for fault detection cases, typically what we want to look at is network alarms. if there are sudden spikes in packet loss, There is a jitter happening in the network. There are a lot of packet errors happening. There is a latency and of course, the quality of, user experience is deteriorating. For example, your video is jittery and so on and so forth. You also consider the C SNIB. seasonality of the data. Now let's get to fault prediction and building our model. On the right side, you see a flowchart. some of these stages we discussed in the previous slides. You start with the data, pre process it, you clean the data, you extract the features you want. You only look at those features which you want to focus on. And the next step is. coming to training the model. Now, data is decomposed in training and test data sets. It's usually a split of 70 30 and you need spatial and temporal diversity in the data. So from an algorithm perspective, there is no best algorithm which can say that this is the best algorithm to predict faults. But typically, long short term memory, decision trees, and random forest are used in most of the use cases for predicting faults. Now models can be learned either in a supervised manner where we know which uses the labeled data where we know what faults are already identified, or also with unsupervised learning, which identifies unusual patterns in the data without needing predefined fault labels. Typically a hybrid approach would give better results in overall fault detection, but it could also go back to the first point of what are we trying to optimize and what kind of faults we are trying to detect in the model. Now comes evaluating the model that has been and then further optimizing it. Forecasting accuracy is a very critical topic in fault predictions. Every forecasting system will inevitably make errors. Therefore, it is essential that predictive engines are designed to effectively manage trade offs. Among true positive, false positives and false negatives. True positives are instances where the model accurately predicts a failure and then failure occurs. False positives are cases where the model predicts a failure, but that does not materialize. And false negatives occur when the models fail to predict a failure even though there was a failure. In case of fault prediction and in general for network, predictive networks, it is advisable to minimize false positive as it could be disruptive in nature. So in short, accept imperfections. To ensure the, there are minimum or zero false positives, even though you're not able to detect all the faults. So what has been the outcome? Studies have demonstrated that predictive maintenance can yield significant cost savings for companies. Primarily by reducing downtime and optimizing repair schedules in telecommunications and ID sectors. Network downtime costs are often cited as high as 5, 600 per minute, which we talked about in one of the earlier slides. That was a study from Gartner, which illustrates the severe financial impact of unexpected outages. Predictive maintenance, which leverages AI and machine learning to monitor equipment in real time and anticipate faults, has been shown to reduce downtime by approximately 50%. This proactive approach not only prevents disruptions, but also reduces the need for emergency repairs, cutting maintenance costs by around 30 to 40 percent in some cases. Along with the alert on fault, a predictive system can also offer insights. regarding the possible root cause, enabling teams to tackle the fundamental issue rather than merely addressing the symptom. Now, there are some challenges as well here. Collecting appropriate data and extracting relevant feature involves several considerations. The data may exhibit inconsistencies and disorganization. We should have a fair understanding of what data features can provide a correlation with the faults we are trying to predict. Next is determining the most suitable machine learning technique for a particular networking challenge, which is essential. Various methodologies exist for addressing issues related to traffic prediction, classification, and detection. So it's crucial to ensure that the solution can scale effectively to accommodate large and varied networks. Additionally, state strategies must be developed to enable machine learning models to learn consistently across networks that are designed non uniformly. So what's the future of network maintenance? Looking to the future as AI continues to evolve, we can expect predictive maintenance to become even more accurate and reliable. Real time analytics And fully automated maintenance could become the norm with networks becoming smarter and more resilient over time. Emerging AI techniques like state transition learning and hierarchical models will likely push the boundaries of predictive accuracy, leading to even more robust and intelligent systems. Now, autonomic self healing networks are the logical next step in network reliability as they minimize disruption and enhance efficiency without requiring manual intervention. These networks create a continuous feedback loop that enables real time responses, maybe in a few seconds, to faults and maintaining optimal performance. coming to, we talked a lot about reactive versus predictive earlier in the slides, but, It's not that reactive is not required as or at all reactive and predictive approaches are complementary to each other as we know that predictive models cannot predict all of the issues. So the reactive comes into play there where the predictive model either cannot predict or predicts. incorrectly, a fault. For example, if traffic is rerouted into an alternative path in anticipation of an issue that could happen in the network, the networking gear could detect the incorrect prediction and immediately revert to the original route. Such a mechanism is less suited for centralized operations though. So what are the key takeaways from here? Predictive network maintenance offers a proactive, reliable solution that reduces downtime, optimizes operations, and boosts network reliability. As AI technology continues to advance, the potential to create smarter, more resilient networks is within reach. Predictive maintenance is not just a way to improve network efficiency. It's a way to transform how we manage networks altogether. Thank you for listening into this talk today. I hope you enjoyed this and learned something out of this. I have a few references mentioned here, which have been used in this presentation. Thanks again. Bye. Until next time.
...

Akshat Kapoor

Director Product Line Management @ Alcatel-Lucent Enterprise

Akshat Kapoor's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)