Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to my talk on predictive network maintenance.
Now imagine a world where network issues almost never happen, where we can actually
prevent network failures before they even have a chance to disrupt operations.
This is the promise of predictive network maintenance.
It's a revolutionary way of using AI to identify and prevent potential problems
in our networks, allowing for smoother operations and fewer interruptions.
I am Akshat Kapoor and I will walk you Through how AI driven
predictive maintenance works, why it's important and how it's being
applied across different industries to create more reliable networks.
Since the inception of internet, network reliability has always been a concern.
Various protocols and technologies were developed to counter this issue.
Most of them have been based on reactive approaches.
which require a failure to happen first before mitigating it.
But as networks grew in complexity, the focus shifted from merely
reacting to failures to proactively predicting them and avoid potential
disruptions to the network.
Predictive network maintenance is the next evolution in this journey,
aiming to read out or repair systems in anticipation of a failure, a major shift
from traditional reactive approaches.
Now, what is a network failure?
A network failure is a complete or partial failure of hardware or software components
that comprise a network, which causes service disruption for the end users.
Now, there are various causes for it.
There has been a study and based on that, they found out the most prominent
cause of network failures were link failures, which could be a fiber cut
or a configuration error on a link, and that comprised of about 32 percent of
the failures reported in the network.
This is followed by operating system failures or software bugs.
Or due to system updates Which cause these glitches and that comprise of about
18 percent of them This is followed by human errors, which are also about 18
percent of them caused by humans due to misconfiguration of the network followed
by hardware failures due to components reaching end of life memory failures
card failures and such or the overuse of those components And even cyber
attacks in the modern world The cost of this network failures is very high.
Average cost of network downtime as per Gartner is about 5, 600 per minute.
That's a lot of money.
Globally for telecom operators, this annual cost can exceed 60 billion a year.
This is not a small cost to pay.
So what are the problems with the current network maintenance?
It has been a topic since early days of internet.
And the paradigm has not changed much.
First, a failure or a path disruption must be detected, followed by traffic rerouting
to an alternate path, which is either pre programmed or computed on the fly.
So it's largely reactive, right?
Companies only address problems once they have identified the problem
and once it has already happened.
But this approach has serious drawbacks.
It means unexpected downtimes and very high operational costs to pay.
Now, I would like to cite an example of the Rogers outage
that happened in Canada in 2022.
That outage was a day long outage, which, caused, which left a quarter
of the population, about 12 million people without internet access,
severely impacting their daily lives.
Even emergency and payment services were cut.
Rogers required excessive time to identify the root cause of the failure, which
turned out that because of a software update, their routers had gone down.
This is a perfect example of how costly reactive maintenance can be, both
financially and in terms of public safety.
And since it's not very efficient, as we can see, it is expensive.
Smaller networks who have wholesale agreements with our
service providers like Rogers.
They have service level agreements with these service providers in which the
service providers are supposed to provide them network uptime guarantees, right?
Otherwise they have to pay, to these smaller networks a fee.
So the more time it takes for them to bring up the network,
the more it costs them.
Another important point is something called gray failures.
The gray failures are when an a path is up technically alive, but is underperforming
or cannot support the, the quality of, experience requirements of the path.
So these are called gray failures.
So a reactive, maintenance approach will still see the path as active and
try to route or use this path, but.
It will cause degraded performance for users.
Now, predictive networks can anticipate and mitigate these issues.
This makes a proactive approach like predictive maintenance essential for
preventing these types of disruptions.
So why predictive network maintenance?
I think by now we understand that reactive approaches are not working,
but there is something more.
Networks are growing more complex with the influx of IoT devices that
are getting onboarded every day.
You have multi cloud environments.
The operational technology Networks are merging with it networks.
For example your factory automation networks or your train networks
They are merging with it networks.
So the networks are more and more heterogeneous and more and more
complex Existing solutions are neither able to scale not able to
cope up with the complexity and heterogeneity of these modern networks.
And that's where the predictive networks come in.
They offer a better fault detection accuracy, and some studies have shown
that their, their ability to detect faults is up to 95%, which is quite high.
Secondly, networks are highly diverse and dynamic.
In a multi cloud highly virtualized world, Where the network keeps changing
and applications constantly move, it has never been so important to equip
the network with an ability to learn.
This is what is missing in the reactive networks.
They have zero ability to learn from the faults or errors that
have happened in a network.
Now, AI plays a crucial role here, enabling us to predict potential
failures and addressing them in advance.
This proactive model leads to a more reliable network and greater efficiency.
Predictive networks can even identify grave failures, issues that degrade
network performance without causing complete outages, helping us to keep both
connectivity and quality of service high.
Now let's move towards how to build a predictive network.
I think we all know here that the first thing to build a
model is to define the problem.
And maybe we already know the problem that we want to identify and detect
the faults, or we want to, proactively find where there could be a next
fault happening in the network.
But we still want to define what kind of fault we want to identify.
Is it a gray fault or is it a black fault, which is a total outage of a link?
Do we want to identify hardware failures, software failures?
What is our forecasting horizon?
Are we looking to do short term forecasting faults happening within
a month or we have a longer or mid term forecasting horizons?
All this comes under defining the problem, which is the most important
thing before we start building our model.
Now, the next step is collecting the data for this model.
Now, there are two types of data and data primarily, which we use in the network.
One is the historical records of faults, and that has happened in the network
and the associated network state at that time, what were the flows, what
was the status of the network components at that time, all of that, there are
historical records available, there are public data sets also available.
that log this network information, but those are mostly outdated and may not
be suitable for training the model.
Now, the next step would be the real time data collection through
continuous monitoring of networks.
So real time data collected and integrated from continuous monitoring
comes from different network components.
It could be the routers, the switches, the servers, the applications, the type of
data that's available in networks is huge.
We have packet logs, we have network alarms, we have flow
telemetry data, packet traces.
device configuration.
What is the configuration of all those devices?
What is the topology?
What kind of applications are running over the network?
This is, there are a lot and lots of data available.
So when we move to the next step, We have to look at preparing this data,
which is something called feature engineering or features extraction.
We need to extract the data from the network based on our KPI.
What we are trying to solve.
Are we, what problem are we trying to solve?
And this goes back to point number one.
defining the problem.
So for fault detection cases, typically what we want to look at is network alarms.
if there are sudden spikes in packet loss, There is a jitter
happening in the network.
There are a lot of packet errors happening.
There is a latency and of course, the quality of, user
experience is deteriorating.
For example, your video is jittery and so on and so forth.
You also consider the C SNIB.
seasonality of the data.
Now let's get to fault prediction and building our model.
On the right side, you see a flowchart.
some of these stages we discussed in the previous slides.
You start with the data, pre process it, you clean the data,
you extract the features you want.
You only look at those features which you want to focus on.
And the next step is.
coming to training the model.
Now, data is decomposed in training and test data sets.
It's usually a split of 70 30 and you need spatial and
temporal diversity in the data.
So from an algorithm perspective, there is no best algorithm which can say that this
is the best algorithm to predict faults.
But typically, long short term memory, decision trees, and random
forest are used in most of the use cases for predicting faults.
Now models can be learned either in a supervised manner where we know which
uses the labeled data where we know what faults are already identified, or
also with unsupervised learning, which identifies unusual patterns in the data
without needing predefined fault labels.
Typically a hybrid approach would give better results in overall fault
detection, but it could also go back to the first point of what are we trying
to optimize and what kind of faults we are trying to detect in the model.
Now comes evaluating the model that has been and then further optimizing it.
Forecasting accuracy is a very critical topic in fault predictions.
Every forecasting system will inevitably make errors.
Therefore, it is essential that predictive engines are designed
to effectively manage trade offs.
Among true positive, false positives and false negatives.
True positives are instances where the model accurately predicts a
failure and then failure occurs.
False positives are cases where the model predicts a failure,
but that does not materialize.
And false negatives occur when the models fail to predict a failure
even though there was a failure.
In case of fault prediction and in general for network, predictive networks, it
is advisable to minimize false positive as it could be disruptive in nature.
So in short, accept imperfections.
To ensure the, there are minimum or zero false positives, even though you're
not able to detect all the faults.
So what has been the outcome?
Studies have demonstrated that predictive maintenance can yield
significant cost savings for companies.
Primarily by reducing downtime and optimizing repair schedules in
telecommunications and ID sectors.
Network downtime costs are often cited as high as 5, 600 per minute, which we
talked about in one of the earlier slides.
That was a study from Gartner, which illustrates the severe financial
impact of unexpected outages.
Predictive maintenance, which leverages AI and machine learning to
monitor equipment in real time and anticipate faults, has been shown to
reduce downtime by approximately 50%.
This proactive approach not only prevents disruptions, but also reduces
the need for emergency repairs, cutting maintenance costs by around
30 to 40 percent in some cases.
Along with the alert on fault, a predictive system can also offer insights.
regarding the possible root cause, enabling teams to tackle
the fundamental issue rather than merely addressing the symptom.
Now, there are some challenges as well here.
Collecting appropriate data and extracting relevant feature
involves several considerations.
The data may exhibit inconsistencies and disorganization.
We should have a fair understanding of what data features can
provide a correlation with the faults we are trying to predict.
Next is determining the most suitable machine learning technique
for a particular networking challenge, which is essential.
Various methodologies exist for addressing issues related to traffic
prediction, classification, and detection.
So it's crucial to ensure that the solution can scale effectively to
accommodate large and varied networks.
Additionally, state strategies must be developed to enable machine learning
models to learn consistently across networks that are designed non uniformly.
So what's the future of network maintenance?
Looking to the future as AI continues to evolve, we can expect
predictive maintenance to become even more accurate and reliable.
Real time analytics And fully automated maintenance could become
the norm with networks becoming smarter and more resilient over time.
Emerging AI techniques like state transition learning and hierarchical
models will likely push the boundaries of predictive accuracy, leading to even
more robust and intelligent systems.
Now, autonomic self healing networks are the logical next step in
network reliability as they minimize disruption and enhance efficiency
without requiring manual intervention.
These networks create a continuous feedback loop that enables
real time responses, maybe in a few seconds, to faults and
maintaining optimal performance.
coming to, we talked a lot about reactive versus predictive earlier in the
slides, but, It's not that reactive is not required as or at all reactive and
predictive approaches are complementary to each other as we know that predictive
models cannot predict all of the issues.
So the reactive comes into play there where the predictive model
either cannot predict or predicts.
incorrectly, a fault.
For example, if traffic is rerouted into an alternative path in anticipation
of an issue that could happen in the network, the networking gear could
detect the incorrect prediction and immediately revert to the original route.
Such a mechanism is less suited for centralized operations though.
So what are the key takeaways from here?
Predictive network maintenance offers a proactive, reliable solution that
reduces downtime, optimizes operations, and boosts network reliability.
As AI technology continues to advance, the potential to create smarter, more
resilient networks is within reach.
Predictive maintenance is not just a way to improve network efficiency.
It's a way to transform how we manage networks altogether.
Thank you for listening into this talk today.
I hope you enjoyed this and learned something out of this.
I have a few references mentioned here, which have been used in this presentation.
Thanks again.
Bye.
Until next time.