Transcript
This transcript was autogenerated. To make changes, submit a PR.
My name is Pablo Skipper. I'm a chief of officer in
the AI investments and today with
Anna Warno,
data scientist from Southern Booth, we will make a preview
of the most advanced and most
efficient, at least in our opinion, machine learning
time series forecasting methods.
I will make a short introduction with slides and
then I will make a real hands on
session how to use some of these methods
on the real data sets. The brief again is
as follow. I will tell a little bit more about the time series,
what this is, and about the statistical methods which
were very extensively used till the two years
ago about the M four and five competition.
Then I briefly go through the most advanced
and effective machine learning time time
time time time time series forecasting methods. The real
time real hands on session how to forecast
different time series using these presented
methods. So for the long time, very long
time, the domination of the statistical
method was obvious. The machine learning methods
achieve much lower results and
the most popular was to use this listed
statistical methods. I do not go into the details, but I
only want to highlight the fact that the most typical
and most popular way of using the statistical method
was CTO ensembling different methods which is
also used very extensively by machine learning methods.
The breakthrough was in the 2018 when the results
of the M four competition was announced.
M competition is probably the most prestigious scientifically
baked packet competition time series
forecasting methods. It is organized by the professor Sparrows
McLiakis from the University of Nicosia and
the fourth edition of this competition
called M four. The first and the second place
was won by the so called hybrid method. So method
which uses statistical approaches and machine learning
and machine learning part was same. Much more important
for this methods comparing to the statistical one
and M five which results
was presented last year was dominated by
the only machine learning methods as in the
M four competition, the goal was to predict
over 1000 100,000
time series, so the very very big amount of
the time series. We consider these results as a very
comprehensive and reliable and the
first method is the ES hybrid method by Swamik
Smell. It is the winning method from M four.
It uses statistical approach, exponential smoothing,
hot winters for the data processing
and also use very novel way of learning
the neural network with the special architecture.
I will tell a little bit more on the next slide together
with the parameters of the exponential smoothing and
this method uses model and sampling very
extensively in a very unique and novel
way. Another novelty of this method was to use
LSTM network, but not a typical LSTM
network but with the delays and residual
connection both these concepts are very popular
in the image recognition for the convolutional
neural network and this application of
that concept was the first
one, at least on the peak scale for
the LSTM and tiny time series forecasting
methods results was obviously great. So Swallow
Queen won the M four competition and
yeah, this architecture was very heavily studied
by other scientists and person
working on the time series forecasting.
One more very important thing about the eshybrid
method is that it uses ten sampling in
a very advanced way. So the
models, the best models for the given time series,
please remember that it was used for the 100,000 time
series was collected and for
the final predictions that these models were
used to achieve the most robust and
accurate results. The second method which
is purely machine learning method is nbeats.
It was published after the M four competition and claims
to get the better results on the M four data
sets comparing CTO ES hybrid this is
purely machine learning methods. It has a unique stack
block based architecture, has different type of
the block transomal generic and also
has some explainability and transfer learning
features and as well uses advanced
model enemy on a very big scale
it ensembling over 100 models
to make a predictions.
This is the architecture of tap nbeats method
and as you can see there are many stack.
Each stack is built of the block
and also these residual connections within
the stack. So the input from
the one layer is passed to the
given layer and also skip this layer to go to
the next layer. Very unique concept
is also to pull
the results of each block and
combine them together and use
them as the output from the stack and
each stack is adding their output
to the global forecasting which is ensampled
inside the given model.
So the nbeats have clearly on the end sampling
in a very good way and
is the fully pure machine learning
method. The next methods which I wanted to
mention is the complete framework called glue
on TS. It is the complete framework for the time series
forecasting. It includes various
models, also the most advanced neural networks
architecture like transformer and different method
of data transformation.
Also it allows for the probabilistic time series modeling
to determine the distribution. It supports for
the cloud computing training and inference and also
has a very strong community support.
And as I said this framework is ready to
use. You can download that library
and start using. It's not easy to use
but it is easier to use than previous
two methods which are available on the SS source code and
you need to download and
compile it by yourself and start using.
Here you can see I'm not sure if it's the latest
diagram, but it shows that how many components is already
included in the gluon TS
and the framework is still developed and used for
time. Time, time series forecasting
methods method, which I wanted to mention is
settling machine. It is based on the stohastic learning
automata invented by the russian
scientist. Settling in the
previous centuries are quite old, but for the long time
this algorithm was used only for
the scientific purposes, but now it's used for both
machine learning and also for the reinforcement
learning. So for supervised learning and reinforcement learning.
And from my point of view,
the biggest innovation of this approach
is that it allows to create or to learn the
stochastic distribution for each of the parameters.
So this algorithm is learning the
probabilistic distributions, which is learned
in the supervised way
and also is constantly updated after
each predictions. So that's the reason that this
approach is considered self learning and
can be used for the reinforcement learning and also
for predictions. And the advantage
is that we do not need to retrain a model
after the each prediction, but this model is somehow
retrained after each prediction. So the weights of
the probabilistic distribution are changed after
each prediction. And yeah,
very briefly, it works that way that we have the
input, and for the, each parameter of the
input, the separate stochastic
distribution is created. And based on the rules
in this settling machine, the probabilistic
distribution is updated and for the prediction,
the final prediction, for the final
value for the output of each
parameter is sampled for the
currently learned distribution and finally
it is ensampled in the
given way. Disciplined machine is a
very different approach. Comparing CTO, the, let's say traditional machine
learning, which is based on
the neural networks usually, and stochastic
gradient destined and pet propagation process,
because it's lent in a different way and
it could be used as the one additional method,
for example, to be included in
the ensembling Anya enhanced session
will present traditional machine learning
method. So years, hybrid and bits
and one more methods called temporal
fusion transformer, which is considered as one
of the most advanced currently available
machine learning methods.
And also Anya will introduce these methods.
So I will skip this temporal fusion
transformer for now. So that's all
about reviewing the methods. Very short
summary the highlights of the forecasting.
From my point of view, the most important thing is that currently
the time, time, time, time series forecasting methods
are very dynamically developed and the
new methods appears. They are
not, say typical convolutional LSTM or
transformer methods, but much more advanced and
the efficiency is much higher in terms of the predictions.
Accuracy is much higher than statistical methods.
Of course, forecasting methods has many, many area of
application in the AI investments. We are
using them for the financial time series and
we achieve over 60% accuracy on
the long 110 years test
periods. But of course, the application of time time time
series forecasting methods used for the many different
areas like business,
sales, lead, retail, and also for social proposals
like health, environment and mobility,
and many more. So having the
more accurate method gives a significant
edge in many areas.
So that's the reason that we are presenting it
here. Okay, now it's time for the
handsome session by Anya. As I said, Anya will
show the nBeats method, TFD method and also
some basic introduction, how to properly forecast
time series using machine learning approaches.
I hope you find our sessions valuable
and can learn something interesting from
that. So that's all from my side and
now it's time for Anna Warno handsome session
hello. After the theoretical introduction I would like to show something
practical. I would like to present how we struggle with time series
data on a daily basis. I will talk shortly about the
data processing models, choice models, evaluation, boosting accuracy
and explainable AI which can be used with time series data.
And to have some examples, I chose a publicly available data
set. The selection criteria were multidimensionality,
difficulty and data size and I will briefly
show what can be done with such data.
So, as I mentioned before, these data are open sourced.
There is around 40,000 of rows
each for one timestamp. Frequency of the data
is 1 hour and we have around 15 columns,
six main air pollutants,
six connected with weather and the rest express
the date. Here we have some examples.
Columns plotted. As we can see,
data look messy. We have large amplitudes.
After zooming data plots, it looks slightly
better. However, there is no visible pattern at the first
site. Only after aggregation example
over a week you can see some regularities and normally
now we would do some explanatory data analysis,
et cetera. However, we don't have as much time so
we will focus only on parts which are absolutely necessary,
which are crucial for modeling. So one
of the first things which needs to be done is handling the missing data.
Firstly, we need to understand the source of missingness.
Does it occur regularly? What are the largest gaps
between consecutive not nuns values? Here I have plotted
some missing data statistics starting from basic bar
plot. As you can see, many columns do not contain any
nuns, but there exist columns with significant
amount of missing values such as carbon monoxide
next heat map. Heat map helps us determine which
occurrences of nuns in different columns are correlated.
We can see a strong correlation of columns describing
the weather such as pressure or temperature.
Correlation of occurrences of missing values in different columns
may be also expressed with dendogram here
and apart from basic statistics and correlations,
we can check the distribution for specific columns.
We can select column and here we have length
of consecutive nuns histogram. As we can see
in this example, most of consecutive nuns sequences
are short. However, series as long as
60 also exist. And the red plot shows the
length of gaps between missing values.
So if it was a straight line, that would mean that non
missing values occur regularly.
They do not in this case. So now we need to handle
the missing data. We could apply standard basic methods like
backward filling layer or polynomial interpolation. We could also
use more advanced methods, for example based on machine
learning models. Here we have examples in the plot we
can see fragment of one of time series for visualization
purposes we can artificially increase number of missileness.
We can select the percentage of values which will be randomly removed
and see how different simple inputation
methods will fill these values. So starting
from forward filling fraud in our interpolarization
to spline with the higher order which will give
us smoother curve. From analysis
of missing values, we know that in our case, sometimes the
gap between two not missing values is very large.
Here we have plotted example. We will not insert anything
in that case, but just split series
into two shorter series. Second thing
which needs to be done are data transformations.
This step is crucial for some statistical models which require
often series in specific format, for example stationary.
For more advanced models it's often not essential, but may
also help with numerical issues. Here we have listed some
basic transformations which can be applied for time series and
we can see how our series would look after the given transformation.
We can also use more advanced transformations like embeddings.
Example of simple but effective time transformation is encoding
cyclical features like hours, days, et cetera
in the unit circle like in the presented GIF,
and for that we are using
this formula. So before the modeling for
our task, we will still be missing values with linear interpolation
normalized features. Sometimes we use also box
cock transformation and encode cyclical features into a unit
circle, in our case hours and days.
And for modeling we will choose one column nitrogen
dioxide and prediction horizon equal to
six. And firstly, we will train baselines
and simpler statistical models to have some point of
reference. And then we will move to neural network methods.
And before the models results,
a few words about the training setup,
we'll use train validation and test split.
And for evaluation we'll use rolling window.
Here we have plotted train validation, test splits and
we will start with extremely simple models. Nave predictors
it's good to always look at them in time series forecasting.
They are very easy to use and it often happens
that the metrics graphs results. Statistics of
our model look okay at the first glance, but then
it turns out that the naive prediction is better and
our model is worthless. So it's
a good practice to start first with naive
prediction. And it's worth to mention that
some alternatives for naive predictions would
be usage of metrics like mean absolute scaled error.
And apart from baby baselines,
we also train some classical models. For example,
sarima proffered Tibet's or exponential time series.
It's moving and these models will be fine
tuned with rolling window, with hyperparameters grid
search or biasion based hyperparameter search.
Okay, so as we have seasonal data, we use two naive
predictors, last value repetition and
repetition of values from the previous day with the same
hour. And here are the results. We'll compare
them later with other methods.
And for advanced neural network models,
we'll choose two methods, temporal fusion transformer
and nbids. Both of them are the state of
the art models,
but they have different advantage and complete each
other very well. In this picture we can see architecture
of temporal current transformer and as you can see it's quite complicated,
but we will not talk about the technical details. We will
focus on advantage of this model. So first of
all, good results. According to the paper comparing
CTO statistical and neural network models,
the very, very big advantage
of temporal sugar transformers is the fact that it works
with multivariate time series with different types of data,
like categorical, continuous or static
temporarily. Transformer has also implemented variable
selection network, so it allows us save
time during the data proprietary. Its results are
interpretable thanks to attention mechanism.
It also works with known feature inputs and
that allows us creating conditional
predictions. In general, it's applicable
without modification CTO a wide range of problems and
we could obtain some explainable predictions. And the
second chosen model was NBITs. And nbeats outperformed
the winner model from prestigious and for competition.
It means that it achieved the highest scores on
100,000 time series from different domains.
So it's a bunch of quality. It's designed
for univariate time series.
Its results are also interpretable. But thanks
to special models which try to explain the trend and
seasonality, and to sum up. TFT and
NBIT both have very good scores
and try to deliver interpretable predictions but are
optimized for different types of data.
NBIT is optimized for universe
time series and TFT is optimized for any
type of time series with any types
of data. Okay, so for
neural network training we use
early stopping learning rate scheduler and
guardian clipping and sometimes but not in this
case, we use also biological based framework like
optuna for hyperparameters
and networks architecture optimization.
Okay, so let's move to results.
The accurate metrics results will be presented later
in the table, but here we have some GIF for
NBID performance on the test set and
this gray rectangle presented the prediction horizontal
so its width is equal to six because
our horizon is equal to
6 hours and the same GIF was delivered
for TFT. And data are noisy
but predictions sometimes look okay.
Model often correctly predicts future
forecast direction which is good.
And here are some predictions for PFT from
test set which are actually very good and
they were selected randomly but luckily we
can very good samples for
sure. There are also worse examples in this test set.
And here we have table with different experiments with
tnt and nbeats and different loss functions
different loss functions for regression problem and
here we have also metrics results like typical
metrics for regressions like mean
absolute error or mean average
percentage error et cetera and
the best scores are highlighted in green.
And as we can see, temporary fusion transformer
with quantile loss scored the best
mean average error for naive prediction for for
naive predictions were around 25.
So our neural networks clearly learn anything
and are significantly better
than nave predictions.
And the next question is can we do better?
Of course we can try to optimize caper parameters or
network architecture, but there is one
thing which requires less time and
is extremely effective. Ensembling even models
with the same architecture trained with different
loss function, input length training,
hyperparameters or transformation can contribute
to score improvement in assembling. And here
we have a proof. These are experiments
with TFT or nbeats and differing
only in loss function. So for example
we used quantity loss or mean
absolute error loss or root mean square error loss
or new loss function delayed
loss which is
different than which
is significantly different than these other loss
losses and we end up with over 15
percentage of mean absolute error improvement over the
best single model and even single models
with low score like this one. TFT with delayed
loss, the worst model from all experiments with TFT
contributes positively to score improvement.
As I mentioned, both TFT and nbeats aim to give explainable
predictions and here we have results obtained from TFT.
First lot shows the value's importance in time. So the higher is
the value here, the more important was that time
point during the predictions.
In this case, the most influential data were measured
168 hours, so seven days
before the prediction time. So it means that it suggests
that we can have weak seasonality here and here we
have features importance from variable selection submodel as
expected, the most important feature is nitrogen
dioxide. So our target and
decoded variable importance plot for feature
known values like those connected with time,
so it shows which features
were the most important from
the known feature inputs.
With small modification of architecture, we can also see
which features were the most important for specific timestamp.
As I mentioned, to obtain such result,
we need to slightly change architecture. So there is no guarantee
that model will work as good as the original one
on any type of data. But for some examples it also
works. It relies
on the same mechanism like the
original safety, so it uses attention
layer for explainability.
Okay, and that's all. Thank you all for attending
tension.