Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody. A warm welcome to
my talk here today titled Forecasting Time series
with polars and Dino. Yeah,
about myself. So I have
a background in computer science,
software engineering, and I
currently lead data science team for Infinity AI.
We do mostly stuff with time series data and time series
forecasting.
Let's start our talk with plan
and few takeaways for this talk.
So I'm going to start with the story just to show
you an example of practical time
series analysis problem.
And this way I'll try to introduce
you into the
matter how I even started thinking
about WASM and using WASM for machine learning
ops. Then I'm going to present you
something which I think is also useful,
which is a daily pattern model. I'm going to
use it just to.
As an example of something not too complicated
that is easy to implement in rust and
that I used for experiment that
I'm going to also show you.
And for this experiment I used technology
that is going
to be presented as a last point. So,
yeah,
so let's start with the story. So the story is
about city, city in Canada,
city called York. And imagine
that you are an engineer or water engineer,
and you work for the city. You collect the
data in different locations all around the city,
and the data from sensors.
And what you are looking for is water
related sensors. So you would
like to know the level in the pipe
and velocity, also temperature and
other environmental variables.
The problem that you need to solve is the
fact that there
are quality issues with the data. So for some reason
there might be sensor malfunctioning,
maybe battery goes down, maybe there is an x
spike in temperature,
maybe there might be other
things going on related to data quality.
And yeah, what you would like to do, you would like to detect those changes
and hopefully fix them.
Imagine that you are very good data scientist and
you found a perfect solution, found a very good model for that.
Just for simplicity's sake, let's assume that this
perfect solution is a linear function.
And yeah, that's what I show
here. You have three different models, three different lines with
different slopes. And what
makes this problem complicated, assuming the
solution is simply it's linear relationship,
you can model that and you know, we can fairly
easily find the slope. But what
makes it very special is
actually time series related issues.
So for time series data,
we're gonna have autocorrelation. So especially for
sensor data, one observation
is strongly dependent on the previous ones,
also seasonality, especially if we have environmental
variables like temperature.
But also water related
variables are strongly seasonal because, you know,
the consumption different in different
parts of the day of the month and across the
year. So it looks like instead
of one model fitted
once, like for
other machine learning problems, like for instance
that you can solve with neural net, you fit
neural net ones and you just get predictions from
this one, from this model deployed
once. And here you're going to end up with
several models. Like for each location we're going to have
actually different model because you
need to feed them with different data and thus you're going to
model different slope.
And just think about it
how you'd like to deploy this thing. In a
software world, we have containers. Yeah. So if
you'd like to deploy it independently of your technology,
what kind of stack you use, if it's Python and any other
language, you probably gonna end up
with some environment where you would like to use containers.
And yeah, just to show you the
solutions that we use for other problems,
we use containers as any
software engineer would use nowadays. So this
running models, part of this slide
shows actually Kubernetes environment where we deploy
our models there is time series storage,
which in our case is Cassandra.
We have several workers which are services
that going to process the data and
process jobs that
are gathered in a message queue, in our case
Kafka.
And this pipeline is just fine.
It's fine as long as you don't need to scale. Because just
think about it, hundreds of those kind
of templates where you need to repeat
refit the same model 100 times, it's still fine.
Yeah, depends on your resources. But you have thousands,
if you have hundreds of thousands of sites.
This problem becomes difficult to scale
with dockers and containers because
of two things.
The main thing related to Docker is latency.
And the main reason of this latency
is process overhead, because you not only need
to run your code, but also you need
to initialize the whole docker environment for
each model that you run in production.
And that's how I started to
think about using WASm as a solution
for this kind of problem.
Because for wasm you
have different stories. I'm going to shortly describe
you what wasm is for those of you who are not very familiar with.
And yeah, so first of
all is a binary instruction format
for stack based VM.
So it's a way to run naturally
server side code in your browser. Firstly introduced
by Mozilla and
yeah, adapted by all major web browsers.
Wasmodules are faster and smaller than
containers and you don't have this whole
process overhead that I mentioned before. The glue
between WaSM and your os is called wasi.
So it's your os interface and
it's already there. So you now we are able to
run Wasi modules outside of browsers,
which makes it interesting and
makes it extremely interesting solution for server
side scaling,
containerizing kind of problem.
There is also a tool called woznpak, which I
personally recommend you if you would like to start your journey with
WASM, which makes things,
makes your life easier, basically if you want to experiment
how I started my experiment with Warzone,
I thought about something, yeah,
something easy to implement in rust because I'm
a beginner rust coder.
So I thought that it's
gonna be good if it's something that it's easy.
Plus, to be honest, I wanted to have something easy enough
to be able to present it during the conference.
So I use model
called daily pattern. And daily pattern
is a very simple yet
powerful idea where
you just having time series data,
you average it by five minutes intervals
and you end up with something that you
see here as a line which represents
the signal during the day. In case
of my team, we usually use it for base
model, for modeling, but also
for many different
occasions. And interestingly,
it's a very difficult model to beat
if you want to predict signal.
And yeah, as a background for this slide, you have
rust code that implements this pattern
model.
I introduce here, polars. Polars is a library written
in rust that allows you
to have a super powerful interface
to data. And it's like you may think about it as
a better version of pandas. So I think it's worth
to use it in your projects. And it's
super fast because it
uses arrow data model behind the scenes and
because of lazy evaluation. And it
makes this library unbeatable
in dataframes processing.
And yeah, and then comes wasm. And yeah.
First of all, I thought that it's gonna be easy
to do something in python, in Wasp, but actually it wasn't.
I struggled a lot with this kind of
approach and I failed. I failed
also because of lack of sockets in WASm,
and it makes things like HTTP
requests actually complicated
problem if you would like to run it in your
wasm compiled code.
Another problem that I have when I tried something
called wasm. Time for
my experiment, I had a problem with manual
memory allocation. For those of you who
are familiar with coding
in C, you probably won't
have problems with memory allocation,
but most of people nowadays, including me,
are used to languages like Python,
where actually memory allocation is an issue
and I end up in dependency hell. And this
whole experience was really painful.
For me, at some point I
even thought that what I'm gonna show during
this talk, so I'm gonna show lessons
learned, my failure and how was actually
a bad idea to use for mlops.
But all of a sudden I realized that there is something that makes
this experiment possible and this environment,
this framework that I found, and I recommend you to
your projects, not only experiments,
maybe,
yeah, but I encourage you to experiment with this framework and
you can, you know, see what's
possible there. It's a better version of node called Deno,
and why it solves all my pains with BOsM is that
natively it's supposed to wasm binaries.
And so what I'm
going to show you right now is an experiment
that I run through this
platform.
So here we have a source code of my solution.
First I compiled with Woznpak and
Ozmpark is this tool that I've already recommended to
you. And what it does for you,
it does, especially if you like to run
a web project. It creates the
whole structure for you basically with one line
of code, line of script. And then what I
have here is compiled wasn't
binary, and I've created something called runner.
And in
this J's code I run two things.
One is our daily pattern written in rust
compared to OZM. And I run it many
times for different data sets.
Yeah,
example data sets and yeah,
random number of times. And I'm going to
compare it with something
that I wrote using numpy.
So I coded
the same daily pattern in numpy and I just
run it from JavaScript. Here we have result
of running my rust
compiled to OSM for, you know,
randomly selected CSV files of different size.
And then at the end of this process I print the size,
overall size of files,
process and time. So this one maybe is
not a good example because it's too many runs.
So I'm going to run it just limited number of times
just to show an example. Yeah, and it's one sec,
almost 150 megabytes.
And then let's go with Python.
The same daily pattern in,
in python it
initializes, it takes
some time and it goes.
And yeah, it's slightly
slower than previous version and
we have 26 seconds and the
same number of megabytes.
So coming
to conclusions, as we saw,
there is potential in
WASM as a runner
for your forecasting code for time
series data,
but there are still some
issues that needs to be solved before it may
really happen and it may be really used in production.
And in my opinion, first is Python,
which is very very painful. If you
would like to do anything with Python in WASM and
it's not yet there,
you cannot compile basically your Python code
into OSmO. And unfortunately
or fortunately,
most of the world uses Python for data science, so it makes it
really difficult as a solution
for data science sockets.
As I already said,
as long as there is no support
for sockets in the warzone, it also
makes it really really difficult to
use. And yeah,
parallel processing. So in my experiment I
simplified it because deno made for me parallel
processing. But actually this is something
that you need to solve for yourself, maybe using rust
library or whatever else.
So yeah,
it looks like Docker's word somehow
is not in competition with
WASM. Actually what I read
not that far ago is that there
are some Kubernetes containers
wasn't based or there is some.
At least I saw some comments on that. So it
looks like not
only me think
about making, you know, use of vozen
for production deployments.
And yeah, hopefully I can one day I can have another
part of this talk about how
to actually use it in production.
Okay, I encourage you to stay in touch with me.
I have a GitHub account and also likd account.
If you want to ask me, please ask any questions,
I'm open. Also any feedback if you have really
welcomed. And yeah,
have a nice rest of the conference and see you
later.