Conf42 Machine Learning 2024 - Online

Forecasting time-series with Polars and Deno

Video size:

Abstract

This session is a presentation of novel concept of time-series processing pipeline for MLOps - polars for data modelling - fast task running framework written in Deno - WASM tasks in which we do forecasting Focus on possible performance improvement in production deployments of time-series models‎ .

Summary

  • I currently lead data science team for Infinity AI. We do mostly stuff with time series data and time series forecasting. In a software world, we have containers. And that's how I started to think about using WASM as a solution for machine learning ops.
  • There is potential in WASM as a runner for your forecasting code for time series data. But there are still some issues that needs to be solved before it may be really used in production.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody. A warm welcome to my talk here today titled Forecasting Time series with polars and Dino. Yeah, about myself. So I have a background in computer science, software engineering, and I currently lead data science team for Infinity AI. We do mostly stuff with time series data and time series forecasting. Let's start our talk with plan and few takeaways for this talk. So I'm going to start with the story just to show you an example of practical time series analysis problem. And this way I'll try to introduce you into the matter how I even started thinking about WASM and using WASM for machine learning ops. Then I'm going to present you something which I think is also useful, which is a daily pattern model. I'm going to use it just to. As an example of something not too complicated that is easy to implement in rust and that I used for experiment that I'm going to also show you. And for this experiment I used technology that is going to be presented as a last point. So, yeah, so let's start with the story. So the story is about city, city in Canada, city called York. And imagine that you are an engineer or water engineer, and you work for the city. You collect the data in different locations all around the city, and the data from sensors. And what you are looking for is water related sensors. So you would like to know the level in the pipe and velocity, also temperature and other environmental variables. The problem that you need to solve is the fact that there are quality issues with the data. So for some reason there might be sensor malfunctioning, maybe battery goes down, maybe there is an x spike in temperature, maybe there might be other things going on related to data quality. And yeah, what you would like to do, you would like to detect those changes and hopefully fix them. Imagine that you are very good data scientist and you found a perfect solution, found a very good model for that. Just for simplicity's sake, let's assume that this perfect solution is a linear function. And yeah, that's what I show here. You have three different models, three different lines with different slopes. And what makes this problem complicated, assuming the solution is simply it's linear relationship, you can model that and you know, we can fairly easily find the slope. But what makes it very special is actually time series related issues. So for time series data, we're gonna have autocorrelation. So especially for sensor data, one observation is strongly dependent on the previous ones, also seasonality, especially if we have environmental variables like temperature. But also water related variables are strongly seasonal because, you know, the consumption different in different parts of the day of the month and across the year. So it looks like instead of one model fitted once, like for other machine learning problems, like for instance that you can solve with neural net, you fit neural net ones and you just get predictions from this one, from this model deployed once. And here you're going to end up with several models. Like for each location we're going to have actually different model because you need to feed them with different data and thus you're going to model different slope. And just think about it how you'd like to deploy this thing. In a software world, we have containers. Yeah. So if you'd like to deploy it independently of your technology, what kind of stack you use, if it's Python and any other language, you probably gonna end up with some environment where you would like to use containers. And yeah, just to show you the solutions that we use for other problems, we use containers as any software engineer would use nowadays. So this running models, part of this slide shows actually Kubernetes environment where we deploy our models there is time series storage, which in our case is Cassandra. We have several workers which are services that going to process the data and process jobs that are gathered in a message queue, in our case Kafka. And this pipeline is just fine. It's fine as long as you don't need to scale. Because just think about it, hundreds of those kind of templates where you need to repeat refit the same model 100 times, it's still fine. Yeah, depends on your resources. But you have thousands, if you have hundreds of thousands of sites. This problem becomes difficult to scale with dockers and containers because of two things. The main thing related to Docker is latency. And the main reason of this latency is process overhead, because you not only need to run your code, but also you need to initialize the whole docker environment for each model that you run in production. And that's how I started to think about using WASm as a solution for this kind of problem. Because for wasm you have different stories. I'm going to shortly describe you what wasm is for those of you who are not very familiar with. And yeah, so first of all is a binary instruction format for stack based VM. So it's a way to run naturally server side code in your browser. Firstly introduced by Mozilla and yeah, adapted by all major web browsers. Wasmodules are faster and smaller than containers and you don't have this whole process overhead that I mentioned before. The glue between WaSM and your os is called wasi. So it's your os interface and it's already there. So you now we are able to run Wasi modules outside of browsers, which makes it interesting and makes it extremely interesting solution for server side scaling, containerizing kind of problem. There is also a tool called woznpak, which I personally recommend you if you would like to start your journey with WASM, which makes things, makes your life easier, basically if you want to experiment how I started my experiment with Warzone, I thought about something, yeah, something easy to implement in rust because I'm a beginner rust coder. So I thought that it's gonna be good if it's something that it's easy. Plus, to be honest, I wanted to have something easy enough to be able to present it during the conference. So I use model called daily pattern. And daily pattern is a very simple yet powerful idea where you just having time series data, you average it by five minutes intervals and you end up with something that you see here as a line which represents the signal during the day. In case of my team, we usually use it for base model, for modeling, but also for many different occasions. And interestingly, it's a very difficult model to beat if you want to predict signal. And yeah, as a background for this slide, you have rust code that implements this pattern model. I introduce here, polars. Polars is a library written in rust that allows you to have a super powerful interface to data. And it's like you may think about it as a better version of pandas. So I think it's worth to use it in your projects. And it's super fast because it uses arrow data model behind the scenes and because of lazy evaluation. And it makes this library unbeatable in dataframes processing. And yeah, and then comes wasm. And yeah. First of all, I thought that it's gonna be easy to do something in python, in Wasp, but actually it wasn't. I struggled a lot with this kind of approach and I failed. I failed also because of lack of sockets in WASm, and it makes things like HTTP requests actually complicated problem if you would like to run it in your wasm compiled code. Another problem that I have when I tried something called wasm. Time for my experiment, I had a problem with manual memory allocation. For those of you who are familiar with coding in C, you probably won't have problems with memory allocation, but most of people nowadays, including me, are used to languages like Python, where actually memory allocation is an issue and I end up in dependency hell. And this whole experience was really painful. For me, at some point I even thought that what I'm gonna show during this talk, so I'm gonna show lessons learned, my failure and how was actually a bad idea to use for mlops. But all of a sudden I realized that there is something that makes this experiment possible and this environment, this framework that I found, and I recommend you to your projects, not only experiments, maybe, yeah, but I encourage you to experiment with this framework and you can, you know, see what's possible there. It's a better version of node called Deno, and why it solves all my pains with BOsM is that natively it's supposed to wasm binaries. And so what I'm going to show you right now is an experiment that I run through this platform. So here we have a source code of my solution. First I compiled with Woznpak and Ozmpark is this tool that I've already recommended to you. And what it does for you, it does, especially if you like to run a web project. It creates the whole structure for you basically with one line of code, line of script. And then what I have here is compiled wasn't binary, and I've created something called runner. And in this J's code I run two things. One is our daily pattern written in rust compared to OZM. And I run it many times for different data sets. Yeah, example data sets and yeah, random number of times. And I'm going to compare it with something that I wrote using numpy. So I coded the same daily pattern in numpy and I just run it from JavaScript. Here we have result of running my rust compiled to OSM for, you know, randomly selected CSV files of different size. And then at the end of this process I print the size, overall size of files, process and time. So this one maybe is not a good example because it's too many runs. So I'm going to run it just limited number of times just to show an example. Yeah, and it's one sec, almost 150 megabytes. And then let's go with Python. The same daily pattern in, in python it initializes, it takes some time and it goes. And yeah, it's slightly slower than previous version and we have 26 seconds and the same number of megabytes. So coming to conclusions, as we saw, there is potential in WASM as a runner for your forecasting code for time series data, but there are still some issues that needs to be solved before it may really happen and it may be really used in production. And in my opinion, first is Python, which is very very painful. If you would like to do anything with Python in WASM and it's not yet there, you cannot compile basically your Python code into OSmO. And unfortunately or fortunately, most of the world uses Python for data science, so it makes it really difficult as a solution for data science sockets. As I already said, as long as there is no support for sockets in the warzone, it also makes it really really difficult to use. And yeah, parallel processing. So in my experiment I simplified it because deno made for me parallel processing. But actually this is something that you need to solve for yourself, maybe using rust library or whatever else. So yeah, it looks like Docker's word somehow is not in competition with WASM. Actually what I read not that far ago is that there are some Kubernetes containers wasn't based or there is some. At least I saw some comments on that. So it looks like not only me think about making, you know, use of vozen for production deployments. And yeah, hopefully I can one day I can have another part of this talk about how to actually use it in production. Okay, I encourage you to stay in touch with me. I have a GitHub account and also likd account. If you want to ask me, please ask any questions, I'm open. Also any feedback if you have really welcomed. And yeah, have a nice rest of the conference and see you later.
...

Piotr Stepinski

CTO @ infinitii ai

Piotr Stepinski's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways