Introducing FireDucks: A Multithreaded DataFrame Library with JIT Compilation

Video size:

Abstract

We introduce a couple of frequently occurring intricate performance issues in large-scale data processing using pandas, along with a compiler-accelerated high-performance DataFrame library named FireDucks to auto-detect and optimize those issues without any manual effort.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. Welcome to my talk. Today I'll be talking about a data frame library named fire ducks that we are working at the research lab. So we will explore, the motivation behind developing the fighters by stating some of the key challenges with pandas. And of course we'll be talking about the tips and tricks and the best practices when dealing with large scale data processing in pandas What fireducks is and what it is offering the optimization strategy available with it With some live demo and the resources available on fireducks so Let's have a quick introduction. My name is Saurav. I'm currently doing in the I'm currently working as a resource engineer at NSE Corporation. So I have been associated with NSE over the last 11 years where I'm doing different work in the field of high performance computing, distributed processing, AI machine learning and desktop. So we have a long history of working with the vector supercomputer. where we start with S6 series of supercomputers and, work with various legacy applications from tsunami prediction to earthquake simulations on this vector architecture. my leader at my team, Stacy Sarkar, first thought that we including our expertise in the field of HPC and the compiler and this kind of distributed processing related tasks, How about creating something that will use the compiler technology because we have a keen interest in compiler And at the same time do something with the python for the data science community So that is the time when you thought okay data scientist often face challenges With the pandas computation. So let's create some library That will look completely same as pandas But internally it will have a compiler and do a lot of magic kind of stuff Such that a programmer, a pandas programmer can experience much faster, data processing when using firetrucks. So please stay with me till the end of this presentation where I'll be talking about various best practices I will give you walkthrough with the finders the offering the motivation the Comparison with other data frame libraries that is available and of course the end of the end I'll have some frequently asked questions because this is something online. So and Of course if you have any questions, you can get in touch with me over the linkedin or other social media i'll be happy to collaborate and help you in whatever matter I can. Yes So this is the flow. I believe all of you can relate what we do as a data scientist, we collect data from various resources and the data in the raw format is not resilient digestible by the AI algorithm, AI machine learning algorithm. So what we do is we clean the data, make it more meaningful so that we can get a better features out of it and we can do a better model. So better the data is the better the model. That is some kind of old story that we have been learned about. But the important thing for this entire cycle to understand that, although AI machine learning is taking time and deploying model is also taking time, but the most time consuming part is the creation of the data as per the research from the anaconda. It says that the data loading, cleaning, and the visualization is something that occupies almost 75 percent of the effort of a data scientist. So if you have a lot of questions in mind, if you want to extract a lot of many features, You need to spend a lot of time in during the first a few months. I would say for doing the efficient data exploration So that is why various researches has been happening in this field Because this is one of the most important areas where we need optimization Now with these ladies, let me have something about pandas. I believe all of you are from the data science background So you must have Work with pandas. So it is one of the most popular libraries even today It is having the second most download after numpy But it comes with some kind of problems. The very common one is it's not multi threaded So if you have a good system with many core available, you cannot be able to leverage that so Your program will be slow. You have to wait for a longer time to get your result if you are working with large data So So longer waiting times, slow execution, if you're running this on the cloud side, a higher cloud cost, and of course, longer execution will activate a lot of carbon dioxide. So these is, these are something that is associated with bundles. So it is ideally not ideal for the large scale. scale exhibition, right? So now Although these are the challenges related with pandas But the key challenge is like there are various way of writing the same programming pandas So if we ourselves can identify the best approaches the best practices and take care of that when writing our program It's fine enough because by default pandas doesn't offer you optimization as it is eager in nature so whatever you Ask it to do it will do it for you So if you have written a good code unoptimized one yet, you are at good stage But if your program has performance issue You may need to pay a lot of performance cost when your data goes in size or your program goes in complexity So we're talking about some crucial performance issue that involves in a pandas program and how to take care of that And what are the automation possibility for those cases using fireducks and other library? So to going forward let us understand one by one the best practices and challenges associated with data large scale data analysis program So with that let us have a first quiz. can you identify? Which one is a better code whether the left one or the right one? yeah since i'm doing this online, so definitely cannot take the response from you But when I asked this kind of question in one of the event last time, so I believe I had Mostly around 90 percent of them answered the rightmost one is the best performing query Because of many reasons so as you can see this is written in chain and that is ideally the best way of writing This kind of query in pandas or pandas like library. Why Let's have let's understand that with some kind of Illustration. So let's say this is the data that I have loaded from file Of course, you can see there are some duplicates once I remove the duplicate This will be the data my data might be reduced to its half Then I will shorting the result by b column selecting, the top two of the result and that's it Now, let's assume some kind of memory footprint associated with the data. Let's say my data is 16 gb You After the removing duplicate it will be 8gb, 8gb and some other gb, so at the end of the thing it will be 32gb. But till the time the foo is alive all these references like df2 t3 are alive because pandas not pandas python cannot Make them ready for the garbage collection because they have the reference alive in them alive, right? So that is the challenge with not writing zend expression what happens when you go for the zend expression It will be like, the data is loaded. We remove the duplicates But now when we'll be applying the shorting at that moment, I only need the input from the drop duplicate side So I don't need this data anymore So this data will be out of reference and python can make it garbage collected based on It's neat. So the same way when I will perform the top two, I don't need the duplicated result drop duplicates result I only need the shorter result. So this way at the end of my program I can remove the intermediate data that is there that is generated during my query of the long expression So and that is definitely going to help you when you'll be dealing with large scale data. I'll be showing this kind of thing with example in my upcoming slides. So that is one of the best practices that write your expression in chained expression and it is possible in pandas, of course it's possible. So as much as possible try to go for the chained expression such that you can avoid this kind of memory and performance issues. Now let's consider the second one. So here is a quiz. Where it is trying to solve the same problem it is trying to find the top five of a value based on the shorting from the b column, right? Now, can you guess which one is a better code if you give the if you're about to write the code whether you'll go for the top one or you'll go for the bottom one yeah, so when I asked this kind of question in a recent event, so almost I have mixed answer So 40 to 50 percent answer that the top one is the good one And 50 to 60 percent answer that the bottom one is the good one So with the majority the answer is of course the bottom one. So if you yourself, thought it is the bottom one Give a pat on your back Now, let's understand why it is. So what happens ideally your data may have many columns in relative, right? Not only a and b if you are going to perform the shorting on the entire data, what will happen, not only the target columns A and B, rest of the columns like C to J will get shorted. And that will never be used in your final result because you're only interested To have the shorted value from the a column after the result is shorted by b So the c to g column will get shorted based on the b columns It will occupy the memory, but it will no longer be needed. So it is waste of competition time and memory So what you can do you can create a view of the data of my interest that i'm interested in a and b column only So why not just create a view of that data? You Perform the same steps do the same thing and this will definitely going to reduce a lot of competition calls because it doesn't matter How many column you will data may have? Because you also understand. I am interested in a and b so let's not bother about rest of the column Let's create a view and work on that This kind of optimization is something called projection pushdown and it is very important that we should take care when dealing with large scale data analysis Now, let's consider another example You So before going to the quiz, let me show what it is trying to do For example, I am an, I have an employee table in a country table and I want to perform some kind of merging to have a larger joint result and see who are the employee, who are the male employees from which country, like how many, like country wise count of the male employee is something of my problem. So how it can be done, one thing that I can just join these two tables. Then I can filter all the male category from the result in data and I can perform a group by count So by looking at this illustration, can you identify what is the performance bottleneck? Yes The issue is with the expensive merge operation So although we are only interested in the male category employee What we did is we performed the merging on the entire employee table. We could instead do what? We could first filter only the male category from my employee table and then we can do the merging operation. Followed by the group by count It will do a lot of benefit when you have like kind of 50 50 percent of the employee ratio Or you have very less amount of mail Like employees because it can reduce your data by to a great extent and the merge will not take much time in that case So this is something called predicate push down In terms of compiler technology. So again, this is one of the very important optimization strategy We should ideally take care of that And we'll see with some demo that how this kind of strategy is important when dealing with large scale data analysis in upcoming slides. So now let's put these two in, some kind of, like rule of thumb. So again, in order to understand that the importance of such execution order, let's take this example where it is trying to perform shorting by the E column, do the filtration based on some conditional B column. Select the E column and get the top two. Now, in order to illustrate that, let's consider this is my data and if I short the data with this color code, I will have the shorted result from the yellow, red, green, and blue. this is my data after I shorted, something like yellow, red, green, and blue will be the result. Let's consider b equals to 1 for the darker shade and b equals to 1, 2 for the lighter shade. if I filter, I'll only have the lighter shaded part, then I will select the only E column and find the top two. Now, if we carefully notice this entire data flow, can you identify what is wrong in this journey? Of course you can, right? you can say that although I am interested in only A, B and E columns. I have already involved rest of the columns like c and d that is there in my data that have been used But never referenced so that is ideally a waste of membrane competition time So instead what we can do is we can first identify that for this query. I only did a b and e column so why not just create a view of a b e column such that we can reduce my scope of the data first in the Horizontal direction by applying pushdown projection Okay. projection poster. Then I can see that, okay, this is my data and, but instead, why to short this entire data? Because I'm only interested in the lighter shaded part, right? So what I can do, I can, for, before going to do shorting, I can reduce the data further by applying the filter. Once the filtration is done, I can have my further reduced data and this data is ready for the shorting because that is ideally part of the data that I'm interested to perform the shorting on. Now apply the shorting, find the E column and find the top two. This is the way by applying kind of projection pushdown You can ideally optimize the data flow and this is a very important execution order. Following that in your large scale data analysis, you can definitely make sure a lot of it. Benefit from the execution time and the memory. So we'll understand this with some Demo with some example real example in upcoming slides Now with that, let's understand what fire ducks is and why to explore this library So the first thing, the fireducks is a DataFrame library, high performance compiler accelerated DataFrame library. why it is called so? Basically, the ducks is, it's just a name of the animal because all the DataFrame libraries, they're like the pandas, boulders, they are named after an animal. So we just thought of, first of all, we thought of DAX. Some data frame accelerator, but dx is already a name that time. So we thought of naming it as a ducks So it doesn't have any link with the duck db for information But the fire has a meaning fire is from the flexible ir engine What is ir is something called intermediary representation. That is the backbone of the fire dogs jit compilation So what fire does do is? It's lazy execution not like pandas like you ask and it do instead you ask and it creates some instruction for example you ask it to do shorting instead of shorting it will just create okay you ask me to do short on this data on this column perfect then you ask it to do please filter based on this condition it will just create an instruction Without any calculation or, like computations, just create an instruction followed by a projection instruction and followed by a slicing instruction. Just some instruction will be generated. So if you measure the execution time for this part, you will see that it's executed blink of the eyes because it does nothing in, but creating the instruction. Now you can tell me like when this instruction will be executed. these are the instruction that will be automatically triggered by some of your action on the result. For example, when you want the result to be printed, when you want the result to be dumped in a file by two csv2 parquet kind of operation, when you apply some kind of aggregation on this result, for example, finding some maximum minimum value or something. So that is the time by when these kind of expression will be automatically triggered and it will be executed. With some compilation related optimization now, let's say like I want to print the result So that is the time the compiler will be activated and it will try to figure out the instruction associated with the result And what are the instruction it will say that okay. I want to find the top two based on e column So it based on the filtration b column based on the shorting on the a column This is how the compiler understand. Okay, I have this data, but I'm only interested in a b and e column So it will add a new instruction To project only the target columns of a b and e before doing the further operation Such that it can reduce the data in the horizontal iteration first Then it will figure okay. You want to do short then you want to do filter That means you're only interested to perform shorting on a selective set of rows it can do some interchanges of the instruction. It can do the filter first, then it can do the short and rest of the operation will remain as it is. This way the compiler will optimize the generated IR. Once the optimized IR is generated, it can be translated to another pandas version. For example, if you consider it's a pandas program, it can be something like. Project a b e filter based on this condition perform the shorting project the column and take the top two Now this can be a pandas program. This can be a c Api call this can be any other library call of your interest because this ir is quite flexible It can be translated to any other data frame library or any other method call for your reference I just ported it to pandas to understand it better. So that is why write one query You Use this design and it can execute anywhere of your target choice So we have developed this multi core kernel in c where we use the multi threaded and other high performance related stuff like effective utilization of the caches the vectorization and other related stuff such that not only, the results, but also it can do much faster when just comparing the kernel operations, like the only filter, only join, only group by itself will be faster because of the careful optimization, careful implementation of the algorithm. So again, I'll be talking about these in details in the benchmarking slides. So now why you should be understanding or you should be interested in exploring this library because the first thing Pandas doesn't offer you optimization, but firetax is lazy So because of the lazy it can do just in time optimization for you If you have a good system with multiple core available it can distribute it can parallelize the workload not distribute parallelize the workload among the multiple threads So that will result much faster data analysis If you work in the cloud you will experience less cloud, cost and of course, it will run much faster So it will attribute to less carbon dioxide, but the most important thing is if pandas, that's fine You don't need to understand. You don't need to learn a new library new data frame library So when the technology like ai and these kind of llm things is progressing so far It's better to learn a new technology than to learn a new library, right? So just you know pandas because we all of us usually start from pandas So knowing pandas is sufficient, but the challenges associated with pandas can be addressed by firedesk automatically Now the question is does it handle any kind of pandas application? Yes, it does handle any kind of pandas application not only pandas application but also the library that expect you to provide the pandas data for example seaborn matplotlib, scikit learn, imbalance learn, category encoders all the library that expect the you to input the pandas data frame You can provide a firedash data frame and it can work seamlessly with those as well So no extra learning no code modification You can experience much better performance benefit when switching from pandas to firetax. Now, let's understand some kind of demo with this. So here in this example, I am using some kind of Bitcoin data of the last three to five years and it is trying to perform some moving average. Just load the data creating a window and perform the average, perform the mean average calculation and plot the result. So the code is same in both the left and right side. The only difference is in the import So left side is important pandas and the right side is importing firedex. pandas. That is the only change But if you see in the execution time, you can see that pandas takes around 4. 06 whereas firedex can complete it Within 275 milliseconds. So this simple code No modification you can experience 15 times build up. So that is some kind of quick demo I'll be talking about another demo another thing in upcoming slides for sure Okay, so with that let me explain you the usage of firedug. So first of all Firedux is currently available for Linux only, so if you're from the Windows or Mac users, so at this moment, probably you can consider using Google Colab or the platform that support Linux. For Windows, definitely you can try WSL because it can work with WSL. And the supported Pythons are from 3. 9 to 3. 12. So if you can satisfy these two conditions, you can install it using pip and you can use instead of import pandas, import firedux pandas and that will be sufficient to execute your existing program using firedux and optimize it. Now, how about zero code modification? if you are using program that is you importing pandas and it is importing other modules All those module might in turn importing pandas as well. So do you need to replace all this import by yourself? No There is monkey passing thing. So when executing the program using the python command, just pass hyphen option followed by firetus pandas. It can automatically replace all the pandas with firetus pandas. And it can do the optimization for you without any code modification. For notebook like platform, the same thing, your rest of the notebook can be same. Just on top of import pandas, you can add this extension module. And it can automatically replace the pandas with fireducks pandas and you can experience the benefit out of it So now let's understand, what is the challenge with seamless integration with pandas like, Of course wire ducks is something that i'm talking about But there are other high performance pandas alternatives as well, right like duckdb, polars, dusk These are the library that try to achieve the same kind of problem like because pandas is slow So they have their own version of optimization that can address the slow performance of pandas But the major challenge comes with the compatibility with pandas like, if you want to use those library you have to learn that library first if because Either they are not compatible or they are not fully compatible. So you need to understand the library first and sometime it may happen that pandas because pandas is full of features, right? But those library might not offer that feature. So you need to convert, the library to pandas using from pandas into pandas mechanism. And, when, comparing the result in the performance, once your complete program can be migrated, you can compare the performances, or taste the results. So that are the challenges associated with it. So whether you want to take that effort, For that cost optimization. So there is always a cost performance thing in your mind so yeah, you have options like modern does vex, but they are ideally for the multi Node environment there is something called polars that is for the single node environment that is super fast But it is not much compatible with pandas. So if you want to use folders probably you need to understand the library in the very first place that is where fireducks can help you like being it highly compatible with pandas No code modification, no node learning associated with it And you can experience the single node speed up If you're working on the data and that can be fit into memory, you can work with firetrucks. Of course, it provides optimization. it may happen that although you have a bigger data, but you're not interested in every part of the data. it can do the optimization for you and it can just selectively load the target columns and rows so that it can do a lot of justice with your, like limited set of resources, the memory and the number of cores. let's see like how seamless integration is possible with pandas. For example, this is a demo notebook. You can consider the actual notebook from this link. But let's understand what it's trying to do. It is trying to load parquet data from InnoIC parking violation. After the data loading is done, it is trying to, perform a query where it is trying to see, which parking violation is the most commonly committed by vehicles from various United States. because of, in order to execute the query, it does some group by head and shorting kind of stuff. So the basic code is entirely written in pandas. So there is nothing that you can identify new So everything is written in pandas. Now simply I executed the same program using python followed by the program name So you can see that the data loading take around 2. 4 and the query one takes around 2. 8 seconds Overall, it takes around 5. 3 seconds But in order to execute the same program with fire dogs, no code change just add hyphenium fire dust and pandas And it can be done much faster like almost eight times can eight times faster but the data loading as well as the query processing can be done much faster when we're switching to Pandas to firetrucks just by adding a program option. Nothing else. So that is how it is very like Easy to integrate once you download it. It can be integrated to any pandas application and you can experience this better So definitely you can try the notebook for Further exploration So now let's consider the optimizing features that is available in FireDogs. So this is the design of FireDogs like It looks like a pandas program because we mock all the api that is written All the user interface in a pandas like interface But down the line everything is written in c once the instruction is optimized by the compiler So what happens is pandas is like you write the program and it execute it right after you click But in case of Firedux, as I explained that it will create some kind of instructions and those instructions will be optimized by the compiler and the optimized instruction will be converted by a system call and then it will be executed by the multiple code that is available in your system. So all the available code in your system will be effectively used. the system cache will be effectively used, the vectorization will be effectively used. So because of this you can experience much faster speedup when using firetux. So now what is the offering from these two layers? So first of all there is a compiler, so definitely you can expect compiler specific optimization like common sub expression elimination, date code elimination kind of stuff. So you can expect domain specific optimization like get pushed down a projection post on that I talked about so far. You can expect some pandas specific tuning here as well. So I'll be talking about it in upcoming slides. Now, what about the second most layer that is the backend? The backend is multithreaded first of all. So If you have a very good, system with multiple threads available, you can experience much, throughput out of that. The, it can take good, use of the memory because it is backed by Apache Parallel Memory. The major data structure is backed by Apache Arrow, so you can have Compact usage of the memory as well and all the kernels like join, group by filter, drop in, everything is written from scratch by our own patented algorithms. So this kernel operation itself is much faster when comparing to pandas or other high performance pandas alternatives. We'll be experiencing it in upcoming slide for sure. So yes, so let's understand what is compiler specific optimization. I took this example from one of the Kaggle notebook that I recently noticed like you can see that here df time is a column of string type So since it is string type and the pop the person wants to perform some group by based on the year and month he wants to find the average sales. So what he does it he performed the string column to date time In order to extract the year field and the month field And do the stuff but as you can understand the two data itself is a complex operation because you try to parse a string to Convert it to datetime So if you do the same operation the same data more than once it is going to cost you when your data is putting in size So ideally what is better is You can just compute it once put it in some placeholder and use it wherever you need it So this is something a basic programming know how But because find x uses a compiler and the compiler can identify this kind of common expression So well, I already have do it. So so because compiler will create an instruction for that Again, it will create an instruction for this And then it will say, I already have created the instruction. So let me use that. And wherever I, wherever it is referred, instead of creating a new instruction, this kind of optimization possible by FireDogs. So this kind of, CAC, Common Sub Explanatory Elimination Optimization Benefit, you can definitely get from FireDogs. Another thing is something called Dead Code Elimination. So for example, this is a piece of function. So where it is trying to merge x. Table with the y table And after merging it is trying to do the shorting operation after shorting It is doing the group by but the group by using the merged variable not the shorted variable, right? So in the entire function This is the line that is used but have never referenced So when you use pandas like eager execution model, it will be executed because the shorting is you ask it to short, right? But fired us can identify. Okay, you asked me to do That is fine. I create an instruction, but you never used it. So I will not execute the instruction I will eliminate that date code out of my execution, graph, right? So this kind of date code elimination is possible when you switch to firetux, from pandas if you want to explore more, so you can definitely take, read the article that explored about this kind of compiler specific optimization details. yes, now let's talk about some, domain specific optimization that we discussed so far like projection pushdown and predicate pushdown. for that, I have taken one sample query from TPCS benchmark, the query, Q3 10 unshipped orders with the highest value. Ideally it was written in SQL. I first let's convert it to pandas and see what it does So if you carefully notice the query It doesn't perform any kind of best practices that we understand like kind of reduction the data in the row direction the column direction It doesn't do anything as such it load the entire data from three tables customer orders and line item After loading the data it can Just join them to create a bigger table such that it can perform Rest of the operation like filtration and group by rest of the stuff as per the demand of the query So when we execute this query using python using pandas, that pandas took around 203 seconds and the memory consumption was around 60 gb for the scale factor 10 but when we execute the same program using fire dogs just by adding this Plug in, it can be finished within 4. 24 seconds and the memory consumption was around 3. 3 gb Yes, because fire although we have written it without consideration of the optimization Fighters compiler will do it for you. It can figure out All those stuff like the reduction of the data in the horizontal direction and the vertical direction and apply it automatically, because of which you can explain much, much speed up, from this kind of program that doesn't take care of the optimization. In order to understand that, let's manually optimize this program. So what is the best practices? First of all, the practice is to instead of operating on the entire data, reduce the data in the columnar direction, right? In the vertical direction. So by carefully observing the query, you can see that we only need these two columns from the customer table. Although there are eight columns, we only need these four columns from the line item. Although there are 16 columns, we only need these four columns from the order table. Although there are nine columns. So instead of loading the entire data, I will only load these selective columns. Then I only need some particular rows from these tables because of the filter operation. So I will filter all the target table based on the given condition. Then I will just join the tables to the group by an aggregate and the shorting operation. Now when I execute the same optimized query in pandas, it becomes 13 seconds. So 203 seconds can be optimized to 13 seconds with this manual optimization and the memory can also be reduced from 60 GB to 5. 5 GB. Even when you use pandas effectively But when you see the fire ducks, even though you try q3 and opt q3 Manual optimization is present or it is absent doesn't matter It can execute with the same speed and with the same memory command Because if you do a panel optimization fair enough when you cannot do it firedust can do it for you So you can rely such optimize rely for such optimization on fire ducks definitely Now let's understand one of the parameter tuning that is possible in FireDogs, not FireDogs actually. First of all, what is the parameter tuning that you can do in Pandas, then how FireDogs can take care of that. So in order to understand that, let's consider that this query where it is trying to perform, Group by on the department column of the employee table followed by that it is calculating the average salary and shorting the Salary based on the descending order, right? So what will happen it will create the group So it group admin group finance group corporate and sales group and it will perform the average of each group So that is ideally the group by operations But there is a hidden cost associated with it You And the cost is the group by is by default has a parameter called short and the value is true But the meaning of the short parameter is It is trying to short the result of the group by based on the key column here The key is the department so it will not only compute the group by aggregation, but also short the result by the key column So admin will come first followed by corporate finance id and sales shorted by the department but In your query, you don't need this shorting to happen. Instead, you want the result to be shorted by the salary with the value column in descending order. So again, because you ask the pandas to do so it will do it. It will short the result by the descending order when it will encounter the short values. what you can expect that this is the step that is taking place because of this default parameter of group by but this is the redundant step it can be avoided, Even if we don't perform this step, it doesn't impact my result. So this is a redundant step. How we can avoid it? We can simply impute the short equals to false parameter for this case. And if we apply this, Pandas will skip that step and we'll get much better performance. Now, this data have very less cardinality, only 5 groups, but when you have a very high cardinality in your data, many groups are present, you can expect that this will give you much more performance in terms of speedup. Like when I try with 100 million samples with high cardinality, I expect that, without short equals to false, it was taking 50 seconds and with short equals to false, it took 30 seconds. the short itself has, A lot of like almost like more than 50 percent contribution for this query So this is something called parameter tuning and because of the compiler in firedux it can detect that After groupby you want to short the result. So the default shorting in the groupby is not needed And firedux can automatically impute such parameter by making it false such that you can avoid this kind of performance cost which is hidden in pandas So that is again one of the offering when you will be using fire tax Now with that, let me talk about some benchmarks So this is a very popular benchmark called deb benchmark where they have written several around 10 queries of different complex City of the data with different cardinality and size for the join and group operations So when you see the comparison table, of course pandas is slow So that is why it is on from the very bottom But other hyperlink alternative of pandas like colors dark divi and all you can see the products is something That can outperform all the other library as well and it is something around the top So now let's, if we talk about some general overview of this popular library, so I can have this kind of image at my mind. So in terms of pandas compatibility and single node performance, if we compare this library, so we can see duckdb is a library that duckdb and polars Are the library that can work better on single node performance Whereas park, dusk, modin are the library that works better on the multi node computation environment because they're target for the multi node In terms of pandas compatibility duckdb has very less compatible because it is for the sql api where polars is not that compatible Although it somewhat look alike interface with pandas But firedogs and modding, these are the library that are fully compatible with, like pandas kind of thing. And of course, Rapids has its own QDF. That is for GPU, although that is another, different story. But Rapids is again compatible with pandas, providing good performance. But since it is for the GPU platform, so it is something different than all these. at this moment. All the Polars has its own GPU version available. And so for the FireDUX that we are currently working on. Now, if we talk about the TPC H that is one of the popular benchmark. In this demo, I have used query three only, but actually there is 22 22 queries for against Pandas and other high performance alternative like Polars and Modin, we can see that Modin was working not that good, like somewhat similar to Pandas, but older. Was quite faster. It was around 57 times faster when comparing, when performing the processing stuff. But fire desks can outperform the average around 1 25 times, so it can work much faster without even modifying a piece of existing partner code. and if you compare the scalability of the library in terms of TPCs framework also. We can see that, the polars, duck, db, fire, ducks are the library that offer multithreading, where pandas works same, even though we increase the number of cores, there is no visible change in terms of performance in pandas, even if we include the I O or exclude the I O. But, in case of duck db and fireducks, that shows somewhat good, scalability when we increase the number of threads, during the execution. Polars is again, doing scalability, but to some, after some extent, the polars, doesn't show that much of scalability when, at least for the TPC benchmark. till 8 cores, it was fine, but after 8 cores. The scalability was something that was not quite good when we compared TPC H. so here are some resources on FireDUX. If you want to know more about it, explore the articles written in, written for FireDUX and other libraries, we can check out this, Website that is there we can follow on us on twitter where we publish the information related to articles the new development the new release etc if you find any issue and want to report back to us you can refer to the github page you can definitely connect us over the slack channel we'd be happy to welcome you there and collaborate directly with you if you have any questions so with that let me conclude my talk thank you once again for listening to me And I hope you enjoyed the talk. the key insight that I talked about in best practices and how fire ducks can help you to improve your journey with pandas. So enjoy green computing and explore fire ducks. If you have any questions, I understand that this is something online. So probably I cannot make it interactive this time, but if you have any questions, you can connect me over the LinkedIn or any other prefer social media network, or you can definitely get in touch with the Slack channel and other media. I'd be happy to help you in optimizing whatever problems you may have with pandas and discuss some kind of collaboration if you are interested in it. Yes. With that, I'm concluding my talk. Thank you very much once again for listening to me till this. till this end, I hope you enjoy rest of the presentation as well, happy collaboration, happy learning. Thank you so much.

Slides

Download slides (PDF)

See all 53 talks at this event!

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Introducing FireDucks: A Multithreaded DataFrame Library with JIT Compilation

Video size:

Abstract

Summary

Transcript

Slides

Sourav Saha

Senior Research Engineer @ NEC Corporation

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Introducing FireDucks: A Multithreaded DataFrame Library with JIT Compilation

Video size:

Abstract

Summary

Transcript

Slides

Sourav Saha

Senior Research Engineer @ NEC Corporation

Join the community!