Conf42 Python 2024 - Online

Mastering Data Refinement: Unleashing the Power of SQL in Python using Polars

Video size:

Abstract

Dive into ‘Mastering Data Refinement’ with me! Learn to filter datasets using SQL queries in Python, amplified by the power of the Polars package. Elevate your skills for unparalleled data analysis.

Summary

  • Polars is a data frame interface on top of an online analytical processing query engine. It's implemented in Rust using Apache arrow as the memory model. In this talk we'll learn how to utilize data frames for big data sets using SQL context in polars.
  • Polars can process huge amounts of data very smoothly on almost every machine. Lazy frames in polars are a clever approach to data analysis. This method enables better efficiency and optimization. By using streaming, this means polars can run queries over large data sets.
  • Streaming basically repents a generator then we're using to dix in order to convert the result to row into a dictionary. A quick disclaimer, please note that for us do not support all SQL operation.
  • Polar's SQL Quadrax offers a very unique and powerful tool for data manipulations. Combines familiarity with SQL with the performance of rust powered operations in Apache O. I encourage you to explore this library and experience the efficiency gains firsthand.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Everyone, welcome to Montopon. Enhancing data refinement with polars today we dive into how polars leverage SQL in Python for efficient data manipulation. I hope you all enjoy this talk and let's start. So our agenda for today is first we're going to answer the question, why polars? We're going to do introduction to SQL context in polars. We're going to learn how to utilize data frames for big data sets using SQL context in polars. We're going to learn about common table expressions, which is a nice feature. Polars sales with SQL concept so why polars? I think first we need to answer the question what Polars is? So Polars is a data frame interface on top of an online analytical processing query engine. It's implemented in Rust using Apache arrow as the memory model. Now why should choose polars when you have so many options out there? So firstly, by performance, Polars is designed for speed. It's leveraging rust efficiency and safety to perform data operations. First you can visit this link in order to see detailed polar's benchmark against other known libraries such as pandas, we have memory efficiency. Since Polars is written in Rust, it's using an enhanced memory management without having a garbage collector. We have IO polar support or common data storage layers such as local and cloud storage. We have lazy evaluation. We're going to go into this more deeply in this talk. But in short, Polars offers lazy frame that allows for lazy evaluation to reduce memory usage. We have easy to use API despite its high performance, Polars provides a user friendly API making data manipulation and analysis straightforward for Python users, especially photos coming from other data processing libraries. We have SQL contracts, which is our subject for today, and we have versatile data manipulation. Polars supports wide range of operation, includes filtering, sorting, joining, grouping and aggregating data. We have interoperability so polars can interact easily with other Python data science tool and libraries again such as pandas. It's allowing for seamless integration into existing data science workflows and we have open source. So Polars is an open source project right now and according to the owners it will always be an open source project. It is driving by an active community of developers and everyone is encouraged to add new features and contribute to the project. You can visit the official user guide with this link. If you need more details about polars, I recommend it. Okay, so let's start with introduction to SQL context. We're going to go and sound right away. So we have this block of code. First we are importing polars, then we're creating a new instance of s two context. So that's the syntax of using that. Then we're creating a small data frame. Here we have a small data frame with three columns id, name and badges. Id is one, two, three. We have Charlie, Bob and Addis and each one has a certain badges one, five and eight. 10th, we are registering a new table in our SQL context. So the syntax for that is register. Our first argument is the table name, so it's going to be trainers. And our second argument is either a data frame or a lazy frame. Now we're using a data frame and then we are executing an SQL query over our SQL contest. So we are using select id name from trainers where badges equal eight. So a simple SQL query that will fetch us all the id and names from our trainers data set that has eight badges. Here we are using the argument eager equal true. Eager basically means that polars will return data frame as the result and not laser frame which is the default. In case our result is small and can fit our memory. We can use eager equal true and then we are printing the result. So let's look at our result here. So we have a small data set with the columns id name. Id is an integer and name is mastering. And we have Alice which is the only one that match our query because she has eight beds. So test SQL context in polars. Now let's go over lazy stream. So what are lazy frames? Lazy frames in polars are a clever approach to data analysis, allowing the entire query to be reviewed before execution. This method not only enables better efficiency and optimization, but also supports working with data sets larger than your machine memory. By using streaming, this means polars can process huge amounts of data very smoothly on almost every machine. So let's go a bit hands on on how to use polars laser frames with SQL context. So first we have two data sets. One is Pokemon and the other one is trainers. So as you can see, Pokemon is a very big data set. It has 200 million rows with four columns id, name, type and trainer. And we have another data set for trainers which contains three columns id, name and badges. So Pokemon cannot fit in memory in a lot of machines. So here we want to utilize data frames in order to query over this big data set. So let's go a bit hands on and see how we can do that with polars. So first we have the scan csV. Scan csV. When you use it, it scan a file and rather than reading the entire CSV into memory, polars script a laserframe with reference to the file data, so no processing of the data happens until you explicitly execute a query over it. Here we're using the infrastructure argument. Infrastemalence makes our life easier and if we don't know the schema of our data set, polars basically scans the amount of force it takes as argument and set the schema itself. So with bigger number, usually inferring the scammer will be slower. So we were reading underdrows in order to understand our schema. And as you can see we have another scan CSV for our trainers. Here we are not using interscama lens, but we are telling followers what the schema of our CSV is. So we are telling id is an integer, name is string and Vegas is also an integer. Then we're creating a new SQL context same way as we did before. Now we are using register many. So register many allow us to register more than one table at a time. So now we are registering two tables. One is Pokemon, which pointing to our Pokemon Slayframe and the other one is trainers. And now we're printing our schema in order to see what our result will be. And we are executing a very simple SQL query quote show terrace. Now we are also using giga two because we want to data frame as the result. So let's see the result. First, our schema planes give us two order digs. As you can see, polars inferred the schema pretty well. Id is integer, name is string, type is also string and trainer is also string. So polars is doing a very good job with installing schema itself. And of course trainer has the schema we configure. Then let's see what short tables will give us. And here we have a very small data frame with only one column name and we have Pokemon and trainers as our tables. So far so good. Now let's see how can we run SQL queries over our two new tables. So here we are using our SQL context that contains two tables, Pokemons and trainers. We are running a simple SQL query select id trainer from Pokemon where Pokemon's name is Charizard. Why? Because everyone loves charizard and we're using sync CSV with one argument called result CSV. So since we are not using eager equal tool like we used before, the result of our query is going to be raised frame. So in order to dump it we are using sync CSV which is evaluating the query in spamming mode and writing to a new CSV file. So here we are dumping all the result of this query into result CSV. Now let's look at the result. So that's how resort csv look like. We have the ideas of the Pokemon that are charizards and we have the trainer that the charizard belongs to. Okay, now that we went over a very simple SQL query, let's go over common table expressions or CTE. So common table expressions are a feature of SQL that allows you to define a temporary named result set that can be referenced within a SQL statement. Sds provide a way to break down complex SQL queries into smaller, more manageable story pieces, making them easier to read, write or maintain. So let's look at an example of SCTe. So as you can see, the query is a bit more complex. So we're using width as the keyword in order to create a common table expression. We're going to call our new temple champion and it's using to contain all the trainer names and trainer badges from our trainer data set that has eight badges. So it's going to give us all our trainers that has eight badges. And after we get a result, by the way, we call it champions because if you ever played a Pokemon game, you know that in order to complete the game and mastering the league, you have to get eight badges. So after we have our champions, we're going to select our trainer from our Pokemon data set. Here we're using an alias called trainer name and we are also going to select all the Pokemon from this specific trainer. So that's why we're using Joiner. We are going to join on the champion's name which is the trainer name, and on our Pokemon trainer. And we're also using limit which is another nice feature that Polar has it support limit and order using. So we're going to get our six pokemons of every champions that exist in our trainers data set. Now here we are using collect. Collect is another way of getting our data from a query result. We are using stream l equal two because we don't want to collect all the results into our memory in case it's a very large result. So we're using streaming. Streaming basically repents a generator then we're using to dix in order to convert the result to row into a dictionary. Now let's see our result and how it's looking like. So we have six results because our limit was six and we only have one champion called Alex Bell and our Pokemon name in the first six results. So as you can see, the alias also extracted the dictionary test recipes. A quick disclaimer, please note that for us do not support all SQL operation. If you need more information about which SQL operations are supported, you can go over their user guide. There's a very nice list there that tells you all the operations are starting to support them. So that's it is. And that concludes our talk. So thank you very much. What we learned today Polar's SQL Quadrax offers a very unique and powerful tool for data manipulations, combining familiarity with SQL with the performance of rust powered operations in Apache O. Whatever you are a seasoned data professional or very new to the field, I encourage you to explore this library and experience the efficiency gains firsthand, both for small and large data sets. Thank you for your attention. I look forward to any questions you might have. You can mail me at tomnabi two@gmail.com I look forward to any questions and enjoy the rest of Python 42.
...

Tom Levy

Software Engineer @ Own Company

Tom Levy's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)