Transcript
This transcript was autogenerated. To make changes, submit a PR.
Everyone, welcome to Montopon. Enhancing data refinement with polars
today we dive into how polars leverage SQL in Python for efficient
data manipulation. I hope you all enjoy this talk and let's start.
So our agenda for today is first we're going to answer the question,
why polars? We're going to do introduction to SQL
context in polars. We're going to learn how to utilize data frames
for big data sets using SQL context in polars.
We're going to learn about common table expressions, which is a nice feature.
Polars sales with SQL concept so
why polars? I think first we need to answer the question
what Polars is? So Polars is a data frame interface on
top of an online analytical processing query engine. It's implemented
in Rust using Apache arrow as the memory model.
Now why should choose polars when you have so many options out there?
So firstly, by performance, Polars is designed for speed. It's leveraging
rust efficiency and safety to perform data operations.
First you can visit this link in order to see detailed
polar's benchmark against other known libraries such as pandas,
we have memory efficiency. Since Polars is written
in Rust, it's using an enhanced memory management without having a garbage
collector. We have IO polar support or common
data storage layers such as local and cloud storage.
We have lazy evaluation. We're going to
go into this more deeply in
this talk. But in short, Polars offers lazy
frame that allows for lazy evaluation to reduce memory usage.
We have easy to use API despite its
high performance, Polars provides a user friendly API making
data manipulation and analysis straightforward for Python users,
especially photos coming from other data processing libraries.
We have SQL contracts, which is our subject for today,
and we have versatile data manipulation.
Polars supports wide range of operation, includes filtering,
sorting, joining, grouping and aggregating data.
We have interoperability so polars can interact easily
with other Python data science tool and libraries again
such as pandas. It's allowing for seamless integration into existing
data science workflows and we have open source.
So Polars is an open source project right now and according to the owners it
will always be an open source project. It is driving by an active
community of developers and everyone is encouraged to add new features and
contribute to the project. You can visit the official
user guide with this link. If you need more details about
polars, I recommend it. Okay,
so let's start with introduction to SQL context. We're going
to go and sound right away. So we
have this block of code. First we are importing
polars, then we're creating a new instance
of s two context. So that's the syntax of using
that. Then we're creating a small data frame.
Here we have a small data frame with three columns id,
name and badges. Id is one, two, three. We have Charlie, Bob and
Addis and each one has a certain badges one,
five and eight. 10th, we are registering a
new table in our SQL context.
So the syntax for that is register. Our first
argument is the table name, so it's going to be trainers.
And our second argument is either a data frame or a lazy
frame. Now we're using a data frame and then
we are executing an SQL query over our SQL
contest. So we are using select id name from
trainers where badges equal eight. So a simple
SQL query that will fetch us all the id and
names from our trainers data set that has eight badges.
Here we are using the argument eager equal true. Eager basically means
that polars will return data frame
as the result and not laser frame which is the default. In case
our result is small and can fit our memory.
We can use eager equal true and then we are printing
the result. So let's look at our result here.
So we have a small data set with the columns id
name. Id is an integer and name is mastering.
And we have Alice which is the only one that match our
query because she has eight beds. So test SQL
context in polars. Now let's go over lazy
stream. So what are lazy frames? Lazy frames
in polars are a clever approach to data analysis, allowing the entire
query to be reviewed before execution. This method
not only enables better efficiency and optimization, but also
supports working with data sets larger than your machine memory.
By using streaming, this means polars
can process huge amounts of data very smoothly
on almost every machine.
So let's go a bit hands on on how
to use polars laser frames with SQL
context.
So first we have two data sets.
One is Pokemon and the other one
is trainers. So as you can see,
Pokemon is a very big data set. It has
200 million rows with
four columns id, name, type and trainer. And we
have another data set for trainers which contains
three columns id, name and badges.
So Pokemon cannot fit in memory
in a lot of machines. So here we want to utilize data frames
in order to query over this big data
set. So let's go a bit hands on and see how
we can do that with polars. So first
we have the scan csV. Scan csV. When you use it,
it scan a file and rather than reading the entire CSV into
memory, polars script a laserframe with reference to
the file data, so no processing of the data happens until
you explicitly execute a query over it. Here we're using
the infrastructure argument.
Infrastemalence makes our life easier
and if we don't know the schema of our
data set, polars basically scans the amount
of force it takes as argument and set
the schema itself. So with bigger number,
usually inferring the scammer will be slower.
So we were reading underdrows in order to understand
our schema. And as you can see we have another scan CSV
for our trainers. Here we are not using interscama lens,
but we are telling followers what the schema of our CSV is.
So we are telling id is an integer, name is string and
Vegas is also an integer. Then we're creating
a new SQL context same way as we
did before. Now we are using register many.
So register many allow us to register more than one table at a time.
So now we are registering two tables. One is Pokemon,
which pointing to our Pokemon Slayframe and
the other one is trainers. And now
we're printing our schema in order to see what
our result will be. And we are executing a very simple
SQL query quote show terrace. Now we
are also using giga two because we want to data frame as the result.
So let's see the result. First,
our schema planes give us two order digs.
As you can see, polars inferred the schema
pretty well. Id is integer, name is string,
type is also string and trainer is also string. So polars is
doing a very good job with installing schema itself. And of
course trainer has the schema we configure. Then let's see
what short tables will give us. And here
we have a very small data frame with only one column name
and we have Pokemon and trainers as our tables.
So far so good. Now let's see how can we
run SQL queries over our two new
tables.
So here we are using our SQL context that contains
two tables, Pokemons and trainers. We are running a
simple SQL query select id trainer from Pokemon where
Pokemon's name is Charizard. Why? Because everyone
loves charizard and we're using sync CSV
with one argument called result CSV. So since
we are not using eager equal tool like we used before,
the result of our query is going to be raised frame.
So in order to dump it we are using sync CSV
which is evaluating the query in spamming mode and writing to a new
CSV file. So here we are dumping all the result of
this query into result CSV. Now let's look at the
result. So that's how resort
csv look like. We have the ideas of the Pokemon
that are charizards and we have the trainer that
the charizard belongs to. Okay,
now that we went over a very simple SQL query, let's go
over common table expressions or CTE. So common table expressions
are a feature of SQL that allows you to define a temporary
named result set that can be referenced within
a SQL statement.
Sds provide a way to break down complex
SQL queries into smaller, more manageable
story pieces, making them easier to read,
write or maintain. So let's
look at an example of SCTe.
So as you can see, the query is a bit more complex.
So we're using width as the keyword
in order to create a common table expression. We're going to call
our new temple champion and it's using to contain all
the trainer names and trainer badges from our trainer data set
that has eight
badges. So it's going to give us all our trainers that
has eight badges. And after we get a result,
by the way, we call it champions because if you ever played a Pokemon game,
you know that in order to complete the game and
mastering the league, you have to get eight badges. So after
we have our champions, we're going to select our
trainer from our Pokemon data set. Here we're
using an alias called trainer name and we are also
going to select all the Pokemon from this specific trainer.
So that's why we're using Joiner.
We are going to join on the champion's name which is the
trainer name, and on our Pokemon trainer.
And we're also using limit which is another nice feature that
Polar has it support limit and order using.
So we're going to get our six pokemons of every
champions that exist in
our trainers data set. Now here we are using collect.
Collect is another way of getting our data from a query result.
We are using stream l equal two because we don't want to collect
all the results into our memory in case it's
a very large result. So we're using streaming. Streaming basically
repents a generator then we're using to dix in order
to convert the result to row
into a dictionary. Now let's see
our result and how it's looking like. So we have
six results because our limit was six and we only
have one champion called Alex Bell and our
Pokemon name in
the first six results. So as you can see,
the alias also extracted the dictionary test
recipes.
A quick disclaimer, please note that for us do not support all
SQL operation. If you need more information about which
SQL operations are supported, you can go over their user
guide. There's a very nice list there that tells
you all the operations are starting to support them.
So that's it is. And that concludes
our talk. So thank you very much.
What we learned today Polar's
SQL Quadrax offers a very unique and powerful
tool for data manipulations, combining familiarity
with SQL with the performance of rust powered
operations in Apache O. Whatever you are a
seasoned data professional or very new to the field, I encourage
you to explore this library and experience the efficiency gains
firsthand, both for small and large data sets.
Thank you for your attention. I look forward to any
questions you might have. You can mail me at tomnabi
two@gmail.com I look
forward to any questions and enjoy the rest of Python 42.