Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, let's talk about polars. So again, I'm Matt Harrison.
I'm the author of effective pandas Machine Learning, Pocket reference
and illustrate guide to Python three among other books.
I'm an advisor to a company called Ponder that creates an
enterprise version of pandas that lets you run pandas
on your bigquery or
on your snowflake data sets. And I
also do corporate training. So I teach people Python and teach people data science.
So I'm excited to talk about Polars today's.
Okay, quick history of polars or relevant background.
So my background is I've worked with data since
1999. In 2006, I created my own OLAP
engine in Python. Later I heard about pandas and I
started using pandas. In 2016, I wrote a book
about pandas. 2019 did some spark training.
2020, wrote the second edition of the Pandas cookbook.
In 2021, I wrote a book called Effective Pandas.
And recently I've done a bunch of training around QDF
mode and pullers as well.
So I've got a bunch of opinions here and I'm going to just walk you
through a data set and we're going to look at the data
set and look at the types, talk about this thing called
chaining, we'll talk about function application and
we'll talk about aggregation as well.
Okay, so let's run our data. Okay,
so we're going to load our libraries here.
I'm running Polar's version zero point 17
and I'm going to download my data. This is vehicle
data from the US government about cars and
their mileage. So it looks something like this. We've got 41,000 rows
and 83 columns. So we're going to look and there's a bunch of
columns in here. I'm not concerned with all of those, but our initial data
size of the, this is like 33 megs. Okay, so I'm
going to walk through this data set and one of the things you want to
do is get the types right. I'm going to use a subset of the columns
here. So instead of all 83 columns, I'm just going to use this subset here.
And you can see that we've got various types of columns in this polar's
data frame. If you're familiar with pandas, this looks pretty familiar.
We've got Int 64s, float 64s,
UTF eight. And one thing to note is
that polars does have a native stream type,
which in pandas one wasn't the case. We look at the size
of this, we're looking at around eight megs for this. So what I'd do is
I'd just go through this data. Let's pull
out the integers. So I'll do something like this. I'm going to say pull out
the columns I'm interested in and then select the columns that are in 64.
So these are the columns that are in 64. We've got city mileage,
combined mileage, highway mileage, cylinders, the fuel
cost per year, the range, that's how far an electric car will go and the
year. I might want to do a describe on this. And this looks very
similar to what we'd see in pandas. Here we get summary statistics of this.
Now one thing to note about polars is if you're familiar with pandas,
pandas has an index. We don't have an index here, we just have
column names and we have a bunch of columns.
Now I wouldn't write this like this. I would write it like this.
Second option here, it's the same code, but I put parentheses around it
and the parentheses allow us to not
worry about white space. So what I can do is put each operation on
its own line. You can see that that gives me the same result.
Now one things to be aware of is you can see that the types for
each of these. And I probably
don't need to use a floating point, a 64
bit floating point for city mileage that goes up to, it looks like 150.
Combined mileage goes up to 136. So I could probably reduce
these cylinders goes up to 16. And I'm going to
just see if I could use like an int eight,
an eight bit integer. And you can see that that goes from like negative 128
to 127. There's also an unsigned eight bit integer which
goes up to 255, which might be more appropriate. So I'm going to say let's
take my data frame, pull out the columns I want, and then we're going to
use this with columns method and we're going to try and convert
the combined mileage to an int eight. And then we'll do a describe
on that. When I do that I get an error and the error here
says that this strict conversion failed. We've got values like
131, 31 and that did not work.
So instead of doing that with an int eight, let's try unsigned
integer eight. And looks like that does indeed work. So this is kind of nice.
Pandas one didn't check those types, it would just allow you to do
those. We can use other types as well. So int 16.
So I might do something like this where I say let's take the city mileage,
the combined mileage, the highway mileage, the cylinders and displacement,
convert those to unsigned eight and then
range, fuel, cost and year. We'll convert that to
unsigned 16 bit integers and we'll look at the size.
We're now down to about 5.8 megabytes
just by doing those operations without losing any of
our data. Okay, let's look at strings
here and
I'm going to just select the string types. So again
you can see that we've got a native string type here. We've got
drive, which appears to be categorical. We've got this engine
description column which looks like it's pseudocategorical.
It's got a bunch of different entries in there and parentheses and commas,
probably free form text make looks categorical.
Model is probably categorical.
Tranny actually looks like it has two pieces of information, whether it's manual or
automatic, and the number of speeds. And then this created on actually looks
like a date. So we'll look at these and try and deal with these.
Okay, so let's just look at our size.
Remember we have 5.8 megs and I'm going to convert
drive, make and model to categoricals and look at our size
after that. We're down to four megs by doing that. So categorical
is just a string value that doesn't
repeat itself a lot. That looks pretty good.
Let's look at the FFS column here.
I'm going to look at these remaining ones. So I've got engine description
that has FFS, I've got transmission. We want to pull that
into two, the manual and the speeds and then create it on. We'll deal with
that when we talk about dates. So I'm just
going to make a column here. This is an expression so I
can pull up the documentation and you can see that we have like
off of a column expression, we have this Str and off of Str
we have various things that we can do. You can see that we can extract
values from that. So I'm going to do an extraction here.
I'm going to say, let's extract. Then I'm using a raw stream to say in
parentheses, backslash, d plus, that will match all the
numeric values inside of the transmission. And then I'm
also saying alias that as the speeds column.
And then from that same tranny column I'm going to see whether
it contains the manual string and alias that as the manual
column. And it looks like that does indeed work.
If I do that okay,
let's look at the columns after I do that. These are the columns that I
have. And what I'm going to do is I'm going to change my code a
little bit here. I'm going to use that columns I
just printed out here and put it in the select method down below that
to select those columns. Let's look at our estimated size. We're now
down to 2.5 megabytes. Now we might be missing
some values. So I'm going to use a filter method here and I'm going to
say where is drive null? Let's run that. And you can
see here are the values where it's null. Looks like a lot of those are
electric vehicles. So what am I going to do here?
I'm going to say, okay, in drive fill null with other
that's in this value right here. You can see this.
And we're also going to say and put
in engine description. You can
see the engine description is whether we contain it in ffs.
Let's just run that and see if that works.
So do we have a drive here?
And it looks like drive worked there.
It's not complaining. And we've got some missing cylinders as
well. So let's just find out where those are missing. And it
looks like those are electric vehicles as well. So I'm just going to come in
here and save cylinders. If we have missing values,
fill those in with zero and cast that to a un eight. And that
looks like that works as well. Okay, let's look
at our dates. To do this, I'm going to have
to clean up this date column a little bit here.
And I need to know about this replace method here.
So I'm going to use this replace method on created on. If I try
and convert the dates, it's not going to like those EDT and EST.
So I'm just going to replace those with the time zone offsets there.
And then I'm going to call the stir stir p time to convert it
to a date time and say UTC is equal to true.
You can see that created on now says that it is a date time.
So that's pretty cool. If I want to convert this to
a New York City time zone, after I've converted it to a date time,
I can say DT convert time zone. And that looks like that
works as well. Okay. At this point my data
is looking pretty good. I'm going to convert this into a function here. And here
is my function. So one of the things I like to do is work
with the raw data here you can see that I've chained up my
operations starting from autos and then doing all these
operations as I've been creating these. I've been checking them to see that
it works along the way. This is my recommended way and
the polar's recommended way. We call this chaining,
and it's also called flow programming. And you
can see how I created this chain as I was going through. Now, one of
the things that Polars gives us, that pandas doesn't give us is polar's
gives us the ability to be lazy. And it actually has
a query engine that will optimize what's going on there.
So here I'm saying make a function called tweak autos lazy.
It's actually taking in a path. This is
a path to the file. And then you can see I'm saying scan CSV and
then I'm saying lazy. And then down here I'm saying collect.
So what this is going to do is it's not going to execute anything until
I run collect. And then it's going to look at
the columns that I use down here and it's only going to read from the
CSV, the column that I specified down
here. So even though they're 83 columns, it's not going to read everything. And it
can do other things like predicate, push down, et cetera. But you can
run this and see that it does work. So this is one of the reasons
you'll want to chain with polars, because it can actually
make your queries run faster when you do that. Now, a lot of people
say this isn't easy to debug. I disagree with that. I mean,
if I want to debug this, I can, I can stick in a pipe here.
Pipe allows you to take the current state of the data frame and just return
whatever you want. So here I'm just printing out the shape and then returning the
data frame using a short circuit. You can see that I'm doing that before this
operation. I'm doing that down here and I'm doing this at the end.
So I should be able to see how big each of these are.
I'm also piping in the git bar here if you want the
intermediate state. This is just using the globals from Python to
make a variable called DF two with the intermediate state right there.
Let's run that and see what happens.
Okay, you can see that we printed out the size as we're going through
here. And if we check DF two, this is that intermediate state
if we wanted to. So I don't think that chaining
is necessarily hard to debug. It's just something that
as you're building up your code, you will debug it as you're going. And then
if you put it into a function, that makes it really easy to use.
Now another thing you want to be aware of is function application.
So I'm going to make this variable called autos two, which is our cleaned
up data set. And this is again pretty us centric here. Let's try
and convert this to liters per 100. We could
do this by saying apply on that column and that looks like that
is working. Our city is converted to liters
per 100. However, we can get the same result by doing this
instead by just saying 235 divided by that column.
Now when you're doing function application,
you need to be aware that that can be a slow process because what you're
doing is you're taking your data out of the backing store.
The case of polars, it's using the
arrow has a rust implementation of the arrow memory
representation of this. And you're crossing that boundary, converting each
individual value to a python value, running the function and sticking
it back in. So you can see that this
took in this case four milliseconds to run the slow code
and 150 microseconds
to run the fast code. The last time I run this, it's similar order
of magnitude there, almost 300 times slower to do
the function application. So just be aware of that. You're going to
want to stay away from that function application if
you can.
Okay. Another thing you'll want to do is master this thing called aggregation.
And that's like a pivot table or a grouping if
you're familiar with Excel or SQL.
And so here we go. We're going to say let's get the mean
by year and what you want to do is group by the
year column and then we can do the mean of that. You can see this
makes a column called year that has each year. And then
for each value we have the numeric value for that.
This is super powerful and it's basically one line
of code, but I've written it as this chain. Now one of the things
I do like to do is visualize that. So I'm going to
load some plotting code here. Now, another way to do this, if I don't want
to get the mean of everything, I can specify the
columns that I want using this Ag method here. So here
I'm saying pull off the combined mileage column
and the speeds column and take the mean of both of those.
Now, one of the things that pullers doesn't give you that pandas does is the
ability to plot. So if I want to plot this, I'm generally
going to stick it back into pandas. So here I call two pandas, which actually
gives me a data frame.
And because I don't have an index there,
I'm going to stick the year into the index and then I'm going
to call plot on that. And when I do this, the plot is kind of
ugly. The issue here is that if you look at the data,
the index is not sorted because of the optimizations that polars
makes. It tries to run as fast as possible. So I'm going
to just stick in a pandas sort index there. And you can see that we
get a pretty plot coming out of this, allowing us to quickly see around the
year 2010 or so, the combined mileage shot up
quite a bit for the average value there.
Okay, here I'm going to group by year and I'm going to,
instead of taking the mean, I'm going to take the standard deviation.
So once you've got these figured out, it's really easy to change mean for
standard deviation or for other operations that you
might want to do. Now in this case I'm going to try and
add a country and so we might want to know about this win
operation and we're
going to use the is in method
on a column as well. So here I'm going to try and add
a country column and I'm going to say with columns and I'm going
to say PL win. This is how you do an if statement and we're going
to say take that make column. And if it's in Chevy,
Ford, Dodge, GMC or Tesla, then I want a
value of us, otherwise I want a value of other and
we're going to call that country. And when I run this, I get an error
and the error is that it didn't like that I used a categorical
and did this on a categorical. So to get around this, I'm going to
actually cast this back to UTF eight and run this. I should get
a country column now. And then I'm going to say group by. We're going to
group by both year and country, then we're going to aggregate that and here's the
aggregation for that. Now in this case, if I wanted to
plot this by year, I would have to basically pull
out that country and I could do that in pandas.
But if I wanted to do that in polars,
I actually want to use what's called a pivot to do that. So here I'm
going to say let's make our column, our country column and then we're going to
call pivot. We're going to stick year into the index even though polars
doesn't have an index. When we say pivot, we can specify what we want in
the year. And here we're going to say the values are the combined
mileage and the speeds and the columns is the country column
and we're going to aggregate that with the mean function.
So this columns here is going to take the values of country and stick them
up into the columns. Let me just maybe run this for you by
commenting out these other values so you can see what's going on here.
And then we'll just step through the chain. Okay, so this is the result
that we get from doing this. And then
if we convert this to pandas, it looks pretty similar. We need an index here.
So we're going to stick the year into the index. It looks semi
sorted, but it's not. So we're going to sort it and then we'll call
plot on this. That looks pretty good, but the legends in the middle. So we'll
just stick the legend off onto the side there.
Okay, so this lets us look and see that around
2008 or so, we did see a bump up in the
combined mileage. Also looks like speeds has
an upward trend or bend at that point as well.
Okay, so we've looked at the polar's library.
A couple of things to note about this. If you
change your types, the correct types will allow you to save space,
which can allow you to load more data and run things
at a quicker clip. Chaining operations
is going to make your code more readable. It's going to make it look like
a recipe of operations. I also think it removes bugs and makes it
easier to debug,
unlike pandas. Pandas, I recommend chaining as
well. But polars, you actually get an additional benefit from chaining because
it does have a query planner and it can do things like predicate, push down,
et cetera. So you will want to chain
in polars to make the most of it. And then remember
that that function application is slow for math
operations. You want to try and avoid crossing that,
what I would call the rust to python boundary there.
And aggregations are super powerful. If you're not familiar with
them, play around with them. Once you get down the pattern of
them. They're going to be super powerful and help you answer a lot of questions
that you might have. And then I do like to visualize
polars does not have visualization, so I'm going to jump back to pandas to
visualize that. Well, I hope you've enjoyed this
quick introduction to polars. It's a super powerful library
and progressing fast. It has a lot of
capabilities and is a lot faster than pandas
for a lot of operations. So check this out if you need
an option for pandas. Pandas isn't working for you. Polar's, I think,
is in a great place. If you're interested in more content like this,
you can follow me on Twitter Dunder M. Harrison, where I talk a lot
about Python and data science science. Have a
great rest of your day and enjoy the conference. Thanks everyone.