Transcript
This transcript was autogenerated. To make changes, submit a PR.
Okay. Hi, I'm Matt Harrison and we're going to be talking about idiomatic pandas.
So I'm excited to be here today. I run a company called Metasnake that does
corporate training and consulting in python and data science. Some of the reasons you
might want to listen to me talking about pandas is because I've written a couple
books on pandas. I co wrote the one point x cookbook, also wrote learning
the Pandas library, and I have this machine learning pocket reference that also uses a
lot of pandas as well. In addition, I've taught pandas to thousands of people over
the years and used it for various consulting projects. So ive seen it
basically since it was released and used it from about that time as well.
I've seen a lot of good pandas code and bad pandas code. So what I
want to do today is talk about some ways that you can make your pandas
code better. Let me just share my notebook here and here's the outline
of what we're going to do. We're going to be looking at loading our data
really quickly. Then we'll talk about using the correct types for your data. We'll talk
about chaining, mutation, application or the
apply method, and finally aggregation. Okay, I'm going to do my imports here at the
top. We're going to be using numpy and pandas. We are using pandas version
1.2.3. I'm going to set some options here just so I can see a few
more rows than the default rows, and reload our data set. This data set is
from fuel economy Gov. It's got information about cars that
were sold in the US, I believe from 1984 to 2018
or so. Information about them. So it looks something like this and it's got
quite a bit of data in it. So we've got 40,000 rows and 83 columns
of data about this. So this is a nice little data set to let us
explore various features that has different types in these data sets as well. So if
you look at the columns in here, here's some of the columns we have.
Like how many barrels of gas does a car consume in a year? What's the
city mileage, highway mileage, engine description, the drive?
So a bunch of good information about cars. Okay, so that's the data.
Let's dive in. So the first part here is using the right types. So getting
the right types in pandas can be a challenge. But if you can do that,
it's going to save you memory. It's going to make your code easier to work
with and it can save you computation time as well. I'm just going to
limit what we're going to do here to a few columns. So again, we had
83 columns in that data set. I don't want all 83 columns here. So what
I'm going to do is pull off the columns I want to look at and
look at the dtypes attribute. This is going to give me a series that has
the column names in the index and the values for those attributes on the right
hand side there. So you can see I've got a bunch of Int 64,
some float, 64, some objects, and those are the types. One thing you
might want to do is you might want to look at how much memory this
is using. So here's our memory usage for each column. If we sum that up,
we can see that this is the amount of memory we're using. Just look at
that number. It's around 19, blah, blah, blah. So we'll just say it's
19 right now, 19 megabytes. And we're going to basically try and
lower that if we can. So the first thing is considering what our
integer columns. So again, I can look at some of the integer columns here.
So I'm going two, use the select dtypes, pass an int into it, and then
chain that with the described method here. And here are the integer types that are
currently on this data frame that I have. We've got city mileage, combined mileage,
highway mileage, fuel cost, the range and the year.
So fuel cost is how much you would spend on fuel, range is
how far an electric car will go, and city combined and highway,
that's miles per gallon. How far these cars go, how many miles
they go with a gallon of gas or petroleum. And then year, that's the year
the car was released. Okay? So of all, I probably wouldn't write this
code like this. I would write this in this chain style right here. So you'll
see throughout the course today that I do a lot of this chaining. And I
just think it makes it easier to read. So you can see, first I'm going
to take the autos, then I'm going to pull off the columns and I'm going
to select the dtypes. And then finally I'm going to describe that it's a few
more lines. We're just splitting it into a few more lines. And we're also wrapping
it with parentheses, which allows us to put it on multiple lines. Now, one thing
you want to do when you do this describe is you want to look at
the min and max values because that can indicate what types you can use for
integers. By default, pandas is going to use this lowercase int 64 type.
And you might want to use a different type. So there are other types in
here. Let's just show some of them. For example, this one down here can int
eight. If you use that type, you can go from negative 128 to 127.
So if we look at our values here, for example highway mileage, we might get
away with using this int eight. There's also an int 16 here. An int 16
goes up to 32,000. So it looks like all of our values there would support
being changed to int 16 instead. So what we can do is we can
stick in this as type call here into our chain.
And then in select dtypes we'll just select int and int
eight. And you can see that this is now showing us what is either an
int or an int eight. It's not showing us the int 16 there,
but we did. Okay. And so if you're not familiar with this as
type, what this is doing. Here's the documentation. Basically you can pass in
this dtype here or you can pass in a dictionary of types.
So what we're doing above with this as type is we're passing in a dictionary
with the columns and the types that we want those columns to be. So you
can see that highway. We're casting that to an int eight city and
combined eight. We're casting those or we're changing those to int
16s. That's why city and int city and
combined don't show up in this. Because when we said select dtypes, we did not
select the int 16 dtype. Okay, so here I'm just flushing
this out a little bit more. I'm filling in range and year as well.
Note that I'm saying now select dtypes with the string integer. And if we do
that, this is going to pull off anything that's integer esque. And so it will
get all those types in there. This is looking okay. We haven't lost any precision
by casting our integer columns to these other columns. We still have the same values
in them. But now we should be using a lot less memory to do that.
Let's just look at the memory here. And now we're using 18 instead of 19.
So we've got a little bit of savings in that. Let's look at floating point
numbers. So I'm going to say, let's describe what's a float and
we've got cylinders and displacement in there. So I'm
going to jump into cylinders. One thing that strikes out at me as I look
at this is that cylinders actually looks integer. Like, if you look at the values
that came up there, those look very integer esque. What I'm going to do is
just stick a describe on here to look at cylinders. And this is confirming my
suspicion that we have the mins and the quartiles there that are all whole numbers.
So why is that being represented as can float instead of an
integer? That might be the question that pops up in your mind, hopefully, is,
well, one thing you can do is you can use this value counts call.
And if I do that and say drop Na is equal to false, we can
see why this is represented as a float. And the reason why is because in
pandas, the lowercase int 64 type does not have support for missing numbers. And you
can see right here, there is 206 entries here where the cylinders
are missing. These are cars that do not have cylinders in them. So you might
want to dive into that and figure out why those cylinders are missing. So this
code right here will just say, let's query at. And I want to query where
cylinders is Na. And so these are the cars where cylinders are missing.
And it looks like these are mostly, if you look in this model column right
here, it looks like these are mostly electric cars, which makes sense. I'm not a
super car buff, but apparently electric motors do not have cylinders in
them. So at this point we need to figure out what we want to do
with that cylinders column. So what I'm going to do is I'm going to make
an executive decision. Rather than having missing numbers in there, I'm just going to say
fill an a with this little operation here. And then with that I should be
able to cast it to an int eight. Also with the, this displacement here,
I'm going to do a similar thing. I'm going to cast that or not.
I'm going to cast it. I'm going to fill that in with empty values here.
Let's just look at the integer types after we do that. And we should see
that cylinders pops up here as an integer type. And the lowest value for a
cylinder is zero. That's right here. And the highest value is 16. Okay, so we've
converted cylinders successfully from a floating point to an integer.
Let's see what else we have. We can also, if we want to dive into
these float types, the default float type is float 64. If we wanted to look
at like a float 16. We can do this numpy call here to see what
the minimum and maximum values for like a float 16 can be.
And if we can support that in here. So this looks like the displacement.
We should be able to use a float 16 for that. I'm going to stick
that in there and we'll have a float 16 for that. Now let's look
at our memory usage now. After we've done that, our memory usage is now 17.5.
So again we've saved a little bit more memory. We started at 19 and
we've gone down a little bit. The next type that I want to look at
is object types. Pandas supports these native float
and integer types and basically it's allocating a block of memory in the
specific type. So if it's an int eight, it's going to use eight bits of
memory for each integer rather than a whole python object. But pandas also
allows you to stick python objects into each cell there. Now when you're
doing that you're going to use a lot more memory because pandas objects take up
more memory, you're also going to lose speed. And so you're sort of executing
things at the python level rather than pandas. Optimized basically c
speed if you stay at those low levels here. So let's look at what is
object types and these columns we care about. We've got drive engine
description, make, model, transmission and created on. So generally when you
read a CSV file, an object represents string data, though it could
also represent numeric data that had string esque values in it. You can also see
that created on looks like a date, but by default read CSV does not convert
that into a date. Now I like to generally treat string types as
basically a few types. One is categorical. And what do I mean by categorical?
Well, if you look at like make, there's only a few different makes in there,
so that could be categorical. Model on the other hand is more free form.
There's going to be basically for every make, there'll be a couple of models for
every year. So it's kind of a gray area. It's a little free form.
It could be categorical esque, but it's going to have a higher cardinality, a higher
number of unique values. Transmission looks categorical created on. I mean, you could
say it's categorical, but it's actually a date type. I mean, these entries all look
the same. So because there's so many entries that look the same, you might think
that's categorical, but I'm going to convert that to a date because then I can
leverage pandas date operations on that. We also have this engine, DSCR,
which is the engine description, which looks like it's got parentheses and strings
inside of it. We'll dig into that a little bit more and then we have
drive, which looks categorical as well. So we kind of need to figure out what
we want to do with these if we want to keep them as strings.
If we convert them to categorical types, we'll save memory there. If we convert them
to dates, we can be able to do date operations on them. Let's just start
off with drive. So what I like to do when I come across a string
column is stick it into this value counts call. That's going to give me the
frequency and let me evaluate whether it is indeed categorical or not. Again,
I put in that drop Na is equal to false to see if there are
missing values in there. So these are the entries, the unique entries for drive.
And I would call that this is categorical. Now, it's interesting that again, there are
some missing values in there. NAN is used to represent missing values.
So small aside, an object type in pandas can mean string,
but it can also mean a mixed type. So in this case, Ive is actually
a mixed type. It's actually strings with floating points. It's using nans.
Floating points are representing those pandas there with those missing values. So let's just dive
into that. Like, where is the drive missing? I'm going to do this little call
here. Query. Drive is can a. Looks like a
lot of these are electric, but some of them are not. There's like Nissan 300.
I'm not sure why the drive is missing from that. There's some corvettes where the
drive is missing. So is this front wheel drive or rear wheel drive? Why is
this missing? Again, I'm not a car expert per se, but if you talk
to a subject matter expert, they could probably give you some insight into figuring out
the appropriate label for those various types. What I'm going to do, given that
I'm not really a subject matter expert here, I'm going to say, let's just fill
in everything else with the string, empty or not empty other, and we're going to
convert it to a category type. So I'm going to say as type category.
Let's just look at the memory after do that and look at that just by
changing that type there. Well, we also did change the make type down here.
So you can see that we changed the make type. So we changed basically two
types, drive and make. We went from, I believe we're around 17. We've now
dropped into twelve. So categoricals are a way to save
a lot of memory in pandas. Why do we care about saving memory in pandas?
Again, I talked about that before, it might make things run faster. But the other
thing is that pandas is an in memory tool. So in order to use pandas,
you need to be able to load your data in memory. So if your sort
of pushing that edge where you're running out of memory using the right types
can allow you to process more data. Okay, let's jump
into tranny again. Value counts is my go to here.
Here's the call for value counts. And it looks like there are a bunch of
entries in there. Does it say at the bottom here? Sometimes it'll say how many
rows there are, but I'm going to estimate there's around 30 different entries in here.
So is this categorical? Maybe. I mean, it looks a little free form too.
There's these parentheticals in there. I'm not quite sure what four
versus the s seven versus the ams seven means per se, but to me,
as sort of a novice, one thing that sort of sticks out is this,
is whether it's automatic or manual. The other thing is there's this number in there,
right? So those are both things that might stick out, that I might want to
keep track of in this data set. So what I'm going to do is I'm
going to put a new column here called automatic,
and that's just whether it contains automatic in there. And then I'm going to put
another new column here called speed. And that's just using this stir extract,
which is passing in a regular expression here and pulling out that value.
If there are missing values, I'm sticking in the string 20 instead. It looks like
the ones that are missing if you dive into that are probably like variable speeds.
So I'm just putting 20 in as a large number for speeds and then I'm
casting that as an int eight. So if we do that and we look at
our memory, our memory now has gone down from 19 to ten weeks. So we've
almost halved our memory by doing these little operations here. Note that
I also now have a few more call columns that might have more interesting information.
Is there more that you could do with this column? The tranny column? Yeah,
there might be. Right, so again, talking to a subject matter expert might reveal more
insight, other features that you can create for your data. Let's look at the date
columns here. So I do have this created on column. So what
I'm going to do is I'm going to run this two date time call.
This is a function in pandas, and then I'm going to call this tz localize
down here because it looked like it was in the eastern time zone. So let's
just run that and I get a warning here. It complains about that, but if
you look at my memory usage again, I'm going down quite a bit. I'm now
at 7000 here. So this warning here is due to a
python issue where python doesn't understand this EDT or ESt in there.
So with this little change here, replacing EDT and ESt by these numbers,
when I rerun this here, I don't have any issues. Now, in this case,
I'm pulling off the value counts here and just looking at the engine description.
Value counts here, but let's just drop the engine description. So that's what
I'll do down here because this looks like free form. But before I drop it,
I'm going to add this little thing here. Does it contain ffs?
Again, I'm not a pro at these engine things, but looks like there's a bunch
of ffs in there, so that might be something that I want to stick in
there. So this is an example, just making a new column, whether it contains ffs
or not. So let's run that. And we're now down to eight. So we're
not quite a third of our original data size, but we're getting close. Two that,
okay, so this is kind of nice. Now, one thing you may or may not
have noticed is that I've just sort of built this up as we've been going
along, and Jupyter makes it really nice to do that. What I like to do
now is I like to make a function that will take my raw data that
ive read directly out of a CSV file, and I generally just call it tweak,
whatever the name is. I pass in my raw data here and I just apply
all these chains right here, and this gives me my cleaned up data. So this
is really nice to have a function like this in one place, in one cell.
Generally, I'll put it right after the load up at the top. Then when I
want to revisit my data, I know that I can load the raw data and
I can immediately get my cleaned up data. This also allows for what's called data
provenance, where I can track through and see what happened, where every row is coming
from. Generally your pointy head boss, after you do an analysis or calculation,
is going to ask you to explain it. And if you don't have this provenance
where you can come back to the original data that might be problematic and bite
you. So this is what I like to do, and this should read pretty clearly
here. I'm taking autos, I'm pulling off these columns, I'm assigning a cylinders column,
a displacement column, a drive column, an automatic column, a speeds column, a created on
column, and an SS. I'm changing the types for various ones of these and I'm
dropping these columns so this sort of reads like a recipe. Do these steps and
then you'll have your cleaned up data. Okay, so that is my first thing
to do with pandas is you probably want to look at your data and make
sure that your types are correct. The next section here is to talk about chains.
So chains are something that I actually just showed you. Sometimes this is called flow
programming. But what we're doing is we're leveraging the fact that in pandas, pandas generally
does not mutate data. Rather it returns new data frames. And so we're just going
two, take the new data frame and do operations on that. So on that note,
because I said pandas can in memory tool, you want to have three to ten
times the amount of memory as the data frame you're working with, because you need
to have some overhead for doing these intermediate operations. I know some of you who
probably use pandas are like Nomad, that's not the way to do it. There's an
in place parameter that you can use on various methods. And pandas will do things
in place. Yeah, there is an in place parameter. However, the in place parameter generally
doesn't do things in place as you might think it does. So I
do have a hint here, if you find some code that you can't change.
Sometimes I like to stick in a pipe. Call pipe is a nice way to
inline any arbitrary code. Basically you can stick in a function that takes a data
frame and returns a data frame for you. So this is my chain here.
Now I think this actually reads very cleanly. I just walked through it in the
last section here. Now compare this with, this is what I generally see in pandas
code in the wild. So basically ive taken just these few lines here and ive
split them out into something that I'll see like this. I'll see something like this,
making all these intermediate variables, changing them and then sticking them
back in. And I'm not even doing that for the rest of the code.
For the rest of the code, I'm just sort of doing, not even making the
intermediate variables, but just assigning these directly. And generally, again, I wouldn't
see that, but just for the sake of typing this in as an example,
you could imagine that generally what I see when I'm working with students or clients
is that this code here is going to be like three times as long as
it is here. So contrast that with this, which I think is really easy to
read, it's easy to step through and is going to make your code and
life a lot easier. Now let's just run this here. And actually when you run
it, you'll notice all these little errors here that's sort of the bane of pandas,
people everywhere, the setting with warning copy. You'll note that I didn't run into that
at all because I used a sign here. So if you use a sign and
you're chaining, you're never going to run into this setting with warning copy error.
That's kind of confusing. And you go and read the descriptions and the stack overflows
and you're still kind of confused on what to do. This is also easy to
debug as well. So people sometimes say, matt, that's not easy to debug that because
I don't have the intermediate objects. I say it is easy to debug. Let me
just show you some things you can do. One thing I can do is I
can just comment out everything and walk through it as I'm going through it.
So here I've got a call to tweak autos here, let me just show you.
So here is tweak autos. If I want to, I can just come in here
and say, well, what did it do before this? This is what it did,
right? And I can look at that intermediate and I can walk through from the
start to the end. So that's one way to debug it, which I think
is actually very useful, more so than having all these arbitrary variables sitting around.
Another thing I can do if I want an arbitrary variable sticking around is I
can make a function like this gitvar. It's going to take a data frame in
a variable name and look what it does. This is sort of a dirty hack.
It just basically injects that variable name pointing to the current data frame into
my globals. So guess what? Right down here I've got this here pipe gitvar
df three. So this is right after getting the columns. And if I come down
here below this and look at df three, this is the intermediate state. So I
can get that intermediate state very easily by piping in this little hack function here
that updates that if I need it. So again, I claim that this is a
lot cleaner. It's going to allow you to very easily books at your code,
but pull out stuff that you need to. One more thing that I have in
here is this pipe, and this is a pipe to display a data frame here.
So I can display one as well. And I didn't really show this, but if
you look at the output here, there's actually two data frames here. So here's the
first one. And if I scroll down a little bit more,
let me rerun this. This was commented out, so I'm going two, rerun this.
Okay, so here's the first one. I commented that out when I run it.
And there's the first one, and there's actually the second one here. So I put
a little function in here that I piped in. That just is a lambda function.
It says display and then it returns that data frame after it displays it.
So the pipe takes a data frame and returns a data frame. So we're just
saying shim in a print along the way so we can inspect it. So this
is the way that it's going to make it very easy for you to debug
these chain operations. Okay, the next step that I have
is do not mutate. And if you look at my chained operations,
because I'm using chains, chains don't mutate. And so I'm obeying this.
And I've got a quote here from one of the core developers in pandas
who I've got the link to the bug as well. That's talking about getting rid
of the in place option in pandas. And the quote is Mr.
Reebax says you're missing the point. In place rarely actually does something in
place. You are thinking you're saving memory, but you're not. And I come across this
all the time when I'm teaching pandas or looking at code people will use in
place and they think, oh, it says in place. So it's just updating the memory
in place. Generally it's not doing that. Generally when you call in place, it's making
the new variable that you'd get anyway and then shimming it back in. So it's
not saving you any memory that you would get by doing chaining, but basically
by forcing yourself to chain, you're going to make that clean code and
you're also going to get rid of the bane of python developers everywhere. That annoying
error that's like, what does this mean by using chains and assigns?
You get rid of that. This is a short section, but Ive sort of demoed
it up to this point. In general, there are no performance benefits by
doing in place. So if you think you're doing that, you're just fooling yourself.
Pandas is not really doing an in place operation, even though it says it is.
It also prohibits this chaining, which I think is going to really mess up your
code, make it uglier and harder to read. And I think that's one of the
most important things, is to have code that's easy to read. And then again you
will get rid of that setting with warning error that is the bane of pandas
developers everywhere. Okay, the next section is don't apply if you can.
So let's just make our data frame here. I'm going to call auto two here,
and this is sort of us centric, but maybe I want to move to something
that's more like world centric. A lot of people, instead of doing miles per gallon,
they don't even know what a mile, let alone a gallon is. They will use
liters per 100 km. Right. As a us person, I don't really know
what that is, but Americans are kind of weird that way. But I've got this
function here that will convert miles per gallons to liters per 100.
Conversion there is, you multiply it by 235. So what I can do is I
can take like the city mileage and I call the supply method on it and
I can do that. And there's our conversions there. This looks like that works.
Note that I can also do this instead of doing apply here,
I can just multiply that right here. And this gives me something that is actually
the same value. Now the output is the same, but what's going on under the
covers is actually very different because I'm calling apply. What apply is doing is
going to take each individual entry in that column and run each individual entry
into this python operation in this cell down here. Because I'm multiplying the column.
What this is going to do is a vectorized or broadcast operation.
Remember, Panda stores this as a block of data. It's going to take that block
of data, multiply it by this number, and give me a new block of data
without going back into Python. We're saving sort of the serialization deserialization
step. In fact, we can just time this here. I'm going to run this little
time operator and we'll see how long this takes with both of these
options here. So it looks like the apply takes about six milliseconds.
Let's look at the vectorized version here. It takes
about 143. So sometimes this changes, but by my calculations
this is from a previous run of numbers here, but it's pretty similar. It's around
a 50 x difference in doing an apply versus doing
the vectorized operation. So my point here is anytime you find yourself
doing apply, check whether that is what you want to do. Now, there are some
cases where apply sort of makes sense. Let's just sort of talk about them.
Let's just say I want to check whether make is american or not.
And I'm going to apply this little function here that just says is the value
for that american? And I'll also do this other one. There's also an is in
on make. And so if I do apply, that's 1.5
milliseconds. You can notice that I do is in, that's about half the
amount of time. So even with this string operation here, by doing an is,
I am saving some operations because pandas can do it a little bit faster.
Now, if we do something a little bit more complicated here, we might do something
like I want to make a country column in here, and it's either going
to be us or other. Again, I'm sort of us centric. Apologize for that if
that offends you here, the timing for that is like 2.6 milliseconds.
Generally, if I was doing a calculation and it had some
conditional like this, I would use a where statement here. So this is using the
where down here. This is how I canonically would do this with a calculation.
However, in this case, because it's working with strings instead of numbers,
it's going to be slower. So this is almost two times as slow to do
this with the where. However, if you were to do a where on a numeric
column rather than a string, it would actually see an improvement in data. Let's just
do one more operation here. There's also this NP select, which is a nice
little tool. And there's NPware, which again are very useful
for basically building up conditional statements. They're a little bit more powerful than the where
in pandas. However, in this case for strings, they are slower, but if
you're doing numeric calculations, all of these will be faster. Than the apply operation
there. My takeaway here with this apply, I see a lot of people using apply.
Don't use it unless you're in a string situation. Otherwise you're
going to want to use one of these NP selects or pandas where. NP where
and your code is going to run a lot faster. I get it that where
might be a little bit confusing. Your code might be a little bit harder to
read. But if speed is of the essence, try and avoid apply if
possible. And really with string operations, because strings are stored as python objects,
you're going back to that python layer anyway. So in that case I'll
say, yeah, apply is okay there. But generally I try to avoid apply for numeric
operations. Okay, the last section here is aggregation. Let's talk about aggregation.
Let's say that I want to compare mileage by country by year.
And so when I started hearing buy this or buy that, basically it means to
me that I want to do these group by aggregation. Pandas kind of has two
ways to do this. One is with a group by, the other one is with
a pivot table. I'm a software engineer by training and so group
by makes sense to me. I'm not really can excel master. People who are very
adept with Excel tend to favor the pivot table way of doing this,
but both of them will get you to the same place. I'm just going two
show the group by way here. So what I'm going to say is group by
and then I'm going to take the mean here. So if you're not familiar with
this, this is really powerful. What it's doing is it's going to take and put
in the index the year. You can see in the left hand side, the bolded
thing is the index. And then it's going to say, okay, for every row that
had the year of 1984, get all the numeric columns and take the mean of
each numeric column. Put that in the column. So this is actually pretty cool.
I've written it out as three lines of code, but you could write it as
a single line, a single operation to do this aggregation here, which is pretty cool.
Now the sky's sort of the limit what you do with these. So instead of
just doing all the columns, I can say I want to pull off combined and
speed and just get the mean for those if I want to. So I stick
in this little operation here. Note that I am chaining these. I'm writing these out
each on their own line, and I can sort of walk through and debug that
if I need to. Now, one thing I like to do is just stick on
a plot. Humans aren't optimized for looking at tables of data. So I can look
at this table and I can see it looks like speed is going up,
but it's kind of hard for me to get a feel for what's going on.
But if I just stick a plot on the end of this, what this is
going to do, it's going to take the index by default, and each column it
will plot as a line in there. So look at this. There's a visualization.
This visualization tells me a lot more to me because I can visually see the
representation, what's going on there, that our speeds are going up over time.
Our combined mileage was sort of static until 2010,
then it started taking up. Now, again, I'm not a car guru by any means,
but this is kind of interesting just from looking at the visualization. So what happened?
Why was there that inflection point around 2007 eight that made things
start ticking up. And this could also be the data. We're also taking the mean,
so there could be outliers in that. I'm imagining what this is, is electric cars,
and electric cars tend to push off the mean a lot, so maybe we'd look
at the median instead. And the median is still going up a little bit,
but it doesn't look like it's going up quite as much as the
mean there. And so in this case, what I'm going to do down here is
we said by country, right? And ive just been doing this by year, but now
what I'm going to do is I'm going to change some operations in here.
I'm going to make a country column. So I'm going to do that, apply and
get the country column in there. And then I'm going to group by look,
I'm going to group by year and country. So let's look at what this gives
me. This gives me this data frame with the index, you can see that there's
both the year and the country in the index. So this is a hierarchical or
a multi index, which is a little bit confusing to use. But hopefully you can
see that there's power there. We can see for 1984. Here's the summary for us
columns, and there's a summary for other non us as well. We can
go a little bit deeper as well. So we can go down this rabbit hole
here. I'm going to say, instead of just doing the mean here, I'm going to
call AG, which allows me to call multiple aggregations. So I'm going to call the
minimum the mean, and I'm even going to pass in my own defined function here
that's going to take a series and aggregate that series, collapse it to a value.
So in this case, my aggregation is a little bit weird. It takes a second
to last value. Now this is just showing that you can pass in whatever function
you want to. It just has to take a series and collapse it to a
single value. And so here we go. Now, you can see that we have,
in addition to hierarchical index, we have hierarchical columns. So for every numeric column here,
we have the minimum value, the mean value, and the second to last value as
well. So the sky's sort of the limit with what you want to do with
these things. So what I'm going to do is I'm just going to go back
to what I had, year and country, take the mean of that, and then plot
that. And we get this plot here that's a little bit hard to grock here.
I'm not quite sure what's going on. Issues here is that, remember, pandas is going
to plot the index along the x axis. So now it's actually plotting these tuples
of year and country like us or other in there, which makes it a little
bit hard. So what you might want to do is one of these more advanced
operations here where I'm going to use this unstack here. So if you're not familiar
with unstack, let me just comment it out here and show you here is grouping
by year and country. And then we're going to take the mean. If we come
in here and say unstack, what that's going to do is it's going to take
the innermost index, which is that country, and it's going to pop that up into
the columns. And so you can see that we now have country in the columns.
Now we only have a single thing in the index. We've got year. And if
we wanted to look at, for example, just the city mileage for a year from
that, we could pull off the city column, which is going to give us the
sub columns there because that's hierarchical. Let me just comment that out and show you
what that's giving us. It gives us that, again, we're going from this unstacked version,
we're going to say city. And you can see I'm using this debug style to
pull that out. Now we're going to plot that. Going to plot the index,
and I'm going to plot each of those columns as a value and we get
something that looks like this. So ive sort of used this construct of grouping by
unstacking to illustrate some of the power. And this allows
me to quickly visualize that and see what's going on with the US. We can
see that the US city mileage is going up, maybe faster than the non us
mileage, at least in this data set that I have here. Okay, so in
summary, pandas is super powerful. You can do a lot of things with it with
very few lines of code. In my experience, I've seen good pandas code and I've
seen some bad pandas code. If you follow these steps, I guarantee
that your code will be better. It will have less bugs, it will probably run
faster, and your colleagues will probably be more happy with you because they're going to
be able to understand your code. So correct types again. Those save time and
make things fast and also give you functionality like date and string manipulation,
chaining operations. That's two. Make your code more readable. It's going to remove bugs.
It's going to be easier to debug as well. I showed some examples of that.
Don't mutate. If you find yourself mutating, you should take a step back and think
about how to do it without mutating. Again, apply is generally your
enemy, so if you're trying to do math operations with apply, you're probably doing it
the wrong way. Apply is okay for string operations, but generally should be avoided.
And then aggregations are very powerful. I get that they can be confusing, but if
you use this chain style and start building them up, it's going to help you
understand what's going on and you're going to be very empowered. So thanks for watching.
I hope you enjoyed this. If you want two learn
more about pandas, you can follow me on Twitter. Underscore underscore Mharrison underscore underscore.
I also have a discount code to my pandas. Course it's going to be 50%
off for people who are watching this. So the code is learn pandas and that
will be active for the next week. Good luck with your pandas. It's been a
pleasure being with you.