Abstract
Nowadays data is getting bigger and bigger, making it almost impossible to processed it in desktop machines.
To solve this problems, a lot of new technologies (Hadoop, Spark, Presto, Dask, etc.) have emerged during the last years to process all the data using multiple clusters of computers. The challenge is that you will need to build your solutions on top of this technologies, requiring designing data processing pipelines and in some cases combining multiple technologies.
However, in some cases we don’t have enough time or resources to learn to use and setup a full infrastructure to run a couple experiments. Maybe you are a researcher with very limited resources or an startup with a tight schedule to launch a product to market.
The objective of this talk is to present multiple strategies to process data as it grows, with the limitations of a single machine or with the use of clusters. The strategies will focus on technologies such as Pandas, Pyspark, Vaex and Modin.
Outline
1.- Introduction (2 mins)
2.- Vertical scaling with Pandas and the Cloud (3 mins)
3.- Keeping the memory under control by reading the data by chunks (5 mins)
4.- Processing datasets larger than the available memory with Vaex (5 mins)
5.- Scaling Pandas with Modin and Dask (5 mins)
6.- All-in with Pyspark (5 mins)
Transcript
This transcript was autogenerated. To make changes, submit a PR.
You strategies for working with data as it grows.
Hello everyone, my name is Marco Carranza. I'm an entrepreneur
and also the technical cofounder of Teamcore Solutions. I have
been a user of Python for more than 15 years and I have
had the opportunity to use it extensively to develop many
technological solutions for the retail industry on a
global scale. In teamcore solutions, we process sales information
through machine learning algorithms, giving companies visibility
into the execution of their product at each store, generating insights and
specific actions for the office and the field teams.
The objective of this talk is to view alternative for
processing data using different types of technologies and data
frames. First we will take a look to different techniques
to keep pandas memory usage under control and
to allow us to process larger files. Then we
will take a look in how we can vertical scale our pandas loads
using Jupyter notebooks. Next we will learn
about this amazing Python library Bayx so
we can process a large amount of data that cannot fit in our memory.
Then we will try Modin so we can speed up our pandas code with
a minimal change in our Python source code.
And finally, we will take a look to Pyspark and
understand in which cases is a great alternative.
Introduction nowadays, data is getting bigger
and bigger, making it almost impossible to process it in
regular desktop machines. To solve this problem, during the
last years a lot of new technologies have emerged to press all the
data using multiple cluster of computers. The challenge is that
you will need to build your own solution on top of these technologies requiring designing
data processing pipelines and in some cases combining multiple
technologies. However, in some cases we don't have enough time or
resources to learn to use and set up a full infrastructure
to run a couple of experiments. Maybe you are a researcher
with very limited resources or a startup with a tight
schedule to launch a product to the market. Usually the software
that process data works fine when it's tested with a small sample file,
but when you load the real data, the program crashes.
In some cases, some simple optimization could help to process
the data, but when the data is much larger than
the memory available, the problem is harder to solve.
In this talk, we will show you multiple strategies
to process data locally and review some alternative tools
that could help us to process large data sets using a
distributed environment.
Pandas is the de facto tool when we are working with
data on python environments. Now we're
going to see a couple of tricks that will help us to control
the memory of our workloads in a better way.
Trick number one, spurse data structures.
Sometimes we have data sets that came with many,
many empty values. Usually these empty values are
represented as non values and using an sparse
column representation could help us to save some memory.
In pandas, sparse objects uses much less
memory, especially if we save these
objects as pickle on dask and when we are
using them inside a Python interpreter.
Let's take a quick look to a small example.
As we can see in this data frame,
when we list the content of the column name
education 2003 revision, we'll see that there
are a lot of rows with non values.
And then if we take a deeper look to
how much memory we are using, we realize that
we are like consuming a lot of memory something cloud to
19 megabytes. So with
a very simple command we can change the
data type of that column and tell pandas to
use a spurse type object.
So after doing that, if we take a look again to
the memory usage, we'll see that it has reduced. It has
been reduced a lot. So basically after
changing only the data type, we have reduced the memory
in 41%. This maybe doesn't look
too much, but it's very useful, especially when you have
very large data sets that come with a lot of
empty values.
Trick number two, sampling. Sampling is a very interesting and
useful technique and will help us to create a smaller
data set from a larger one, and if it's dont
in the right weight, will help us to run a faster analysis
without sacrificing the quality of the results.
Pan has a special function for that named sample
and let's see an example.
In this example we have one large data frame,
so we are creating a sample of 1000
rows. But before running a sample,
you need to be careful because a common mistake is that
a lot of people think that if they only pick up
the first thousand rows,
that will be a right sample. But in reality the
correct way is to use this function because you will get a more uniform sample
that is better for further
analysis. For example, if we run later
the function describe pandas,
we'll get some instagrams and also
some descriptive statistics. And if you can see if
we compare the result of the original data frame with
the sample one, the results are pretty similar.
Also, if we take a look to the histograms,
we'll see that both are very similar.
But if we make the mistake of only picking the
first n grows, you will get a completely different result.
Okay, trick number three, cloud only the columns that you need.
In some cases we have very large data sets that
comes with many columns. In some cases you can
have hundreds of columns. So there's no point to cloud
all these columns into memory. So the basic rule is
less columns, less memories. So let's take a
quick look to an example. In this
small example we have a large text file that is
3.8gb on disk. So basically after reading it
we realize that we have 77 columns and
if we analyze the memory usage we
will realize that we're using 4.5gb of memory.
So a quick way to reduce the memory
usage is only to select
the columns that we are going to work with. So after
selecting these four columns that you can see in the example, we realize
that we have reduced the memory from 4.5gb
to a little bit more of 300 megabytes.
Also this could be done directly when reading the CSV file
number four use smaller numerical
data types. There are like multiple data types
in pandas and according to the type we can
store more or less information. For example, if we
want to store the age of a person there's
no point of using the data type in 64
view because we are going to waste a lot of memory. In that case it's
much better to use a smaller data type as int eight.
Let's take a quick look to an example.
For example, in this data frame we can see the column name detail h
type and we realize that all the
values are between one and nine. But when we analyze the
data type of that specific column we realize that we are
using in 64. So the recommendation is to change that data
type and we realize that we are using 196
megabytes of memory. But after changing it to a smaller
data type we will be using only 24
megabytes. That means that we could be reducing with only
one line, 87 dont 5%
of the memory consumption.
Trick number five using categorical data
types in some cases it's possible to shrink non numerical
data and reduce the total memory footprint.
For these cases pan has a custom categorical
data type. So let's take a quick
look to an example.
In this example we can see that there is a column
named sex that could have only two categorical values,
f and m. So when we analyze the type of
data that we are using, it's an object.
If you remember the objects can consume a lot of memory.
So when we look deeply we realize
that we are using more than 142
megabytes of memory only for that column. But then if we
change the data type as a categorical type we will see
a huge memory reduction. We will reduce from 142
megabytes to only 2.4. That means that we are reducing the
total memory of to 98%.
Trick number six reading data by chunks attempting
to read a large file can lead to a crash if
there's not enough memory for the entire file to be read at
once. Reading the file in chunks make it possible
to access a very large files by reading one part of
the file at a time. Let's take a look to a
small example.
In this example we have a very large file
that is almost 4gb on disk.
So basically what we are doing in pandas, we are reading the CSV
file, but we are adding a parameter named chunk size. So in
this case we are iterating over the file and reading
the grows in blocks of 500,000.
So every time we read a part of
the CSV file we start to apply and
count all the values and then all the results are beginning
storing a different variable. So after looping,
every time we continue looping, we will release the
memory and read the next chunk. So after finishing
all the process, we will get a result with the desired
calculation.
Vertical scaling with Jupyter on the cloud the
easiest solution to not having enough ram is to throw money to
the problem. That's basically vertical scaling. Thanks to
cloud computing, this is a very easy task.
Vertical scaling is the ability to increase the capacity of existing
hardware or software by adding resources, cpu,
memory, dask, et cetera. In the other hand,
we have horizontal scaling that involves adding machines
to the pool of existing resources.
Jupyter is a very popular tool that helps us
to create documents with live code. Thanks to this tool.
It's very easy to run this code on the cloud, and there
are plenty of large and cheap machines
all around the cloud providers.
What is the advantage of this approach?
Basically no code change is needed.
It's very easy if you are using the right cloud tools.
There are plenty of options like binder,
kaggle kernels, Google Collab, Azure notebooks,
et cetera. It's very good for testing and
cleaning data and visualizing it.
And the good thing is that you only pay for what you use. Of course,
if you forget to turn off your machine, you'll have to pay that
too. Which are the disadvantages of this
approach is that in the long run it's very
expensive because you haven't really optimized your code,
you're only throwing more money to the problem. It does not scale
very well and also it's
not production ready. Normally your code is not optimized
for production.
Speeding up pandas with Modin Modin
is a multi process data frame library with an identical API
to pandas. That means that you don't need to change your code because
the syntax is the same. Modin will allow users
to speed up their pandas workflows because it will unlock
all the cases of your machine. Also, it's very easy to
install and only requires to change a single line of code.
You have to change panda import to import Modin pandas
as PD, so it's very easy.
Pandas implementation is single threaded,
so this means that only one of your cpu
cases can be utilized at any given point of
time. But if you use modin implementation,
you will be able to use all the available cores of your machine,
or maybe all the cores available in the entire cluster.
The modin advantage of Modin is that unlocks
all the cpu power of your machine.
Only one import is needed, so no changes in the code are needed
and it's really fast when reading data. Also has
multiple compute engines available to distribute the calculation of clusters.
These clusters could be implemented with dask or ray
and the main disadvantages of modin is that you need to
expend x ray for depending on the combining gen setup,
dask array plus the
configuration of the clusters. Also,
distributed systems are complex so debugging could
be a little bit tricky. And finally, Modin requires a lot of memory
also as pandas processing
large data sets with by Biax
is a python library with a similar syntax to pandas that
help us work with large data sets in machines with limited resources
where the only limitations is the size of a hard drive.
By provides memory mapping so it will never touch or copy that data
to memory unless it's explicitly requested.
Bix also provides some cool features
like virtual columns and calculate statistics such as mean sum
count standard deviation, with a blazing speed of
up to billion object rows per second.
For the best performance, bikes recommends using the HDF
five format. HDF five is able
to store a large amount of numerical data set with their metadata.
But what do I do if my data is on a CSV format?
No problem. Bikes include the functionality to read a large CSV
file by chunks and convert them to HDF files on the
fly. To do this, it's very easy,
we only need to add a parameters convert equals true to the
function from csV.
Also, Bix provides a data frame server so calculation and
aggregation could run on a different computer. Bix provides
a python API with websockets and a rest API to communicate
with a biox data frame server.
The advantages of bikes are it
helps control the memory usage with the memory mapping and also there are
amazing examples in their website. Also, Bix allows
computing on the fly with a lazy and virtual columns.
It's possible and easy to build visualization with a data set larger than
memory. Machine learning is also available through
the Biax ML package and data can be exported
to pandas data frame if you need a feature that is only available on pandas.
Also, the disadvantages of BIax are you
need to do some modification to your code, but the syntax is slightly
similar to pandas. Also, Bix is
not as mature as pandas but is improving every day.
And also it's a little bit tricky to work with some of the data types
with HDF five all
in with Pyspark. When you need to
work with a very large scale data, it's mandatory to distribute
both the data and the computation to a cluster. This cannot be achieved with
pandas. Spark is an analytic engine used
for large scale data processing. It lets you spread both data and
computation over clusters to achieve a substantial performance
increase.
Pyspark is a Python API for Spark. It combines
the simplicity of python with the high performance of Spark also
provides the Pyspark shell for interactively analyzing your data
in a distributed environment. PySpark support most of
Spark features such as Pyspark, SQL data
frames, streaming machine learning libraries, and spark
core. So the advantages
of Pyspark are first, you will get a
great speed with a large data set.
Also, it's a very rich and mature ecosystem with
a lot of libraries for machine learning, feature extractions, and data
transformation. Also,
spark runs on top of Hadoop so it can access other tools
in the Hadoop ecosystem. In the other hand,
we have some disadvantages. For example,
the first thing that you will notice is that you need to modify
your source code because the syntax of Pyspark is very different to pandas
syntaxes. Also,
Pyspark could have a very bad performance if the data
set is very small. In that case, it's much better to keep using
pandas. In Spark we have the machine
learning library, but at this moment it
has fewer algorithms. Also, Spark requires
a huge amount of memory to process all the data,
so in some cases this could not be very cost effective.
Final notes there are multiple options
to scale up your workloads. The easiest way is to vertical scale
your resources with Jupiter and a cloud provider. But first,
don't forget to optimize your data frame or you will be wasting money.
Also, there are some powerful alternative to work with large data
sets like bikes, so it's worth giving it a try if
you need to process a huge amount of data. You can use Modin with ray
or dask to distribute your workload, but this will add some
extra complexity. Finally,
you can rewrite your pandas logic to make it run over,
spark data frames, and take advantage of many platforms available
in almost all the cloud providers.
Thank you very much for your attention.