Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Mario Garcia I'm so happy to be here sharing
with the global Python community at Conf 42
Python 2023. Today's talk will be about
how to generate this data for your database project.
I'm a technical evangelist at Percona. You can find me
on Twitter LinkedIn or you can contact me by email.
And this is the agenda that I will follow today.
If you need to generate data for testing
your application or validating the schema
of your database, you have two options.
You can get a data set from sites like Kaggle
and then use pandas for processing that data and insert
that data into the database. Or you can use a
data generator. We have one that is available
in one of our repositories, but it only works for
MySQL. If you need a data generator for
postgres or MongoDB, I will show you how you can use
Python for creating your own data generator.
First we need to install the
dependencies for our project. Create the
database where we will insert
the data that we will generate. I will explain the directory
structure. I will show you how you can use Faker
that is a library for generating data and also how
to create a Pandas data frame. We need a data frame
for storing the data that we will generate and then
how to establish connection to the database.
How to define the schema of the database we have two
options here. We would use SQL Alchemy for establishing the
connection to our MySQL or postgres database.
And SQL Alchemy has a
schema definition language that we can use for
defining the schema. But as we are using pandas,
Pandas has a built in method for inserting data into
SQL databases like MySQL
or postgres, and we will use that building
method for that.
And also we will see how to use multiprocessing
for improving the runtime of our application,
and then we will see how we generate data
by running the scripts that we will create.
First we need to create requirements txt
that will contain a list of the
dependencies of our project. Or if you're using Anaconda,
we need to create an environment YamL file that will
contain the name of our environment. The version
of Python that we are using. The libraries
that we are using for this project, including pandas,
Sqlalchemy, TKDM that is a library for adding a progress
bar, faker that is the library
that we use for generating data.
And also we need the drivers for
the database that we are generating data for,
including MySQL, postgres and MongoDB.
After we add that information into our
requirements txt file or youre environment
YamL file, we need to install our dependencies
by running PIP, install or
conda and create depending on the way that we are configuring
our development environment.
Then after we install our dependencies,
we need to create the database.
This is an example. You can adjust this
information according to the requirements
of your project. But first we create a database
for MySQL or postgres for MongoDB.
We don't need to do that. The database
and the collection where the information
will be stored is created while
running the script.
Well, this is the directory structure of our project.
We have three modules, one for establishing
information about how to establish the connection to the database,
including information for MySQL, Postgres and
MongoDB. And here
this model will contain instructions for generating
the data and storing that data into a pandas
data frame that will later be processed for
inserting that information into our database.
And here in the schema py
file we specify
the schema of our database.
We will see that later, but we are not using SQL Alchemy
schema definition language. Here we
create an environment pile
file, a requirements TxT file for our dependencies
as the way for establishing the connection
to the database and inserting the data into our database
is slightly similar for MySQL and Mongo.
We create a script for both database technologies
and we create a separate script for MongoDB.
Well, how we use Faker Faker has some built
in providers and properties that
we use for generating data. Name is
one of those properties. So first we need
to initialize we need to import
faker and initialize
faker. And then by calling
one of the properties we generate ten
randomly generated names.
This list will change every time you run this
code as this data is generated
by using a random method.
What providers and properties are
available with Faker?
We have some built in providers but
also we have some providers created by
the community. The ones listed here are the
built in ones. The ones created by the
developers behind Faker
well, these are some of the providers that
are available, including person that we can use for
generating name. This will generate a pair of
first name and last name or we can generate
those data separated by calling the first name and
last name methods or properties.
And we have an address providers
that we can use for generating address city or country.
We have a job provider that can be uses for generating
job company an Internet provider that we
uses for generating email or company email well,
now that we know how to use faker for generating data,
it's time to create our script. First we need to
import some libraries and method that
we would use for the whole process.
First we import this method from the multi processing model.
Cpu can will return
the number of cpu cores available.
We need this information as we are implementing multi processing
for using all the cpu cores available.
We will see that later and then we import
pandas as well as TQDM and Faker.
TQDM is a library for adding a
progress bar.
We create and initialize a faker generator.
We get the number of cpu cores available minus
one. This way we avoid that
the computer freezes when using all the cpu cores
available. So instead, for example, instead of using
16 cores on
a cpu that has this
number of cores available, we use 15 of them and
we leave one free to avoid that the computer freezes.
Okay, and these are the columns that we will add
to this table. And we will generate that data by calling
some of the properties available in the faker library.
So we will generate first name, last name,
job, company and the
other columns listed there.
So how we generate that data and then
insert or store that data into pandas data frame.
So first as we are implementing multiprocessing,
the idea with multi processing is that we
divide the whole process into small
pieces. So instead of for example, if we are generating
60,000 records, we divide that number of
records into every cpu youre available
and every cpu cord will
generate part of that information. And then we
will concatenate that data
for creating a
single one pandas data frame.
So that's why we are dividing
the number of records that we will generate into the number
of cpu cores minus one. Then we
create an empty data frame and
we will create all the columns in
the data frame and assign as values
the values return by calling some
of the properties provided by Faker.
Okay, now that we have
generated data, it's time to establish a connection
to the database. This is part of
the base PY model that we created before that
I explained as part of the
directory structure. So what we do here
is first we import the create engine and session maker
methods, the first one for establishing
the connection to the database. And then we create a session
that is bind to the connection
that we're establishing to our
database. So for MysQL
we use this driver. So we need to specify
the database technologies that we are using
plus the driver that we will use for
establishing the connection to the database as well as
running all the queries that we need okay,
then we specify user password,
the URL for the host that can be localhost or the IP address
or a URL if you are hosting
our database in the cloud.
And then we create a session and that is bind
to our connection to the database.
For postgres,
the way to establish a connection is
similar.
What changes here is that we
are using postgres and we are using a different driver.
We need to specify user password,
the port that
is assigned to this server and the name of the database.
Then we create a session and for
MongoDB we use Pymongo.
We establish a connection to the database by calling the Mongo
client method. We need to
specify the user password, the port
and the URL of our
MongoDB server.
Okay, so there are two ways for
defining the schema of youre database. We can use the schema
definition language provided by SQL alchemy.
But remember that we can use SQL alchemy
for MySQL, postgres and SQLite,
but we can use SQL alchemy for MongoDB.
So this time
we are using a built in method provided
by pandas for inserting that
data frame that we previously created into the database.
And the way that we define
our schema is different from one,
from the way that we define the schema
when we use the schema definition language. So we
only need to specify the name
of the columns, the value
of the value types,
and then well after
we define the schema of a database we need to
understand why we are using multiprocessing.
With multi processing, we take advantage of all the cpu core
available. So when we
are running this script, this is how
the cpu uses appears.
For example in our monitoring tool,
percona monitoring and management. This is how we see all
the cpu running part
of the process at the same time. So that's
why we are implementing multi processing.
Instead of having just one cpu core running all
the script and generating all the records
that we need. For example, we are generating 60,000
records. So by default Python will only uses
one cpu core. So with
multi processing we divide all
the process into small pieces and every
cpu core run part of the script.
And for example here we are using 15 cpu
cores and every cpu core is generating 4000
records. After that we concatenate
all that data into a single one data
frame.
Okay, after learning
a little about multiprocessing, we need to import all the
methods and libraries for
the rest of our script.
We are going to create a multiprocessing pool for
running for
every cpu cord to run part
of the process of generating all
the data, we need to get
the number of CPU cores available by calling the cpu count method.
We need to import pandas as well as some of the methods available
as part of the models that we created before,
including the base PY schema Py
and data frame PY by calling some
methods available there.
And then we
call well,
every cpu core called the
create data frame method and every
cpu core will generate 4000 records and
then we concatenate that data into
a data frame. And for
MySQL and postgres we
are calling to SQL method provided by pandas,
we need to specify the name of the table where that
information will be stored,
the database connection information,
and if that data exists
we will append that data. We can change that
later. And also we need to
specify the schema, the one that we defined
before, so that pandas know where
to insert all
the information.
Okay, for MongoDB, this is the way that we
insert data into MongoDB database.
First we need to convert the
data that we generated into a python
dictionary. Then we need
to specify the name of the database
and the collection where the information will be stored.
And by calling the insert menu, we insert all
the data to the MongoDB database.
After we inserted,
after we insert the data into every
database, MySQL, postgres and Mongo, we need
to modify the tables
especially for MySQL and postgres because
we are not defining
a column in that table,
that will be our primary key. So we need to add
another column and this
will be auto increment values.
And the way that we do that for MySQL and
postgres is slightly different.
So for MySQL we will execute
this SQL statement alter table
employees at id. That will be our
primary key. This column
will be not null and this will be an
auto increment. The values will be auto increment.
And with the first option here
we are saying that that column will
be added at the beginning of the table.
For postgres, we don't have the auto increment option.
What we have in postgres is the
serial option that is similar to the
auto increment in MySQL. So this is why we
have two different instructions
for both database technologies.
Okay, so it's time for
running a script.
Well, I already created a repository where you
can find all the scripts that I explained here.
I'm using an anaconda, but you can install all the dependencies
by running Pip install. But if
you want to use Anaconda for better managing your environment
and all the dependencies of your project, you can download it
here. You can find installation
files for Windows,
Mac and Linux. Then we install
Anaconda. This is the instruction for Linux distributions
and then we clone the repository. You can find
the repository that contains all the example that
I explained here. At this
repository we
change to the data generator directory,
configure the python environment by installing all the
dependencies and configuring the environment.
This will create the environment defined
in the environment yaml file. We need to edit the
base Pymodel that contain information about
how to establish the connection to the database.
And then we
run the script. And finally we can check
the database for the
data that we generated by running this script.
Here are some resources that can be useful. You can
check the Faker website. You can
find more information on bundle providers that are the
ones that are
available with Faker by default. You can use
community provider. You can find more information on
the website. You can check the
SQL Alchemy website as well as check the
schema definition language to know how you can define
the schema of your database by using SQL alchemy.
We didn't do that because
we are using the built in method provided
by Panda for inserting data into the database.
Here you can find more information about using multiprocessing
to make Python code faster. That is the idea behind
multi processing. Instead of having one cpu
core running all the tasks in our script,
we have all the cpu cores available
running part of the process and we
can improve the runtime of our
application.
Okay,
so here
youre can find more information about the careers available.
I just want to say thank you for being
here and I
really enjoy sharing with the community today.
If you have any questions about what
I shared today, you can contact
me on social media or by email or are.
You can also check the blog post that I
already published about this topic and I'm planning to
update the code of this project.