Conf42 Python 2023 - Online

How To Generate Test Data for Your Database Project

Video size:

Abstract

For testing purposes, especially if you’re building an app that uses any database technology to store information, you may need data to try out your project. In that case, you have to find a good dataset or use Faker, a Python library that you will learn how to use through this talk.

Summary

  • Mario Garcia is a technical evangelist at Percona. Today's talk will be about how to generate data for your database project. How to use Python for creating your own data generator. Create database where we will insert the data that we will generate.
  • Faker has some built in providers and properties that we use for generating data. By calling one of the properties we generate ten randomly generated names. We then insert or store that data into pandas data frame.
  • We are going to create a multiprocessing pool for running for every cpu cord to run part of the process of generating all the data. With multi processing, we take advantage of all the CPU cores available. This is how we see all the cpu running at the same time.
  • Using multiprocessing to make Python code faster. Here are some resources that can be useful. If you have any questions about what I shared today, you can contact me on social media or by email or are.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Mario Garcia I'm so happy to be here sharing with the global Python community at Conf 42 Python 2023. Today's talk will be about how to generate this data for your database project. I'm a technical evangelist at Percona. You can find me on Twitter LinkedIn or you can contact me by email. And this is the agenda that I will follow today. If you need to generate data for testing your application or validating the schema of your database, you have two options. You can get a data set from sites like Kaggle and then use pandas for processing that data and insert that data into the database. Or you can use a data generator. We have one that is available in one of our repositories, but it only works for MySQL. If you need a data generator for postgres or MongoDB, I will show you how you can use Python for creating your own data generator. First we need to install the dependencies for our project. Create the database where we will insert the data that we will generate. I will explain the directory structure. I will show you how you can use Faker that is a library for generating data and also how to create a Pandas data frame. We need a data frame for storing the data that we will generate and then how to establish connection to the database. How to define the schema of the database we have two options here. We would use SQL Alchemy for establishing the connection to our MySQL or postgres database. And SQL Alchemy has a schema definition language that we can use for defining the schema. But as we are using pandas, Pandas has a built in method for inserting data into SQL databases like MySQL or postgres, and we will use that building method for that. And also we will see how to use multiprocessing for improving the runtime of our application, and then we will see how we generate data by running the scripts that we will create. First we need to create requirements txt that will contain a list of the dependencies of our project. Or if you're using Anaconda, we need to create an environment YamL file that will contain the name of our environment. The version of Python that we are using. The libraries that we are using for this project, including pandas, Sqlalchemy, TKDM that is a library for adding a progress bar, faker that is the library that we use for generating data. And also we need the drivers for the database that we are generating data for, including MySQL, postgres and MongoDB. After we add that information into our requirements txt file or youre environment YamL file, we need to install our dependencies by running PIP, install or conda and create depending on the way that we are configuring our development environment. Then after we install our dependencies, we need to create the database. This is an example. You can adjust this information according to the requirements of your project. But first we create a database for MySQL or postgres for MongoDB. We don't need to do that. The database and the collection where the information will be stored is created while running the script. Well, this is the directory structure of our project. We have three modules, one for establishing information about how to establish the connection to the database, including information for MySQL, Postgres and MongoDB. And here this model will contain instructions for generating the data and storing that data into a pandas data frame that will later be processed for inserting that information into our database. And here in the schema py file we specify the schema of our database. We will see that later, but we are not using SQL Alchemy schema definition language. Here we create an environment pile file, a requirements TxT file for our dependencies as the way for establishing the connection to the database and inserting the data into our database is slightly similar for MySQL and Mongo. We create a script for both database technologies and we create a separate script for MongoDB. Well, how we use Faker Faker has some built in providers and properties that we use for generating data. Name is one of those properties. So first we need to initialize we need to import faker and initialize faker. And then by calling one of the properties we generate ten randomly generated names. This list will change every time you run this code as this data is generated by using a random method. What providers and properties are available with Faker? We have some built in providers but also we have some providers created by the community. The ones listed here are the built in ones. The ones created by the developers behind Faker well, these are some of the providers that are available, including person that we can use for generating name. This will generate a pair of first name and last name or we can generate those data separated by calling the first name and last name methods or properties. And we have an address providers that we can use for generating address city or country. We have a job provider that can be uses for generating job company an Internet provider that we uses for generating email or company email well, now that we know how to use faker for generating data, it's time to create our script. First we need to import some libraries and method that we would use for the whole process. First we import this method from the multi processing model. Cpu can will return the number of cpu cores available. We need this information as we are implementing multi processing for using all the cpu cores available. We will see that later and then we import pandas as well as TQDM and Faker. TQDM is a library for adding a progress bar. We create and initialize a faker generator. We get the number of cpu cores available minus one. This way we avoid that the computer freezes when using all the cpu cores available. So instead, for example, instead of using 16 cores on a cpu that has this number of cores available, we use 15 of them and we leave one free to avoid that the computer freezes. Okay, and these are the columns that we will add to this table. And we will generate that data by calling some of the properties available in the faker library. So we will generate first name, last name, job, company and the other columns listed there. So how we generate that data and then insert or store that data into pandas data frame. So first as we are implementing multiprocessing, the idea with multi processing is that we divide the whole process into small pieces. So instead of for example, if we are generating 60,000 records, we divide that number of records into every cpu youre available and every cpu cord will generate part of that information. And then we will concatenate that data for creating a single one pandas data frame. So that's why we are dividing the number of records that we will generate into the number of cpu cores minus one. Then we create an empty data frame and we will create all the columns in the data frame and assign as values the values return by calling some of the properties provided by Faker. Okay, now that we have generated data, it's time to establish a connection to the database. This is part of the base PY model that we created before that I explained as part of the directory structure. So what we do here is first we import the create engine and session maker methods, the first one for establishing the connection to the database. And then we create a session that is bind to the connection that we're establishing to our database. So for MysQL we use this driver. So we need to specify the database technologies that we are using plus the driver that we will use for establishing the connection to the database as well as running all the queries that we need okay, then we specify user password, the URL for the host that can be localhost or the IP address or a URL if you are hosting our database in the cloud. And then we create a session and that is bind to our connection to the database. For postgres, the way to establish a connection is similar. What changes here is that we are using postgres and we are using a different driver. We need to specify user password, the port that is assigned to this server and the name of the database. Then we create a session and for MongoDB we use Pymongo. We establish a connection to the database by calling the Mongo client method. We need to specify the user password, the port and the URL of our MongoDB server. Okay, so there are two ways for defining the schema of youre database. We can use the schema definition language provided by SQL alchemy. But remember that we can use SQL alchemy for MySQL, postgres and SQLite, but we can use SQL alchemy for MongoDB. So this time we are using a built in method provided by pandas for inserting that data frame that we previously created into the database. And the way that we define our schema is different from one, from the way that we define the schema when we use the schema definition language. So we only need to specify the name of the columns, the value of the value types, and then well after we define the schema of a database we need to understand why we are using multiprocessing. With multi processing, we take advantage of all the cpu core available. So when we are running this script, this is how the cpu uses appears. For example in our monitoring tool, percona monitoring and management. This is how we see all the cpu running part of the process at the same time. So that's why we are implementing multi processing. Instead of having just one cpu core running all the script and generating all the records that we need. For example, we are generating 60,000 records. So by default Python will only uses one cpu core. So with multi processing we divide all the process into small pieces and every cpu core run part of the script. And for example here we are using 15 cpu cores and every cpu core is generating 4000 records. After that we concatenate all that data into a single one data frame. Okay, after learning a little about multiprocessing, we need to import all the methods and libraries for the rest of our script. We are going to create a multiprocessing pool for running for every cpu cord to run part of the process of generating all the data, we need to get the number of CPU cores available by calling the cpu count method. We need to import pandas as well as some of the methods available as part of the models that we created before, including the base PY schema Py and data frame PY by calling some methods available there. And then we call well, every cpu core called the create data frame method and every cpu core will generate 4000 records and then we concatenate that data into a data frame. And for MySQL and postgres we are calling to SQL method provided by pandas, we need to specify the name of the table where that information will be stored, the database connection information, and if that data exists we will append that data. We can change that later. And also we need to specify the schema, the one that we defined before, so that pandas know where to insert all the information. Okay, for MongoDB, this is the way that we insert data into MongoDB database. First we need to convert the data that we generated into a python dictionary. Then we need to specify the name of the database and the collection where the information will be stored. And by calling the insert menu, we insert all the data to the MongoDB database. After we inserted, after we insert the data into every database, MySQL, postgres and Mongo, we need to modify the tables especially for MySQL and postgres because we are not defining a column in that table, that will be our primary key. So we need to add another column and this will be auto increment values. And the way that we do that for MySQL and postgres is slightly different. So for MySQL we will execute this SQL statement alter table employees at id. That will be our primary key. This column will be not null and this will be an auto increment. The values will be auto increment. And with the first option here we are saying that that column will be added at the beginning of the table. For postgres, we don't have the auto increment option. What we have in postgres is the serial option that is similar to the auto increment in MySQL. So this is why we have two different instructions for both database technologies. Okay, so it's time for running a script. Well, I already created a repository where you can find all the scripts that I explained here. I'm using an anaconda, but you can install all the dependencies by running Pip install. But if you want to use Anaconda for better managing your environment and all the dependencies of your project, you can download it here. You can find installation files for Windows, Mac and Linux. Then we install Anaconda. This is the instruction for Linux distributions and then we clone the repository. You can find the repository that contains all the example that I explained here. At this repository we change to the data generator directory, configure the python environment by installing all the dependencies and configuring the environment. This will create the environment defined in the environment yaml file. We need to edit the base Pymodel that contain information about how to establish the connection to the database. And then we run the script. And finally we can check the database for the data that we generated by running this script. Here are some resources that can be useful. You can check the Faker website. You can find more information on bundle providers that are the ones that are available with Faker by default. You can use community provider. You can find more information on the website. You can check the SQL Alchemy website as well as check the schema definition language to know how you can define the schema of your database by using SQL alchemy. We didn't do that because we are using the built in method provided by Panda for inserting data into the database. Here you can find more information about using multiprocessing to make Python code faster. That is the idea behind multi processing. Instead of having one cpu core running all the tasks in our script, we have all the cpu cores available running part of the process and we can improve the runtime of our application. Okay, so here youre can find more information about the careers available. I just want to say thank you for being here and I really enjoy sharing with the community today. If you have any questions about what I shared today, you can contact me on social media or by email or are. You can also check the blog post that I already published about this topic and I'm planning to update the code of this project.
...

Mario Garcia

Technical Evangelist @ Percona

Mario Garcia's LinkedIn account Mario Garcia's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways