Conf42 Platform Engineering 2024 - Online

- premiere 5PM GMT

Billions of Records in MySQL: Lessons Learned

Video size:

Abstract

Database performance issues are always a concern. If you’re asking, “How do I optimize MySQL for big data?” or struggling to achieve and maintain high performance on your MySQL instances, this talk is for you. We will also cover bugs in MySQL discovered by the Lukas.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to Conf42 and welcome to billions of records in MySQL lessons learned. I'm very excited to walk you through my experience working with bigger data sets with billions of records in MySQL, as you can see on the screen. And I hope you will learn something from my talk. So let's get right into it. To begin with, I would like to show you two queries. One query is a query searching for data without an eta sign. And the second query is searching for data with an eta sign in Boolean mode. These queries are something that is known as queries with a full text search index. Such queries are very prevalent in MySQL when we are searching for data in using the full text search index. Meaning that our queries using the index will have the match against clauses. As you can see, they have match against something. And if your query is Runs under my scale, and if it has such clauses, it is most likely using the full text search index. And the full text search index means that it's privy to all kinds of behavior that you would likely, most likely wouldn't expect. In this case. One query completes very quick and the other time out. And the reason behind this is that because the second query has a bug within itself, the bug is. It has a number that you can see on your screen and the core reason behind this bug, I assume I found this bug somewhere around in 2014 or 2015 when I was building a data breach search engine, which I will also share with you. The core reason behind this bug is that I assume the query takes a different path within a database and as such the query times out. Again, this is some weird behavior within MariaDB. So this is a glance of something you will see during this talk, but before we get too much into the stock, I should probably introduce myself. My name is Lukas. I'm an author. I am originally a security guy, but I switched to databases a couple of years ago, I wrote a book titled Hacking MySQL, Breaking, Optimizing and Securing MySQL for Your Use Case, which can be accessed at hackingmysql. com. And I've built one of the most prominent Databricks search engines to this day. I run BridgeDirectory. com, which is a Databricks search engine, which in essence makes you able to provide your email address, username, IP address, or whatever details you want. And then runs a search through stolen and leaked data to tell you whether you are at risk or not. I also very frequently blog in various blocks including bridge directory, my own blog, I blogged for archetype when it existed. There was a SQL client, a block for DB visualizer, Redgate, and other mediums as well. Aside from that, I also have a YouTube channel called Database Dive. If anyone wants to have a gander through YouTube videos depicting databases, you're more than welcome to. And I also found bugs in MySQL. And I also run my own blog that you can see on screen as well. So the agenda for this talk right here would be performance basics. I'm going to walk you through engines available in MySQL, how to optimize schema and data types, how to optimize your data for performance. Security, how to optimize for big data and security, how to work with your backups, how to optimize your backups. We're gonna get a glance, a look into advanced MySQL features and have some tools to improve performance as well. So starting from the bottom we have performance basics. Now to follow along with this, you would have to have some sort of minimal understanding of MySQL or databases in general. So when I, when you see some sort of titles here on the screen, such as my. cnf, these are most likely file names that are related to the database. I will try to explain as they come across. So for example, my. ini or my. cnf is a configuration file that you use to configure your MySQL database and configuring your server means that you glance through the documentation. Get the parameters applicable to your storage engine for your database because all databases come with certain storage engines. For MySQL, the primary storage engine is InnoDB. And InnoDB has a couple of parameters that you need to optimize. That is the buffer pool. InnoDB buffer pool size, that's the InnoDB log file size. And some other parameters such as InnoDB flush log at transaction commits and so on. After these parameters have been optimized you only search for what you need so that your database is not very busy to begin with. So if you search for example one book with a limit one clause, if you only need to return one book, your database doesn't scan through maybe 15, 000 or 20, 000 books and returns the results faster. If you use MySQL, you should consider using InnoDB, which is the primary storage engine, which is going to be used by all MySQL database anyway. But if you changed the storage engine for whatever reason, please switch it back, because MySAM, as you probably changed it to, is not a very reliable storage engine. And most of the features that have been available within that storage engine have been available within Maria, within InnoDB as well. And also you need to follow performance best practices. Performance best practices can often be found on the same YouTube channels, for example, at 1. hiron on documentation or by reading just simple blogs. So after you have your what's about. The basics, right? You need to change, you need to choose your storage engines or change them for that matter, right? So the majority of you who will be using MySQL or MariaDB or Percona server or whatever you use, you're going to be using InnoDB or ExtraDB, which is one of the primary storage engines available within MySQL. And There are a couple of other options as you can see on the screen. There is MySAM, there is Memory, there is CSV, there is Archive, there is Blackhole, and if you use MariaDB, there are gonna be a couple of other options such as Connect. If you use Percona Server, you might be able to see instead of InnoDB, you maybe use ExtraDB as you can see here. ExtraDB is essentially the same as InnoDB, but With a couple of enhancements choose storage engines carefully, depending on your use case. It's, it depends on what kind of storage engine will you need. Most of you for most of your use cases will need InnoDB, but if you really need to count the exact count of rows within all of your tables. Sometimes you may put mySum into the mix, but again, I wouldn't recommend using mySum for anything else than just counting rows in your table because it's not meant for that. The only thing that it is good for is counting rows within a table because mySum, it stores the internal row count of the table in its metadata, which InnoDB does not, which is why, which is precisely the reason why InnoDB cannot show the exact row count of your of the rows in your table with MySAM can. So anyway, after you chose the storage engine, You need to optimize the schema. Here you need to keep in mind a couple of principles. Keep things simple, no matter what you're working with, keeping things simple will alleviate you for most of the problems. Avoid innovating, avoid inventing new things, inventing new stuff changing things where this is not necessary. Adhere to the documentation because will answer most of the questions that you have. And if you have any further questions, you can consider visiting conferences like this one, or you can consider asking questions on Stack Overflow. I think they have even the DBA Stack Exchange Forum, or whatever it's called. And people are very active there, and you will get your answers answered your questions answered relatively quickly. Now, the other thing you can do is there is certain parameters that you can, for example, disable, for example, strict trans tables or strict all tables for automated truncation. So that means that if your data is too long to be entered into a database, it will be automatically truncated and MySQL will not come back with any errors. Again, this is not applicable for all use cases, but Maybe for you it is you may also choose to avoid text values, which means that your data will not cover as much space on the disk. And keep in mind that also, as I said, all data types come with essential data space on the disk, right? VARCHAR data types, they will occupy most likely the length in bytes plus one or two bytes on the disk. And so on. If you run bigger datasets, this is another issue that you need to deal with. For this thing, you need to to load data in, you would need to avoid big overhead. That means that you should avoid using insert queries and using, and use load data in file. If possible, it has a lots of available parameters that you can use. As you can see on the screen, you can set the character set, you can set what your fields will be terminated by. This is a typo here, it's, it should be terminated by, lines terminated by, or escaped by, and yada, yada, yada. It helps your database understand what kind of raw data you're working with. And if you're working with raw data, the good thing is with your database, is that It inserts data instantly into your database as opposed to insert statements when they have to go through a bunch of operations to just insert data. They have to unlock the database, check for data integrity, insert data, prepare the database for the next insert, and yada yada. And do this for every insert, so it takes up time. Another thing that you can do is partitioning. So partitioning is very useful if you run, for example, data breach search engines. When you have bigger data sets, So partitioning, in essence, what it means is you split your table into smaller tables that are treated as smaller tables by your application, but your database still sees them as a whole. So on one level, they're treated as, let's say, smaller tables when you need to search through them, but when you need to work with them, they're still one intact table. And this is why they are so much bigger when you see their size on the disk. Because if you have, for example, five partitions your table size will most likely be five times bigger than ordinary size, right? And this is why people working with MySQL with big data sets, They often run out of space or do something, run out of, run into issues similar to that. So they most likely run into issues working with big data because one thing they don't understand, certain functionality of certain queries, such as alter table which alter table, how it works internally is it makes a copy of the disk. So when you need to update data with alter table. It makes a copy of the table on the disk, it updates that copy, and then it swaps the newly created table on the disk with your original table, which means that it most likely for tables that have a lot of data, it will be a lot it will consume a lot of time, basically, and if you don't have enough space on the disk, for example, if your table is, 200 gigabytes. And if you have only 57 gigabytes of space on the disk, you will run out of space. So what some people do is instead of running update queries when they need to update data because they need to run update queries with alter table, which would make a copy of the table on the disk. They, in essence, make use of the default clause. So the default clause is used when you create a table and you set a default value of a column to some value. That means, in your database's point of view that means that one column that you set, or more columns, whatever the count is for you, one or more columns has a value that is already preset. And it doesn't need to be updated, right? The second thing you need to keep in mind are the character sets and collations. So these character sets and collations, they directly depend on the kind of data that you are storing in the database, right? So if you're using, for example, Arabic data sets, right? So you will need to make use of Arabic character sets. if you're using, I don't know, Hebrew datasets, right? You need to make use of Hebrew character sets. But in most cases, you will need to use UTF 8 MB4 Unicode CI, which is the original form of Unicode. And please don't use UTF because this is not really UTF 8. You will see because original UTF 8 with 4 bytes per character it supports 4 bytes per character, as I said, right? But UTF 8 in MySQL, it supports only 3 bytes per character, which means that some queries can error out. Especially if you're working with big data, that can be a concern. If you add columns after, if you add columns into a table and the column is the last column, it will be wise to apply some sort of hacks called setSessionAlterAlgorithm, which on the screen, which means that you set the algorithm of the alter query to instant. And when you set the alter algorithm to instant, the column, if it is the last column you add to the table is added instantly, but is and without any IO on the disk or whatever, but it only works in newer MariaDB versions, right? And such hacks can be found all across the documentation and all across Stack Overflow, so please look around. So after you optimize your data for MySQL or your MySQL for big data, whatever is around you also need to backup your bigger datasets. Backing up 1 billion rows in MySQL isn't a very useful task and isn't a very easy task. So let's say it like that, because MySQL dump, it likely won't work. It will back up your datasets and you will be able to restore them. But the first thing that's going to happen is your file will going to be exceedingly large. And the second thing this is going to happen is that the time that you need to recover the data will also be out of proportion. Like you will need to wait a week, two weeks, three weeks, or even a month, or even your query will even timeout because, the second thing you need to take into account is that the backups have multiple types. They can be physical, which means that you. Take backups of physical files within your database. They can be logical, which means that you take backups of queries, recreate data inside of your database and so on. But with big data sets, the problem is that these things, they all come with overhead. And basically to avoid that, you need to make use of queries like select into out file and load data in file. These two queries, in essence, they back up raw data sets. So select into out file will in essence select every row in a table and load it into a file, which for the first part is significantly smaller than files with logical backups. That is with INSERT, INTO statements, CREATE TABLE statements, whatever. And LOADDATA and FILE will be used to load the data into a database. Backups taken with SELECT, INTO, OUT file and backed up with LOADDATA and FILE have significantly less overhead than those taken with logical MySQL dumps. So this is something you need to keep in mind. In In some cases, you also need to keep in mind some advanced features such as the default clause can be used in place of an update clause sometimes, as I mentioned before, right? If you need to use a unique index, that means that if you need to get rid of some values in a column, And that column is pretty large that that means that MySQL again will have to make use of the Altar query, which in essence copies your data into another table and then swaps it over. To avoid all this, you can use unique sort features. That means that you in essence run Linux or Cygwin on Windows. And then load your data into a text file using the same select into out file or whatever. And then basically utilize sort unique, sort with a unique flag, and then specify your file into some other file. And this command that you can see on your screen right now is, it will it will drop all of the duplicates from that file. So in essence, you will load into your database the data without duplicates. You can sometimes use insert ignore instead of insert, which means that you will ignore some errors or all errors that may occur when inserting data. That may be useful when you're inserting into some columns instead of all columns or whatever. And you can also delete data according to time periods if it's necessary. So if you use, for example, queries like delete from something for a portion of time period, from this to that, it will also delete rows relatively quickly. Also, you can make use of the truncate clause, which means that in essence, your table will be truncated, which means in other words, all rows within that table will be deleted. And such queries are significantly faster than using the delete clause because they don't come with as much overhead for your database as well. And the last thing that you need to keep in mind is that if you use the InnoDB storage engine, you are able to kill queries using the show process list query. So if you, for example, if you run a Databricks search engine if you run a Databricks search engine and someone, and you forgot to rate limit, let's say, or do something like that, and someone, some genius decides to run 10 million queries at once, and your application obviously lags out because there is a lot of strain on the database, you can use InnoDB to basically show the queries, use queries like show process list, They will return the IDs of the queries that are currently running in the database and then you can just kill them one by one or just all of them at once by using kill statement named kill and then ID and then your query will be killed killed no longer exist in the database and then you can, your application will be less laggy and then you can implement your load balancers, your rate limiters, whatever, and then you can go on. The tools that you may need to use is are varied. The most likely tool most useful tool for your use case will be you yourself because nobody knows your database better than you, right? If you run a MariaDB also has tools such as MariaDB Secure Installation, MySQL Secure Installation, there's a symlink. And these tools will walk you through a couple of things that you may need to do to secure your database instance. And if your queries are slow, MariaDB and MySQL have such things as The slow query log that you, it logs all slow queries, advanced users of MySQL will also know that there is such thing as the MySQL enterprise edition, which in essence is enterprise edition of MySQL, which is a firewall and an audit appliance, if I remember correctly, it costs thousands of dollars, if I remember correctly, but But it lets you further secure MySQL. Aside from the Enterprise Edition, there's tools like DbVisualizer, ArcType, which is now a part of ClickHouse. There's RedGate Monitor, there's ClusterControl by several nines, and et cetera, et cetera. So these things, some of them act as SQL clients, for example, DbVisualizer and ArcType. Some of them act as, double as SQL editors, the same DbVisualizer, right? They let you edit your data as if it was a spreadsheet. Some of you monitor, some of them monitor other things and et cetera, et cetera. So aside from using tools there's such thing as a Schrodinger's paradox that you may be, maybe heard of, and maybe not heard of. So in the database world there is such thing as a data definition files, which is FRM files, which you can see on the screen and then the data files. So FRM files, in essence, they. They contain information about the data files, and the data files contain the data itself. In some situations, in very rare situations, if you insert bigger datasets into MySQL, and your, for example, electricity goes out, you may be able to see that your data file is missing, but the FRAM file still exists. And that will be the case of the Schrodinger's Paradox, where Your database sees the table existing, but you cannot drop it. In order to drop it, you need to most likely shut down the database instance. You can just add the data file to your data directory and the database in question, and then just drop it like that. And there's a bunch of other ways, but I will not get into that because we simply don't have time. But thank you. So this was my speech and this was the things I learned and I wanted to share when working with billions of rows and bigger data sets in MySQL and MariaDB. And I would be very grateful if you could share some feedback of my, from our talk. That is on my website slash feedback. So lucasvilleacres. com slash feedback. It will direct you to a Google form, which you can. Which I would ask you to fill out, it takes one or two minutes, and then I will see if I am a good speaker or not, and what I can improve, and so on. And also, I would suggest you to visit my website, breachdirectory. com, to in essence check if you are at risk of identity theft. Put your email address into the data breach search engine, and it will run a search through a bunch of known data breaches. And the return with results. Aside from that keep in touch with the documentation, blogs, database, YouTube channels, such as Database Dive and so on, and you should be good to go. So thank you so much for watching. I hope you enjoyed this talk and I hope you will enjoy upcoming talks as well. Thank you.
...

Lukas Vileikis

Author, Database Wrangler, Security Engineer

Lukas Vileikis's LinkedIn account Lukas Vileikis's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)