Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to Conf42 and welcome to billions of records in MySQL lessons learned.
I'm very excited to walk you through my experience working with bigger
data sets with billions of records in MySQL, as you can see on the screen.
And I hope you will learn something from my talk.
So let's get right into it.
To begin with, I would like to show you two queries.
One query is a query searching for data without an eta sign.
And the second query is searching for data with an eta sign in Boolean mode.
These queries are something that is known as queries with a full text search index.
Such queries are very prevalent in MySQL when we are searching for data
in using the full text search index.
Meaning that our queries using the index will have the match against clauses.
As you can see, they have match against something.
And if your query is Runs under my scale, and if it has such clauses, it is most
likely using the full text search index.
And the full text search index means that it's privy to all kinds
of behavior that you would likely, most likely wouldn't expect.
In this case.
One query completes very quick and the other time out.
And the reason behind this is that because the second query has
a bug within itself, the bug is.
It has a number that you can see on your screen and the core reason behind
this bug, I assume I found this bug somewhere around in 2014 or 2015 when
I was building a data breach search engine, which I will also share with you.
The core reason behind this bug is that I assume the query takes
a different path within a database and as such the query times out.
Again, this is some weird behavior within MariaDB.
So this is a glance of something you will see during this talk, but before
we get too much into the stock, I should probably introduce myself.
My name is Lukas.
I'm an author.
I am originally a security guy, but I switched to databases a couple
of years ago, I wrote a book titled Hacking MySQL, Breaking, Optimizing
and Securing MySQL for Your Use Case, which can be accessed at hackingmysql.
com.
And I've built one of the most prominent Databricks search engines to this day.
I run BridgeDirectory.
com, which is a Databricks search engine, which in essence makes you able to
provide your email address, username, IP address, or whatever details you want.
And then runs a search through stolen and leaked data to tell you
whether you are at risk or not.
I also very frequently blog in various blocks including bridge
directory, my own blog, I blogged for archetype when it existed.
There was a SQL client, a block for DB visualizer, Redgate,
and other mediums as well.
Aside from that, I also have a YouTube channel called Database Dive.
If anyone wants to have a gander through YouTube videos depicting
databases, you're more than welcome to.
And I also found bugs in MySQL.
And I also run my own blog that you can see on screen as well.
So the agenda for this talk right here would be performance basics.
I'm going to walk you through engines available in MySQL, how to optimize
schema and data types, how to optimize your data for performance.
Security, how to optimize for big data and security, how to work with your
backups, how to optimize your backups.
We're gonna get a glance, a look into advanced MySQL features and have some
tools to improve performance as well.
So starting from the bottom we have performance basics.
Now to follow along with this, you would have to have some
sort of minimal understanding of MySQL or databases in general.
So when I, when you see some sort of titles here on the screen, such as my.
cnf, these are most likely file names that are related to the database.
I will try to explain as they come across.
So for example, my.
ini or my.
cnf is a configuration file that you use to configure your MySQL database
and configuring your server means that you glance through the documentation.
Get the parameters applicable to your storage engine for your
database because all databases come with certain storage engines.
For MySQL, the primary storage engine is InnoDB.
And InnoDB has a couple of parameters that you need to optimize.
That is the buffer pool.
InnoDB buffer pool size, that's the InnoDB log file size.
And some other parameters such as InnoDB flush log at
transaction commits and so on.
After these parameters have been optimized you only search for what
you need so that your database is not very busy to begin with.
So if you search for example one book with a limit one clause, if you only need to
return one book, your database doesn't scan through maybe 15, 000 or 20, 000
books and returns the results faster.
If you use MySQL, you should consider using InnoDB, which is the primary
storage engine, which is going to be used by all MySQL database anyway.
But if you changed the storage engine for whatever reason, please switch it back,
because MySAM, as you probably changed it to, is not a very reliable storage engine.
And most of the features that have been available within that storage
engine have been available within Maria, within InnoDB as well.
And also you need to follow performance best practices.
Performance best practices can often be found on the same YouTube
channels, for example, at 1.
hiron on documentation or by reading just simple blogs.
So after you have your what's about.
The basics, right?
You need to change, you need to choose your storage engines or
change them for that matter, right?
So the majority of you who will be using MySQL or MariaDB or Percona
server or whatever you use, you're going to be using InnoDB or ExtraDB,
which is one of the primary storage engines available within MySQL.
And There are a couple of other options as you can see on the screen.
There is MySAM, there is Memory, there is CSV, there is Archive, there is Blackhole,
and if you use MariaDB, there are gonna be a couple of other options such as Connect.
If you use Percona Server, you might be able to see instead of InnoDB, you
maybe use ExtraDB as you can see here.
ExtraDB is essentially the same as InnoDB, but With a couple of
enhancements choose storage engines carefully, depending on your use case.
It's, it depends on what kind of storage engine will you need.
Most of you for most of your use cases will need InnoDB, but if you
really need to count the exact count of rows within all of your tables.
Sometimes you may put mySum into the mix, but again, I wouldn't recommend
using mySum for anything else than just counting rows in your table
because it's not meant for that.
The only thing that it is good for is counting rows within a table because
mySum, it stores the internal row count of the table in its metadata, which
InnoDB does not, which is why, which is precisely the reason why InnoDB cannot
show the exact row count of your of the rows in your table with MySAM can.
So anyway, after you chose the storage engine, You need to optimize the schema.
Here you need to keep in mind a couple of principles.
Keep things simple, no matter what you're working with, keeping things simple will
alleviate you for most of the problems.
Avoid innovating, avoid inventing new things, inventing new stuff changing
things where this is not necessary.
Adhere to the documentation because will answer most of
the questions that you have.
And if you have any further questions, you can consider visiting conferences
like this one, or you can consider asking questions on Stack Overflow.
I think they have even the DBA Stack Exchange Forum, or whatever it's called.
And people are very active there, and you will get your answers answered your
questions answered relatively quickly.
Now, the other thing you can do is there is certain parameters that
you can, for example, disable, for example, strict trans tables or strict
all tables for automated truncation.
So that means that if your data is too long to be entered into a database,
it will be automatically truncated and MySQL will not come back with any errors.
Again, this is not applicable for all use cases, but Maybe for you it
is you may also choose to avoid text values, which means that your data will
not cover as much space on the disk.
And keep in mind that also, as I said, all data types come with essential
data space on the disk, right?
VARCHAR data types, they will occupy most likely the length in bytes
plus one or two bytes on the disk.
And so on.
If you run bigger datasets, this is another issue that you need to deal with.
For this thing, you need to to load data in, you would need to avoid big overhead.
That means that you should avoid using insert queries and using,
and use load data in file.
If possible, it has a lots of available parameters that you can use.
As you can see on the screen, you can set the character set, you can set
what your fields will be terminated by.
This is a typo here, it's, it should be terminated by, lines terminated by,
or escaped by, and yada, yada, yada.
It helps your database understand what kind of raw data you're working with.
And if you're working with raw data, the good thing is with your database,
is that It inserts data instantly into your database as opposed to insert
statements when they have to go through a bunch of operations to just insert data.
They have to unlock the database, check for data integrity, insert
data, prepare the database for the next insert, and yada yada.
And do this for every insert, so it takes up time.
Another thing that you can do is partitioning.
So partitioning is very useful if you run, for example, data breach search engines.
When you have bigger data sets, So partitioning, in essence, what it means
is you split your table into smaller tables that are treated as smaller
tables by your application, but your database still sees them as a whole.
So on one level, they're treated as, let's say, smaller tables when
you need to search through them, but when you need to work with them,
they're still one intact table.
And this is why they are so much bigger when you see their size on the disk.
Because if you have, for example, five partitions your table size
will most likely be five times bigger than ordinary size, right?
And this is why people working with MySQL with big data sets, They often
run out of space or do something, run out of, run into issues similar to that.
So they most likely run into issues working with big data because one
thing they don't understand, certain functionality of certain queries,
such as alter table which alter table, how it works internally
is it makes a copy of the disk.
So when you need to update data with alter table.
It makes a copy of the table on the disk, it updates that copy, and then it swaps
the newly created table on the disk with your original table, which means that it
most likely for tables that have a lot of data, it will be a lot it will consume
a lot of time, basically, and if you don't have enough space on the disk, for
example, if your table is, 200 gigabytes.
And if you have only 57 gigabytes of space on the disk, you will run out of space.
So what some people do is instead of running update queries when
they need to update data because they need to run update queries
with alter table, which would make a copy of the table on the disk.
They, in essence, make use of the default clause.
So the default clause is used when you create a table and you set a default
value of a column to some value.
That means, in your database's point of view that means that one column that
you set, or more columns, whatever the count is for you, one or more columns
has a value that is already preset.
And it doesn't need to be updated, right?
The second thing you need to keep in mind are the character sets and collations.
So these character sets and collations, they directly depend
on the kind of data that you are storing in the database, right?
So if you're using, for example, Arabic data sets, right?
So you will need to make use of Arabic character sets.
if you're using, I don't know, Hebrew datasets, right?
You need to make use of Hebrew character sets.
But in most cases, you will need to use UTF 8 MB4 Unicode CI, which
is the original form of Unicode.
And please don't use UTF because this is not really UTF 8.
You will see because original UTF 8 with 4 bytes per character it supports
4 bytes per character, as I said, right?
But UTF 8 in MySQL, it supports only 3 bytes per character, which means
that some queries can error out.
Especially if you're working with big data, that can be a concern.
If you add columns after, if you add columns into a table and the
column is the last column, it will be wise to apply some sort of hacks
called setSessionAlterAlgorithm,
which on the screen, which means that you set the algorithm of
the alter query to instant.
And when you set the alter algorithm to instant, the column, if it is the last
column you add to the table is added instantly, but is and without any IO
on the disk or whatever, but it only works in newer MariaDB versions, right?
And such hacks can be found all across the documentation and all across
Stack Overflow, so please look around.
So after you optimize your data for MySQL or your MySQL for big data,
whatever is around you also need to backup your bigger datasets.
Backing up 1 billion rows in MySQL isn't a very useful task
and isn't a very easy task.
So let's say it like that, because MySQL dump, it likely won't work.
It will back up your datasets and you will be able to restore them.
But the first thing that's going to happen is your file will
going to be exceedingly large.
And the second thing this is going to happen is that the time
that you need to recover the data will also be out of proportion.
Like you will need to wait a week, two weeks, three weeks, or even a
month, or even your query will even timeout because, the second thing
you need to take into account is that the backups have multiple types.
They can be physical, which means that you.
Take backups of physical files within your database.
They can be logical, which means that you take backups of queries, recreate
data inside of your database and so on.
But with big data sets, the problem is that these things,
they all come with overhead.
And basically to avoid that, you need to make use of queries like select
into out file and load data in file.
These two queries, in essence, they back up raw data sets.
So select into out file will in essence select every row in a table
and load it into a file, which for the first part is significantly smaller
than files with logical backups.
That is with INSERT, INTO statements, CREATE TABLE statements, whatever.
And LOADDATA and FILE will be used to load the data into a database.
Backups taken with SELECT, INTO, OUT file and backed up with LOADDATA and FILE
have significantly less overhead than those taken with logical MySQL dumps.
So this is something you need to keep in mind.
In In some cases, you also need to keep in mind some advanced features
such as the default clause can be used in place of an update clause
sometimes, as I mentioned before, right?
If you need to use a unique index, that means that if you need to get rid of some
values in a column, And that column is pretty large that that means that MySQL
again will have to make use of the Altar query, which in essence copies your data
into another table and then swaps it over.
To avoid all this, you can use unique sort features.
That means that you in essence run Linux or Cygwin on Windows.
And then load your data into a text file using the same select
into out file or whatever.
And then basically utilize sort unique, sort with a unique flag, and then
specify your file into some other file.
And this command that you can see on your screen right now is, it will it will drop
all of the duplicates from that file.
So in essence, you will load into your database the data without duplicates.
You can sometimes use insert ignore instead of insert, which means that you
will ignore some errors or all errors that may occur when inserting data.
That may be useful when you're inserting into some columns
instead of all columns or whatever.
And you can also delete data according to time periods if it's necessary.
So if you use, for example, queries like delete from something for a portion of
time period, from this to that, it will also delete rows relatively quickly.
Also, you can make use of the truncate clause, which means that in essence,
your table will be truncated, which means in other words, all rows
within that table will be deleted.
And such queries are significantly faster than using the delete clause
because they don't come with as much overhead for your database as well.
And the last thing that you need to keep in mind is that if you use the InnoDB
storage engine, you are able to kill queries using the show process list query.
So if you, for example, if you run a Databricks search engine if you run a
Databricks search engine and someone, and you forgot to rate limit, let's say,
or do something like that, and someone, some genius decides to run 10 million
queries at once, and your application obviously lags out because there is a
lot of strain on the database, you can use InnoDB to basically show the queries,
use queries like show process list, They will return the IDs of the queries that
are currently running in the database and then you can just kill them one
by one or just all of them at once by using kill statement named kill and then
ID and then your query will be killed killed no longer exist in the database
and then you can, your application will be less laggy and then you can implement
your load balancers, your rate limiters, whatever, and then you can go on.
The tools that you may need to use is are varied.
The most likely tool most useful tool for your use case will be you
yourself because nobody knows your database better than you, right?
If you run a MariaDB also has tools such as MariaDB Secure Installation, MySQL
Secure Installation, there's a symlink.
And these tools will walk you through a couple of things that you may need
to do to secure your database instance.
And if your queries are slow, MariaDB and MySQL have such things as The slow query
log that you, it logs all slow queries, advanced users of MySQL will also know
that there is such thing as the MySQL enterprise edition, which in essence is
enterprise edition of MySQL, which is a firewall and an audit appliance, if I
remember correctly, it costs thousands of dollars, if I remember correctly, but
But it lets you further secure MySQL.
Aside from the Enterprise Edition, there's tools like DbVisualizer, ArcType,
which is now a part of ClickHouse.
There's RedGate Monitor, there's ClusterControl by several
nines, and et cetera, et cetera.
So these things, some of them act as SQL clients, for example,
DbVisualizer and ArcType.
Some of them act as, double as SQL editors, the same DbVisualizer, right?
They let you edit your data as if it was a spreadsheet.
Some of you monitor, some of them monitor other things and et cetera, et cetera.
So aside from using tools there's such thing as a Schrodinger's
paradox that you may be, maybe heard of, and maybe not heard of.
So in the database world there is such thing as a data definition files,
which is FRM files, which you can see on the screen and then the data files.
So FRM files, in essence, they.
They contain information about the data files, and the data
files contain the data itself.
In some situations, in very rare situations, if you insert bigger
datasets into MySQL, and your, for example, electricity goes out, you may
be able to see that your data file is missing, but the FRAM file still exists.
And that will be the case of the Schrodinger's Paradox, where
Your database sees the table existing, but you cannot drop it.
In order to drop it, you need to most likely shut down the database instance.
You can just add the data file to your data directory and the database in
question, and then just drop it like that.
And there's a bunch of other ways, but I will not get into that
because we simply don't have time.
But thank you.
So this was my speech and this was the things I learned and I wanted to share
when working with billions of rows and bigger data sets in MySQL and MariaDB.
And I would be very grateful if you could share some feedback of my, from our talk.
That is on my website slash feedback.
So lucasvilleacres.
com slash feedback.
It will direct you to a Google form, which you can.
Which I would ask you to fill out, it takes one or two minutes, and then
I will see if I am a good speaker or not, and what I can improve, and so on.
And also, I would suggest you to visit my website, breachdirectory.
com, to in essence check if you are at risk of identity theft.
Put your email address into the data breach search engine, and
it will run a search through a bunch of known data breaches.
And the return with results.
Aside from that keep in touch with the documentation, blogs, database, YouTube
channels, such as Database Dive and so on, and you should be good to go.
So thank you so much for watching.
I hope you enjoyed this talk and I hope you will enjoy upcoming talks as well.
Thank you.