Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello guys, this is Shista. So I'm so happy to give this
talk around how to upgrade your existing
traditional databases, for example MySQL or PostgreSQL,
to become a distributed one so that you can leverage
the distributed database system to have the higher performance
and make your query become faster and to
give you the out of the box deployment solution. Right.
So let's get started. This is about myself.
I'm the Trista now I'm at the Sofias working
as the co founder and the CTO. My area,
it's about the distributed system around the databases,
around the automatic databases management
platform. So I spend my time
as a technology and also in some open source
foundation, for example Apache Software foundation.
I'm the Apache member and incubator mentor
to help to give some of the tips to some open source projects
and their community. Yeah, so sometimes I will post some
articles about the open source, about the open
source monetization, about the databases
and cloud databases on my Twitter and linking.
If you're interested in such a topic soon, welcome to give a look there.
Right? And today we will have the following topics we
will talk. The first one, I will give the current scenarios
because we have the new technology, we have the new needs.
So we need to do some change around our database infrastructure.
And second one, let us dive into the distributed
database system and to look at
what is happening in a distributed database system now.
And then I will give some new ideas or solutions.
Like I said before at the beginning, how to make your
favorite postgres or MySQL or SQL
server or Oracle to become a distributed one.
To become a distributed system. And then if
we have time, I will introduce my demo show and to let
you learn it or deploy it one by one.
Yeah. So first one, when you talk about the
new scenarios, because in this new era,
actually we have some new technologies. All of
them are good features or good technologies
to provide their values for our bi, I mean
the business, your commercial, your customer.
But for SPCo, for big data or databases,
actually all the new scenarios make
us have the new definition about big
data characteristics. That means the
big data five v they are the
value, veracity,
volume, velocity and variety.
I will give the brief introduction about each of them.
The first one about the value. That means we need
to use some AP
or use some the big data architecture or
the solutions to help us do some statistics
so we can know more about our
business or give project some of the
good feedback. Right? The second one, it's about variety.
That means we need to guarantee that the data we collected
are correct though. If we use this data
to create some data model to do the
bi work, that means that they
are the valuable. If we use the wrong data
to do the AI, that means
we will guess the wrong readout will guide our
product or companies to the wrong direction,
right? The NASA one is about volume. That means,
you know that in the Internet that people
create the data every day, every minute, every second.
So that means we need to manage the
tremendous data size. And that's why it's
velocity. That means that any users or
other department, they really want to get the readout or the queries,
the readout as quick as possible,
right? So they want to get the real time
readouts or do the real near real time
process or stream data stream. The next part is
about variety. That means your databases
or your databases store or data store need
to help you manage the structured and unstructured
data. So that's the main innate
big data characteristic. Based on
all of the characteristic, we need
to solve all of the issues youth or new databases
architecture or new databases solution or product.
But before we introduce all of them, I just want to
give a little bit introduction about the OLTP OLAP.
Because you know we cannot use one single
databases or product to solve all of the issues,
right? Or to meet all of the requirements from our previous
slides. So generally speaking,
we split the scenarios into two. One, the first one focused
on the OLAP. That means to precise
the larger amount of the data quickly and to do the analystic.
The second one, it's about OLTP, that means
or this kind of databases or productions,
they want to handle large volumes
of the transactional data involving multiple
users, right? When we speak of the MySQL
or PostgreSQL or Oracle, actually they
are all belongs to this type, the type of the databases.
So today I just want to pick up the OLTP
area to give my solution and my idea.
But if you have some idea about OLAP, maybe later on
I can give other talks about around that topic.
So now I know people love the MySQL
or love the postgresql or some
company will bought some the Oracle databases,
right? But you can see that all of the monolith
databases actually for them is too hard
or too difficult to manage the structured data,
especially data STP or
BP. All of them kind of like tremendous
data size, right? So people will consider how to use some
distributed databases or new SQL databases,
for example the crop row DB or MongoDB or
the aura or such kind of the databases solution,
right? But we need to ask
a question why people want to adopt the
new solutions at this point. Because you know that
I introduced before that the monolithic databases.
For them, it's very hard
to help us to manage all
the structured databases in BT or PD,
right? And plus you know that the
service at the beginning, they're the single one, they are
deployed in a monolithic machine, right? But now
you found it split it in the
microservice or serviceless, right? And people will use the
Kubernetes to manage all of the
services, right? So you can see that the
services, it's become the distributed one,
it's split, it's sharded and your server
actually on the same way to do all
the stuff at the beginning, maybe you can just use
the one server to do all the stuff. But later on you found that if
we can use the cloud,
the pod or container or the
docker, you found it's easy for us to manage
and to skew in or skew out it
make this system become or the server make
the infrastructure become distributed and
become the elastic skill in our skill out,
right? So that's the power of the distribution.
And how about the databases? Actually big
databases. And currently at this point people use the same
idea, that means the distributed or sharding or the
splitting the architecture to split one
single databases into a distributed cluster
to make this cluster can help you manage the tremendous
data, enormous data, and can help you to
do the elastic skewing and skew out based on
your needs, right? So you can choose the pay
add to go this model. Yeah. So at
this point, you know that maybe we
need to use some of the distributed databases for some
necessary needs or cases.
But I know most people, maybe at the
five years ago or ten years ago, you are so
like, you are so support
the PostgreSQL or MySQL. But now you
need to consider the upgrading all the stuff, okay,
you can actually just at the beginning choose the distributed
databases. But I guess most of
us are still in the progress of the evolution,
right? How to make it become the distributed one and
today. So I want to share another solution
to help you to own a distributed system based
on your existing postgreSQL cluster or
MySQL SQL cluster, and make this
distributed postgres cluster, have the skewing
skew out elastic skewing skew out features or rewrite
splitting like the traffic governance features
into this cluster. So how to do that?
We need to know that when we speak of the
distributed databases. Actually most of them
adopt the computing storage splitting
architecture. That means in this distributed system,
it's made up of two
important parts. The first one is the computing part.
That means the computing nodes. The second one the storage nodes,
right? So based on this idea, if we want to
make our monolithic postgreSQl become a
distributed one, actually we can regard
or view postgreSQl or MySQL SQL as
the storage nodes, different storage nodes. It's postgres
means the one storage node and then
all the storage nodes. They can do some the
local computing work and help us to persist your
data. Right? So if we want to create
this distributed system, then we need the
exact global computing nodes,
right? So who can become the
distributed nodes? Later on I will introduce the Apache
sharding Sophia. It can work as the computing nodes.
So use Apache sharding Sophia plus your existing favorite,
MySQL, PostgreSQL or Oracle. You can have
the distributed one. So what's the
partition Sophia?
All right, before that I need to give some of the benefits
of this solution. The first one, it can leverage your existing
database cluster. That means you don't need to overturn
the whole of the infrastructure. A second one, you can
leverage the ecosystem of the mature
postgreSql deployment or maintaining
the ecosystem of the postgreSQl
or MySQL's maintaining work and deploying
work, right? You have a lot of mature tools or the dbas
can help you deal with all of the stuff. The third
one that I will give more like the SQL
audit or traffic governance or elastic skilling all
the features into this solution, right?
The last one that it can help you consider how to
move your databases into cloud. I mean
because we use the storage and computing
splitting architecture, that means your postgreSQl
works working at the storage nodes and
are deployed in on premise.
But you can use Apache sharing Sphereex working
at the computing nodes and are deployed
on the Kubernetes or on the cloud,
right? So that means you can use this
solution to solve the headache of database
on cloud such issues. And that's one that
it will provide you the out of the box deployment. I guess
most of the projects will give such feature, right? It's not
the highlighted one. Yeah. So let's move the next page
with the Charlene Sophia. Before introduce
the feature of this product or this project, I just want
to say that it's the big community, open source community.
It's the Apache top layer project.
And because it has been open source for
more than six years. So you
know that people love it. So there are many
contributors committers active there.
And it released for more
than fifty s. So you know, worry that it's a
brand new project and for you, because people
will worry about that. Oh, if I have some questions, I cannot get
help from the community or oh, it's a new one.
So people will worry about the quality and his state
stability, right? But all
of the issues don't exist
here because it's a bigger one. And many
people or companies use this project for their production
environment. And it provides a friendly document
for you to learn more about it. And that's
why I want to introduce the feature of this project.
Generally speaking, sharing Sophia provides two clients for you to
choose. The first one, the sharding Sophia GDBC. It's a
lightweight GDBC driver for your Java application.
It's a framework so you can use the native Java
interface to use Apache Sharding VGDBC and help you
to reach the following functions. The first one,
data sharding. That means to shard your data into
different database instance, right? Make it
become the distributed database one. A second about
the distributed transformation and also rewrap splitting.
If your databases have many replicas and you
want to input the QPS or TPF
system, you consider how to leverage all of
the replica one, right? For example, to send some the
select query rings to the replicas ones, right?
So at the beginning, maybe you consider use the load balancer.
But now, if you use a existing Sophia, no matter it's a GDBC
or proxy, all of them can help you automatically do
the load balancer stuff. And send the update
or delayed queries to the primary instance.
And send all of the liked queries to the replicas
once. And can for example, use the random
strategy or use the round robin
strategy to leverage your replicas into the
instance. And also it can help you do the data masking
or data encryption. I think it's so
valuable, so valuable features, because you
know that you want the data in
the databases are encrypted,
right? But how to do that
if you use sharding Sophia, because you can see here,
sharding Sophia, it's deployed
between your application and your databases. So all
of your queries first read sharding Sophia proxy shorting Sophia
can automatically do the data masking or data encryption
or decryption. And so it sends
encrypted data into your databases, right? So you don't worry
about that. Your data, it's all plain text in the databases
because shortening Sophia can help you do all of the encryption work
and decryption work. Yeah, so that's because
you know that shortening Sophie, oh, another product is shorting
Sophie proxy. You know that at the name hint, it's the
database proxy. Database proxy can help you do
all the features I mentioned earlier,
right? Yeah. So that's all about the sharding
Sophia GDBC and Sharding Sophia proxy and the feature they can
provide you. Actually if you use sharding JDBC because
it just implement the native GDBC.
So no matter your databases, it's a postgreSQL,
MySQL or SQL, SQL or Oracle,
it all works here. Right?
The last part about if you
use the sharding SUV GDBC then your deployment
architecture will become the right column. Right?
At the beginning, your application need to manage
your database connection and data sources by
yourself. But now sharing Sophia proxy
can help you manage all of the database instance, help you
manage the connections, do the data sharding to do the data
masking or data encryption. All of the good stuff, right?
And your application just focus on itself
to use this distributed database system.
Yeah, and another part that
how about if my application lives
on the Kubernetes, right? If so, you can just
use our sharding Sophia on cloud project
to use the helm charts. One command to deploy
sharding Sophia cluster in your Kubernetes cluster and
then light your application, leaving the kubernetes
to connect the sharding Sofia proxy and sharding Sophie proxy
can help you manage all of the databases and do data sharding.
No matter your databases are on the
cloud, are on the premise or in the Kubernetes
cluster or the RDS.
Just like the sharding Sophia can connect to your databases,
you can solve all of the issue, right? So thanks
to the computing
and storage splitting architecture, that means your
databases where your storage nodes can locate
it anywhere.
So NAs part, I will use the sharding
Sophia on cloud project to do
the demo show and I will use
the helm to deploy the Sharding Sophia cluster and
use the postgresqL chart to deploy the postgresql on
the Kubernetes. If you already have your RDS, then you
can ignore this part and just use sharding Sophia cluster operator
chart to deploy sharding Sophia in the kubernetes
and light the sharding Sophia connect
to your RDS. Then it can help you manage the connection
and do the data.
Yeah, so when you first do the deploy work, the second one
we need to do the work about to create your
databases and sharding table it's not
a single table or standard table in a postgreSQL database.
It's a sharding table. That means this sharding table,
actually it's made up of many
shards or many subtables or many physical
tables. All the subtables located in
different postgresl instance.
But for your application, your application just
visit shorting surface proxy and it thinks that
there's only one logic table named T
user. But actually this t user consists
of 1234 subtables
into two postgresql
instance. But you know, here each PostgreSQl cluster,
it has the primary one and a replica one, right?
So Sharon Sophia will send the queries
to the primary one or replica one in
this cluster. Also if
it found that the data or rows records
existing the cluster
two, I mean the group two, it will send a query
to your postgres cluster two to finish
the routing work and the SQL
parsing work, right? But here, if we want
to define a sharding table, we need to use
this distributed SQL, it's a sharding Sophia
SQL dialect dialect and
to similar with your MySQL
dialect or postgreSQl. It's the Sharding Sophia dialect,
it's a distributed SQL. You can see here,
maybe before you can just use this create a table to create a
single table. But here you need to use a sharding table, similar SQL
language to create a sharding table. And this sharding table, you can
see here the sharding column, it's an order id
and it used the harsh mod strategy
to split one single table into
four one, right. Four subtables, four shards,
right. So if
we finish this dymo and your application
just waited, one of the sharding Sophia proxy and sharding Sophia
will do all of the internal workflow and it will
use the sharding algorithm or strategy to
help you locate the exit
table where your expected
record locate.
Right. And if it found that
it's a select query, it will also
to leverage replica to
send the query to one of the replica.
Therefore you will found the QPs
of this new database system will become higher
because it will automatically do the rewrite splitting such
features. Yeah, so the demo show
will include the following steps and
I do it by myself but actually I have no time
to introduce more. If you have this slides
you can follow my guide and
to do it by yourself. First one,
deploy the cluster by use the Sharding
Sophia proxy charts and then use the distributed
SQL to create the sharding DB sharding table
to make sharding Sophia be aware sharding
Sophia aware of your databases and
then do some acquiring to test it's okay or
not. All right, so that's all about my talk. And if
you really have some questions and we can talk
later, or you can just give me the message on my linking
or GitHub or Twitter. Right? So see
you later on.