Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Alexey Ozeritskiy and today
I'm going to talk about performance optimization of
distributed SQL engine which is used by we project.
First few words about me.
I'm a software engineer and I have been working with
distributed systems for many years and
since the beginning of last year I was
deeply involved in VDB project and
I've been working on optimizing of distributed
square engine. Here is my agenda
for today. Firstly, I'll be talking
about background information about YDB engine
itself and where it is used.
Then I'll discuss my testing methodology
and then I'll discuss my
investigations. And the final
part of my talk will be about containerization
and performance. Let's get started.
Firstly, a few words about distributed
SQL engine. The distributed SQL engine which is
used by YDB is called SQL or
VDB query language. SQL is a library
which was designed to pass
and execute SQL queries.
Currently it is used by four
projects. It is VDB
itself. VDB is distributed to open source SQL
database. Then SQL is used by Vitisaurus.
Vitisaurus is an open source big data
platform which is similar to Apache
Hadoop and with
Waco Engine Waitizaurus can
provide feature like Apache Hive and.
Well, there are also
two projects similar to Google's Bigquery.
The first one is project which is called SQL.
It is internal Yandex service and the
second project is Yandex Query.
This project is also
similar to Google Bigquery and it used
by Yandex Cloud for
our external customers. Yandex query
is also open source project and it is part of
YDB project.
Now let's have a look at these numbers.
I got these statistics
from Yandex internal service
SQL and these statistics shows
that we process
a lot of queries. This is
600,000 queries per day and eight
petabytes data per day. These numbers are very
huge I think and even small
improvement of our
distributed square engine even on 1%
could give big value.
Now I'll talk about Wakel
architecture. This is a brief introduction to Wakel architecture.
Wakel consists of four main components.
This is parser execution plan builder,
execution layer and compute layer.
Parser parses query and constructs
abstract syntax tree or IST.
Then plan builder gets
ST and constructs execution
plan and also plan builder can
optimize this plan. Then this
plan is executed by execution layer
and execution itself
is made with the help of compute layer. Compute layer
handles execution of individual plan nodes and
compute layer is responsible for computations
like SQL filters, SQL projections,
expressions, SQL functions,
joins and so on. And now let's
have a look at this very basic example.
In this example we want to join two tables.
We want to filter the result and to
get top first rows.
Now let's have a look at WaqL's execution plan.
Execution plan is graph.
Graph consists of nodes or stages.
Each stage is divided into tasks.
And as you can see leaves
of this graph can read tables,
customers and orders. And also some
read stages can contain filter
like this read order stage.
Then after read stages we have
join stage, then aggregate and sort stage
and the final stage. Now let's
have a look at my testing methodology.
For my testing methodology I use benchmark
driven approach. This approach has
some advantages. First of all it provide
us metrics. For example
it is execution times
of our benchmarks. Then the help of benchmark
we can find bottlenecks.
Then benchmark is a good scalability
test. For example we can tune scale
parameter of benchmark and we can see
how our system is scalable for some
amount of data. Then benchmarkdriven
also provide some real world simulation. And so if
we improve benchmark we
will improve real users tasks.
And also benchmarkdriven
are vendor natural. So we can run this
benchmark on different systems and we can compile
our system with its competitors.
And this is the benchmarkdriven that
I used. This is TPCH benchmark.
This is very famous benchmark
for OLAP systems or analytical database.
This benchmark consists of 22 SQL queries,
nine tables and it contains
data generator. On the right side you
can see TPCH database
schema.
Well now let's consider
TPCh benchmark data generator. It is DBGen
tool. With the help of DBGEn tool
you can generate data of any
size. For example GBGen tool
has minus s or scale parameter
and for instance scale
100 means to generate 100gb
of data. Then there are very useful
keys minus c and minus s. And with
help of these keys you can generate
a really big amount of data
on Mapreduce system. And of
course I generated everything on Mapreduce.
Then I converted this generated
data to paquette format and uploaded it
to s three storage. And here you
can see my packet files,
properties like compression, row group and
table split parts.
Now let's talk about continuous integration.
To see improvements
in performance of our engine we
should set up continuous integration or CI.
For CI I used virtual
machines and TPCh benchmark
of small scale of ten.
I run this continuous integration daily
and I run it on packet files.
Also I set up
per commit run and commit to comment comparison.
And as you can see on this graph when
I started,
we cannot pass some TPCH tests.
For example, we had a lot
of problems with test 29
21 and we had a
lot of issues with scale
100. This graph was constructed
on scale ten.
And for running this continuous integration
pipeline I used the
utility which could execute
whole engine and one process.
It is possible in our architecture because
we use so colon actor
approach and actually our tasks
in execution plan are actors. And the sectors
can work in a single process or they can work
on some ledge distributed system and so
on. And now let's
consider this utility that
was used for testing for continuous integration
and actually for everything.
This utility is called decoran or distributed
query run. And this utility can run
all components of distributed engine in a single process.
And this utility designed for
execute SQL queries on
pique files. And this utility
doesn't contain a
lot of layers of big YDB project for
example doesn't contain transactional
layer, then replication layer, storage layer and
so on. And for running
benchmarks with network interaction I
implemented these totalities, service node
and worker node.
To run the test with network interaction you should start
one or more worker node instances
and only one service node instance. Worker nodes
are responsible for compute part of our layer.
The service node is execution
part. Servicenode actually
controls compute
layer which is executed in Walker nodes.
And also to run this test in distributed
configuration you should construct plan
and the plan can
be constructed with the help of decora utility.
So first you run Decoran utility.
Decora utility constructs execution plan,
then it sends the execution plan
to service node and so on.
To achieve this,
you should provide two additional parameters
to gcoran utility. Here they are.
This is minus minus Gq host and minus minus
gqpot.
Now let's consider what we actually
measure. For measurements. I use Unix
bench styles measures and
here it is, I execute each
test n times, then I discard
lower third of the results and I
calculate the final value using geometric mean or the
remaining results. And actually
this is very effective
method for getting a reliable measure of performance.
Let's move on. This is our
target values.
When you are improving something,
you need to compare your values with
something. And I
think that the best approach is to compare your values with
values of your competitors.
And I came across with an
article about benchmarkdriven
of these three database IDB,
Green, plum and Apache Spark on TPCH 100.
And this article provides
the following numbers. So I used
these numbers as my target values
and as this benchmarkdriven was
running on 120
cores. I also decided to use
the similar hardware and this
is my hardware. I use this hardware
for the final result
and for debugging. It is a
big machine which contains of two
zone processors and it contains
total of 64 cores
or 128 threads. Also it has 512gb
of ram. Okay, let's move on
to my investigations. I'll focus
only on most meaningful
low level improvements because I found
this especially interesting.
Of course I worked on low
level improvements as well as high level
plan improvements. Let's move on.
First of all, let's consider these tools
which I used. There are three tools.
The first one is perfutility.
This is a well known Linux
profiler. With the help of these tools
you can collect performance
metrics from running processes. Also I
used these two utilities,
stack count and mem leak.
Stack count is a very useful utility and
it is especially useful when you use it in
pair with perf top comment.
In perf top you
can see hottest functions
and with the stack count you
can find who calls these hottest
functions. And there is also mem
leak utility which is also very useful.
With this memory utility you can find
memory consumption of
parts of your code. And with
this utility it's very easy to resolve problems
such as incorrect functionality
of bug pressure companion,
the back pressure often used during
communication of tasks.
There are also more Linux performance
tools. First of all it
is Bcc utility. Actually this picture
that you can see was taken from BcC project.
I think that this is well known
picture. And BCC project uses
EBPF functionality of new Linux kernel.
It provides C library and Python
bingens. And with the help of this library
and Python binge you can
collect any performance
metrics of your program. Of course
this project is very low
level. So on top of BCc it
was created a lot of useful utilities.
For example in BCC repository
you can find special utilities for performance benchmarkdriven
of such databases like
SQL and PostgreSQL.
There are also the similar utility Bpftrace.
Bpftrace is more high
level because it is implemented
as language. As programming language.
It looks like avocado language.
And also there is very useful script
by Brandon Gregg which is called flame graph.
And with the help of this script you can visualize
the output of Bcc utilities and
BPF trace. Let's move on
to my first investigation. I run some
tpch query which contains
join and I collected
perf counters and I constructed this flame graph
and I
saw that our great
join algorithm consumed a lot
of cpu time. And if
you zoom in you
will see that there is nothing
interesting, just add tuple function.
It's very difficult to
see something on this
flame graph. So after
that of course you
could look at your code and read it
line by line. But the best solution is to
use perf report to
look at raw perf data.
And let's do it. This is perfreport,
and with perfreport you
can zoom in into your code. Let's zoom
in into a tuple function.
Here it is. And you can see that atomic
fetch ad consumes a lot of cpu.
Actually this is very strange.
First when I looked at it that there
was something wrong because
it looks like a mistake actually,
and actually this was a mistake because
when this code was written,
someone added to this code these atomic
counters for debugging purpose and
he forgot to remove it.
And here is the patch,
these atomic counters were just removed
and I got this impressive
performance improvement. And as
you can see query 29 was
improved from 15 seconds to 7 seconds.
And actually all other queries
with joints was improved
by half. And I think this is very big
improvement. Let's move on to
my second investigation. When I running
benchmarks I like to run pufftop
comment in parallel to see hottest functions
in real time. And once when
I run pf top I saw that some
kernel symbol rescue lock is
shown in pufftop and it was
very strange. And to investigate
who called this Oscar lock, I used
stack count utility. And stack count utility showed
me that this OSQ
lock is part of a mapsis call.
But why do we use this mps
call? And the answer was
very simple, we use it because we use our own
memory allocator in our compute
layer. And why do we
do this? First of all,
we do this because our memory allocator
is optimized for concurrency and it's optimized
for running in multithreaded environment.
And in theory it
should work very fast, but actually
it wasn't very fast.
The second we
create an allocator instance per query.
And this approach has
some advantages. First of all,
we isolate queries from
each other. Then it's very easier
to allocate memory and release memory. With this
approach you can write
exception unsafe code on compute layer because
on the end of your query all
memory will be allocated automatically.
And let's have a look at our problem.
Problem with high frequency of mps calls.
I solved this problem in a very easy
way. I think I just started to
allocate 32 pages memory pages
on one allocated call.
Before this we allocated one
memory page on each call and
I started to allocate 32 pages and
I return one page to
caller and the rest pages I
store into cache. And next time when
caller calls allocator I'll
get him some page from cache. This is
very simple patch and
here you can see performance improvements.
Actually I think this is very big improvement.
And here is the final execution time for
Wakel. I got 154
seconds. This run was on
packet files and with VGb I got 209
seconds. VGB means I
run this benchmark on VGB cluster.
And as you can see we outperformed
our competitors. And now
let's move on on my final part of this
talk.
First of all let's have a look at this very interesting feature of our
engine. This is an SQL query
with embedded Python script.
You can switch on this feature
on engines configuration. As you can
see a user can execute
any Python code. We don't
have any limitations on
this. And for example our user
can use ctypes library for
calling any C code. In this example
the user calls ctypes to
the reference invalid pointer and
as a result he
got segmentation fault.
And actually one
binary of our engine can
execute a lot of queries of different users and
if one query crashes the
other queries will case too.
And this is very bad and we
wanted to resolve it.
To resolve it, we use the following
execution scammer. Let's recall
our execution plan which consists of strategies and each
stage is divided to tasks and
so on. Now let's divide our
task into two components. The first
component will be responsible to
network interactions and the second
component will be responsible for
computation itself. And for
the second component we will start
a container. And here it
is. So we have lot
of containers container per task.
These containers contain compute,
path and tasks are
communicating with containers with the help of Unix pipe.
And we
use bi directional communications for
communicating tasks with containers. So each
task uses two
pipes per container,
one pipe for input and other pipe for
output. I think this is very simple scammer.
And of course I run TPCH
benchmark with this feature switched on and
I got the following numbers. I got
561 seconds and I thought that this
is very slow and my
first thought that it's problem with
the pipe itself,
why I decided that. Let's have a look
at this well known picture. This picture
was taken from IPC
bench project by Peter Goldsborough.
And as you can see pipes are very
slow in Linux. And the most fast
IPC in Linux is memory mapped files.
And there is also very interesting article
about how to write two pipes fast.
An article by Francesco Mazole.
And he said that with
the help of new VEm splice call,
you can read and write two
pipes very fast. And I
decided to try these two
techniques. First, I tried to
replace pipes with memory mapped
files with in memory query on
memory mapped files. And then I tried to use
VM splice syscol.
I spent a day on it and I
achieved nothing.
Nothing improvements.
And after that, only after that,
I decided to try PF. And PF showed
me the following.
Let's have a look at the second square.
The second square shows that we
call some
kind of statistics very often.
And as you can see, the statistics uses
hash maps.
Actually, it appeared that
these statistics were designed
to be called once on query,
on the end of query. But these statistics
were touched on every pipe
message and it was very slow. It was very slow
because these statistics were
very ineffective.
They uses string
based hash maps, which are very slow.
And I just removed
these statistics and I
got these numbers.
After optimizations, I got 223
seconds and it was achieved with running
all benchmarkdriven in containers.
So I started a container query
plan task and I think this is an
excellent number. What's next?
First of all, we are going to work on
TPCh of terabyte scales.
I think we will
get a lot of issues with it, but who knows?
And the second, we are going to work on
tpcds benchmark tpcds
is also Olap
benchmark or benchmark for
analytical databases, but it's more modern.
It contains 99 queries
and most queries contain joins
and typical join
consists of ten tables.
So for this benchmark
plan level optimizations will be important.
That's it. Thank you for watching and listening.
And if you like my talk,
please hit the like button
as well as feel free to join me on this
social media. Thank you very much.