Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. As you all know,
SQL is everywhere, from extracting niche
information, to analyzing cell information to
analyzing the company finance. And today we are
here to explore fascinating aspects of SQL
queries that you may have not been aware of,
the ability to deposit them like any other
code, SQL queries are prone to errors and
bugs, and while fixing syntax error like
that query may seem trivial,
identifying and resolving other issues queries may
be quite challenging, as things are not
always as they seem, and silent
errors are lurking beneath the tough, waiting to
catch up off guard. These silent errors can particularly
be elusive when they go undetected until later
stages of the pipeline, making it challenging
to spot discrepancies in those dashboard or the end
of those funnel. In the realm of regular software
developer, bugs can be uncovered through various means like
debugging, testing, monitoring and even user feedback.
However, in this talk our focus will be on
SQL and on debugging specifically,
while we will speak about monitoring and how debugging
and monitoring can be relatively similar
and achieved in similar ways. So there
are several things or kind of issues that can happen
in SQL query. We can have missing records,
too many records, duplication, null business
logic, computation that is wrong and many,
many more. And identifying flows in SQL
queries is tough. It requires skills
and experience, and the database, or at least most of
them, does not provide a debugger.
It does provide, however, an execution plan.
And with that execution plan we
can understand what's happening beneath the
surface. And the reason databases
doesn't provide debugging capabilities is because
it was a lot of data and getting there will
be kind of overwhelming and can be very moments
are impacted. But those that do
that, they do it by creating a
materialized sub expression of a specific relation
or part of the relation. So buzzword
alert using execution plans from above
will allow us to debug those system tough tracing it
will allow us to understand the behavior of the database inside
and by that identifying buttons and bugs.
So we should democratize execution
plan. So how does it actually work?
We got SQL query that we want to execute and
it goes through a journey from parsing to the execution
itself. In parsing we create structure and
do some basic validation. Then in the analysis we
include some semantic validation.
Then we rewrite the tree to
be a bit more optimized.
And the two last steps are adding
really important information. We plan what
execution strategy we want to select and it got many,
many statistics and then we execute it and we add the
statistics from real time. So let's delve
on explain, which gives us the
execution plan. So explain without nothing,
give us what the database
wanted or planned to do. While if we add,
explain, analyze, we get what the database planned
to do and what actually happened. And you
can see here an example of execution
plan in postgres specifically, but it's quite similar across
databases. And you can see that we have explained and then
the options and the query itself. So one option
is those analyze that we discussed about. We can tell
it if we want format like JSON XML or stuff like that.
And we got many, many options
that add information, for example, adding information
about the cost or the buffers.
And a pro tip for me to you go over execution plan
at least once because it's similar across databases
and it will help you probably in your career.
So let's explain it.
Let's say we have a query that counts the number
of users that have a Twitter handle.
So we run explain, analyze on top of that and
we get that blob of yarn. So we can
see that it's pretty cryptid at first
and it's even longer than our queries. So if
we have like a big table, it can be overwhelming.
But there are numbers here. So probably we
can understand some of the things.
So we can see here the time that the query
took to execute and the time that it could be
took to plan that query.
And if we go to the main
stuff, we can see that it's structured as an inverse
trick. And we got many, many operations.
We got operations that related to scan,
which is reading at those table. We got operations that
relate to join and we got operations that relate
to aggregation. And each of those
node hold the information that is
relative to eight. So we got the information
on aggregation and on the sequential can
that include the field? And you
don't need to remember all the things that I teach you
now. You can cheat on your own work and just
look for the weird numbers and just ask
Google, ask JGPt, they know even better than me,
but you need to know what to feed them. And I
think it's kind of the easiest way to
do it. And after you identify the issues,
they can even help you find an appropriate
solution. But let me try to explain it anyway.
So if we look on the aggregate node,
we can see several columns.
So we got the planned rows, like the estimate number
of rows that will be produced by this node,
the aggregate. So in this example it's one.
And we got the actual rows, which is the number
that was actually produced. There is an asterisk
regarding the loop that we will discuss bit later.
And we got those plan with it, which means the average
size of a row. And here
we can see that it's eight bytes.
And next there are stuff that relates to the time,
to the execution time and the plan time. So startup
cost and total cost is what the planner thinks that will
happen. It's arbitrary units, and we got the actual
startup time and the actual total time, which actual
total time tells us the time to return all the
rows in milliseconds of this specific node.
And again, it's per loop and
we got to the loop, the thing that I waited
to not discuss about. So loop is
kind of weird, basically takes an entire node
and the numbers of actual total time and
the actual total rows is multiplied
by the number of loops that is executed.
And they do that to make it comparable
in terms of the cost and the actual execution plan.
So you need to know this. And when you calculate,
take into consideration that kind of stuff, which is a
bit annoying. And to be honest,
every database has its own wildcat. Some was
different segments, but it is the same.
So how can we use this technique
to find the issues? So I will give you a silly example like
empty result, and you can consider it's those
same if you have like ten joins and
just one, and zero returns records
and you dont know where it comes from.
So we can see here the same query I picked
specific users that have.
Donald Trump was the ender. And we can
see that the return number of rows
that we got here is zero. And it
happened both on the aggregate and on the sequence
scan, which means that it origined
from a wrong filter. You can see here the numbers
of rows that were filtered. And after we did our
investigation, we understood that the Twitter
handle is a bit different. So we went back,
we fixed it and we found the issues.
So we can use these kind of techniques
to help us in case a problem exists.
And then it find us a productivity tool.
And in cases that we don't want or we
don't know if a problem exists, it may protect us
and give us more knowledge about whether an
issue exists or not. But as you saw,
it's a bit like knowing specifically where
to look for an issue and where not. And on real
execution plan it's much much bigger,
which is not that nice to be honest.
So aren't there any easier
ways? So I know that I like
UI and I want people to tell me exactly
where a particular issue was originated. And I
want even ints to help me to rewrite
those stuff. Maybe now with JPT it's a bit less important.
And we want to know that my query
now that is great. Even if I push it to
production and data changes, I know how
to handle it a bit later. And unfortunately,
most of current tools focus
on the performance part, which is critical.
But we want our query to be actually correct.
Like you got finance questions which you over
test, so they are probably okay. But even on the sales and
marketing and the rest of the part is kind
of buggy. And you try to find
and fix issues in the wrong way. Like if
you find duplications, you might add a distinct at the
end of the funnel. But then other consumers
in the middle will have problems as well.
So what can we do?
So I wrote a tool that is called queryflow,
which is not that mature and not that
ready to be used. But I want to show you how easy
it can be and how we should thrive
to have things around. So let's say I have
a query that reads from titles table
and then do join to cruise and then another
join to people. And I want to get the movies
that Mr. Bing is
playing. So obviously I have
a mistake and this will return zero
entry points. So instead of looking on a ujson,
I can look at those specific visualization,
which tells me where the issues originate.
So I can visualize any number that we saw before
as a number, and then according
to how big it is, then the length of the
operation is big. So we can see that reading
the coup is like big. There is a lot of rows,
people is quite big and titles is smaller.
And after we do the filter, we can see that not only
it's quite small, it's colored
in red. And as a very opinionated
person, I like to add visualization and
stuff like that. So every color has its meaning.
So here, let's say if you have ten joints,
you can see exactly where the table where
it was originated. And you can also, instead of visualizing,
you can use the intermediate representation
to tell you and to track you if you got any
zero information or zero records and even
push it to somewhere like slack or whatever.
But this is an easy example. I want to show
you something a bit different. So let's say I have a join
between titles and genre. And I want to find only movies
that are safe for my kids. So I can see
here that I have a lot of titles,
I have a few genres.
And I can see that after I joins the genres and the title,
I can see that it's much bigger. So here, if I
expected one to one or many to
one, I can see that I have duplications here and I can find
it again either as a visualization, I can give it
even a coloring and other things like that.
And I find many issues like too many
errors, nulls and stuff like that. And it's easy, you just
need to parse the execution plan and then add
several rules on top of that. So what's happening behind
the scenes is that I parse or I create parser
for each database for the execution plan and then it's engine
agnostic and it can support multiple queries and
metrics, which the database itself will never
be able to do it. So it can be extremely useful for
performance optimization as well. That's not the
talk. And each node is enriched with additional
information and you can use the UI that indicates
the proportions of the metrics and find
issues. But obviously it's not mature, it's extremely opinionated.
But I think we should expect opinionated
things that will help us make queries correct.
It's awesome to use it because it's easy,
but we don't test it in most of the cases,
even with things DBT, which is great.
So I'm pretty optimistic about the future.
And let's face it, queryflow is still not mature, you should probably
not use it yet. If I work
a bit on query flow you might, but I hope that other
people like you will try to make it
easy to work with SQL because SQL will not leave us very
soon. And those tools I think need
to be easy and intuitive,
integrated in an ide and proactive
help you do stuff like not go back every
single time because data change and you want to monitor it
as well. So maybe even like saving execution
plan, you can do it on postgs and then run the
analysis of the should plan beforehand and stuff like that.
But I think this will be a major game changer.
So thank you very much for listening to me and please enjoy the
rest of the conference. Thank you very much.