Apparently, you can "debug" your SQL queries

Video size:

Abstract

It’s pretty challenging to perfect SQL queries. Unfortunately, duplications, wrong joins, and other issues are pretty common. Most of the time, we developers don’t even know about those, and even if we do, it’s tough to eliminate those issues. Don’t you wish there was an easier way?

Summary

Today we explore fascinating aspects of SQL queries that you may have not been aware of. SQL queries are prone to errors and bugs. With an execution plan we can understand what's happening beneath the surface. Using buzzword alert using execution plans will allow us to debug those system tough tracing.
Most of current tools focus on the performance part, which is critical. But we want our query to be actually correct. So I wrote a tool called queryflow that is not that mature and not that ready to be used. But I want to show you how easy it can be.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. As you all know, SQL is everywhere, from extracting niche information, to analyzing cell information to analyzing the company finance. And today we are here to explore fascinating aspects of SQL queries that you may have not been aware of, the ability to deposit them like any other code, SQL queries are prone to errors and bugs, and while fixing syntax error like that query may seem trivial, identifying and resolving other issues queries may be quite challenging, as things are not always as they seem, and silent errors are lurking beneath the tough, waiting to catch up off guard. These silent errors can particularly be elusive when they go undetected until later stages of the pipeline, making it challenging to spot discrepancies in those dashboard or the end of those funnel. In the realm of regular software developer, bugs can be uncovered through various means like debugging, testing, monitoring and even user feedback. However, in this talk our focus will be on SQL and on debugging specifically, while we will speak about monitoring and how debugging and monitoring can be relatively similar and achieved in similar ways. So there are several things or kind of issues that can happen in SQL query. We can have missing records, too many records, duplication, null business logic, computation that is wrong and many, many more. And identifying flows in SQL queries is tough. It requires skills and experience, and the database, or at least most of them, does not provide a debugger. It does provide, however, an execution plan. And with that execution plan we can understand what's happening beneath the surface. And the reason databases doesn't provide debugging capabilities is because it was a lot of data and getting there will be kind of overwhelming and can be very moments are impacted. But those that do that, they do it by creating a materialized sub expression of a specific relation or part of the relation. So buzzword alert using execution plans from above will allow us to debug those system tough tracing it will allow us to understand the behavior of the database inside and by that identifying buttons and bugs. So we should democratize execution plan. So how does it actually work? We got SQL query that we want to execute and it goes through a journey from parsing to the execution itself. In parsing we create structure and do some basic validation. Then in the analysis we include some semantic validation. Then we rewrite the tree to be a bit more optimized. And the two last steps are adding really important information. We plan what execution strategy we want to select and it got many, many statistics and then we execute it and we add the statistics from real time. So let's delve on explain, which gives us the execution plan. So explain without nothing, give us what the database wanted or planned to do. While if we add, explain, analyze, we get what the database planned to do and what actually happened. And you can see here an example of execution plan in postgres specifically, but it's quite similar across databases. And you can see that we have explained and then the options and the query itself. So one option is those analyze that we discussed about. We can tell it if we want format like JSON XML or stuff like that. And we got many, many options that add information, for example, adding information about the cost or the buffers. And a pro tip for me to you go over execution plan at least once because it's similar across databases and it will help you probably in your career. So let's explain it. Let's say we have a query that counts the number of users that have a Twitter handle. So we run explain, analyze on top of that and we get that blob of yarn. So we can see that it's pretty cryptid at first and it's even longer than our queries. So if we have like a big table, it can be overwhelming. But there are numbers here. So probably we can understand some of the things. So we can see here the time that the query took to execute and the time that it could be took to plan that query. And if we go to the main stuff, we can see that it's structured as an inverse trick. And we got many, many operations. We got operations that related to scan, which is reading at those table. We got operations that relate to join and we got operations that relate to aggregation. And each of those node hold the information that is relative to eight. So we got the information on aggregation and on the sequential can that include the field? And you don't need to remember all the things that I teach you now. You can cheat on your own work and just look for the weird numbers and just ask Google, ask JGPt, they know even better than me, but you need to know what to feed them. And I think it's kind of the easiest way to do it. And after you identify the issues, they can even help you find an appropriate solution. But let me try to explain it anyway. So if we look on the aggregate node, we can see several columns. So we got the planned rows, like the estimate number of rows that will be produced by this node, the aggregate. So in this example it's one. And we got the actual rows, which is the number that was actually produced. There is an asterisk regarding the loop that we will discuss bit later. And we got those plan with it, which means the average size of a row. And here we can see that it's eight bytes. And next there are stuff that relates to the time, to the execution time and the plan time. So startup cost and total cost is what the planner thinks that will happen. It's arbitrary units, and we got the actual startup time and the actual total time, which actual total time tells us the time to return all the rows in milliseconds of this specific node. And again, it's per loop and we got to the loop, the thing that I waited to not discuss about. So loop is kind of weird, basically takes an entire node and the numbers of actual total time and the actual total rows is multiplied by the number of loops that is executed. And they do that to make it comparable in terms of the cost and the actual execution plan. So you need to know this. And when you calculate, take into consideration that kind of stuff, which is a bit annoying. And to be honest, every database has its own wildcat. Some was different segments, but it is the same. So how can we use this technique to find the issues? So I will give you a silly example like empty result, and you can consider it's those same if you have like ten joins and just one, and zero returns records and you dont know where it comes from. So we can see here the same query I picked specific users that have. Donald Trump was the ender. And we can see that the return number of rows that we got here is zero. And it happened both on the aggregate and on the sequence scan, which means that it origined from a wrong filter. You can see here the numbers of rows that were filtered. And after we did our investigation, we understood that the Twitter handle is a bit different. So we went back, we fixed it and we found the issues. So we can use these kind of techniques to help us in case a problem exists. And then it find us a productivity tool. And in cases that we don't want or we don't know if a problem exists, it may protect us and give us more knowledge about whether an issue exists or not. But as you saw, it's a bit like knowing specifically where to look for an issue and where not. And on real execution plan it's much much bigger, which is not that nice to be honest. So aren't there any easier ways? So I know that I like UI and I want people to tell me exactly where a particular issue was originated. And I want even ints to help me to rewrite those stuff. Maybe now with JPT it's a bit less important. And we want to know that my query now that is great. Even if I push it to production and data changes, I know how to handle it a bit later. And unfortunately, most of current tools focus on the performance part, which is critical. But we want our query to be actually correct. Like you got finance questions which you over test, so they are probably okay. But even on the sales and marketing and the rest of the part is kind of buggy. And you try to find and fix issues in the wrong way. Like if you find duplications, you might add a distinct at the end of the funnel. But then other consumers in the middle will have problems as well. So what can we do? So I wrote a tool that is called queryflow, which is not that mature and not that ready to be used. But I want to show you how easy it can be and how we should thrive to have things around. So let's say I have a query that reads from titles table and then do join to cruise and then another join to people. And I want to get the movies that Mr. Bing is playing. So obviously I have a mistake and this will return zero entry points. So instead of looking on a ujson, I can look at those specific visualization, which tells me where the issues originate. So I can visualize any number that we saw before as a number, and then according to how big it is, then the length of the operation is big. So we can see that reading the coup is like big. There is a lot of rows, people is quite big and titles is smaller. And after we do the filter, we can see that not only it's quite small, it's colored in red. And as a very opinionated person, I like to add visualization and stuff like that. So every color has its meaning. So here, let's say if you have ten joints, you can see exactly where the table where it was originated. And you can also, instead of visualizing, you can use the intermediate representation to tell you and to track you if you got any zero information or zero records and even push it to somewhere like slack or whatever. But this is an easy example. I want to show you something a bit different. So let's say I have a join between titles and genre. And I want to find only movies that are safe for my kids. So I can see here that I have a lot of titles, I have a few genres. And I can see that after I joins the genres and the title, I can see that it's much bigger. So here, if I expected one to one or many to one, I can see that I have duplications here and I can find it again either as a visualization, I can give it even a coloring and other things like that. And I find many issues like too many errors, nulls and stuff like that. And it's easy, you just need to parse the execution plan and then add several rules on top of that. So what's happening behind the scenes is that I parse or I create parser for each database for the execution plan and then it's engine agnostic and it can support multiple queries and metrics, which the database itself will never be able to do it. So it can be extremely useful for performance optimization as well. That's not the talk. And each node is enriched with additional information and you can use the UI that indicates the proportions of the metrics and find issues. But obviously it's not mature, it's extremely opinionated. But I think we should expect opinionated things that will help us make queries correct. It's awesome to use it because it's easy, but we don't test it in most of the cases, even with things DBT, which is great. So I'm pretty optimistic about the future. And let's face it, queryflow is still not mature, you should probably not use it yet. If I work a bit on query flow you might, but I hope that other people like you will try to make it easy to work with SQL because SQL will not leave us very soon. And those tools I think need to be easy and intuitive, integrated in an ide and proactive help you do stuff like not go back every single time because data change and you want to monitor it as well. So maybe even like saving execution plan, you can do it on postgs and then run the analysis of the should plan beforehand and stuff like that. But I think this will be a major game changer. So thank you very much for listening to me and please enjoy the rest of the conference. Thank you very much.

Slides

Download slides (PDF)

See all 16 talks at this event!

Conf42 Observability 2023 - Online

June 08 2023

Apparently, you can "debug" your SQL queries

Video size:

Abstract

Summary

Transcript

Slides

Eyal Trabelsi

Architect @ Bigabid

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2023 - Online

June 08 2023

Apparently, you can "debug" your SQL queries

Video size:

Abstract

Summary

Transcript

Slides

Eyal Trabelsi

Architect @ Bigabid

Join the community!