Database Guardrails - new age for developers and databases

Video size:

Abstract

You can’t deploy blindly and let your application fail. You need to prevent the bad code from reaching production, monitor your databases, and automatically troubleshoot when something is wrong. You need database guardrails, and you need them now.

Summary

Adam Fulmanik: The world we are in now is way more complex. In the world of microservices, logs are scattered across multiple components. Most importantly, we have so many databases that when the problem appears, we don't know how to track it.
As with every database, we have like three different groups of problems. First group is code changes. Second area is schema changes. Third area is query changes. There are many things that can change outside of the query. This query may start running completely differently, which leads to lower performance.
The first thing is we have slow queries. Because it uses so many tables, it exponentially increases the number of rows it extracts from the database and how long it takes to process all of them. But it's only when we go to production that's one of the issues we may have with databases.
Another problem we may see is let's say that we refactor the code to make it more readable and we break the performance. Yet another issue is changes in schema. Those migrations can kill your performance and take your business down for some prolonged time.
Changing column types is yet another issue you may face when you change the column type. Other thing with schema migrations is missing indexes. You have an index that increases performance for reading but kills the performance for writing and updating. Bugs, obviously some of them will capture with our unit tests. Some of them we won't.
Other problems that orms introduce are around polymorphism and generally impedance mismatch. Unit tests and integration tests are going to be green, but we will hit errors when we go to production. Orms and databases in general are hard because changes are scattered in various places.
There are open source standardized solutions for capturing the telemetry. Once we capture signals from everywhere, we can say that we have visibility. Then we can start building the APM application performance management. Observability lets you connect the dots from like CI, CD, from your application code.
Matis lets you see the stuff and protects you from breaking your production. Prevent bad code from reaching production and push all your checks to the left as early as possible. Monitor your solution constantly to understand what is going on. Finally troubleshoot all of that automatically to give you the big picture.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, and thank you for coming to this talk. My name is Adam Fulmanik and we are going to talk a little bit about databases, guardrails, deployment, CI CD, and which things can go wrong. So in the next minutes we are going to see how our industry changed over last years, how we didn't manage to build good processes and tools around the databases yet, and most importantly, how to fix all of that so we can use that from our applications and prevent the bad code from reaching production. This is the idea. So let us begin. The very important thing I'd like to start with is the world has changed significantly in the last two decades. Previously, what we were doing is whenever we had a team working on something, generally there was one product with one database with couple of hosts running behind the load balances. That was the idea. If you wanted to deploy something new, then it was deployed completely independently along the way. So each one product, something like this on the slide, was basically running with one database, couple of hosts, load balancer, and obviously the user calling all of that. Right? And if we were to deploy new product, then it was completely independent. It wasn't necessarily talking to our product, it was using separate database and whatnot. So this is how the world looked like before. When there were issues with the setup, then they were quite easy to manage. I'm not going telling solving those because this depends on the issue, obviously, but managing how to solve this issue was rather straightforward. We knew where the locks are. We knew how to find files, requests, whatever else. And when the database was slow, we generally knew where to tackle it. So one DBA could actually deal with all of that. And it was enough for us to have a DBA around that was capable of solving our performance issues, problems with databases, or whatever else. However, this has changed significantly. The world we are in now is way more complex. We have microservices, we have multiple components. We have multiple databases for each application. We have SQL databases, NoSQL databases. We have various other serverless solutions. We have as freebackets, we have all of that, most importantly talking with everything else. So the connections, interconnections between applications are now much more complex and much more complicated. This is where we are now. And if the problem appears, then we don't even know how to track it. Because in the world of microservices, logs are scattered across multiple components in our ecosystem. It's not that we just go to the web server and we capture the logs. No. Now web servers are first, they are scaled out second, we have multiple independent microservices so web servers talk to each other. And third, we have so many databases that when the problem appears, we do not necessarily know whether it started where we see it, or whether it's just an outcome of something else. And this something else can be of various nature. This something else can be because the data distribution changed, or maybe because we have like spike in the traffic, or maybe because it's different time of the year, and people are generally using our platform way more because it's like Christmas or Halloween or whatever else. So understanding all those things is now much, much harder. Most importantly, one DBA just won't do. I mean, we have many databases now, and those independent databases, they generally need to be tuned and adjusted. And it doesn't scale well because adjusting and tuning one database takes time, and adjusting many databases takes much more time to do that. So how can we deal with that? Well, our current industry is telling us that yes, we can use monitoring, we can use monitoring tools, and this way we can figure out what is going on. But there are a couple of problems to that. First problem is everything is complex. This is like a slide I took from the Internet describing an architecture of some monitoring solution. And we can see how many things there are in here that are capturing just the monitoring data and just everything around to show us the design of the system and the monitoring and observability. But even if we take the monitoring solutions, they are less than ideal. The problem with monitoring solutions is, well, they are kind of complex and sophisticated. Nowadays we have so many charts. We have charts showing like raw signals, trends over time, various data from various sources. We have all of that shown on a single chart that is not very much readable. And most importantly, if there is a problem with the data, with the traffic, with our applications behind the scenes, then based on this chart, we just don't know what to do because this fingy, those charts, those diagrams, they just show us, hey, there is a fire in here, but it doesn't tell us how to put down this fire, how to tame the complexity and how it started. We see the fire is there, but we have no idea what is the reason behind the fire and how to cope with it. So this is where we are now. And the thingy we need to address is we need to see what breaks in modern world and how we can fix that. And the solution to fixing that is building a completely new approach we call database guardrails. Just like we had to learn that sysops and developers, they cant work independently, they need to sit next to each other and work hand to hand. And later on we started doing DevOps engineering. So there are people that know both sides of this scenario the same way. We need to build new solutions that will let us tame the complexity of databases and protect them from breaking and also monitor them when something goes wrong and finally will automatically root cause all the issues. And this is a must have. No matter whether we are like big Fortune 500 company or small startup, just treading the water and building their first product doesn't matter. We need to be able to manage our databases effectively, because when the time comes and the problems begin to happen, then it's way too late to figure out what solutions to put in place. So before going to that, let's see what breaks in the current world. And this way we'll be able to see how to fix those issues. So generally, as with every database, we have like three different groups of problems, three different areas where problems can appear. First group is code changes. So we change our code, we change the code of our JavaScript application. So then it reads the data from the database in a different manner. So we have different query using different tables, different views, basically extracting different data, right. Obviously we change the query, things may begin to break. Okay, second area is schema changes. So we change the schema of our database, we change the schema of the columns, types, whatever else. And this may break the way how queries are executed, which indexes are used, or maybe our Orm may go wrong and decrease the quality of the data. So this is around how we store the data in the database. But there is also third important area, which I call here query changes. And what I mean here is not that we change the query that we execute, no, we run exactly the same query as before. But this time what we do is the query that we execute, it executes differently. It executes differently because something has changed. This something can be database configuration, ORM configuration, maybe index got wrong, maybe statistics become outdated or other stuff. Or maybe there are many more transactions in the database, so the database engine need to cope with those transactions differently. There are many things that can change outside of the query, and this query may start running completely differently, which leads to lower performance. And this is something that is actually very important in our databases. And we are going to see throughout this talk how to deal with that and how our current solutions, current CI CD pipelines and current tests don't capture those scenarios. So let's jump right straight to it. The very first thing is we have slow queries. So this is the scenario, as you cant see here on the screen we have the query that is generated from our JavaScript application. We are getting user for some specific user id and this is just an aggregate route. So this user had so much more data related to it. For instance like details, pages, questions, location, whatever else. We get this data and we finally return the user in our application code and this query generates the following SQL statements sent to our database. This SQL statement has something like what, eight different tables being used? And this in general, because it uses so many tables, then it exponentially increases the number of rows it extracts from the database and how long it takes to process all of them. This is something I observed in one of my production code. And generally this simple query was extracting 300,000 rows from the database and ultimately the code was running for nearly 25 seconds. That wasn't fast, just to say that was terribly slow. Now the problem with this approach is that when we run this thingy and we'd like to test it locally, then we have generally no way how to test it with our local developers environment. Why is that? That is because in our local database what happens is we have, I don't know, hundred rows. So when we have hundred rows then this query on the right side of the screen won't generate many, many more rows and effectively it wouldn't cause us any issues. We won't see the performance problems. However, once we go to production we'll immediately see that something is wrong. But it's only when we go to production that's one of the issues we may have with databases. Another issue, and actually how we fixed this issue in this case was to change this one query, this one big application code, extracting all the tables. The production was to change it to getting the aggregate root and sending many independent queries. This resulted in sending many SQL statements to the database, which is correct, but may seem like okay, it's going to be terrible for the performance, but actually those queries are much smaller, get much smaller amount of data, so they execute much faster. And ultimately we just concatenate and join all the data in the application code. And this production, this change made the query run in something like a split second, generally way faster, even though we were running many more queries. Another problem we may see is let's say that we refactor the code to make it more readable and we break the performance. So let's say that I have a boarding passes table with something like 8 million rows. Now what I'd like to do is this is a handcrafted query. So not something probably you do often in your production code. But let's say that I would like to use CTE common table expression to extract boarding passes and calculate double checksum of the ticket number. So CTE, if you don't know what that is, it's kind of named temporary table that leaves only during the duration, for the duration of the query. So what we do is we get everything from boarding passes. We take the ticket number and we calculate the checksum of the ticket number and we do it twice just to show some cpu intensive operation. And we store that as an additional column called double hash. So this is our temporary table with this double hash. And then what we do is we join this table three times to match like one to one rows based on the ticket number, flight id and bonding number, and we look for some specific checksum of the ticket number. However, we don't need to use the CTE common table expression for that. We can just take the boarding passes table initially, join it three times and then calculate the checksum of the ticket number. Here inside the were condition, those two queries are equivalent and they return exactly the same result. The problem is the query at the bottom runs in 8 seconds while the query at the top runs in 13, so nearly twice as long. So now imagine that you come to your application and you like to rewrite this query to make it more readable. You have all your unit tests, you have all your integration tests and they are still green after you rewrite the code. Why is that? That's because those tests, they only compare the result of the query, not the query performance. They verify whether you extracted exactly the same rows and whether you extracted the same data. So they will be green if you implement the code, but they won't notice the issue. The same way would actually happen with the example we had previously. If you have unit tests here and here, then those unit tests will be exactly the same. They will be green because you extract the same data. It's just you extract them differently. So here with this example, you change the statement the query to make it more readable and you decrease the performance. Yet another issue is changes in schema. We may actually add columns, drop columns, change the column type or whatever else. And all those operations they make break our performance and how things work in the production database. And we won't notice that with our unit tests. Adding a column may be like detrimental for the performance when we use select star statements. Because if you add a column, then select star will pick it up and will start extracting more and more data. This may actually decrease the performance. And you won't notice that from your unit test because hey, the old data that was needed for the unit test does not change. And also when adding a column you may cause table reorganization. Table rewrite because sometimes the database engine cant add a column into the table just like that. It needs to take the data, copy it on the side, recreate the table, and then copy the data back to the original table. This may take minutes or even hours. And in some cases I heard about migrations taking more than a month. Now if the migration is so long and you didn't prepare for that, then you risk taking your database down because the database may simply not be available for your users just because it's working hard to reorganize the schema. So those migrations is something that may kill your performance and take your business down for some prolonged time. Now the question is, how do you verify that before going to production? You don't, because your unit tests, your integration tests, they generally work against the latest schema. They don't work against like the schema that is currently being migrated. Not to mention that you won't even notice the performance issues with schema migration because in your local environment that schema will run nearly instantaneously right in no time. There are other issues with schemas, for instance, dropping a column that's obviously never safe, especially harder when we are dealing with like heterogeneous applications, when we have a database that is used by multiple applications. In such a scenario, what happens is that you change the schema, you test your application, one application in isolation, and all your unit tests are green, obviously, and everything goes well. But your second application doesn't know that the column was removed. So again, unit tests for the second application will still be green because hey, it's not aware that the column was dropped. And then you deploy to production and you break the other application. Changing column types is yet another issue you may face when you change the column type. Then first you may break the data quality just because you change the representation of numbers, dates, geolocation information. Or you may break your application altogether just because your Orm cannot read and use the data anymore. And you can also change the performance. For instance, previously you were storing geolocation data as like pair of numbers, and now you need to serialize it as a string. And again, your unit test won't capture this change just because first you have small database and second, everything will work correctly from the user perspective. It's only the performance will break once you go to production. Other thing with schema migrations is missing indexes. So what happens is, okay, you send a query and this query is slow because what? Because you lack an index. So what do you do? You need to add an index to increase the query performance, right? So you need to pick a proper index type. There are various index types. You need to analyze how it's going to store the data and whatnot. But generally you can do that. But the important part to understand here is that index is only helping in the read scenario. It's not helping in the data update scenario. Actually it's even worse in the data update. That is because you need to update the data in the table, but also all the indexes configured on the table. So while you can configure index and you can even confirm yes, it increased the performance for reading. You also need to check everything else around the performance of data modification. And as before, your unit tests are going to be green. But there is yet another problem with that. Let's say that you have a junior inexperienced developer on your team and they have slow query and you suggest them what to do. Add an index. What do they do now is they add indexes everywhere. So you have so many indexes because you index basically everything. And this is yet another problem. You have an index that increases the performance for reading but kills the performance for writing and updating. And what's more, when you have many indexes, you don't actually know whether those indexes are still used, because it may be that today you configure such an index and it is being used properly, but you can configure it and forget about it. And six months from now the index is not used anymore. Are your unit tests, integration tests going to tell you that the index is not being used? They won't tell you that. Why? Because nothing changes. The query gets the same data with or without an index. It doesn't matter. Are you going to face the performance issue? Obviously you are, because those indexes will be like staying around, but they are not being used at all. Not to mention that there are bugs. Bugs, obviously some of them will capture with our unit tests. Some of them we won't just because they are tied to the engine we are using. An important part here is to understand that engines differ. It may be that in production you have different type of the database just because you have different license or you are using some cloud component that you can't run locally, whereas in local environment you just use something like in memory database or free tier edition of your cloud component. Right? Now, those two products, the one you use in the developers and the one you use in production, they may differ. And because they differ, you may hit different bugs. One of the bugs that we faced, like years back, I mean, the industry first years back was the Halloween problem. And the Halloween problem is actually pretty interesting. And now databases protect themselves from facing this Halloween problem. But let's see what was going on. Let's say that we would like to give a pay rise to the employees. So we get the table employees, and we increase salary by 10% for everyone that earns less than 10,000, right? Obviously numbers do not match here. That doesn't matter, and so on. So what we do is we scan rows one by one. And once we find a row, okay, this row is below 10,000, so we increase the salary by 10%. Now, the problem with the table was that rows were ordered by the salary. So what happens now is if you get the row, that salary is below 10,000, you put it, you increase the salary. So the row goes here, and then you continue scanning and you find the same row again and you process it again. So what happened? And the problem takes name from when it happened and happened on the last day of October. So it was Halloween. So what happened is everyone in the company started earning at least 10,000 because we protested the same row many times. Obviously that was a bug in the database. But if you are testing your code against a different database type, then you won't find this bug at all until you hit the production. There are some other issues with the orms. So orm libraries, they help us reading the data, writing the data, mapping the SQL entities into our objects and other stuff, and they are pretty useful. However, they pose their own issues on the way they work. So let's say that we have table court aircraft, and aircraft has one single aircraft may have many seats. What we would like to do is we would like to get all the aircraft, then iterate through every single aircraft and just get the number of seats. What may happen behind the scenes is that we may face an unplug one queries problem just because how ORM gets the data. ORM can get the data in the eager mode or lazy mode. In eager mode, ORm gets the aggregate route with all the data behind. In lazy mode, the ORM just gets the basic data. And when we try accessing something else, it gets this something else and fill in the gaps as we go. So if our ORM is configured in the lazy mode that we may start with one query to get all the aircraft. And then just as we iterate over those entities, we will be getting another query for every single entity along the way. Ultimately we get n plus one queries, one query for aircraft and n additional queries for every single aircraft to get the seats. We can optimize that to reconfigure the ORM to get the data in the eager mode. And in the eager mode what would happen is we would get all the aircraft and join them with the seats in one query. So this seems like a solution. There are two problems to that though. First problem is we won't notice this problem from our unit tests, integration tests or whatever else, just because the data is correct. Data doesn't change behind the scenes, it's just the way how we extract the data changes. And second problem is, let's say that we would like to change our ORm from lazy code to eager mode always. And then we end up with the same problem that we already saw, that you join the big aggregate route and you end up extracting like hundreds of thousands of rows just because you wanted to get all of that. So it's not a silver bullet either lazy or eager mode. You just don't just flip the switch and you're good to go. No, you need to understand what's your scenario and you need to decide on a case by case basis. Other problems that orms introduce are around polymorphism and generally impedance mismatch. In general, impedance mismatch is a term meaning that we want to map two incompatible paradigms on top of each other. So we have object oriented paradigm or functional paradigm in our application code, and we want to map it on the SQL paradigm like relationship paradigm. And the opposite obviously goes the same way. We take the SQL paradigm and map it on OOP object oriented programming. So this doesn't go well. If we want to do it like naively, then yes, we can do this naive way. And there are for instance patterns how to do polymorphism in SQL. Let's say that you have a class that inherits from some other class and you can see how to put this class onto the database. In SQL database there are patterns like table PI hierarchy, table per type, et cetera, et cetera. However, they will decrease the performance. So it's not that you can take the way you implement your applications like object oriented way and put it in SQL database. This doesn't work and it goes both ways. So what happens with the data types in SQL databases. For instance, we have a thing regarding varcar, regarding the strings. Basically we store in the database, the thing that is called collation. Collation is basically a configuration in the database that dictates how we compare characters in the database. So we decide whether we like a lowercase character is equal to uppercase character, or whether we should put, I don't know, null first, or maybe how we compare international characters and whatnot. And this is something we can configure per database. But can we configure something like that in our object oriented programming language? Can we configure how we compare strings in JavaScript? We can't, and this goes to other types as well. We can specify the precision of numbers with decimal in SQL database, but in JavaScript we do not have that precision at all. We can, for instance, use spatial data that can be stored as a pair of numbers, but our orm may not be capable of doing this way. So it may actually serialize and deserialize that to string or to JSON. So this may hit the performance or even make it unusable just because we want to map one incompatible paradigm on top of each other. But this goes even further. Again, what happens if we have heterogeneous applications? How can we deal with that? It may be that we change the type of a column and we break some other application just because it cannot deal with this new column. Or DRM cant handle that. Other problems that orms introduce are around, like lack of visibility. Do you know what your isolation level is? Do you know where your transactions code begins or where it rolls back? Do you know who commits the transaction? Who rolls it back? What about caching? What about connection pulling? What about query hint? What about migration configuration? Is your RM going to deal with the migration or with the schema when it realizes that there is a schema drift? Or is it going to throw an exception on you, or is it going to go and change the schema on your behalf? Other things that may go well, and we could have more things around like migrations, rollbacks, et cetera, et cetera, heterogeneous applications, how we manage them, how we fix them, and so on and on. Generally there are so many issues around and managing all of that is hard. And the most important part is we won't notice those when we implement application in our developer environment. We just won't notice that. That is because unit tests and integration tests are going to be green, but we will hit errors when we go to production. So orms and databases in general they're hard because changes are scattered in various places. There are so many moving pieces, dump so many things that are not clear to us, whether we load the data eagerly or lazily, which indexes we use, whether we run n plus one queries or just one query, whether the problem is in the database, in the code, are there triggered stored procedures, functions, views? Many, many more issues than we can't deal with easily. And now talking about tests, do they help? Do they work in our case? So unit tests obviously won't capture those issues because they just first run against the latest schema and they only check the correctness of the data. So they only check whether we read and returned correct data, not how we extracted this data. Same goes for integration tests. They will use like a physical component instead of just like unit mocked databases or in memory databases or whatever else. But still, they are probably triggered against like small databases to make them run much faster so they won't observe the issues end to end tests. On the other hand, if we don't run them in production, they face exactly the same problem, so they won't capture the issues. So it seems like the typical test pyramid is not going to do for us. But you might say, okay, we have load tests. Load tests should capture those issues, right? Shouldn't they? And they can capture some of those issues, but they have problems on their own. First problem is they are super expensive. They may take hours to complete. You need to create new hardware for them. You need to host a dedicated environments for those. You need to anonymize your data, you need to gather the data, you need to access production. For that, you need to anonymize. So you are GDPR compliant, California protection compliant, and other stuff. This gets even harder with machine learning scenarios or with, I don't know, specific mobile devices, scenarios and whatnot. And whatnot. So there are many issues with those. But the biggest issue we have with load tests is it happens way too late. They happen at the very end of our pipeline. We run load tests after we wrote the code, probably committed it to the branch, even code reviewed with teammates, merged to the main branch, and deployed to non production environments. So this is when we run load tests. And if there is an issue observed during the load test, then we need to start from scratch. This is something that is way too late for our cases. So yes, load tests will help us a bit, but they won't prevent the bed code from reaching preprod environment, meaning that they won't save us time. We need something better, we need some other solution that is going to prevent the bad code as early as possible. And even when we deploy to production and we face issues, we need something that will show us what the problem is and how to automatically fix that. And the solution to that is database guard rows. Database guard rows is the concept of maintaining managing databases in an efficient manner so that we can prevent bedcode from reaching production. We can monitor databases and monitor them semantically, understanding that we monitor databases that are used by developers and it applications and finally automatically troubleshoot the issues. So in order to build those database guardrails and to improve what we have today, we need to understand what we are dealing with. So just because we are talking databases, we can't use generic solutions that would work for them. We need to have specifically crafted solutions for the business we use, for the domain we work with, for the architecture of our system. We need to capture the context to understand what is going on. And to do that we need to understand all sides of our application, developers side and production side. So let's see how we can capture this context in order to build solutions that will prevent bad code from reaching production and help us diagnose it once we make a mistake. So the very first thing is the multiple parts, multiple pieces to creating the full observability and monitoring of the solution. So we start with telemetry. Telemetry is the ability to capture the raw signals. Signals could be logs, metrics, traces, whatever else. So those signals, we need to have a way of capturing them and the telemetry lets us do that. And now there are open source standardized solutions for capturing the telemetry. For instance, open telemetry is a library or a set of libraries that we can use for that. Once we cant capture signals, we need to process them in some centralized way and we need to obviously be able to manage those. Then we can start building the visibility. Visibility is the state when we see what's inside the system in every single place. Meaning that no matter whether the web server database, whether it's Q Enterprise service, bus, load balancer, serverless application, whatever else, we have basically telemetry wired in everywhere, in all the places. Once we have that and we capture signals from everywhere, we can say that we have visibility. Visibility shows us what's it inside the system. Then we can start building the APM application performance management. That gives us this ability to say whether our system is healthy or not. And we do it in a way that we aggregate the data from all the places we build meaningful aggregates, meaningful charts, meaningful dashboards showing us what's going on? So that we can just pick at those and immediately say okay, it's all green, our system is up and running and APM is going to give us exactly that. We can just look at the dashboards and it's green. However, when things break, APM is not going to tell us what's the reason. It's going to tell us that hey, this component is slow. There are problems in this component, but APM won't tell us that this component is slow just because we changed the migration. We deployed it last Wednesday. Now there is a different data distribution coming from our users and we stopped using the index. APM won't give us this full story. APM will only tell us databases slow and cpu spikes. And now what can you do with that? Well, if you don't know what's the reason? You may just try and upscale your database and this might help actually, but it's not the problem. The true problem is that you stopped using an index that was in the database. So now you need to modify your code to fix that. APM is not going to tell you that, but observability is. Observability gives you this, lets you connect all the dots from the developer environment to production environment. Connects the dots from like CI, CD, from your application code, from your web servers, queues, databases, whatever else, and with proper tooling around with database guardrails cant tell you the full story how it worked. This is what you can do. Now the question is okay, but what can we do? How do we do that? We have telemetry, we have signals, why can't we have observability right away? And the problem with that is monitoring. The way we do is it captures the signals that are generally agnostic to what we do. They can show you errors, they can show you charts, but you need to know how to configure thresholds and what to do with them. Observability, on the other hand, needs to connect the dots from multiple places and understand the characteristic of your application. And specifically it needs to understand what is running in your application. And in order to do that we need to use open telemetry to capture the stuff from the databases. How can we do that? Well, observability is basically a tool. Open telemetry is a tool for capturing logs, traces and metrics. And the way we do that is we can use the set of sdks from open telemetry that lets us capture the signals. Those sdks are built for your programming languages of your choice and they are also nowadays integrated with many of the pieces you already have with your databases, orms, SQL drivers, web frameworks, whatever else. Whatever else. The way open telemetry works is it captures the so called signals, sorry, traces and spans. So imagine that we have a workflow of like the user comes and wants to, I don't know, buy a ticket, so they click the buy purchase button. Now what happens is the request comes to the load balancer. It is then being set to, I don't know, your web server to the database queuing mailer system and whatever else open telemetry. What it does is it lets us compose a trace, a trace of everything that is actually happening behind the scenes as part of this workflow. And the trace is composed of like spans that represent what happened in a particular piece of our system. And such a span is basically a big jSon with like metadata, human friendly explanation. What were the events when the situation started, when it finished, what was the id of the operation, what was the attributes of the operation and other stuff. So this is what opentelemetry can do for us. We cant capture those signals and capture that stuff to build traces and spans. Now the question is, what do we actually capture to build proper database guardrails? And the question the answer is we need to understand databases as simple as that. And to do that we need to understand how databases work behind the scenes. So whenever you send a query to the database, it goes through various stages, it is being parsed, it is being rewritten by your database engine. But most importantly, a planner comes in. Planner is a component of database engine that checks your query and tries to figure out okay, how do I want to execute this query to get the highest possible performance? And let's see how it works. Imagine that we have a query like this one, select star from whatever, and we can use the explain keyword to get the execution plan of this query. And the execution plan is like, could be JSON, could be some other form explaining what's going on. So if we now dissect this execution plan provided by the database, we can see that it consists of multiple nodes. Basically every row here is a node in this plan, and node explains a particular operation that is being executed by the database. Every node has different types, so we can for instance, sequentially scan the table. We can use an index, we can join two tables with nested loops, strategy or whatever else. And more importantly, every operation, every node has cost associated with it. And this cost basically indicates how hard it is to execute a particular operation. It's not that it's some meaningful value like it will take 6 seconds or load 600 megabytes, nothing like this. It's just an arbitrary value telling us how expensive it is. Now the idea is the database engine can create many plans for a particular query, meaning that you can have multiple execution plans with different costs. And then the database engine picks the plan, which is the cheapest one with the lowest cost. So that's how databases work. Now this plan includes everything. It includes which tables are going to be read, how we are going to read them, which indexes we're going to use, how much data we're going to extract and whatnot. So this is what we can do. And now the idea is we use open telemetry and other observability solutions to extract the things from the database when we run stuff in our local environment. And this way we can explain what is going on, because now just by looking at the query in our local environment, we can tell whether this query is going to work well, because we can take the execution plan and compare it with production database. And we can see that it doesn't matter that we have like one row in our local database just because we take this plan and compare how it would execute in the production database and how long it would took, how heavy it would be. So this way we can prevent the bed code from reaching production. Because we can analyze the changes in the developer environments easily. We can analyze the changes, we can see the migrations, we can see whether indexes are used, and most importantly, we can find things that unit tests won't find for us because we can see, okay, you are running n plus one queries instead of one query. You are converting the data from like numbers to strings. You are not using indexes and whatnot and whatnot. So this way we can prevent the bed code from reaching production. But the story doesn't end here. What we can do also is we can monitor the production database the same way. And when something breaks, when the database is slow, when cpu spikes, we can see it's not because like you have, I don't know, slow cpu, but it's because you stopped using an index just because you deployed recently and it changed the query, or maybe just because database engine decided to run the query differently, because you have different data distribution nowadays. And ultimately we can automatically troubleshoot the stuff just because we know how it went from your local development box until production. We can tell you the full story, the full big picture of what is going on and how to deal with that. So those are database guardrails. Prevent bad code from reaching production and push all your checks to the left as early as possible. Monitor your solution constantly to understand what is going on with understanding of the databases and finally troubleshoot all of that automatically to give you the big picture. And now how cant we do that? Well, the interesting thing is that it only takes like one Javascript dependency to be installed in your application, to have all of that to be proactive and to push all the stuff to the left and to save you from headache of debugging issues in production because you can just capture them as early as possible. And Matis does exactly that. Matis lets you see the stuff and protects you from breaking your production. So the way we do it is we integrate with source code, we analyze the stuff and let me show you really quickly what we do. So once you create a project in Matis, you have three pillars just like we spoke before. So the very first pillar is we integrate with your application and can show you easily that hey, your application exposes this particular rest endpoint. And in this rest endpoint what happens is when someone coded, they got this HTTP code. However, there were some issues with this rans point. What happens is you can dig deeper and you can see and you can understand that hey, this is the query that has been executed. This is the actual SQL statement that was run against the database. This is the execution plan with all the details, how things were executed on the database. In the database. Those are statistics of tables, what was read and whatnot. And most importantly we can analyze that and automatically tell you hey, you are scanning a table and you read something like 16 million rows, that's not going to fly well. This is the impact on your database and this is how you can fix that. For instance by creating such an index to improve the performance. So this is the important stuff. We can see what happens behind the scenes and we can give you, using expert knowledge, using machine learning solutions, using whatever else we can find the anomalies and show you this is not going to fly well once you go to production and we can do that from your local development environment. So we take your local box and we compare that with production and we can show you this is not going to fly well. So this is about the queries that you run in your database. But we can also do other stuff. We can for instance analyze your CI CD. So for instance we can show you hey, you submitted this pull request and this is what happened inside this particular pull request. This is the performance, but we can also find pull requests for like schema migration and we can show you that hey, you are trying to create this thingy and this is not going to fly well because you lack indexes when creating this migration or whatever else. So we analyze this migration even though in your local environment this migration can run in a split second just because you have hundreds of rows. But we can analyze and tell you once you go to production that's not going to fly well. But this is just for the prevention. We also cover the stuff for monitoring. So what we can do is we can show you hey this is your production database and we cant show you like metrics for your database host showing you the things that you are interested in from the database perspective. We can analyze your configuration and tell you what is going wrong, but we can also analyze apart from the host, we can analyze particular databases. So we can for instance tell you that hey, those are the transaction you're running, this is the number of rows, temporary files, whatever else. We can analyze your indexes, extensions, we can analyze the live queries as you go through the database and we can show you hey this is the query that is being executed in your database and this is how it goes over time. And we can find insights, show you what is wrong. We can provide anomaly detection for all of that. So this is how we cant deal with the databases. And most importantly we can do all of that with understanding databases, with knowing what you do and building this full picture. So coming back to the slides, we integrate with the source code, with your CI CD pipelines, with your databases, no matter whether it's local pre production, production database we can do on demand analysis and we can provide database observability that you need to build database guardrails. So we integrate with the source code of whatever language you have, JavaScript, Python, any other languages, frameworks, libraries, drivers. We can integrate with CI CD, with GitHub actions or whatever else to analyze what you do. We can provide you with observability dashboards showing you what's going on and how things change along the way. So in summary, there are many things that can break with the databases today, many things that we won't capture with unit tests, with integration tests, and we need to be proactive. We can't wait for load tests to happen. We need to have those checks as early in the pipeline as possible, we need to push them to the left as much as possible, and we need to build understanding to have successful database godros, no matter whether we are a startup or Fortune 500 company, we need to understand all those big pieces because just monitoring and just APM won't do for us. And Mattis does exactly that. It covers all the things, all the problems and all the areas of your application. And Matis is the database godros ultimate solution that you can just take and start using in your applications. And being all of that said, I'd like to thank you for attending this talk. I hope you find it useful. I hope you enjoyed it. Join our Discord channel, see our Matty's platform, check it out, drop me a line on email, Twitter, whatever works for you and enjoy the rest of the conference. Thank you.

Slides

Download slides (PDF)

See all 32 talks at this event!

Conf42 JavaScript 2023 - Online

November 16 2023

Database Guardrails - new age for developers and databases

Video size:

Abstract

Summary

Transcript

Slides

Adam Furmanek

DevRel @ Metis

Join the community!

Featured event

2025

2024

Info

Conf42 JavaScript 2023 - Online

November 16 2023

Database Guardrails - new age for developers and databases

Video size:

Abstract

Summary

Transcript

Slides

Adam Furmanek

DevRel @ Metis

Join the community!