Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, and thank you for coming to this talk. My name is
Adam Fulmanik and we are going to talk a little bit about
databases, guardrails, deployment,
CI CD, and which things can go wrong.
So in the next minutes we are going to see how
our industry changed over last years,
how we didn't manage to build good processes and
tools around the databases yet, and most importantly,
how to fix all of that so we can use that from
our applications and prevent the bad code from reaching
production. This is the idea. So let us begin.
The very important thing I'd like to start with is the
world has changed significantly in the last two decades.
Previously, what we were doing is whenever we had a team
working on something, generally there was one product
with one database with couple of hosts running behind
the load balances. That was the idea. If you
wanted to deploy something new, then it was deployed
completely independently along the way. So each
one product, something like this on the slide, was basically
running with one database, couple of hosts, load balancer, and obviously
the user calling all of that. Right? And if we were
to deploy new product, then it was completely independent.
It wasn't necessarily talking to our product, it was
using separate database and whatnot. So this
is how the world looked like before. When there were issues
with the setup, then they were quite easy to manage.
I'm not going telling solving those because this
depends on the issue, obviously, but managing how to solve
this issue was rather straightforward. We knew where
the locks are. We knew how to find files,
requests, whatever else. And when the database was slow,
we generally knew where to tackle it. So one DBA
could actually deal with all of that. And it was enough for us
to have a DBA around that was capable of solving
our performance issues, problems with databases, or whatever
else. However, this has changed significantly.
The world we are in now is way more complex.
We have microservices, we have multiple components.
We have multiple databases for each application.
We have SQL databases, NoSQL databases. We have
various other serverless solutions. We have as freebackets, we have
all of that, most importantly talking with everything else.
So the connections, interconnections between applications are
now much more complex and much more complicated.
This is where we are now. And if the problem appears,
then we don't even know how to track it. Because in
the world of microservices, logs are scattered
across multiple components in our ecosystem.
It's not that we just go to the web server and we capture
the logs. No. Now web servers are first, they are scaled
out second, we have multiple independent microservices
so web servers talk to each other. And third,
we have so many databases that when the problem appears,
we do not necessarily know whether it started where
we see it, or whether it's just an outcome of something else.
And this something else can be of various nature. This something
else can be because the data distribution changed,
or maybe because we have like spike in the traffic,
or maybe because it's different time of the year, and people are
generally using our platform way more because it's like Christmas
or Halloween or whatever else. So understanding all
those things is now much, much harder.
Most importantly, one DBA just won't do.
I mean, we have many databases now, and those independent
databases, they generally need to be tuned and adjusted.
And it doesn't scale well because adjusting and tuning one
database takes time, and adjusting many databases
takes much more time to do that. So how
can we deal with that? Well, our current industry is
telling us that yes, we can use monitoring, we can use monitoring
tools, and this way we can figure out what is going
on. But there are a couple of problems to that. First problem is
everything is complex. This is like a
slide I took from the Internet describing an architecture of
some monitoring solution. And we can see how many things
there are in here that are capturing just the monitoring
data and just everything around to
show us the design of the system and the
monitoring and observability. But even if we take the monitoring
solutions, they are less than ideal.
The problem with monitoring solutions is, well,
they are kind of complex and sophisticated.
Nowadays we have so many charts. We have charts
showing like raw signals, trends over time,
various data from various sources.
We have all of that shown on a single chart that
is not very much readable. And most
importantly, if there is a problem with the data,
with the traffic, with our applications behind the scenes,
then based on this chart, we just don't know what to do because
this fingy, those charts, those diagrams,
they just show us, hey, there is a fire in here,
but it doesn't tell us how to put
down this fire, how to tame the
complexity and how it started.
We see the fire is there, but we have no idea what
is the reason behind the fire and how to cope with it.
So this is where we are now. And the
thingy we need to address is we need to see
what breaks in modern world and how we can fix that.
And the solution to fixing that is building a
completely new approach we call database guardrails. Just like
we had to learn that sysops and developers,
they cant work independently, they need to sit next
to each other and work hand to hand.
And later on we started doing DevOps
engineering. So there are people that know both sides of
this scenario the same way. We need to build new solutions
that will let us tame the complexity of databases and
protect them from breaking and also monitor
them when something goes wrong and finally will automatically
root cause all the issues. And this is a must have.
No matter whether we are like big Fortune 500 company
or small startup, just treading the water and building
their first product doesn't matter. We need to be able to manage
our databases effectively, because when the time comes
and the problems begin to happen, then it's way too late
to figure out what solutions to put in place. So before going
to that, let's see what breaks in the current world.
And this way we'll be able to see how to fix
those issues. So generally, as with every database,
we have like three different groups of problems, three different
areas where problems can appear. First group is code
changes. So we change our code, we change the code of
our JavaScript application. So then it reads the
data from the database in a different manner. So we
have different query using different tables, different views,
basically extracting different data, right. Obviously we change the
query, things may begin to break.
Okay, second area is schema changes.
So we change the schema of our database,
we change the schema of the columns, types, whatever else.
And this may break the way how queries are
executed, which indexes are used, or maybe our Orm may
go wrong and decrease the quality of the data.
So this is around how we store the data in the database.
But there is also third important area, which I call here
query changes. And what I mean here is not that we
change the query that we execute, no, we run exactly
the same query as before. But this time what
we do is the query that we
execute, it executes differently.
It executes differently because something has
changed. This something can be database configuration,
ORM configuration, maybe index got wrong,
maybe statistics become outdated or other stuff.
Or maybe there are many more transactions in the database, so the database
engine need to cope with those transactions differently.
There are many things that can change outside of
the query, and this query may start running
completely differently, which leads to lower performance.
And this is something that is actually very important in
our databases. And we are going to see throughout this talk
how to deal with that and how our current solutions,
current CI CD pipelines and current tests don't
capture those scenarios. So let's jump right straight
to it. The very first thing is we have slow
queries. So this is the scenario, as you cant see here on
the screen we have the query that is generated from
our JavaScript application. We are getting user for some specific
user id and this is just an aggregate route. So this user
had so much more data related to it. For instance like details,
pages, questions, location, whatever else.
We get this data and we finally return the
user in our application code and this query generates
the following SQL statements sent to our database.
This SQL statement has something like what, eight different tables
being used? And this in general,
because it uses so many tables, then it
exponentially increases the number of rows it
extracts from the database and how long it takes to process
all of them. This is something I observed in one of my production code.
And generally this simple query was extracting 300,000
rows from the database and ultimately
the code was running for nearly 25 seconds.
That wasn't fast, just to
say that was terribly slow. Now the problem
with this approach is that when we run this thingy
and we'd like to test it locally, then we have
generally no way how to test it with our local
developers environment. Why is that? That is because in
our local database what happens is we have, I don't know,
hundred rows. So when we have hundred rows then this query
on the right side of the screen won't generate many,
many more rows and effectively it
wouldn't cause us any issues. We won't see the performance problems.
However, once we go to production we'll immediately see
that something is wrong. But it's only when
we go to production that's one of
the issues we may have with databases. Another issue,
and actually how we fixed this issue in this case
was to change this one query, this one
big application code, extracting all the tables.
The production was to change it to getting the aggregate
root and sending many independent queries.
This resulted in sending many SQL statements to the
database, which is correct, but may seem like
okay, it's going to be terrible for the performance, but actually those
queries are much smaller, get much smaller amount of
data, so they execute much faster. And ultimately we
just concatenate and join all the data in the application
code. And this production, this change made the query
run in something like a split second,
generally way faster, even though we were running many more queries.
Another problem we may see is let's say that we refactor the code
to make it more readable and we break the performance.
So let's say that I have a boarding passes table with
something like 8 million rows. Now what I'd like to do is
this is a handcrafted query. So not something probably you do often
in your production code. But let's say that I would like to use CTE
common table expression to extract boarding passes and
calculate double checksum of the ticket number.
So CTE, if you don't know what that is, it's kind of named temporary
table that leaves only during the duration, for the
duration of the query. So what we do is we
get everything from boarding passes. We take the
ticket number and we calculate the checksum of the ticket number
and we do it twice just to show some cpu
intensive operation. And we store that as an additional column
called double hash. So this is our temporary table with
this double hash. And then what we do is we join
this table three times to match like one
to one rows based on the ticket number, flight id and bonding
number, and we look for some specific
checksum of the ticket number. However,
we don't need to use the CTE common table expression
for that. We can just take the boarding passes table initially,
join it three times and then calculate the
checksum of the ticket number. Here inside the were
condition, those two queries are equivalent
and they return exactly the same result. The problem is the
query at the bottom runs in 8 seconds while the query at the
top runs in 13, so nearly twice as
long. So now imagine that you come to your application
and you like to rewrite this query to make it more
readable. You have all your unit tests, you have all
your integration tests and they are still green after
you rewrite the code. Why is that? That's because
those tests, they only compare the result
of the query, not the query performance. They verify
whether you extracted exactly the same rows
and whether you extracted the same data.
So they will be green if you implement the code,
but they won't notice the issue. The same way would
actually happen with the example we had previously. If you
have unit tests here and here, then those unit tests will be
exactly the same. They will be green because you extract the same data.
It's just you extract them differently. So here with
this example, you change the statement the
query to make it more readable and you decrease the performance.
Yet another issue is changes in schema.
We may actually add columns, drop columns, change the column type or
whatever else. And all those operations they make break our
performance and how things work in the production database.
And we won't notice that with our unit tests.
Adding a column may be like detrimental for the performance
when we use select star statements. Because if you add
a column, then select star will pick it up and will start
extracting more and more data. This may actually
decrease the performance. And you won't notice that from your unit
test because hey, the old data that was needed for the
unit test does not change. And also when adding a
column you may cause table reorganization.
Table rewrite because sometimes the database
engine cant add a column into the
table just like that. It needs to take the data,
copy it on the side, recreate the table,
and then copy the data back to the
original table. This may take minutes or even
hours. And in some cases I heard about migrations
taking more than a month. Now if
the migration is so long and you didn't prepare for
that, then you risk taking your database down because the
database may simply not be available for your users just
because it's working hard to reorganize the schema.
So those migrations is something that
may kill your performance and take your business down for some prolonged
time. Now the question is, how do you verify that before
going to production? You don't, because your
unit tests, your integration tests, they generally work against
the latest schema. They don't work against
like the schema that is currently being migrated.
Not to mention that you won't even notice the performance issues with schema
migration because in your local environment that schema
will run nearly instantaneously right in no time.
There are other issues with schemas, for instance, dropping a column that's obviously
never safe, especially harder when we are dealing with
like heterogeneous applications, when we have a database
that is used by multiple applications. In such
a scenario, what happens is that you change the schema,
you test your application, one application in isolation,
and all your unit tests are green, obviously, and everything goes well.
But your second application doesn't know that the
column was removed. So again,
unit tests for the second application will still be green
because hey, it's not aware that the column was dropped.
And then you deploy to production and you break the other application.
Changing column types is yet another issue you may face
when you change the column type. Then first you may
break the data quality just because you change
the representation of numbers, dates, geolocation information.
Or you may break your application altogether just because your
Orm cannot read and use the data anymore.
And you can also change the performance. For instance, previously you
were storing geolocation data as like pair of
numbers, and now you need to serialize it as a
string. And again, your unit test won't capture
this change just because first you have small database
and second, everything will work correctly from the
user perspective. It's only the performance will break once you go to
production. Other thing with schema migrations is
missing indexes. So what happens is, okay, you send
a query and this query is slow because
what? Because you lack an index. So what do you do?
You need to add an index to increase the query performance, right? So you
need to pick a proper index type. There are various index types.
You need to analyze how it's going to store the data and whatnot.
But generally you can do that. But the important part to understand here
is that index is only helping in the
read scenario. It's not helping in the data
update scenario. Actually it's even worse in the
data update. That is because you need to update the data
in the table, but also all the indexes configured
on the table. So while you can configure index
and you can even confirm yes, it increased the performance for reading.
You also need to check everything else around the performance of
data modification. And as before, your unit
tests are going to be green. But there is yet another problem with
that. Let's say that you have a junior inexperienced developer on
your team and they have slow query and you suggest them what
to do. Add an index. What do they do now is they
add indexes everywhere. So you have so
many indexes because you index basically everything.
And this is yet another problem. You have an index that increases
the performance for reading but kills the performance for writing and
updating. And what's more, when you have many indexes,
you don't actually know whether those indexes are still used,
because it may be that today you configure such an index
and it is being used properly, but you can configure it
and forget about it. And six months from now the index is
not used anymore. Are your unit tests, integration tests
going to tell you that the index is not being used? They won't tell
you that. Why? Because nothing changes. The query
gets the same data with or without an index. It doesn't matter.
Are you going to face the performance issue? Obviously you are,
because those indexes will be like staying around, but they
are not being used at all. Not to mention
that there are bugs. Bugs, obviously some of them will
capture with our unit tests. Some of them we won't
just because they are tied to the engine we are using.
An important part here is to understand that engines
differ. It may be that in production you have different
type of the database just because you have different license or
you are using some cloud component that you can't run locally,
whereas in local environment you just use something like in memory database
or free tier edition of your cloud component.
Right? Now, those two products, the one you use in the developers
and the one you use in production, they may differ.
And because they differ, you may hit different bugs.
One of the bugs that we faced, like years back, I mean,
the industry first years back was the Halloween problem.
And the Halloween problem is actually pretty interesting. And now databases
protect themselves from facing this Halloween problem.
But let's see what was going on. Let's say that we would like to give
a pay rise to the employees. So we get the table employees,
and we increase salary by 10% for everyone that earns
less than 10,000, right? Obviously numbers do not match here.
That doesn't matter, and so on. So what we do is we scan
rows one by one. And once we find a row, okay, this row
is below 10,000, so we increase the
salary by 10%. Now, the problem with the table was
that rows were ordered by the salary.
So what happens now is if you get the row, that salary is below 10,000,
you put it, you increase the salary. So the
row goes here, and then you continue scanning and
you find the same row again and you process it again.
So what happened? And the problem takes name from when it
happened and happened on the last day of October. So it was Halloween.
So what happened is everyone in the company
started earning at least 10,000 because we protested the same row
many times. Obviously that was a bug in the database.
But if you are testing your code against a different
database type, then you won't find this bug
at all until you hit the production.
There are some other issues with the orms.
So orm libraries, they help us reading the data,
writing the data, mapping the SQL entities into
our objects and other stuff, and they are pretty useful.
However, they pose their own issues on
the way they work. So let's say that we have table court aircraft,
and aircraft has one single aircraft may have many seats.
What we would like to do is we would like to get all the aircraft,
then iterate through every single aircraft and just
get the number of seats. What may happen
behind the scenes is that we may face an unplug
one queries problem just because how ORM
gets the data. ORM can get the data in
the eager mode or lazy mode.
In eager mode, ORm gets the aggregate
route with all the data behind. In lazy
mode, the ORM just gets the basic data.
And when we try accessing something else, it gets this
something else and fill in the gaps as we go.
So if our ORM is configured in the lazy mode that
we may start with one query to get all the aircraft.
And then just as we iterate over those entities,
we will be getting another query for every single
entity along the way. Ultimately we get n plus one queries,
one query for aircraft and n additional queries for
every single aircraft to get the seats. We can optimize
that to reconfigure the ORM to get the data in the eager
mode. And in the eager mode what would happen is we would
get all the aircraft and join them with the seats in
one query. So this seems like a solution.
There are two problems to that though. First problem is
we won't notice this problem from our
unit tests, integration tests or whatever else, just because
the data is correct. Data doesn't change behind
the scenes, it's just the way how we extract the data changes.
And second problem is, let's say that we would like to change our ORm
from lazy code to eager mode always. And then we end
up with the same problem that we already saw, that you join the
big aggregate route and you end up extracting like hundreds of
thousands of rows just because you wanted to get
all of that. So it's not a silver bullet either
lazy or eager mode. You just don't just flip
the switch and you're good to go. No, you need to understand what's
your scenario and you need to decide on a case by case basis.
Other problems that orms introduce
are around polymorphism and generally impedance
mismatch. In general, impedance mismatch is a term
meaning that we want to map two incompatible
paradigms on top of each other. So we have object
oriented paradigm or functional paradigm in our application
code, and we want to map it on the SQL
paradigm like relationship paradigm.
And the opposite obviously goes the same way. We take the
SQL paradigm and map it on OOP object
oriented programming. So this doesn't go well.
If we want to do it like naively, then yes,
we can do this naive way. And there are for instance patterns
how to do polymorphism in SQL. Let's say that you have a
class that inherits from some other class and you can see how
to put this class onto the database. In SQL database there
are patterns like table PI hierarchy, table per type,
et cetera, et cetera. However, they will decrease the
performance. So it's not that you can take the way
you implement your applications like object oriented way and
put it in SQL database. This doesn't work and it goes both
ways. So what happens with the data types
in SQL databases. For instance, we have a thing regarding
varcar, regarding the strings.
Basically we store in the database, the thing that is called collation.
Collation is basically a configuration in the database
that dictates how we compare characters
in the database. So we decide whether we like
a lowercase character is equal to uppercase character,
or whether we should put, I don't know, null first, or maybe how
we compare international characters and whatnot.
And this is something we can configure per database. But can we
configure something like that in our object oriented
programming language? Can we configure how we compare
strings in JavaScript?
We can't, and this goes to other types as well. We can
specify the precision of numbers with decimal in SQL
database, but in JavaScript we do not have that precision
at all. We can, for instance, use spatial data that can be
stored as a pair of numbers, but our orm may not
be capable of doing this way. So it may actually serialize
and deserialize that to string or to JSON.
So this may hit the performance or even make it unusable
just because we want to map one incompatible paradigm on
top of each other. But this goes even further.
Again, what happens if we have heterogeneous applications?
How can we deal with that? It may be that we change the type
of a column and we break some other application just
because it cannot deal with this new column. Or DRM cant handle
that. Other problems that orms introduce
are around, like lack of visibility. Do you know what your isolation
level is? Do you know where your transactions code begins or
where it rolls back? Do you know who commits the transaction?
Who rolls it back? What about caching? What about connection pulling?
What about query hint? What about migration configuration? Is your RM
going to deal with the migration or with
the schema when it realizes that there is a schema drift?
Or is it going to throw an exception on you, or is it going to
go and change the schema on your behalf?
Other things that may go well, and we could
have more things around like migrations,
rollbacks, et cetera, et cetera, heterogeneous applications,
how we manage them, how we fix them, and so on and on.
Generally there are so many issues around and
managing all of that is hard. And the most important part
is we won't notice those when
we implement application in our developer environment.
We just won't notice that. That is
because unit tests and integration tests are going to be
green, but we will hit errors when we go to production.
So orms and databases in general they're
hard because changes are scattered in various places.
There are so many moving pieces, dump so many things
that are not clear to us, whether we load the
data eagerly or lazily, which indexes we
use, whether we run n plus one queries or
just one query, whether the problem is in the database,
in the code, are there triggered stored procedures, functions,
views? Many, many more issues than we can't deal
with easily. And now talking about tests,
do they help? Do they work in our case?
So unit tests obviously won't capture
those issues because they just first run against the
latest schema and they only check the correctness
of the data. So they only check whether we
read and returned correct data, not how
we extracted this data. Same goes for integration
tests. They will use like a physical component
instead of just like unit mocked databases
or in memory databases or whatever else. But still,
they are probably triggered against like small databases
to make them run much faster so they won't observe the
issues end to end tests. On the other hand,
if we don't run them in production, they face exactly the same
problem, so they won't capture the issues. So it seems like the
typical test pyramid is not going to do for us.
But you might say, okay, we have load tests.
Load tests should capture those issues, right? Shouldn't they?
And they can capture some of those issues, but they
have problems on their own. First problem is
they are super expensive. They may take hours
to complete. You need to create new hardware for
them. You need to host a dedicated environments for those.
You need to anonymize your data, you need to gather the data,
you need to access production. For that, you need to anonymize.
So you are GDPR compliant, California protection
compliant, and other stuff. This gets even harder
with machine learning scenarios or with, I don't know,
specific mobile devices, scenarios and whatnot. And whatnot.
So there are many issues with those. But the biggest issue we have
with load tests is it happens way
too late. They happen at the very end of
our pipeline. We run load tests after we wrote the code,
probably committed it to the branch, even code
reviewed with teammates, merged to the main branch,
and deployed to non production environments. So this
is when we run load tests. And if there is an
issue observed during the load test,
then we need to start from scratch. This is something that
is way too late for our cases.
So yes, load tests will help us a bit, but they
won't prevent the bed code from reaching preprod
environment, meaning that they won't save us time.
We need something better, we need some other solution
that is going to prevent the bad code as early
as possible. And even when we deploy to production and
we face issues, we need something that will show us what the
problem is and how to automatically fix that.
And the solution to that is database guard rows.
Database guard rows is the concept of
maintaining managing databases in an efficient
manner so that we can prevent bedcode from reaching production.
We can monitor databases and monitor them semantically,
understanding that we monitor databases that are used by developers
and it applications and finally automatically troubleshoot
the issues. So in order
to build those database guardrails and to
improve what we have today, we need to understand what
we are dealing with. So just because we are talking databases,
we can't use generic solutions that would work
for them. We need to have specifically crafted
solutions for the business we use, for the domain we work with,
for the architecture of our system. We need to
capture the context to understand what is going on.
And to do that we need to understand all
sides of our application, developers side and production side.
So let's see how we can capture this context in order to
build solutions that will prevent bad code from reaching production
and help us diagnose it once we make a mistake.
So the very first thing is the multiple parts,
multiple pieces to creating the
full observability and monitoring of the solution. So we start
with telemetry. Telemetry is the ability to
capture the raw signals. Signals could be logs, metrics,
traces, whatever else. So those signals, we need to have a
way of capturing them and the telemetry lets
us do that. And now there are open source
standardized solutions for capturing the telemetry. For instance,
open telemetry is a library or a set
of libraries that we can use for that. Once we cant
capture signals, we need to process them in
some centralized way and we need to obviously be able
to manage those. Then we can start building the
visibility. Visibility is the state when we see
what's inside the system in every single place. Meaning that
no matter whether the web server database,
whether it's Q Enterprise service, bus, load balancer,
serverless application, whatever else, we have basically telemetry
wired in everywhere, in all the places.
Once we have that and we capture signals from everywhere,
we can say that we have visibility. Visibility shows
us what's it inside the system.
Then we can start building the APM application
performance management. That gives us this
ability to say whether our system is healthy or not.
And we do it in a way that we aggregate
the data from all the places we build meaningful
aggregates, meaningful charts, meaningful dashboards
showing us what's going on? So that we can just pick at those
and immediately say okay, it's all green,
our system is up and running and APM
is going to give us exactly that. We can just look at the dashboards
and it's green. However, when things break,
APM is not going to tell us what's the
reason. It's going to tell us that hey, this component is
slow. There are problems in this component,
but APM won't tell us that this component
is slow just because we changed the migration. We deployed
it last Wednesday. Now there is a different data distribution coming
from our users and we stopped using the index.
APM won't give us this full story. APM will only
tell us databases slow and cpu spikes. And now what
can you do with that? Well, if you don't know what's the reason?
You may just try and upscale your database and this might help
actually, but it's not the problem. The true problem is that
you stopped using an index that was in the database.
So now you need to modify your code to
fix that. APM is not going to tell you that, but observability
is. Observability gives you
this, lets you connect all the dots from the developer
environment to production environment. Connects the dots
from like CI, CD, from your application code, from your
web servers, queues, databases, whatever else,
and with proper tooling around with database guardrails
cant tell you the full story how it worked.
This is what you can do. Now the question is okay,
but what can we do? How do we do that? We have telemetry,
we have signals, why can't we have observability right away?
And the problem with that is monitoring.
The way we do is it captures the signals
that are generally agnostic to what we do. They can
show you errors, they can show you charts, but you need
to know how to configure thresholds and what to do with them.
Observability, on the other hand, needs to connect the
dots from multiple places and understand
the characteristic of your application. And specifically it
needs to understand what is running in your application.
And in order to do that we need to use open
telemetry to capture the stuff from the databases. How can
we do that? Well, observability is basically a tool.
Open telemetry is a tool for capturing logs, traces and
metrics. And the way we do that is we can use the set of sdks
from open telemetry that lets us capture the signals. Those sdks
are built for your programming languages of your choice
and they are also nowadays integrated with many of the pieces you
already have with your databases, orms,
SQL drivers, web frameworks, whatever else. Whatever else.
The way open telemetry works is it captures the so
called signals, sorry, traces and spans.
So imagine that we have a workflow of like the user comes
and wants to, I don't know, buy a ticket, so they click the
buy purchase button. Now what happens is
the request comes to the load balancer. It is then being
set to, I don't know, your web server to the database queuing
mailer system and whatever else open telemetry.
What it does is it lets us compose a trace,
a trace of everything that is actually happening
behind the scenes as part of this workflow. And the trace is
composed of like spans that represent what happened
in a particular piece of our system. And such a
span is basically a big jSon with like metadata,
human friendly explanation. What were the events when the
situation started, when it finished, what was the
id of the operation, what was the attributes of the
operation and other stuff. So this is what opentelemetry can
do for us. We cant capture those signals and
capture that stuff to build traces and spans.
Now the question is, what do we actually capture to
build proper database guardrails? And the question the answer
is we need to understand databases
as simple as that. And to do that we need to understand how databases
work behind the scenes. So whenever you send a
query to the database, it goes through various
stages, it is being parsed, it is being rewritten by
your database engine. But most importantly, a planner
comes in. Planner is a component of database engine
that checks your query and tries to
figure out okay, how do I want to execute this
query to get the highest possible performance?
And let's see how it works. Imagine that we have
a query like this one, select star from whatever,
and we can use the explain keyword
to get the execution plan of this query.
And the execution plan is like, could be JSON, could be some
other form explaining what's going on.
So if we now dissect this execution plan provided
by the database, we can see that it consists of multiple
nodes. Basically every row here is a node in
this plan, and node explains a particular operation
that is being executed by the database. Every node has
different types, so we can for instance, sequentially scan the table.
We can use an index, we can join two tables with nested
loops, strategy or whatever else. And more
importantly, every operation, every node has cost associated
with it. And this cost basically indicates
how hard it is to execute a
particular operation. It's not that it's some
meaningful value like it will take 6 seconds or load
600 megabytes, nothing like this. It's just an arbitrary
value telling us how expensive it
is. Now the idea is the database
engine can create
many plans for a particular query,
meaning that you can have multiple execution plans with
different costs. And then the database engine picks the plan,
which is the cheapest one with the lowest cost.
So that's how databases work. Now this
plan includes everything. It includes which tables are going
to be read, how we are going to read them, which indexes
we're going to use, how much data we're going to extract
and whatnot. So this is what we can do. And now
the idea is we use open telemetry
and other observability solutions to extract
the things from the database when we run stuff
in our local environment. And this way
we can explain what is going on,
because now just by looking at the query in our
local environment, we can tell whether this
query is going to work well, because we can take the execution
plan and compare it with production database.
And we can see that it doesn't matter that we have like one row in
our local database just because we take this plan
and compare how it would execute in the production
database and how long it would took, how heavy it would
be. So this way we can prevent the bed code from
reaching production. Because we can analyze the
changes in the developer environments easily.
We can analyze the changes, we can see the migrations,
we can see whether indexes are used, and most importantly,
we can find things that unit tests won't find for us because
we can see, okay, you are running n plus one queries instead of
one query. You are converting the data from like numbers to strings.
You are not using indexes and whatnot and whatnot.
So this way we can prevent the bed code from reaching production.
But the story doesn't end here. What we can do also
is we can monitor the production database the same
way. And when something breaks, when the database is slow,
when cpu spikes, we can see it's not because like
you have, I don't know, slow cpu, but it's because you
stopped using an index just because
you deployed recently and it changed the query,
or maybe just because database engine decided to run the query differently,
because you have different data distribution nowadays. And ultimately
we can automatically troubleshoot the stuff just
because we know how it went from your local
development box until production.
We can tell you the full story, the full big picture of
what is going on and how to deal with that. So those
are database guardrails. Prevent bad code from
reaching production and push all your checks to the left as
early as possible. Monitor your solution constantly
to understand what is going on with understanding
of the databases and finally troubleshoot all of
that automatically to give you the big picture.
And now how cant we do that? Well, the interesting thing
is that it only takes like one Javascript dependency
to be installed in your application, to have
all of that to be proactive and to push all the
stuff to the left and to save you from headache
of debugging issues in production because you can
just capture them as early as possible. And Matis does exactly
that. Matis lets you see the stuff and
protects you from breaking your production. So the way
we do it is we integrate with source code, we analyze
the stuff and let me show you really quickly what we do.
So once you create a
project in Matis, you have three pillars just
like we spoke before. So the very first pillar
is we integrate with your application and can show you easily
that hey, your application exposes this particular
rest endpoint. And in this rest endpoint what
happens is when someone coded, they got this
HTTP code. However, there were some issues with this rans point.
What happens is you can dig deeper and you can see and you
can understand that hey, this is the query that has
been executed. This is the actual SQL statement that was run
against the database. This is the execution plan
with all the details, how things were executed on the
database. In the database. Those are statistics of tables,
what was read and whatnot. And most importantly we can
analyze that and automatically tell you hey,
you are scanning a table and you read something like 16
million rows, that's not going to fly well. This is
the impact on your database and this is how you can fix that. For instance
by creating such an index to improve the performance.
So this is the important stuff. We can see what happens behind
the scenes and we can give you, using expert
knowledge, using machine learning solutions, using whatever else we can
find the anomalies and show you this is not going to fly well once
you go to production and we can do that from your local
development environment. So we take your local box and
we compare that with production and we can show you this is not going
to fly well. So this is about the queries that
you run in your database. But we can also do other stuff.
We can for instance analyze your CI CD. So for instance
we can show you hey, you submitted this pull request and this is what
happened inside this particular pull request. This is
the performance, but we can also find pull requests for like schema
migration and we can show you that hey, you are trying to
create this thingy and this is not going to fly well because you
lack indexes when creating this migration or whatever else. So we
analyze this migration even though in your local
environment this migration can run in a split second just because you
have hundreds of rows. But we can analyze and tell you once you go
to production that's not going to fly well. But this
is just for the prevention. We also cover the stuff
for monitoring. So what we can do is we can show you hey
this is your production database and we cant show you like
metrics for your database host showing
you the things that you are interested in from the database
perspective. We can analyze your configuration and tell you what
is going wrong, but we can also analyze apart from the host,
we can analyze particular databases. So we can
for instance tell you that hey, those are the transaction you're running,
this is the number of rows, temporary files, whatever else.
We can analyze your indexes, extensions, we can analyze the
live queries as you go through the database and we can show
you hey this is the query that is being executed in your database
and this is how it goes over time. And we can find
insights, show you what is wrong. We can provide anomaly
detection for all of that. So this is how we cant
deal with the databases. And most importantly we
can do all of that with understanding databases,
with knowing what you do and building this full picture.
So coming back to the slides, we integrate with the source
code, with your CI CD pipelines, with your databases,
no matter whether it's local pre production, production database
we can do on demand analysis and we can provide database observability
that you need to build database guardrails. So we
integrate with the source code of whatever language you have,
JavaScript, Python, any other languages,
frameworks, libraries, drivers. We can integrate with
CI CD, with GitHub actions or whatever else to analyze
what you do. We can provide you with observability
dashboards showing you what's going on and
how things change along the way. So in summary,
there are many things that can break with the databases today,
many things that we won't capture with unit tests, with integration
tests, and we need to be proactive. We can't wait for
load tests to happen. We need to have those checks as
early in the pipeline as possible, we need to push them to
the left as much as possible, and we need to build understanding
to have successful database godros, no matter whether we are a startup
or Fortune 500 company, we need to understand all those
big pieces because just monitoring and just APM won't
do for us. And Mattis does exactly that. It covers
all the things, all the problems and all the areas
of your application. And Matis is the database godros
ultimate solution that you can just take and start using
in your applications. And being all of that said,
I'd like to thank you for attending this talk. I hope you
find it useful. I hope you enjoyed it. Join our Discord
channel, see our Matty's platform, check it out,
drop me a line on email, Twitter, whatever works for you
and enjoy the rest of the conference. Thank you.