Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, everyone.
My name is Rotem, and welcome to my talk, Strategies for Reliable
Database Schema Migrations.
Before we jump into what that exactly means, let me introduce myself.
So my name is Otem.
In the past three and a half years, I'm the CTO and co founder
of a company called Ariga.
As part of my role, I have the awesome responsibility to be a co maintainer of
two fairly large open source projects.
One of them is called Ent is an entity framework for Go.
It's part of the Linux Foundation, started by my co founder when
he was working at Facebook.
And the second one, which we will go into at depth today, is called Atlas, which
is a database schema management tool.
Before we go into how to make schema migrations more reliable,
I want to tell you a story.
And it's my favorite kind of story.
It is a developer horror story.
It's a true story.
Like in the TV show Fargo, all of the events here really took place.
We just changed the names.
But the rest has been told exactly as it occurred.
It was a sunny day in Tel Aviv.
The product manager asked our developer for a simple request.
Can we add a new status to the user entity in our backend application?
No problem, thought DDEV.
I'm a professional.
In this case, our developer is using an O.
R.
M.
In no J.
S.
That is called sequel eyes.
Sequel eyes lets you define your data model in as javascript objects and this
object at the status field and In it, we list the possible values for this enum.
So in this case, we added the invited status to the list of
possible values for the user status.
Now, when you make a change to the data model, you need to
change the database as well.
Otherwise, you're going to have drift.
You're going to have an unsynchronized situation where your application doesn't
understand your database and your database doesn't understand your application.
For this purpose, Virtually all frameworks provide some mechanism for
what we call database schema equations.
Basically, it's a way to express the logic needed to be, that needs
to be applied to the database.
To bring it to the most recent version, we call it migrations because it
migrates from the previous version to the next of the database schema.
Our developer did just that.
It was a tiny pull request, 10 lines of migration script, one
line change on the data model.
And our developer did all the standard things.
He created a pull request, the CI, All of the automation, the unit
tests, everything worked great.
The team reviewed it, even the tech lead gave it a thumbs up.
Everything was by the book.
Now came the best part of any developer's life, right?
All of the hard work that we did into, we put into our software engineering,
now we get to ship it to the world.
Ship, our developer did.
Can anyone guess what happened?
The
shocking sound of a full blown database related outage.
The system blows up.
Database CPU shoots to the roof.
100 percent error rate on all the critical, Back in endpoints.
And of course, 100 percent unsatisfied management.
The CTO is blowing down our developers neck.
What's going on?
How do we fix this outage?
So why did this happen?
Let's do a quick postmortem to get to the bottom of what is the core
reason for this outage happening.
On the technical level, the explanation, maybe it's not known to everyone,
But it's fairly simple to explain.
Once our migration kicked in, we altered the table, we modified the enum values.
The way we did the change caused the table to lock.
When the table is locked, nobody can write any new data.
into this table.
The reason is that the way my sequel represents the enums on
this, if we don't do the change in the right way, the database
needs to rewrite the entire table.
Now, if the table is small, it may take a few milliseconds.
But in our case, a very busy database in a very busy table.
This lock can be acquired for hours on it, which means a very long incident.
By the way, there's nothing you can do To stop this from rolling out,
the rollback will only start being applied after you finish this thing
from after this thing completes.
But in postmortems, we're not usually interested in the
superficial technical mechanics.
of the outage.
You want to get to the bottom of it.
Why did this thing really happen?
And for this, we have the famous technique of the five whys.
The idea is that we ask why five times to get to the root cause or a
deeper explanation of the incident.
Why did the outage happen?
Immigration locked the table.
Why did the migration lock the table?
Because our developer shipped broken code.
Very simple.
Why did our developer ship broken code?
It wasn't caught during continuous integration, and it wasn't caught
in our staging environment.
But why didn't we catch it?
The truth is that we don't really have automation to verify these changes, and
we actually rely on human reviewers.
In this case, the pull request was approved.
And it passed all of our quality checks.
But how can something dissevere all of our quality checks?
Let's try to explore that.
Okay, but why should we care?
Why should we be even talking about making Database schema changes more reliable.
For many teams, there is an inherent tradeoff between velocity, how fast
they can move, how fast they can make changes to their database, and
the reliability of their service.
Because the negative impact of a bad schema change is very high,
people start to fear change.
They become slow.
They don't do different refactorings.
They maybe batch many changes together.
In general, 13 becomes slow.
So there is a very big payoff to be able to make this tension obsolete.
If we can both be faster and more reliable at more time.
At the same time, we can provide a much better service.
Okay, with the introductions behind us, we are finally ready
to discuss five strategies for reliable database schema iterations.
The first one I want to talk about is so important that it gets the digit zero.
It's not included in the five strategies, but it's just a big one.
Baseline that I want to get out of the way before we start discussing
this stuff for a surprisingly amount, large amount of companies that I've
talked to over the past three years.
Database schema changes are not automated.
It means that when you want to deploy a version of your application that requires
a change to the database schema, someone manually or semi manually directly
connects to the database And applies these changes before this version is deployed.
Now this is A very bad practice for multiple reasons.
One, it's error prone.
We know from the DevOps movement that automation is all about
reducing the rate of defects.
If we let a machine do something that a human might mess up, we
greatly reduce the chances of error.
Also, migrations Become the slowest link in the chain.
If you automate everything except for database schema changes, your
velocity is limited by the rate where you can do schema migrations.
If that's manual, that is slow.
It's not repeatable.
It has no audit log.
And I think maybe most importantly, it's super unfair to your
engineers and team members.
Making a manual change on the production database of a real database
application is one of the most stressful things you can do an engineer.
So if you are not automating yet, please start doing so today.
The next principle I want to talk about is something that I
call database schema as code.
This is the core principle behind Atlas database schema management tool
that my company Riga is building.
It was started by my co founder and I around 21 today.
It's very popular in many organizations and projects.
all around the world.
And the main idea with it is that you, similar to Terraform, instead of planning
changes in a manual way and listing out every change to the database by yourself,
you define the desired state of the database and let the tool figure out the
diff and calculate the migrations for you.
So by further reducing Another manual step in the chain of creating
database changes, you can get a more consistent and less error prone result.
I want to show a quick demo of this now of using Atlas.
So let's see what we have here.
So in our small project, we have two files.
One is the Atlas HCL file.
This is the configuration file that tells Atlas The desired state
of the database is stored in this schema file and the URL, the target
database, which we are managing, is going to be a local SQLite file.
Now, when I want to apply this schema, I run a simple command atlas schema apply,
the local environment, and what Atlas will do is it connects to the database, which
currently does not exist, Let's plan the migration, run some diagnostics, in this
case it's a simple additive operation, so no diagnostic found, and prompt us
whether we want to approve the plan.
Let's approve it.
Now, the next time we run an Atlas schema apply, as you can guess, Atlas detects
that there is no difference between the current and desired state, and therefore
tells us that there is nothing left to do.
Additionally, let's now show what happens when we add an additional column.
As you might expect, Atlas will plan the addition of the email column by
planning an ALTER TABLE statement.
Again, this is a very simple operation, so no diagnostics found,
and we can safely approve and apply.
The next strategy I want to mention today is to test database logic the same
way that you test code with unit tests.
So databases are not only just containers of information in tables.
Databases have lots of application logic that can be expressed inside them.
For example, constraints, triggers, stored procedures, functions.
extensions and more.
So the way that we ensure a reliable application with code is that we write
unit tests to verify that everything works correctly, but also that future
changes don't break the logic and the expectations of the application.
To this day, it's very difficult to express this test for the logic
that resides in your database.
But like we said, the core principle of Atlas is to take database
schema and to make it into code.
And using this principle, we're also able to create a testing
infrastructure for our database logic.
Let's see what this looks like.
Okay, let's take a look at our project.
So we have a new schema file here.
You will notice it's called schema.
pg.
hcl.
This is a Postgres schema, but we are using HCL.
This is the Atlas HCL dialect.
It's our configuration language for defining databases.
You can find here stuff like tables and columns, and it will be familiar to
you if you used Terraform in the past.
I want to point your attention to a bit of logic that is
called the positive function.
This block will end up becoming a function defined on the database.
Now, if database schema is code, we should be able to test it.
And Atlas provides you with the testing framework.
You can see this little green arrow, see that your IDE recognized this
block of text as a test, and you can make assertions, for instance,
that this statement, select positive one, returns a truthy statement.
Now, let's try to make this test fail before to, to verify that it works.
I run it and you will see that this test failed.
Now let's correct the input and now we can show that the test passes.
So by treating our database schema as code, we are able to write the unit test
for it just like we do for our code.
We can make our.
Database changes more reliable by preventing from regressions.
This test protects our users from any developer in the future that makes
a change that will break this logic.
Most database migrations are about making changes to the structure
of the database, to its schema.
But, oftentimes, teams need to make changes to the actual data.
For instance, if you have a full name column, and you would like to
split it into a first and last name.
Now, the trouble with these types of changes is that they are especially risky
if you make this change in the wrong way.
In most modern relational databases, there is no undo button that is going
to revert the changes to your data set.
You might accidentally delete data or cause some other very
negative impact to the business.
The second piece of trouble is that to this day, there is no framework
or harness that makes it easy.
to write these kinds of tests.
But thankfully, for people who adopt the database schema is called Mindset, Atlas
has a testing framework that enables you to write tests for data migrations.
Let's see how.
Let's see how this works.
In our example, we have a migration directory with a bunch of migrations.
Now, we are now in this scenario where we create some trigger and we
need to backfill existing records.
Okay, so we have a trigger that's going to populate some column, but
we need to make sure that all of the existing records in the users table
also will contain the correct value.
We create this data migration, this backfill operation that updates
the existing records in the users table with the correct value.
Now, how can we be sure that this is correct?
This query, this backfill statement is correct.
The only way that I know to verify software quality in a
reliable way is to use testing.
So Atlas provides you with this framework to write tests for migrations.
And it works like this.
We start by telling Atlas to migrate the database to the version before
the migration we want to test.
Then we seed some data using insert into.
Statements, then we run the migration that we're interested in testing.
And finally, we make some assertions.
In this case, we verify that the latest post timestamp
column is populated correctly.
Let's see that this works.
Amazing.
So what Atlas did was run this test scenario on our local dev
database to verify that the that we get a reliable result.
So far, we mostly covered strategies that look at the application logic.
How do we ensure that the application logic that is stored in the database
Behaves correctly and reliably remains doing what it is supposed to do.
But if you remember the beginning of the talk, we looked at something
that was completely different.
We looked at cases where the actual change of the schema is a risk factor.
And if you think about it, migrations are a very risky business.
Let's mention some types of ways that you can shoot yourself in
the foot if you're not careful.
You can accidentally drop a column or a table if you're not careful
causing you one to potentially lose data that you cannot recover.
And of course, if your application relies on this resource
to exist, you can cause a.
Production outage.
This might sound to you like something that, never happens, but if you look at
industry postmortems that are published, such as this one that was published
by recent in February of this year.
You will notice that in this outage that this company suffered around 5 a.
m.
They started the database migration around a minute later.
They noticed that accidentally some tables were dropped from the production database.
And the next 24 hours were not a lot of fun for this team that need to work
really hard to recover from this outage.
Next, we have many migrations may succeed in dev.
They may succeed in staging when we are dealing with smaller empty tables.
But when we are, for example changing constraints, like adding a
uniqueness check, On a column, our migration might fail in production,
whereas it succeeded in development.
Now, if this happens to us midway immigration, depending on our database
and different circumstances, we might find ourselves in a limbo state where
we are not at version a anymore, but we are not yet at version B, causing
rollbacks to be a potentially an issue.
And of course, our server doesn't necessarily know how
to handle this situation.
Additionally, we have the potential to make breaking changes.
So everybody today is talking about data contracts, mostly between the
production database and data warehouse.
But there is also a data contract between our backend application and our database.
If we rename the column in a way that was not, we didn't think it through
all the way, we might end up in a way where we break, broke the contract
between backend and database, and our queries are beginning to fail.
Finally, as we saw in the horror story in the beginning, Certain types of operations
require the database to rewrite the table on disk, which is a slow operation and
an operation that requires the database to take a lock on the database resource.
This usually translates into a production outage for any processes
that rely on this table during runtime.
So how do we prevent such dangerous stuff from happening
to our database in production?
The way to do this is by automating testing and verification to catch
this kind of dangerous change early in the process way before
it hits our production database.
Luckily, Atlas comes with a migration analysis engine that knows how to
understand the semantics of the schema changes that are planned and
classify them according to their risk.
factors.
Let's see how this looks like.
So in this case, we have a small project.
This is an SQLite database with three tables, T1, T2, T3.
Let's apply this locally.
We are now creating these resources.
And suppose now we want to make a change and T3 table.
As I'm commenting, I'm deleting this table completely, and now let's
see the change as it is planned.
Now, first of all, the plan is to drop table T3.
Next, what Atlas does is it uses its analysis engine.
And what it tells us is that destructive change detected.
Dropping table T3.
And it gives us a suggestion, but we'll get into that into a later phase.
Okay, so now Atlas warned us that this type of thing might be risky
and perhaps we should not do this.
Let's look at another small example.
Suppose we want to rename a column.
If you recall from previous examples, Our examples before, renaming a column is
considered a backward incompatible change.
So when we do this, first of all, Atlas asks us if our intent
was to rename or to drop an ad.
We choose the rename option, but then Atlas gives us the diagnostics
and it tells us that a backward incompatible change was detected.
This warns us in time just before we apply the dangerous change to the database.
Now, this is very powerful during development when I want a tool to assist
me in finding these risky changes, but it becomes much more powerful when we
combine it with continuous integration.
What if we could have Atlas run and analyze these changes and
during the CI phase to warn us way before we merge the code to master.
and deploy it to production.
If we can catch these risky changes early, we can end up with much
more reliable database schema migration deployment pipelines.
Detecting risky migrations during CI is a great strategy to enable
reliable database schema migrations.
However, some changes can only be verified against a specific target database.
For instance, Suppose that you are adding a unique constraint to
a column in one of your tables.
Adding this unique constraint will always succeed on an empty database,
such as the one you might be developing against, or the one that
you're using in your CI environment.
However, in production, if your table already contains a duplicate
value for this column, This migration will, as we have shown in the,
in one of the previous slides.
So to help you,
to help you get around this problem, Atlas supports a really neat feature
that is called pre migration checks.
Essentially, what you can do, instead of providing just a plain SQL file when
defining an emigration, you can define a file using the txtar text archive format.
If we provide this txtar directive at the top of the file, it tells
Atlas that our migration file is comprised of two different sections.
One is the checks section.
This allows us to run some logic before the migration begins.
For example, here we are checking that the users table is not
is empty before dropping it.
During runtime, Atlas will run the pre migration checks before deploying
the migration, ensuring that we don't lose any data in production.
Okay, wrapping up, let's consider how the five strategies that we presented
today can improve and make database schema migrations more reliable.
So database schema is code Terraform for databases.
In general, automatically planned migrations are less error prone
and more predictable than manually planned changes most of the time.
Testing schema logic, being able to write unit tests for your database
logic can prevent regressions.
As your application evolves, testing data migrations when we need to
rewrite data in our tables, part of the immigration process can prevent
data corruption and getting us into a very nasty incident where we
need to restore data from backups.
Automated quality checks can detect stuff like destructive changes,
table logs and more early in the software development lifecycle.
locally during development or during CI automation.
Finally, we demoed we talked about pre immigration checks that allow you
to verify data dependent changes just a moment before you roll them out
because they depend on the actual data.
That's inside the database.
Thanks a lot for your time.
I really appreciate it.
And if any of this is interesting to you, please visit us at atlasgo.
io.
See you soon.