Strategies for Reliable Database Schema Migrations

Video size:

Abstract

Database schema migrations are a ubiquitous part of managing modern apps. But they can be complex and unreliable, making them dreadful for developers and operators. This talk explores strategies for detecting and mitigating migration risks using “Database Schema-as-Code” tools such as Atlas.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, everyone. My name is Rotem, and welcome to my talk, Strategies for Reliable Database Schema Migrations. Before we jump into what that exactly means, let me introduce myself. So my name is Otem. In the past three and a half years, I'm the CTO and co founder of a company called Ariga. As part of my role, I have the awesome responsibility to be a co maintainer of two fairly large open source projects. One of them is called Ent is an entity framework for Go. It's part of the Linux Foundation, started by my co founder when he was working at Facebook. And the second one, which we will go into at depth today, is called Atlas, which is a database schema management tool. Before we go into how to make schema migrations more reliable, I want to tell you a story. And it's my favorite kind of story. It is a developer horror story. It's a true story. Like in the TV show Fargo, all of the events here really took place. We just changed the names. But the rest has been told exactly as it occurred. It was a sunny day in Tel Aviv. The product manager asked our developer for a simple request. Can we add a new status to the user entity in our backend application? No problem, thought DDEV. I'm a professional. In this case, our developer is using an O. R. M. In no J. S. That is called sequel eyes. Sequel eyes lets you define your data model in as javascript objects and this object at the status field and In it, we list the possible values for this enum. So in this case, we added the invited status to the list of possible values for the user status. Now, when you make a change to the data model, you need to change the database as well. Otherwise, you're going to have drift. You're going to have an unsynchronized situation where your application doesn't understand your database and your database doesn't understand your application. For this purpose, Virtually all frameworks provide some mechanism for what we call database schema equations. Basically, it's a way to express the logic needed to be, that needs to be applied to the database. To bring it to the most recent version, we call it migrations because it migrates from the previous version to the next of the database schema. Our developer did just that. It was a tiny pull request, 10 lines of migration script, one line change on the data model. And our developer did all the standard things. He created a pull request, the CI, All of the automation, the unit tests, everything worked great. The team reviewed it, even the tech lead gave it a thumbs up. Everything was by the book. Now came the best part of any developer's life, right? All of the hard work that we did into, we put into our software engineering, now we get to ship it to the world. Ship, our developer did. Can anyone guess what happened? The shocking sound of a full blown database related outage. The system blows up. Database CPU shoots to the roof. 100 percent error rate on all the critical, Back in endpoints. And of course, 100 percent unsatisfied management. The CTO is blowing down our developers neck. What's going on? How do we fix this outage? So why did this happen? Let's do a quick postmortem to get to the bottom of what is the core reason for this outage happening. On the technical level, the explanation, maybe it's not known to everyone, But it's fairly simple to explain. Once our migration kicked in, we altered the table, we modified the enum values. The way we did the change caused the table to lock. When the table is locked, nobody can write any new data. into this table. The reason is that the way my sequel represents the enums on this, if we don't do the change in the right way, the database needs to rewrite the entire table. Now, if the table is small, it may take a few milliseconds. But in our case, a very busy database in a very busy table. This lock can be acquired for hours on it, which means a very long incident. By the way, there's nothing you can do To stop this from rolling out, the rollback will only start being applied after you finish this thing from after this thing completes. But in postmortems, we're not usually interested in the superficial technical mechanics. of the outage. You want to get to the bottom of it. Why did this thing really happen? And for this, we have the famous technique of the five whys. The idea is that we ask why five times to get to the root cause or a deeper explanation of the incident. Why did the outage happen? Immigration locked the table. Why did the migration lock the table? Because our developer shipped broken code. Very simple. Why did our developer ship broken code? It wasn't caught during continuous integration, and it wasn't caught in our staging environment. But why didn't we catch it? The truth is that we don't really have automation to verify these changes, and we actually rely on human reviewers. In this case, the pull request was approved. And it passed all of our quality checks. But how can something dissevere all of our quality checks? Let's try to explore that. Okay, but why should we care? Why should we be even talking about making Database schema changes more reliable. For many teams, there is an inherent tradeoff between velocity, how fast they can move, how fast they can make changes to their database, and the reliability of their service. Because the negative impact of a bad schema change is very high, people start to fear change. They become slow. They don't do different refactorings. They maybe batch many changes together. In general, 13 becomes slow. So there is a very big payoff to be able to make this tension obsolete. If we can both be faster and more reliable at more time. At the same time, we can provide a much better service. Okay, with the introductions behind us, we are finally ready to discuss five strategies for reliable database schema iterations. The first one I want to talk about is so important that it gets the digit zero. It's not included in the five strategies, but it's just a big one. Baseline that I want to get out of the way before we start discussing this stuff for a surprisingly amount, large amount of companies that I've talked to over the past three years. Database schema changes are not automated. It means that when you want to deploy a version of your application that requires a change to the database schema, someone manually or semi manually directly connects to the database And applies these changes before this version is deployed. Now this is A very bad practice for multiple reasons. One, it's error prone. We know from the DevOps movement that automation is all about reducing the rate of defects. If we let a machine do something that a human might mess up, we greatly reduce the chances of error. Also, migrations Become the slowest link in the chain. If you automate everything except for database schema changes, your velocity is limited by the rate where you can do schema migrations. If that's manual, that is slow. It's not repeatable. It has no audit log. And I think maybe most importantly, it's super unfair to your engineers and team members. Making a manual change on the production database of a real database application is one of the most stressful things you can do an engineer. So if you are not automating yet, please start doing so today. The next principle I want to talk about is something that I call database schema as code. This is the core principle behind Atlas database schema management tool that my company Riga is building. It was started by my co founder and I around 21 today. It's very popular in many organizations and projects. all around the world. And the main idea with it is that you, similar to Terraform, instead of planning changes in a manual way and listing out every change to the database by yourself, you define the desired state of the database and let the tool figure out the diff and calculate the migrations for you. So by further reducing Another manual step in the chain of creating database changes, you can get a more consistent and less error prone result. I want to show a quick demo of this now of using Atlas. So let's see what we have here. So in our small project, we have two files. One is the Atlas HCL file. This is the configuration file that tells Atlas The desired state of the database is stored in this schema file and the URL, the target database, which we are managing, is going to be a local SQLite file. Now, when I want to apply this schema, I run a simple command atlas schema apply, the local environment, and what Atlas will do is it connects to the database, which currently does not exist, Let's plan the migration, run some diagnostics, in this case it's a simple additive operation, so no diagnostic found, and prompt us whether we want to approve the plan. Let's approve it. Now, the next time we run an Atlas schema apply, as you can guess, Atlas detects that there is no difference between the current and desired state, and therefore tells us that there is nothing left to do. Additionally, let's now show what happens when we add an additional column. As you might expect, Atlas will plan the addition of the email column by planning an ALTER TABLE statement. Again, this is a very simple operation, so no diagnostics found, and we can safely approve and apply. The next strategy I want to mention today is to test database logic the same way that you test code with unit tests. So databases are not only just containers of information in tables. Databases have lots of application logic that can be expressed inside them. For example, constraints, triggers, stored procedures, functions. extensions and more. So the way that we ensure a reliable application with code is that we write unit tests to verify that everything works correctly, but also that future changes don't break the logic and the expectations of the application. To this day, it's very difficult to express this test for the logic that resides in your database. But like we said, the core principle of Atlas is to take database schema and to make it into code. And using this principle, we're also able to create a testing infrastructure for our database logic. Let's see what this looks like. Okay, let's take a look at our project. So we have a new schema file here. You will notice it's called schema. pg. hcl. This is a Postgres schema, but we are using HCL. This is the Atlas HCL dialect. It's our configuration language for defining databases. You can find here stuff like tables and columns, and it will be familiar to you if you used Terraform in the past. I want to point your attention to a bit of logic that is called the positive function. This block will end up becoming a function defined on the database. Now, if database schema is code, we should be able to test it. And Atlas provides you with the testing framework. You can see this little green arrow, see that your IDE recognized this block of text as a test, and you can make assertions, for instance, that this statement, select positive one, returns a truthy statement. Now, let's try to make this test fail before to, to verify that it works. I run it and you will see that this test failed. Now let's correct the input and now we can show that the test passes. So by treating our database schema as code, we are able to write the unit test for it just like we do for our code. We can make our. Database changes more reliable by preventing from regressions. This test protects our users from any developer in the future that makes a change that will break this logic. Most database migrations are about making changes to the structure of the database, to its schema. But, oftentimes, teams need to make changes to the actual data. For instance, if you have a full name column, and you would like to split it into a first and last name. Now, the trouble with these types of changes is that they are especially risky if you make this change in the wrong way. In most modern relational databases, there is no undo button that is going to revert the changes to your data set. You might accidentally delete data or cause some other very negative impact to the business. The second piece of trouble is that to this day, there is no framework or harness that makes it easy. to write these kinds of tests. But thankfully, for people who adopt the database schema is called Mindset, Atlas has a testing framework that enables you to write tests for data migrations. Let's see how. Let's see how this works. In our example, we have a migration directory with a bunch of migrations. Now, we are now in this scenario where we create some trigger and we need to backfill existing records. Okay, so we have a trigger that's going to populate some column, but we need to make sure that all of the existing records in the users table also will contain the correct value. We create this data migration, this backfill operation that updates the existing records in the users table with the correct value. Now, how can we be sure that this is correct? This query, this backfill statement is correct. The only way that I know to verify software quality in a reliable way is to use testing. So Atlas provides you with this framework to write tests for migrations. And it works like this. We start by telling Atlas to migrate the database to the version before the migration we want to test. Then we seed some data using insert into. Statements, then we run the migration that we're interested in testing. And finally, we make some assertions. In this case, we verify that the latest post timestamp column is populated correctly. Let's see that this works. Amazing. So what Atlas did was run this test scenario on our local dev database to verify that the that we get a reliable result. So far, we mostly covered strategies that look at the application logic. How do we ensure that the application logic that is stored in the database Behaves correctly and reliably remains doing what it is supposed to do. But if you remember the beginning of the talk, we looked at something that was completely different. We looked at cases where the actual change of the schema is a risk factor. And if you think about it, migrations are a very risky business. Let's mention some types of ways that you can shoot yourself in the foot if you're not careful. You can accidentally drop a column or a table if you're not careful causing you one to potentially lose data that you cannot recover. And of course, if your application relies on this resource to exist, you can cause a. Production outage. This might sound to you like something that, never happens, but if you look at industry postmortems that are published, such as this one that was published by recent in February of this year. You will notice that in this outage that this company suffered around 5 a. m. They started the database migration around a minute later. They noticed that accidentally some tables were dropped from the production database. And the next 24 hours were not a lot of fun for this team that need to work really hard to recover from this outage. Next, we have many migrations may succeed in dev. They may succeed in staging when we are dealing with smaller empty tables. But when we are, for example changing constraints, like adding a uniqueness check, On a column, our migration might fail in production, whereas it succeeded in development. Now, if this happens to us midway immigration, depending on our database and different circumstances, we might find ourselves in a limbo state where we are not at version a anymore, but we are not yet at version B, causing rollbacks to be a potentially an issue. And of course, our server doesn't necessarily know how to handle this situation. Additionally, we have the potential to make breaking changes. So everybody today is talking about data contracts, mostly between the production database and data warehouse. But there is also a data contract between our backend application and our database. If we rename the column in a way that was not, we didn't think it through all the way, we might end up in a way where we break, broke the contract between backend and database, and our queries are beginning to fail. Finally, as we saw in the horror story in the beginning, Certain types of operations require the database to rewrite the table on disk, which is a slow operation and an operation that requires the database to take a lock on the database resource. This usually translates into a production outage for any processes that rely on this table during runtime. So how do we prevent such dangerous stuff from happening to our database in production? The way to do this is by automating testing and verification to catch this kind of dangerous change early in the process way before it hits our production database. Luckily, Atlas comes with a migration analysis engine that knows how to understand the semantics of the schema changes that are planned and classify them according to their risk. factors. Let's see how this looks like. So in this case, we have a small project. This is an SQLite database with three tables, T1, T2, T3. Let's apply this locally. We are now creating these resources. And suppose now we want to make a change and T3 table. As I'm commenting, I'm deleting this table completely, and now let's see the change as it is planned. Now, first of all, the plan is to drop table T3. Next, what Atlas does is it uses its analysis engine. And what it tells us is that destructive change detected. Dropping table T3. And it gives us a suggestion, but we'll get into that into a later phase. Okay, so now Atlas warned us that this type of thing might be risky and perhaps we should not do this. Let's look at another small example. Suppose we want to rename a column. If you recall from previous examples, Our examples before, renaming a column is considered a backward incompatible change. So when we do this, first of all, Atlas asks us if our intent was to rename or to drop an ad. We choose the rename option, but then Atlas gives us the diagnostics and it tells us that a backward incompatible change was detected. This warns us in time just before we apply the dangerous change to the database. Now, this is very powerful during development when I want a tool to assist me in finding these risky changes, but it becomes much more powerful when we combine it with continuous integration. What if we could have Atlas run and analyze these changes and during the CI phase to warn us way before we merge the code to master. and deploy it to production. If we can catch these risky changes early, we can end up with much more reliable database schema migration deployment pipelines. Detecting risky migrations during CI is a great strategy to enable reliable database schema migrations. However, some changes can only be verified against a specific target database. For instance, Suppose that you are adding a unique constraint to a column in one of your tables. Adding this unique constraint will always succeed on an empty database, such as the one you might be developing against, or the one that you're using in your CI environment. However, in production, if your table already contains a duplicate value for this column, This migration will, as we have shown in the, in one of the previous slides. So to help you, to help you get around this problem, Atlas supports a really neat feature that is called pre migration checks. Essentially, what you can do, instead of providing just a plain SQL file when defining an emigration, you can define a file using the txtar text archive format. If we provide this txtar directive at the top of the file, it tells Atlas that our migration file is comprised of two different sections. One is the checks section. This allows us to run some logic before the migration begins. For example, here we are checking that the users table is not is empty before dropping it. During runtime, Atlas will run the pre migration checks before deploying the migration, ensuring that we don't lose any data in production. Okay, wrapping up, let's consider how the five strategies that we presented today can improve and make database schema migrations more reliable. So database schema is code Terraform for databases. In general, automatically planned migrations are less error prone and more predictable than manually planned changes most of the time. Testing schema logic, being able to write unit tests for your database logic can prevent regressions. As your application evolves, testing data migrations when we need to rewrite data in our tables, part of the immigration process can prevent data corruption and getting us into a very nasty incident where we need to restore data from backups. Automated quality checks can detect stuff like destructive changes, table logs and more early in the software development lifecycle. locally during development or during CI automation. Finally, we demoed we talked about pre immigration checks that allow you to verify data dependent changes just a moment before you roll them out because they depend on the actual data. That's inside the database. Thanks a lot for your time. I really appreciate it. And if any of this is interesting to you, please visit us at atlasgo. io. See you soon.

See all 32 talks at this event!

Conf42 Kube Native 2024 - Online

September 26 2024 - premiere 5PM GMT

Strategies for Reliable Database Schema Migrations

Video size:

Abstract

Summary

Transcript

Rotem Tamir

Co-Founder & CTO @ Ariga

Join the community!

Featured event

2026

2025

Info

Conf42 Kube Native 2024 - Online

September 26 2024 - premiere 5PM GMT

Strategies for Reliable Database Schema Migrations

Video size:

Abstract

Summary

Transcript

Rotem Tamir

Co-Founder & CTO @ Ariga

Join the community!