Chaos Engineering for SQL Server

Video size:

Abstract

In this session we’ll look at how Chaos Engineering can be implemented with regard to SQL Server. SQL has various different high availability solutions but can we be sure that they’ll react as expected to a real world issue? Has the HA architecture only ever been tested in a planned maintenance window?

We’ll explore SQL Server’s built-in high availability features and take a look at Kubernetes, a brand new platform for SQL Server.

We’ll also have some fun by looking at KubeInvaders, a chaos engineering tool for Kubernetes…using Space Invaders!

Summary

Andrew Pruski is a SQL server DBA and Microsoft data platform MVP. All the slides and the code for the demos I'll be running are available on GitHub. If you have any questions after today, please feel free to reach out.
The aim of this session is to talk about how we can apply chaos engineering principles and practices to SQL Server and the systems around it. I think that being a DBA and chaos engineering really do go hand in hand. All dbas should be embracing chaosengineering principles and performing chaos engineering experiments.
How can we identify weaknesses for SQL Server and its systems that we want to test? The best way to start with this is to do a past incident analysis. Another good method is a likelihood impact map. Let's think about monitoring systems monitoring is important.
Dr. Solution: You need to be regularly testing your solutions for problems. From backups to user error, test your solutions regularly. You need a Dr. Solution in place when you have an issue in production.
You need to be regularly testing it. We want to get some actionable results from running these chaos engineering experiments. But I'm pretty sure you will find issues out there in the wild.
Andrew: Is there anything else that you haven't thought of? That's a bit of a weird question because it's Andrew. Maybe there's something there that you can test to further build confidence in your systems. Let's run a chaos engineering experiment to simulate solutions in production.
If there is more than one failure of that clustered role, the availability group in 6 hours, if another failover occurs, the clustered role will be left in the failed state. By performing this chaos engineering experiment, we've identified a setting in our availability group that could possibly cause SQL to react in a way we don't expect it to.
Kubernetes is built in high availability features. Couldn't it be great if that GuI was Space Invaders? You can nuke your pods and test your pods high availability by killing them playing Space Invaders.
At the moment we are in the middle of running a proof of concept. Is SQL server running on Kubernetes and Docker just a playground? Are we looking at something more serious?

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome everyone to this session on Chaos Engineering for SQL Server. A little bit about myself first. My name is Andrew Pruski. I am a SQL server DBA and Microsoft data platform MVP. Originally from Swansea, but I've been living in Dublin for the last just over six years now. My twitter handle at dbA from the cold and my email address dbA from thecald@gmail.com are on the slide there. So if you have any questions after today, please feel free to reach out. I'm always willing to talk about this stuff. My blogs are as well. Dbafthecol.com posting an article next week about SQL Server and Chaos Engineering. And then finally at the bottom there is my GitHub account. All the slides and the code for the demos I'll be running are available there and I'll post an exact link at the end of the session. So onto the session. The aim of this session is to talk about how we can apply chaos engineering principles and practices to SQL Server and the systems around it. I'm a database administrator by trade. Have we got any dbas in? I can't see no good, so I can say pretty much what I want. I'm a database administrator by trade and we're taught to think about failures. For example, one of the core aspects of being a DBA is backups. We backup our databases and we don't just back up our databases and forget about them. We verify those backups and we practice restores because we need to be able to restore our databases if they encounter a failure in production to our company's RPO recovery point objective and within our company's RTO. And it's because of practices like this. I think that being a DBA and chaos engineering really do go hand in hand. I think all dbas should be embracing chaos engineering principles and performing chaos engineering experiments. So here's what we're going to cover. We're going to talk about identifying potential failures and weaknesses in SQL Server and the systems around it that we can test with chaos engineering experiments. Then we're going to actually run a chaos engineering experiment up in a lab that I've got in Azure. If the Wi Fi holds out on me, fingers crossed. And then finally to round off the session, just have a bit of fun. We're going to talk about SQL Server on kubernetes, and we've got a little tool that I found online that we can use to test SQL running on kubernetes. It's a really exciting time to be a SQL Server DBA at the moment, over the last few years we've had all these brand new platforms. We've had SQL servers on Linux, SQL Server in Docker containers, and SQL Server on kubernetes. So we'll have a little test about testing SQL Server's HA capabilities on kubernetes to round the session off. But first things first, let's talk about how we can identify weaknesses for SQL Server and its systems that we want to test. So the best way to start with this is to do a past incident analysis, or as Ross would say, past surprise analysis. So the reason we do this is because we want to find a potential weakness that has happened in the past or is likely to happen. There's no point in running a chaos engineering experiment against something that's never going to happen in your very unlikely to happen in your environment. We want to get some actionable results from our chaos engineering experiments. So look at the past incidents that have happened in your environment, what's happened and what technologies and strategies were put in place to mitigate that issue. So say we had a standalone SQL server instance in production. That instance went down and some sort of ha strategy was put in place, whatever that may be. Maybe it's mirroring, maybe it's clustering, maybe it's always on availability group. Or if you're mad, we can use replication. So maybe we want to run a chaos engineering experiment against that Ha strategy to further build confidence in our systems so that they will react as we expect them to when, not if, they encounter a failure, because we want to learn from the previous failures and that will guide us to our future experiments so we can do a past surprise analysis. Another really good method is to do what's called a likelihood impact map. So we think about all the failures that could happen in our systems and we rank them on how likely they are to their impact. So all the way down the bottom there in the green, least likely, least impact, all the way up to the top there with most likely, oh holy moly, impact brand. When people do this, they do tend to be eitherly overly pessimistic or overly optimistic about the failures in their systems. I'll let you decide which way I fall on that one. But if that does happen, you'll see that all your failures, you clump down the bottom or at the top. If that happens, drill in and really, really think about the failures and perform the analysis again. So you get a nice spread across the graph with hopefully three or four up in the red there that are potential candidates for you to run your chaos engineering experiments against. So let's think about potential failures for SQL server. I've already mentioned one high availability we built. Say we had a standalone SQL instance in production that went down. So a two node always on availability group cluster has built. So we had a primary and a secondary. If the primary fails, we've configured it automatic failover to the secondary. The secondary will stay up and our databases will remain online. Now we can perform tests in SQL server to fail over that availability group and fail it back with SQL. We can say alter availability group failover, hit execute, watch it fail over. But that's not how SQL servers is going to fail out in production. What about just turning off the primary node, nuking that primary node, seeing what happens? Isn't that a more realistic test of how SQL will fail out in production? Yes, absolutely. So that is one test we could possibly run. So let's think about some more failures. Let's think about monitoring systems monitoring is important. Monitoring things like pagerduty gets alerted, you get paged. If that goes down, you're in trouble. So let's think about a runaway query in the database. Runaway query fills up the transaction log of a SQL server database. If that happens in SQL Server, no more writes can hit that database. Now there's no point in really writing a chaos engineering experiment to test that. We know what's going to happen. But what about your monitoring? When did you get alerted there was an issue? Did you get an alert at the end saying, oh, the transaction logs full, you need to do something? Or did you get preemptive alerts saying, hey, there's a lot of growth happening here, you need to investigate before an issue occurs. So maybe we could write a chaos engineering experiment to test that. Thinking of another one. Let's go. Backups already mentioned backups. We test our backups, right? We back them up to the same server the databases are on? No, that server goes offline, we've lost not only our databases, we've lost the ability to recover them. So we put them off site somewhere on an external server. We just back them up there and leave them, right? No, we regularly run restore tests against those backs. We verify them and then we regularly restore those databases because issues with restores do happen. A few jobs ago, I was working as a SQL server DBA and I was using a third party tool to backup my databases. The reason for this was we had kind of large databases and I wanted to have native backup compression. We were working with SQL Server 2005. This is how long ago it was SQL Server didn't have native backup compression. So this third party tool did it. I'm not going to mention it because I don't want to get sued, but it gave us the compression. It was great. It worked everywhere. I has implemented it absolutely everywhere. Compressing my backups. Happy as Larry. One day I needed to restore a three terabyte. We had an issue in production. I needed to restore a three terabyte database. Not a problem, I said, I have the backups, they've all been verified. Went to the software, clicked the backup file, clicked restore, and I also clicked a function that this third party tool had called instant Restore because who wouldn't want to instantly restore a three terabyte database? Awesome. So I clicked instant restore, I hit execute and the software went to 99% and there it stayed for 5 hours and then promptly failed. I hadn't been regularly testing my restores using that feature. I should have regularly been restoring databases to a test server using those settings so that I know that that process will work as I expect it to when I really needed it. And guess what? I do now. I mean, there's loads of tools we have out there these days. Has everyone here heard of DBA tools? Powershell module? Yay. Nay. Yep, we got a few in there. So it's a Powershell module specifically designed for database administration and you can do database features with it. So I have it running on a separate server and it picks databases at random, grabs their backups, restores them for me to another server, and then flags any issues, if there are any. So that when I come to needed to be able to restore a production database, I know that process will work. Okay, more failures. Let's go for an obvious one. Let's go. User error. Who here has ever run can update statement against a production table without a where clause? So you've updated all the rows instead of just the ones you wanted. Be honest. Happened to a friend of mine in a previous job that I did. He ran an update statement against a production table, the outcome of which, for a period of about half an hour, a 15% discount was applied to every single transaction that company did. Lovely. So this is a test of maybe our deployment pipelines or our practices. Should you have been able to run an ad hoc update query against those production servers? Do your admins really need admin access? All the time? Really? So maybe this is a way we can test a deployment pipeline. Maybe this is a test of our security policies. Does he need admin access all the time? The outcome of this actually was. We restricted our ad accounts to read only access to the databases. And then when we needed to do stuff like that, we logged in with SQL authenticate accounts. And then we installed another tool called SMS boost to management Studio, a tool for managing SQL server databases. And that would scan any script that you're about to run on certain servers and say, hey, this update statement doesn't have a where clause. Are you sure you want to run this? So maybe there's extra steps we could put in there. Maybe there's something around that we could test to try and prevent human error. You're never going to get rid of it completely. One of the things with SMS boost is you got just really used to just clicking. Ok. So are you ready to run this? Yes. Click. But maybe there's something we could test there. Okay, let's do one more. Let's go nuclear. Let's think. A production data center outage. You're running all your production systems in a private data center. That data center has an outage. Okay. Likelihood, not very likely. Impact, pretty bad. So this would be a test of your Dr. Solution. You need to have a Dr. Solution in place and you need to be regularly testing that Dr. Plan. You do not want to be enacting your Dr. Strategy for the first time when you have an issue in production. Now I've said fairly unlikely, and it is, but it does happen. Happened to can ex colleague of mine not two months ago. Their production data center had a power outage. They didn't have a Dr. Plan. They were out for 4 hours. Think of the damage that'll do to a business. You need to have a Dr. Strategy in place. Brand. You need to be regularly testing it. Okay. So we do the likelihood impact map. We think about all these failures because we want to identify things like which failure has the highest likelihood, which failure has the highest impact, and what will we gain from testing these failures? Remember, we want to get some actionable results from running these chaos engineering experiments. Okay. It's great. If you run an experiment and everything passes and it works exactly as we expect it to, we've built confidence in the system. But I'm pretty sure I can guarantee you you will find issues out there in the wild. And the last point there is, is there anything else that can be tested? Now this is sort of. Is there anything else that you haven't thought of? That's a bit of a weird question because it's Andrew, is there anything you haven't thought of? I don't know. I haven't thought of it. What this means is go and talk to people outside of your team. Go and talk to, say, sysadmins, your network admins, your database developers. Talk to your end users because I guarantee they will come up with perceived weaknesses or failures that they've seen that you haven't thought of. So maybe there's something there that you can test to further build confidence in all of your systems that they will react as you expect them to when they encounter a failure. So let's go ahead and let's have a look at running an experiment. So what test are we going to run? So I've already mentioned it. Let's have a look at what happens if the primary node in an availability group cluster fails. So we're going to have a two node cluster. That primary node is going to go down. What are we expecting to happen? So let's define the experiment. The hypothesis, the listener of the availability group should remain online. The listener is the endpoint. This is what all the applications and users connect to so that they can access the databases. So if the primary node goes down, we want the listener, the availability group, to fail over to the secondary node, brand that listener to remain online so that our databases are accessible. Method. Now I've said we can run TSQL statements to availability groups over, but very sanitized way of testing. So how about stopping the database engine service on that primary node? So we're simulating that primary node going down and we'll see what happens with that availability group. Of course, because we're running this test, we need a rollback strategy. Now if this was me, even if I was doing it in a development environment, in my work, I would have a set load of scripts that would perform a point in time restore of all the databases there just in case something horribly went wrong. But as this is just a demo environment, all I'm going to do is restart the primary database engine service on the primary node. So let's go ahead and let's run a chaos engineering experiment. Okay, here we have a two node availability group cluster. So we have AP SQlago one which is our primary and AP SQL two which is the secondary. So let's use SQL to failover from the primary to the secondary so we can execute this. And this is where we tell SQL to perform the failover of our availability group from one to two. And there we are, that's completed. So if we do a refresh here, we can see one is now the secondary and two is our primary. So if I connect back to the one, I can fail it back just to verify that failover works both ways. And now we're saying fail back from two to one. So we're verifying that the AG can fail from one node to the other and then back again. And there we go, that's completed. So if I do a refresh, one is back to our primary and two is our secondary. But this isn't how an availability group will fail, but in the wild. So let's run a chaos engineering experiment to simulate solutions in production that an availability group would experience, say the primary node failing. And then we'll see if the availability group can still fail over. So if we jump into visual studio code, I have this scripture scrub to the top. First thing it does is just grab the listener name running some SQL, and then we have a load of write hosts here just to make it basically look pretty, to be honest. And then we run a testnet connection against the listener to make sure the listener is online. And then we put that into a pest test to verify whether or not it's online. If testnet connection comes back with a value of true, the pester test will pass. Anything else? The pester test will fail. And this is our steady state hypothesis. We run this to make sure the listener is online before we actually run our chaos engineering experiment. So if steady state hypothesis passes, we then stop the database engine service on the primary. So we're simulating the primary failing and then we test the listener again. The listener is online. It means it's failed over to the secondary node brand, the availability group is online, and SQL has reacted to the primary node shutting down as we expect it to. So let's run our chaos engineering experiment. Let's make sure I'm in the right location and let's run. So first thing we're doing is our steady state hypothesis is that listener online. Listener is online. So now we can shut down SQL on the primary node. So wait for that to go down. And now we're running our test again to see if that listener is online. And boom, there we are. Looks like our chaos engineering experiment has passed. Let's have a look in SQL itself and let's do a refresh. And one is now the secondary brand. Two is the primary. So the same has running the SQL by shutting down the node, sorry, by shutting down SQL on the primary node, the availability group has failed over to the secondary node as we expected it to. So let's run the test again. Let's fail back from two to one, verifying that the availability group can fail both ways. So same again. Starting off is that listener online on two. It's online. So now we can shut down the database engine service on that node. And let's see if the availability group fails over. So database engine service going down and we rerun our testnet connection against the listener. So this has taken a little bit longer than before. This might be something gone wrong. We'll have to give it a little bit to see. And we go, it looks like it's failed. Yes. Okay, so the listener is no longer online. There we go. And let's catch engineering expected from rolling back. So let's have a look in SQL. So refresh here. Let's do a refresh here. Okay, so two is still the primary brand, one is still the secondary. So our Chaos engineering experiment has failed. We were unable to fail that availability group back from two to one. Now the reason for this is settings in the clustered role. If I come down here to properties, I click failover. We have these settings here. And what these settings are saying is the maximum failures in the specified period can be one and the period is hours. Now what this is saying is if there is more than one failure of that clustered role, the availability group in 6 hours, if another failover occurs, the clustered role, the availability group will be left in the failed state. So I can use SQL, I want to fail the availability group backwards and forwards. But if the availability group experiences a primary node failure like we've simulated with our chaos engineering experiment more than once in 6 hours with these settings, that failover won't work. And guess what the default settings are for a clustered role. Theyll are these ones here. So by performing this chaos engineering experiment, we've identified a setting in our availability group that could possibly cause SQL to react in a way we don't expect it to. So we could come in, make changes here. I'm just going to pick say ten brand, let's rerun our experimental. So let's check that the listener is online again. It's online. And now let's shut down that database engine service on two. Brand, let's see if the setting changes we've made will allow that availability group to fail over. So waiting for the database engine service to go down and we're rerunning our test. Boom, listener is online. Come into SQL, let's do a refresh. SQL is now the prior, sorry, one is now the primary brand, two is the secondary. So by making that configuration change, we are now able to see SQL failing over back from two. When we shut down the database engine service on two. Okay, little bit of a contrived experiment, but those are the default settings in that clustered role of one failure every 6 hours. And I have to admit, when I was building this test, this demo, I was sort of sitting there thinking, I can't really remember the last time I checked that setting for my production SQL instances. And I went and checked and guess what, a couple of them had those default settings I need to go in and change. So even just by performing a chaos engineering experiment against this SQL instance in a lab, I got some actionable results that I could take away and put into my production environment. Okay, so that is running a chaos engineering experiment against a SQL always on availability group. Now to round the session off, I want to talk about SQL Server running on kubernetes. Anyone here got experience with kubernetes? Oh, loads of you. All right, cool. I love Kubernetes. I think it's really cool. I really think it's really, really cool that we can run SQL server on it because I like to take advantage of Kubernetes. Kubernetes is built in high availability features. So I have my desired state. I define my deployment in say a config file or a YAML file, and then I deploy it into kubernetes and it becomes my running state. And Kubernetes is constantly checking between the two to see if there are any issues. If there are any issues, Kubernetes will fix them for me. So if my instance of SQL running in my pod has an issue, Kubernetes will spin me up a new one automatically. And I was demoing this to my boss and we'll run through that. Now. He was saying, that's great, Andrew, but this is a brand new platform SQL server. So doing it in the command line, that's okay. But can you jazz up your presentation a little bit? Can you make it a little bit more wow factor? I went, well, you want me to use a GUI? And he went, yeah, yeah, a nice gui would be good. And I went, great, I can use a gui. And wouldn't it be great if that GuI was Space Invaders? This is a project has mentioned earlier. It's on GitHub. It's called Kube Invaders. And guess what? Your pods are the invaders and you play the spaceship and you can nuke your pods and you can test your pods high availability by killing them playing Space Invaders. So let's run through the demo. So I'm first on my laptop here and then we'll run up into if I knew what I was doing. So I have a one node cluster running on my laptop here. It's called Micro K eight. So it's actually really good. If you're running on Windows you have the ability to run Docker desktop and that has a Kubernetes feature. You just have a tick box and you've got it up and running. So let's run one pod with SQl server and let's expose that on its default port. So there we are. There's my pod, there's my service up and running. Lovely stuff. So I can connect to it. So say this is my steady state hypothesis. Is SQL up and running connecting with Ms SQl Cli. Just a little command line tool. And there we are. We have Microsoft SQl server 2019 rc one. So there's my pod. We can see IP address of ten 161 at 57 age about 28 seconds. Let's nuke that pod. There we go. So what's happened here is the running state no longer matches the desired state. Kubernetes should automatically say, oh hang on, something's not right here brand it should fix it for me. So what we should see when this comes back is that I now have a new pod running for 16 seconds with a new IP address and I can further test because I still have my service and I can run my query again and that should using the same IP address go in and I should be able to query that and get my version back. So that's a very, very basic chaos engineering experiment against SQL server running on Kubernetes. Making use of kubernetes ha systems. Okay so that's great. Let's clean up and I'm going to switch to a cluster running up in Azure and I'm going to deploy ten pods to it. Let's see how that gets on. Okay I've got a few pods coming. There we are running and now I can play my game. Oh hang on. Let's grab that. Here we go. And okay let's get rid of that one. Let's get rid of that one. And we can see the pods on the side there been killed and hopefully Kubernetes will realize that the running state no longer matches the desired state and it will fix them for me. It should sync up and I should get a whole bunch of brand new pods coming into the system there. Another cool thing about this is I can stick it on automatic mode and I can let it play for me so I can let that go off and I can stand here and I could talk about it and I can decide when it wants to start killing pods and if any of them are going to sink back in one of the really nice things. But this is that it's shooting little mini Kubernetes logos, which I think is a really nice touch. So we can let that play there. Some of them are coming back in. There we go. See, Kubernetes realized that the running state no longer matched the desired state and fixed that for me, brought all those pods back in brand. It's going to let that play and it'll just keep killing them as I'm talking. So let's do a cleanup there and jump back into the slides. So round off. Just got a couple of resources there. The top one there is the link to my GitHub account with the slides and the code for all the demos I just ran. And the second link there is to the Kube Invaders project on GitHub as well. Thank you very much. Are there any questions? Yes, any sort of persistent storage like relational databases within Kubernetes and Docker? Are you talking about that as just a playground? No. At the moment we are in the middle of running a proof of concept. So the question is, is SQL server running on Kubernetes and Docker just a playground? Are we looking at something more serious? We are looking at something more serious. We've built our own Kubernetes cluster running of a pure storage array with obviously provisioning persistent volumes for the databases, the system databases, transaction logs and then backing them off site. And then we are properly nuking it. I've managed to corrupt quite a few databases so far, but that again will figure into how quickly we can restore brand. It's a lot of learning processes here because maybe we got the configuration wrong. Pure storage eraser, so lightning fast, so that's not going to be an issue I reckon. I think it's more us nuking pods whilst there's heavy transactions going on, which will happen with if you own SQL server in a vm, to be honest.

Slides

Download slides (PDF)

See all 11 talks at this event!

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Chaos Engineering for SQL Server

Video size:

Abstract

Summary

Transcript

Slides

Andrew Pruski

SQL Server DBA @ Channel Advisor

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Chaos Engineering for SQL Server

Video size:

Abstract

Summary

Transcript

Slides

Andrew Pruski

SQL Server DBA @ Channel Advisor

Join the community!