Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome everyone to this session on Chaos Engineering for SQL
Server. A little bit about myself first. My name is Andrew Pruski.
I am a SQL server DBA and Microsoft data platform MVP.
Originally from Swansea, but I've been living in Dublin for the last just
over six years now. My twitter handle at dbA from the cold and
my email address dbA from thecald@gmail.com are on the slide there.
So if you have any questions after today, please feel free to reach out.
I'm always willing to talk about this stuff. My blogs are as well.
Dbafthecol.com posting an article next week about
SQL Server and Chaos Engineering. And then finally at the bottom there is my
GitHub account. All the slides and the code for the demos
I'll be running are available there and I'll post an exact link at the
end of the session. So onto the session.
The aim of this session is to talk about how we can apply chaos
engineering principles and practices to SQL Server and the
systems around it. I'm a database administrator
by trade. Have we got any dbas in? I can't see no
good, so I can say pretty much what I want. I'm a
database administrator by trade and we're taught to think about
failures. For example, one of the core aspects of being a DBA
is backups. We backup our databases and we
don't just back up our databases and forget about them. We verify those backups and
we practice restores because we need to be able to
restore our databases if they encounter a failure in production
to our company's RPO recovery point objective
and within our company's RTO. And it's because of practices like
this. I think that being a DBA and chaos engineering
really do go hand in hand. I think all dbas should be embracing
chaos engineering principles and performing chaos engineering
experiments. So here's what we're going to cover. We're going to talk about
identifying potential failures and weaknesses in SQL
Server and the systems around it that we can
test with chaos engineering experiments.
Then we're going to actually run a chaos engineering experiment up in a lab
that I've got in Azure. If the Wi Fi holds out on me,
fingers crossed. And then finally to round off the session, just have a bit
of fun. We're going to talk about SQL Server on kubernetes,
and we've got a little tool that I found online that we can use to
test SQL running on kubernetes. It's a really exciting time to be a SQL
Server DBA at the moment, over the last few years we've had all
these brand new platforms. We've had SQL servers on Linux, SQL Server in Docker
containers, and SQL Server on kubernetes.
So we'll have a little test about testing SQL Server's
HA capabilities on kubernetes to round the session off.
But first things first, let's talk about how we
can identify weaknesses for SQL Server and its systems
that we want to test. So the best way to start with this is
to do a past incident analysis, or as
Ross would say, past surprise analysis.
So the reason we do this is because we want to find a
potential weakness that has happened in the past or is likely to happen. There's no
point in running a chaos engineering experiment against something that's never going to happen
in your very unlikely to happen in your environment. We want to get some actionable
results from our chaos engineering experiments.
So look at the past incidents that have happened in your environment,
what's happened and what technologies and strategies were put in place to
mitigate that issue. So say we had a standalone SQL
server instance in production. That instance went down
and some sort of ha strategy was put in place,
whatever that may be. Maybe it's mirroring, maybe it's clustering, maybe it's always
on availability group. Or if you're mad, we can use replication.
So maybe we want to run a chaos engineering experiment against that Ha
strategy to further build confidence in our systems
so that they will react as we expect them to
when, not if, they encounter a failure,
because we want to learn from the previous failures and that will guide us to
our future experiments so we can do a past surprise
analysis. Another really good method is to do what's called a
likelihood impact map. So we think
about all the failures that could happen in our systems and
we rank them on how likely they are to their impact.
So all the way down the bottom there in the green, least likely, least impact,
all the way up to the top there with most likely, oh holy moly,
impact brand. When people do this,
they do tend to be eitherly overly pessimistic or
overly optimistic about the failures in their systems. I'll let you decide
which way I fall on that one.
But if that does happen, you'll see that all your failures, you clump down
the bottom or at the top. If that happens, drill in and really,
really think about the failures and perform the analysis again. So you
get a nice spread across the graph with hopefully
three or four up in the red there that are potential candidates
for you to run your chaos engineering experiments against.
So let's think about potential
failures for SQL server. I've already mentioned one high
availability we built. Say we had a standalone
SQL instance in production that went down. So a
two node always on availability group cluster has built. So we had a primary and
a secondary. If the primary fails, we've configured it automatic
failover to the secondary. The secondary will stay up and our databases
will remain online. Now we can perform tests
in SQL server to fail over that availability group and fail it back with
SQL. We can say alter availability group failover, hit execute,
watch it fail over. But that's not how SQL
servers is going to fail out in production. What about
just turning off the primary node, nuking that primary
node, seeing what happens? Isn't that a more realistic test of
how SQL will fail out in production? Yes, absolutely.
So that is one test we could possibly run.
So let's think about some more failures. Let's think about
monitoring systems monitoring is important. Monitoring things like pagerduty
gets alerted, you get paged. If that goes down, you're in trouble. So let's
think about a runaway query in the database. Runaway query
fills up the transaction log of a SQL server database. If that happens in SQL
Server, no more writes can hit that database.
Now there's no point in really writing a chaos engineering experiment
to test that. We know what's going to happen.
But what about your monitoring? When did you get alerted there
was an issue? Did you get an alert at the end saying, oh, the transaction
logs full, you need to do something? Or did you get preemptive alerts saying,
hey, there's a lot of growth happening here, you need to investigate
before an issue occurs. So maybe we could write a chaos engineering
experiment to test that. Thinking of another one.
Let's go. Backups already mentioned backups.
We test our backups, right? We back them up to the same server the databases
are on? No, that server goes offline,
we've lost not only our databases, we've lost the ability to recover them.
So we put them off site somewhere on
an external server. We just back them up there and
leave them, right? No, we regularly run restore
tests against those backs. We verify them and then we regularly restore those
databases because issues with restores
do happen. A few jobs ago, I was working as
a SQL server DBA and I was using a third party tool to backup
my databases. The reason for this was we had kind of
large databases and I wanted
to have native backup compression. We were
working with SQL Server 2005. This is how long ago it was SQL Server
didn't have native backup compression. So this third party tool did it. I'm not going
to mention it because I don't want to get sued,
but it gave us the compression. It was great. It worked everywhere. I has implemented
it absolutely everywhere. Compressing my backups. Happy as Larry.
One day I needed to restore a three terabyte. We had an issue in
production. I needed to restore a three terabyte database.
Not a problem, I said, I have the backups, they've all been verified.
Went to the software, clicked the backup file,
clicked restore, and I also clicked a
function that this third party tool had called instant Restore
because who wouldn't want to instantly restore a three terabyte
database? Awesome. So I clicked instant restore,
I hit execute and the software went to 99%
and there it stayed for 5 hours and then promptly failed.
I hadn't been regularly testing my restores using that feature.
I should have regularly been restoring databases to a test server
using those settings so that I know that that process will
work as I expect it to when I really
needed it. And guess what? I do now. I mean, there's loads of tools we
have out there these days. Has everyone here heard of DBA tools?
Powershell module? Yay.
Nay. Yep, we got a few in there. So it's a Powershell module
specifically designed for database administration and you can do database features with
it. So I have it running on a separate server and it picks databases at
random, grabs their backups, restores them for me to another server,
and then flags any issues, if there are any. So that when I
come to needed to be able to restore a production database,
I know that process will work.
Okay, more failures. Let's go for an obvious
one. Let's go. User error. Who here has ever
run can update statement against a production table without a where clause? So you've
updated all the rows instead of just the ones you wanted.
Be honest. Happened to
a friend of mine in a previous job that I did. He ran an update
statement against a production table, the outcome of which, for a period of about half
an hour, a 15% discount was applied to every single
transaction that company did. Lovely.
So this is a test of maybe our deployment pipelines or our
practices. Should you have been able to run an ad hoc update query
against those production servers? Do your admins really need admin
access? All the time? Really?
So maybe this is a way we can test a deployment pipeline. Maybe this
is a test of our security policies. Does he need admin
access all the time? The outcome of this actually was. We restricted our ad accounts
to read only access to the databases. And then when we needed to
do stuff like that, we logged in with SQL authenticate accounts. And then we
installed another tool called SMS boost to management Studio,
a tool for managing SQL server databases. And that would scan any script
that you're about to run on certain servers and say, hey, this update
statement doesn't have a where clause. Are you sure you want to run this?
So maybe there's extra steps we could put in there. Maybe there's something around that
we could test to try and prevent human error. You're never going to get rid
of it completely. One of the things with SMS boost is you got just really
used to just clicking. Ok. So are you ready to run this? Yes. Click.
But maybe there's something we could test there.
Okay, let's do one more. Let's go nuclear. Let's think.
A production data center outage.
You're running all your production systems in a private data center.
That data center has an outage. Okay.
Likelihood, not very likely. Impact,
pretty bad. So this would be a test of your
Dr. Solution. You need to have a Dr. Solution in place
and you need to be regularly testing that Dr. Plan. You do not want
to be enacting your Dr. Strategy for the first time when
you have an issue in production. Now I've said fairly unlikely,
and it is, but it does happen. Happened to can ex
colleague of mine not two months ago. Their production data center
had a power outage. They didn't have a Dr. Plan. They were out for 4
hours. Think of the damage that'll do to a business. You need
to have a Dr. Strategy in place. Brand. You need to be regularly testing it.
Okay. So we do the likelihood impact map. We think about
all these failures because we want to identify things like which failure has the highest
likelihood, which failure has the highest impact, and what will we
gain from testing these failures? Remember, we want to get some actionable results from running
these chaos engineering experiments. Okay. It's great. If you run an experiment and
everything passes and it works exactly as we expect it to, we've built confidence
in the system. But I'm pretty sure I can guarantee you
you will find issues out there in the wild. And the
last point there is, is there anything else that can be tested?
Now this is sort of. Is there anything else that you haven't thought of?
That's a bit of a weird question because it's Andrew, is there anything you haven't
thought of? I don't know. I haven't thought of it. What this means is go
and talk to people outside of your team.
Go and talk to, say, sysadmins, your network admins, your database developers.
Talk to your end users because I guarantee they
will come up with perceived weaknesses or failures that they've seen that you
haven't thought of. So maybe there's something there that you can test to further build
confidence in all of your systems that they will react
as you expect them to when they encounter a failure.
So let's go ahead and let's have a look at running an experiment.
So what test are we going to run? So I've
already mentioned it. Let's have a look at what happens if the primary node
in an availability group cluster fails.
So we're going to have a two node cluster. That primary
node is going to go down. What are
we expecting to happen? So let's define
the experiment. The hypothesis, the listener of the availability group
should remain online. The listener is the endpoint. This is what all the applications
and users connect to so that they can access the databases. So if the
primary node goes down, we want the listener, the availability group, to fail
over to the secondary node, brand that listener to remain online so that our
databases are accessible. Method.
Now I've said we can run TSQL statements to availability groups over,
but very sanitized way of testing.
So how about stopping the database engine service on
that primary node? So we're simulating that primary node going down
and we'll see what happens with that availability group.
Of course, because we're running this test, we need a rollback
strategy. Now if this was me, even if I was doing it in a development
environment, in my work, I would have a set load of scripts
that would perform a point in time restore of all the databases there just in
case something horribly went wrong. But as this is just a demo environment,
all I'm going to do is restart the primary database
engine service on the primary node. So let's go
ahead and let's run a chaos engineering experiment.
Okay, here we have a two node availability group cluster.
So we have AP SQlago one which is our primary and
AP SQL two which is the secondary.
So let's use SQL to failover from the primary to the secondary so
we can execute this. And this is where we tell SQL to perform the failover
of our availability group from one to two.
And there we are, that's completed. So if we do a refresh here,
we can see one is now the secondary and
two is our primary. So if I connect back to
the one, I can fail it back just to verify that failover works
both ways. And now we're saying fail back
from two to one. So we're verifying that the
AG can fail from one node to the other and then back again.
And there we go, that's completed. So if I do a refresh,
one is back to our primary and
two is our secondary. But this isn't how an availability
group will fail, but in the wild. So let's run a
chaos engineering experiment to simulate solutions in production
that an availability group would experience, say the primary
node failing. And then we'll see if the availability group can
still fail over. So if we jump into visual studio code,
I have this scripture scrub to the top. First thing
it does is just grab the listener name running some SQL,
and then we have a load of write hosts here just to make it basically
look pretty, to be honest. And then we run a testnet
connection against the listener to make sure the listener is online.
And then we put that into a pest test to verify whether or
not it's online. If testnet connection comes back with
a value of true, the pester test will pass. Anything else?
The pester test will fail. And this is our steady state hypothesis.
We run this to make sure the listener is online before
we actually run our chaos engineering experiment.
So if steady state hypothesis
passes, we then stop the
database engine service on the primary. So we're simulating
the primary failing and then we test
the listener again. The listener is online. It means it's failed over to the
secondary node brand, the availability group is online,
and SQL has reacted to the primary node shutting down
as we expect it to. So let's run our chaos
engineering experiment. Let's make sure I'm in the right location and
let's run. So first thing we're
doing is our steady state hypothesis is that listener online.
Listener is online. So now we can shut down SQL on
the primary node. So wait for that to go down.
And now we're running our test again to see if that listener is
online. And boom, there we are.
Looks like our chaos engineering experiment
has passed. Let's have a look in SQL itself
and let's do a refresh.
And one is now the secondary brand. Two is the primary.
So the same has running the SQL by shutting down the node, sorry,
by shutting down SQL on the primary node, the availability group has failed
over to the secondary node as we expected it to.
So let's run the test again. Let's fail back from two to one,
verifying that the availability group can fail both ways. So same
again. Starting off is that listener online
on two. It's online. So now we can shut down the
database engine service on that node. And let's see if
the availability group fails over. So database
engine service going down and
we rerun our testnet connection against the listener.
So this has taken a little bit longer than before. This might be
something gone wrong. We'll have to give it a little bit to see.
And we go, it looks like it's failed.
Yes. Okay, so the listener is no longer online. There we
go. And let's catch engineering expected from rolling back.
So let's have a look in SQL.
So refresh here.
Let's do a refresh here.
Okay, so two is still the primary brand,
one is still the secondary. So our Chaos
engineering experiment has failed. We were unable to fail that availability
group back from two to one. Now the reason for this
is settings in the clustered
role. If I come down here to properties, I click failover.
We have these settings here. And what these settings are saying is
the maximum failures in the specified period can be one and the period is hours.
Now what this is saying is if there is more than one failure of that
clustered role, the availability group in 6 hours, if another
failover occurs, the clustered role, the availability
group will be left in the failed state.
So I can use SQL, I want to fail the availability group backwards and forwards.
But if the availability group experiences a primary node
failure like we've simulated with our chaos engineering experiment more than once
in 6 hours with these settings, that failover
won't work. And guess what the default settings
are for a clustered role. Theyll are these ones here.
So by performing this chaos engineering experiment, we've identified
a setting in our availability group that
could possibly cause SQL to react in a
way we don't expect it to. So we could come in,
make changes here. I'm just going to pick say ten brand,
let's rerun our experimental.
So let's check that the listener is online again.
It's online. And now let's shut down that database engine service on
two. Brand, let's see if the setting changes we've made will
allow that availability group to fail over. So waiting for the database
engine service to go down and we're rerunning our test.
Boom, listener is online.
Come into SQL, let's do a refresh.
SQL is now the prior, sorry, one is now the primary brand,
two is the secondary. So by making that configuration change, we are now able to
see SQL failing over back from two. When we shut down the database engine
service on two.
Okay, little bit of a contrived experiment, but those are the default settings in that
clustered role of one failure every 6 hours. And I have to
admit, when I was building this test, this demo,
I was sort of sitting there thinking, I can't really remember
the last time I checked that setting for my production SQL
instances. And I went and checked and guess what, a couple of them
had those default settings I need to go in and change. So even
just by performing a chaos engineering experiment against this SQL instance
in a lab, I got some actionable results that I could take
away and put into my production environment.
Okay, so that
is running a chaos engineering experiment against a SQL
always on availability group. Now to round the session off, I want
to talk about SQL Server running on kubernetes. Anyone here
got experience with kubernetes? Oh, loads of you. All right,
cool. I love Kubernetes. I think it's really cool. I really
think it's really, really cool that we can run SQL server on it because I
like to take advantage of Kubernetes. Kubernetes is
built in high availability features.
So I have my desired state. I define my deployment in
say a config file or a YAML file, and then I deploy it into kubernetes
and it becomes my running state. And Kubernetes is constantly checking between the two
to see if there are any issues. If there are any issues,
Kubernetes will fix them for me. So if my instance of SQL running in my
pod has an issue, Kubernetes will spin me up a new one automatically.
And I was demoing this to my boss and we'll run through that.
Now. He was saying, that's great, Andrew, but this is a brand new platform
SQL server. So doing it in the command line,
that's okay. But can you jazz up your presentation a little bit? Can you make
it a little bit more wow factor? I went, well, you want
me to use a GUI? And he went, yeah, yeah, a nice gui would be
good. And I went, great, I can use a gui. And wouldn't it be great
if that GuI was Space Invaders?
This is a project has mentioned earlier. It's on GitHub.
It's called Kube Invaders. And guess what? Your pods are the invaders and you play
the spaceship and you can nuke your pods and you can test
your pods high availability by killing them playing
Space Invaders. So let's run through the
demo. So I'm first on my laptop here and then
we'll run up into if I knew what I was
doing. So I have a one node cluster running
on my laptop here. It's called Micro K eight. So it's actually really good.
If you're running on Windows you have the ability to run Docker desktop and that
has a Kubernetes feature. You just have a tick box and you've got
it up and running. So let's run one pod with SQl
server and let's expose that on its default port. So there
we are.
There's my pod, there's my service up and running. Lovely stuff.
So I can connect to it. So say this is my steady state hypothesis.
Is SQL up and running connecting with Ms SQl
Cli. Just a little command line tool. And there we are. We have Microsoft SQl
server 2019 rc one.
So there's my pod. We can see IP address of ten 161
at 57 age about 28 seconds.
Let's nuke that pod. There we go. So what's happened here
is the running state no longer matches the desired state. Kubernetes should automatically
say, oh hang on, something's not right here brand it should fix it for me.
So what we should see when this comes back is that
I now have a new
pod running for 16 seconds with a new IP address and
I can further test because
I still have my service and I can run my query again and that
should using the same IP address go in and I should be able to query
that and get my version back. So that's a very,
very basic chaos engineering experiment against SQL server running
on Kubernetes. Making use of kubernetes ha systems.
Okay so that's great. Let's clean up and
I'm going to switch to a cluster running up in Azure and
I'm going to deploy ten pods to it.
Let's see how that gets on. Okay I've
got a few pods coming. There we are running
and now I can play my game. Oh hang on. Let's grab that.
Here we go. And okay let's
get rid of that one. Let's get rid of that one. And we
can see the pods on the side there been killed and
hopefully Kubernetes will realize that the running state
no longer matches the desired state and it will fix them for me.
It should sync up and I should get a whole bunch of brand new pods
coming into the system there.
Another cool thing about this is I can stick it on automatic mode and I
can let it play for me so I can let that go off and I
can stand here and I could talk about it and I can decide when it
wants to start killing pods and if any of them are going to sink back
in one of the really nice things. But this is that it's
shooting little mini Kubernetes logos, which I think is a really
nice touch.
So we can let that play there. Some of them are coming back in.
There we go. See, Kubernetes realized that the
running state no longer matched the desired state and fixed that for me,
brought all those pods back in brand. It's going to let that play and it'll
just keep killing them as I'm talking.
So let's do a cleanup there and jump
back into the slides. So round off. Just got
a couple of resources there. The top one there is the link to my GitHub
account with the slides and the code for all the demos I just ran.
And the second link there is to the Kube
Invaders project on GitHub as well. Thank you very much.
Are there any questions? Yes,
any sort of persistent storage like relational
databases within Kubernetes and Docker?
Are you talking about that as just a playground?
No. At the moment we are in the middle of running a
proof of concept. So the question is, is SQL server
running on Kubernetes and Docker just a playground? Are we looking at something
more serious? We are looking at something more serious. We've built our own Kubernetes
cluster running of a pure storage array with
obviously provisioning persistent volumes for the databases,
the system databases, transaction logs and then backing them off site.
And then we are properly nuking it. I've managed
to corrupt quite a few databases so far,
but that again will figure into how quickly we can restore brand.
It's a lot of learning processes here because maybe we got the configuration wrong.
Pure storage eraser, so lightning fast, so that's not going to be an issue I
reckon. I think it's more us nuking pods whilst there's heavy transactions
going on, which will happen with if you own SQL server in a vm,
to be honest.