Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Welcome everyone to this session on Chaos Engineering for SQL
            
            
            
              Server. A little bit about myself first. My name is Andrew Pruski.
            
            
            
              I am a SQL server DBA and Microsoft data platform MVP.
            
            
            
              Originally from Swansea, but I've been living in Dublin for the last just
            
            
            
              over six years now. My twitter handle at dbA from the cold and
            
            
            
              my email address dbA from thecald@gmail.com are on the slide there.
            
            
            
              So if you have any questions after today, please feel free to reach out.
            
            
            
              I'm always willing to talk about this stuff. My blogs are as well.
            
            
            
              Dbafthecol.com posting an article next week about
            
            
            
              SQL Server and Chaos Engineering. And then finally at the bottom there is my
            
            
            
              GitHub account. All the slides and the code for the demos
            
            
            
              I'll be running are available there and I'll post an exact link at the
            
            
            
              end of the session. So onto the session.
            
            
            
              The aim of this session is to talk about how we can apply chaos
            
            
            
              engineering principles and practices to SQL Server and the
            
            
            
              systems around it. I'm a database administrator
            
            
            
              by trade. Have we got any dbas in? I can't see no
            
            
            
              good, so I can say pretty much what I want. I'm a
            
            
            
              database administrator by trade and we're taught to think about
            
            
            
              failures. For example, one of the core aspects of being a DBA
            
            
            
              is backups. We backup our databases and we
            
            
            
              don't just back up our databases and forget about them. We verify those backups and
            
            
            
              we practice restores because we need to be able to
            
            
            
              restore our databases if they encounter a failure in production
            
            
            
              to our company's RPO recovery point objective
            
            
            
              and within our company's RTO. And it's because of practices like
            
            
            
              this. I think that being a DBA and chaos engineering
            
            
            
              really do go hand in hand. I think all dbas should be embracing
            
            
            
              chaos engineering principles and performing chaos engineering
            
            
            
              experiments. So here's what we're going to cover. We're going to talk about
            
            
            
              identifying potential failures and weaknesses in SQL
            
            
            
              Server and the systems around it that we can
            
            
            
              test with chaos engineering experiments.
            
            
            
              Then we're going to actually run a chaos engineering experiment up in a lab
            
            
            
              that I've got in Azure. If the Wi Fi holds out on me,
            
            
            
              fingers crossed. And then finally to round off the session, just have a bit
            
            
            
              of fun. We're going to talk about SQL Server on kubernetes,
            
            
            
              and we've got a little tool that I found online that we can use to
            
            
            
              test SQL running on kubernetes. It's a really exciting time to be a SQL
            
            
            
              Server DBA at the moment, over the last few years we've had all
            
            
            
              these brand new platforms. We've had SQL servers on Linux, SQL Server in Docker
            
            
            
              containers, and SQL Server on kubernetes.
            
            
            
              So we'll have a little test about testing SQL Server's
            
            
            
              HA capabilities on kubernetes to round the session off.
            
            
            
              But first things first, let's talk about how we
            
            
            
              can identify weaknesses for SQL Server and its systems
            
            
            
              that we want to test. So the best way to start with this is
            
            
            
              to do a past incident analysis, or as
            
            
            
              Ross would say, past surprise analysis.
            
            
            
              So the reason we do this is because we want to find a
            
            
            
              potential weakness that has happened in the past or is likely to happen. There's no
            
            
            
              point in running a chaos engineering experiment against something that's never going to happen
            
            
            
              in your very unlikely to happen in your environment. We want to get some actionable
            
            
            
              results from our chaos engineering experiments.
            
            
            
              So look at the past incidents that have happened in your environment,
            
            
            
              what's happened and what technologies and strategies were put in place to
            
            
            
              mitigate that issue. So say we had a standalone SQL
            
            
            
              server instance in production. That instance went down
            
            
            
              and some sort of ha strategy was put in place,
            
            
            
              whatever that may be. Maybe it's mirroring, maybe it's clustering, maybe it's always
            
            
            
              on availability group. Or if you're mad, we can use replication.
            
            
            
              So maybe we want to run a chaos engineering experiment against that Ha
            
            
            
              strategy to further build confidence in our systems
            
            
            
              so that they will react as we expect them to
            
            
            
              when, not if, they encounter a failure,
            
            
            
              because we want to learn from the previous failures and that will guide us to
            
            
            
              our future experiments so we can do a past surprise
            
            
            
              analysis. Another really good method is to do what's called a
            
            
            
              likelihood impact map. So we think
            
            
            
              about all the failures that could happen in our systems and
            
            
            
              we rank them on how likely they are to their impact.
            
            
            
              So all the way down the bottom there in the green, least likely, least impact,
            
            
            
              all the way up to the top there with most likely, oh holy moly,
            
            
            
              impact brand. When people do this,
            
            
            
              they do tend to be eitherly overly pessimistic or
            
            
            
              overly optimistic about the failures in their systems. I'll let you decide
            
            
            
              which way I fall on that one.
            
            
            
              But if that does happen, you'll see that all your failures, you clump down
            
            
            
              the bottom or at the top. If that happens, drill in and really,
            
            
            
              really think about the failures and perform the analysis again. So you
            
            
            
              get a nice spread across the graph with hopefully
            
            
            
              three or four up in the red there that are potential candidates
            
            
            
              for you to run your chaos engineering experiments against.
            
            
            
              So let's think about potential
            
            
            
              failures for SQL server. I've already mentioned one high
            
            
            
              availability we built. Say we had a standalone
            
            
            
              SQL instance in production that went down. So a
            
            
            
              two node always on availability group cluster has built. So we had a primary and
            
            
            
              a secondary. If the primary fails, we've configured it automatic
            
            
            
              failover to the secondary. The secondary will stay up and our databases
            
            
            
              will remain online. Now we can perform tests
            
            
            
              in SQL server to fail over that availability group and fail it back with
            
            
            
              SQL. We can say alter availability group failover, hit execute,
            
            
            
              watch it fail over. But that's not how SQL
            
            
            
              servers is going to fail out in production. What about
            
            
            
              just turning off the primary node, nuking that primary
            
            
            
              node, seeing what happens? Isn't that a more realistic test of
            
            
            
              how SQL will fail out in production? Yes, absolutely.
            
            
            
              So that is one test we could possibly run.
            
            
            
              So let's think about some more failures. Let's think about
            
            
            
              monitoring systems monitoring is important. Monitoring things like pagerduty
            
            
            
              gets alerted, you get paged. If that goes down, you're in trouble. So let's
            
            
            
              think about a runaway query in the database. Runaway query
            
            
            
              fills up the transaction log of a SQL server database. If that happens in SQL
            
            
            
              Server, no more writes can hit that database.
            
            
            
              Now there's no point in really writing a chaos engineering experiment
            
            
            
              to test that. We know what's going to happen.
            
            
            
              But what about your monitoring? When did you get alerted there
            
            
            
              was an issue? Did you get an alert at the end saying, oh, the transaction
            
            
            
              logs full, you need to do something? Or did you get preemptive alerts saying,
            
            
            
              hey, there's a lot of growth happening here, you need to investigate
            
            
            
              before an issue occurs. So maybe we could write a chaos engineering
            
            
            
              experiment to test that. Thinking of another one.
            
            
            
              Let's go. Backups already mentioned backups.
            
            
            
              We test our backups, right? We back them up to the same server the databases
            
            
            
              are on? No, that server goes offline,
            
            
            
              we've lost not only our databases, we've lost the ability to recover them.
            
            
            
              So we put them off site somewhere on
            
            
            
              an external server. We just back them up there and
            
            
            
              leave them, right? No, we regularly run restore
            
            
            
              tests against those backs. We verify them and then we regularly restore those
            
            
            
              databases because issues with restores
            
            
            
              do happen. A few jobs ago, I was working as
            
            
            
              a SQL server DBA and I was using a third party tool to backup
            
            
            
              my databases. The reason for this was we had kind of
            
            
            
              large databases and I wanted
            
            
            
              to have native backup compression. We were
            
            
            
              working with SQL Server 2005. This is how long ago it was SQL Server
            
            
            
              didn't have native backup compression. So this third party tool did it. I'm not going
            
            
            
              to mention it because I don't want to get sued,
            
            
            
              but it gave us the compression. It was great. It worked everywhere. I has implemented
            
            
            
              it absolutely everywhere. Compressing my backups. Happy as Larry.
            
            
            
              One day I needed to restore a three terabyte. We had an issue in
            
            
            
              production. I needed to restore a three terabyte database.
            
            
            
              Not a problem, I said, I have the backups, they've all been verified.
            
            
            
              Went to the software, clicked the backup file,
            
            
            
              clicked restore, and I also clicked a
            
            
            
              function that this third party tool had called instant Restore
            
            
            
              because who wouldn't want to instantly restore a three terabyte
            
            
            
              database? Awesome. So I clicked instant restore,
            
            
            
              I hit execute and the software went to 99%
            
            
            
              and there it stayed for 5 hours and then promptly failed.
            
            
            
              I hadn't been regularly testing my restores using that feature.
            
            
            
              I should have regularly been restoring databases to a test server
            
            
            
              using those settings so that I know that that process will
            
            
            
              work as I expect it to when I really
            
            
            
              needed it. And guess what? I do now. I mean, there's loads of tools we
            
            
            
              have out there these days. Has everyone here heard of DBA tools?
            
            
            
              Powershell module? Yay.
            
            
            
              Nay. Yep, we got a few in there. So it's a Powershell module
            
            
            
              specifically designed for database administration and you can do database features with
            
            
            
              it. So I have it running on a separate server and it picks databases at
            
            
            
              random, grabs their backups, restores them for me to another server,
            
            
            
              and then flags any issues, if there are any. So that when I
            
            
            
              come to needed to be able to restore a production database,
            
            
            
              I know that process will work.
            
            
            
              Okay, more failures. Let's go for an obvious
            
            
            
              one. Let's go. User error. Who here has ever
            
            
            
              run can update statement against a production table without a where clause? So you've
            
            
            
              updated all the rows instead of just the ones you wanted.
            
            
            
              Be honest. Happened to
            
            
            
              a friend of mine in a previous job that I did. He ran an update
            
            
            
              statement against a production table, the outcome of which, for a period of about half
            
            
            
              an hour, a 15% discount was applied to every single
            
            
            
              transaction that company did. Lovely.
            
            
            
              So this is a test of maybe our deployment pipelines or our
            
            
            
              practices. Should you have been able to run an ad hoc update query
            
            
            
              against those production servers? Do your admins really need admin
            
            
            
              access? All the time? Really?
            
            
            
              So maybe this is a way we can test a deployment pipeline. Maybe this
            
            
            
              is a test of our security policies. Does he need admin
            
            
            
              access all the time? The outcome of this actually was. We restricted our ad accounts
            
            
            
              to read only access to the databases. And then when we needed to
            
            
            
              do stuff like that, we logged in with SQL authenticate accounts. And then we
            
            
            
              installed another tool called SMS boost to management Studio,
            
            
            
              a tool for managing SQL server databases. And that would scan any script
            
            
            
              that you're about to run on certain servers and say, hey, this update
            
            
            
              statement doesn't have a where clause. Are you sure you want to run this?
            
            
            
              So maybe there's extra steps we could put in there. Maybe there's something around that
            
            
            
              we could test to try and prevent human error. You're never going to get rid
            
            
            
              of it completely. One of the things with SMS boost is you got just really
            
            
            
              used to just clicking. Ok. So are you ready to run this? Yes. Click.
            
            
            
              But maybe there's something we could test there.
            
            
            
              Okay, let's do one more. Let's go nuclear. Let's think.
            
            
            
              A production data center outage.
            
            
            
              You're running all your production systems in a private data center.
            
            
            
              That data center has an outage. Okay.
            
            
            
              Likelihood, not very likely. Impact,
            
            
            
              pretty bad. So this would be a test of your
            
            
            
              Dr. Solution. You need to have a Dr. Solution in place
            
            
            
              and you need to be regularly testing that Dr. Plan. You do not want
            
            
            
              to be enacting your Dr. Strategy for the first time when
            
            
            
              you have an issue in production. Now I've said fairly unlikely,
            
            
            
              and it is, but it does happen. Happened to can ex
            
            
            
              colleague of mine not two months ago. Their production data center
            
            
            
              had a power outage. They didn't have a Dr. Plan. They were out for 4
            
            
            
              hours. Think of the damage that'll do to a business. You need
            
            
            
              to have a Dr. Strategy in place. Brand. You need to be regularly testing it.
            
            
            
              Okay. So we do the likelihood impact map. We think about
            
            
            
              all these failures because we want to identify things like which failure has the highest
            
            
            
              likelihood, which failure has the highest impact, and what will we
            
            
            
              gain from testing these failures? Remember, we want to get some actionable results from running
            
            
            
              these chaos engineering experiments. Okay. It's great. If you run an experiment and
            
            
            
              everything passes and it works exactly as we expect it to, we've built confidence
            
            
            
              in the system. But I'm pretty sure I can guarantee you
            
            
            
              you will find issues out there in the wild. And the
            
            
            
              last point there is, is there anything else that can be tested?
            
            
            
              Now this is sort of. Is there anything else that you haven't thought of?
            
            
            
              That's a bit of a weird question because it's Andrew, is there anything you haven't
            
            
            
              thought of? I don't know. I haven't thought of it. What this means is go
            
            
            
              and talk to people outside of your team.
            
            
            
              Go and talk to, say, sysadmins, your network admins, your database developers.
            
            
            
              Talk to your end users because I guarantee they
            
            
            
              will come up with perceived weaknesses or failures that they've seen that you
            
            
            
              haven't thought of. So maybe there's something there that you can test to further build
            
            
            
              confidence in all of your systems that they will react
            
            
            
              as you expect them to when they encounter a failure.
            
            
            
              So let's go ahead and let's have a look at running an experiment.
            
            
            
              So what test are we going to run? So I've
            
            
            
              already mentioned it. Let's have a look at what happens if the primary node
            
            
            
              in an availability group cluster fails.
            
            
            
              So we're going to have a two node cluster. That primary
            
            
            
              node is going to go down. What are
            
            
            
              we expecting to happen? So let's define
            
            
            
              the experiment. The hypothesis, the listener of the availability group
            
            
            
              should remain online. The listener is the endpoint. This is what all the applications
            
            
            
              and users connect to so that they can access the databases. So if the
            
            
            
              primary node goes down, we want the listener, the availability group, to fail
            
            
            
              over to the secondary node, brand that listener to remain online so that our
            
            
            
              databases are accessible. Method.
            
            
            
              Now I've said we can run TSQL statements to availability groups over,
            
            
            
              but very sanitized way of testing.
            
            
            
              So how about stopping the database engine service on
            
            
            
              that primary node? So we're simulating that primary node going down
            
            
            
              and we'll see what happens with that availability group.
            
            
            
              Of course, because we're running this test, we need a rollback
            
            
            
              strategy. Now if this was me, even if I was doing it in a development
            
            
            
              environment, in my work, I would have a set load of scripts
            
            
            
              that would perform a point in time restore of all the databases there just in
            
            
            
              case something horribly went wrong. But as this is just a demo environment,
            
            
            
              all I'm going to do is restart the primary database
            
            
            
              engine service on the primary node. So let's go
            
            
            
              ahead and let's run a chaos engineering experiment.
            
            
            
              Okay, here we have a two node availability group cluster.
            
            
            
              So we have AP SQlago one which is our primary and
            
            
            
              AP SQL two which is the secondary.
            
            
            
              So let's use SQL to failover from the primary to the secondary so
            
            
            
              we can execute this. And this is where we tell SQL to perform the failover
            
            
            
              of our availability group from one to two.
            
            
            
              And there we are, that's completed. So if we do a refresh here,
            
            
            
              we can see one is now the secondary and
            
            
            
              two is our primary. So if I connect back to
            
            
            
              the one, I can fail it back just to verify that failover works
            
            
            
              both ways. And now we're saying fail back
            
            
            
              from two to one. So we're verifying that the
            
            
            
              AG can fail from one node to the other and then back again.
            
            
            
              And there we go, that's completed. So if I do a refresh,
            
            
            
              one is back to our primary and
            
            
            
              two is our secondary. But this isn't how an availability
            
            
            
              group will fail, but in the wild. So let's run a
            
            
            
              chaos engineering experiment to simulate solutions in production
            
            
            
              that an availability group would experience, say the primary
            
            
            
              node failing. And then we'll see if the availability group can
            
            
            
              still fail over. So if we jump into visual studio code,
            
            
            
              I have this scripture scrub to the top. First thing
            
            
            
              it does is just grab the listener name running some SQL,
            
            
            
              and then we have a load of write hosts here just to make it basically
            
            
            
              look pretty, to be honest. And then we run a testnet
            
            
            
              connection against the listener to make sure the listener is online.
            
            
            
              And then we put that into a pest test to verify whether or
            
            
            
              not it's online. If testnet connection comes back with
            
            
            
              a value of true, the pester test will pass. Anything else?
            
            
            
              The pester test will fail. And this is our steady state hypothesis.
            
            
            
              We run this to make sure the listener is online before
            
            
            
              we actually run our chaos engineering experiment.
            
            
            
              So if steady state hypothesis
            
            
            
              passes, we then stop the
            
            
            
              database engine service on the primary. So we're simulating
            
            
            
              the primary failing and then we test
            
            
            
              the listener again. The listener is online. It means it's failed over to the
            
            
            
              secondary node brand, the availability group is online,
            
            
            
              and SQL has reacted to the primary node shutting down
            
            
            
              as we expect it to. So let's run our chaos
            
            
            
              engineering experiment. Let's make sure I'm in the right location and
            
            
            
              let's run. So first thing we're
            
            
            
              doing is our steady state hypothesis is that listener online.
            
            
            
              Listener is online. So now we can shut down SQL on
            
            
            
              the primary node. So wait for that to go down.
            
            
            
              And now we're running our test again to see if that listener is
            
            
            
              online. And boom, there we are.
            
            
            
              Looks like our chaos engineering experiment
            
            
            
              has passed. Let's have a look in SQL itself
            
            
            
              and let's do a refresh.
            
            
            
              And one is now the secondary brand. Two is the primary.
            
            
            
              So the same has running the SQL by shutting down the node, sorry,
            
            
            
              by shutting down SQL on the primary node, the availability group has failed
            
            
            
              over to the secondary node as we expected it to.
            
            
            
              So let's run the test again. Let's fail back from two to one,
            
            
            
              verifying that the availability group can fail both ways. So same
            
            
            
              again. Starting off is that listener online
            
            
            
              on two. It's online. So now we can shut down the
            
            
            
              database engine service on that node. And let's see if
            
            
            
              the availability group fails over. So database
            
            
            
              engine service going down and
            
            
            
              we rerun our testnet connection against the listener.
            
            
            
              So this has taken a little bit longer than before. This might be
            
            
            
              something gone wrong. We'll have to give it a little bit to see.
            
            
            
              And we go, it looks like it's failed.
            
            
            
              Yes. Okay, so the listener is no longer online. There we
            
            
            
              go. And let's catch engineering expected from rolling back.
            
            
            
              So let's have a look in SQL.
            
            
            
              So refresh here.
            
            
            
              Let's do a refresh here.
            
            
            
              Okay, so two is still the primary brand,
            
            
            
              one is still the secondary. So our Chaos
            
            
            
              engineering experiment has failed. We were unable to fail that availability
            
            
            
              group back from two to one. Now the reason for this
            
            
            
              is settings in the clustered
            
            
            
              role. If I come down here to properties, I click failover.
            
            
            
              We have these settings here. And what these settings are saying is
            
            
            
              the maximum failures in the specified period can be one and the period is hours.
            
            
            
              Now what this is saying is if there is more than one failure of that
            
            
            
              clustered role, the availability group in 6 hours, if another
            
            
            
              failover occurs, the clustered role, the availability
            
            
            
              group will be left in the failed state.
            
            
            
              So I can use SQL, I want to fail the availability group backwards and forwards.
            
            
            
              But if the availability group experiences a primary node
            
            
            
              failure like we've simulated with our chaos engineering experiment more than once
            
            
            
              in 6 hours with these settings, that failover
            
            
            
              won't work. And guess what the default settings
            
            
            
              are for a clustered role. Theyll are these ones here.
            
            
            
              So by performing this chaos engineering experiment, we've identified
            
            
            
              a setting in our availability group that
            
            
            
              could possibly cause SQL to react in a
            
            
            
              way we don't expect it to. So we could come in,
            
            
            
              make changes here. I'm just going to pick say ten brand,
            
            
            
              let's rerun our experimental.
            
            
            
              So let's check that the listener is online again.
            
            
            
              It's online. And now let's shut down that database engine service on
            
            
            
              two. Brand, let's see if the setting changes we've made will
            
            
            
              allow that availability group to fail over. So waiting for the database
            
            
            
              engine service to go down and we're rerunning our test.
            
            
            
              Boom, listener is online.
            
            
            
              Come into SQL, let's do a refresh.
            
            
            
              SQL is now the prior, sorry, one is now the primary brand,
            
            
            
              two is the secondary. So by making that configuration change, we are now able to
            
            
            
              see SQL failing over back from two. When we shut down the database engine
            
            
            
              service on two.
            
            
            
              Okay, little bit of a contrived experiment, but those are the default settings in that
            
            
            
              clustered role of one failure every 6 hours. And I have to
            
            
            
              admit, when I was building this test, this demo,
            
            
            
              I was sort of sitting there thinking, I can't really remember
            
            
            
              the last time I checked that setting for my production SQL
            
            
            
              instances. And I went and checked and guess what, a couple of them
            
            
            
              had those default settings I need to go in and change. So even
            
            
            
              just by performing a chaos engineering experiment against this SQL instance
            
            
            
              in a lab, I got some actionable results that I could take
            
            
            
              away and put into my production environment.
            
            
            
              Okay, so that
            
            
            
              is running a chaos engineering experiment against a SQL
            
            
            
              always on availability group. Now to round the session off, I want
            
            
            
              to talk about SQL Server running on kubernetes. Anyone here
            
            
            
              got experience with kubernetes? Oh, loads of you. All right,
            
            
            
              cool. I love Kubernetes. I think it's really cool. I really
            
            
            
              think it's really, really cool that we can run SQL server on it because I
            
            
            
              like to take advantage of Kubernetes. Kubernetes is
            
            
            
              built in high availability features.
            
            
            
              So I have my desired state. I define my deployment in
            
            
            
              say a config file or a YAML file, and then I deploy it into kubernetes
            
            
            
              and it becomes my running state. And Kubernetes is constantly checking between the two
            
            
            
              to see if there are any issues. If there are any issues,
            
            
            
              Kubernetes will fix them for me. So if my instance of SQL running in my
            
            
            
              pod has an issue, Kubernetes will spin me up a new one automatically.
            
            
            
              And I was demoing this to my boss and we'll run through that.
            
            
            
              Now. He was saying, that's great, Andrew, but this is a brand new platform
            
            
            
              SQL server. So doing it in the command line,
            
            
            
              that's okay. But can you jazz up your presentation a little bit? Can you make
            
            
            
              it a little bit more wow factor? I went, well, you want
            
            
            
              me to use a GUI? And he went, yeah, yeah, a nice gui would be
            
            
            
              good. And I went, great, I can use a gui. And wouldn't it be great
            
            
            
              if that GuI was Space Invaders?
            
            
            
              This is a project has mentioned earlier. It's on GitHub.
            
            
            
              It's called Kube Invaders. And guess what? Your pods are the invaders and you play
            
            
            
              the spaceship and you can nuke your pods and you can test
            
            
            
              your pods high availability by killing them playing
            
            
            
              Space Invaders. So let's run through the
            
            
            
              demo. So I'm first on my laptop here and then
            
            
            
              we'll run up into if I knew what I was
            
            
            
              doing. So I have a one node cluster running
            
            
            
              on my laptop here. It's called Micro K eight. So it's actually really good.
            
            
            
              If you're running on Windows you have the ability to run Docker desktop and that
            
            
            
              has a Kubernetes feature. You just have a tick box and you've got
            
            
            
              it up and running. So let's run one pod with SQl
            
            
            
              server and let's expose that on its default port. So there
            
            
            
              we are.
            
            
            
              There's my pod, there's my service up and running. Lovely stuff.
            
            
            
              So I can connect to it. So say this is my steady state hypothesis.
            
            
            
              Is SQL up and running connecting with Ms SQl
            
            
            
              Cli. Just a little command line tool. And there we are. We have Microsoft SQl
            
            
            
              server 2019 rc one.
            
            
            
              So there's my pod. We can see IP address of ten 161
            
            
            
              at 57 age about 28 seconds.
            
            
            
              Let's nuke that pod. There we go. So what's happened here
            
            
            
              is the running state no longer matches the desired state. Kubernetes should automatically
            
            
            
              say, oh hang on, something's not right here brand it should fix it for me.
            
            
            
              So what we should see when this comes back is that
            
            
            
              I now have a new
            
            
            
              pod running for 16 seconds with a new IP address and
            
            
            
              I can further test because
            
            
            
              I still have my service and I can run my query again and that
            
            
            
              should using the same IP address go in and I should be able to query
            
            
            
              that and get my version back. So that's a very,
            
            
            
              very basic chaos engineering experiment against SQL server running
            
            
            
              on Kubernetes. Making use of kubernetes ha systems.
            
            
            
              Okay so that's great. Let's clean up and
            
            
            
              I'm going to switch to a cluster running up in Azure and
            
            
            
              I'm going to deploy ten pods to it.
            
            
            
              Let's see how that gets on. Okay I've
            
            
            
              got a few pods coming. There we are running
            
            
            
              and now I can play my game. Oh hang on. Let's grab that.
            
            
            
              Here we go. And okay let's
            
            
            
              get rid of that one. Let's get rid of that one. And we
            
            
            
              can see the pods on the side there been killed and
            
            
            
              hopefully Kubernetes will realize that the running state
            
            
            
              no longer matches the desired state and it will fix them for me.
            
            
            
              It should sync up and I should get a whole bunch of brand new pods
            
            
            
              coming into the system there.
            
            
            
              Another cool thing about this is I can stick it on automatic mode and I
            
            
            
              can let it play for me so I can let that go off and I
            
            
            
              can stand here and I could talk about it and I can decide when it
            
            
            
              wants to start killing pods and if any of them are going to sink back
            
            
            
              in one of the really nice things. But this is that it's
            
            
            
              shooting little mini Kubernetes logos, which I think is a really
            
            
            
              nice touch.
            
            
            
              So we can let that play there. Some of them are coming back in.
            
            
            
              There we go. See, Kubernetes realized that the
            
            
            
              running state no longer matched the desired state and fixed that for me,
            
            
            
              brought all those pods back in brand. It's going to let that play and it'll
            
            
            
              just keep killing them as I'm talking.
            
            
            
              So let's do a cleanup there and jump
            
            
            
              back into the slides. So round off. Just got
            
            
            
              a couple of resources there. The top one there is the link to my GitHub
            
            
            
              account with the slides and the code for all the demos I just ran.
            
            
            
              And the second link there is to the Kube
            
            
            
              Invaders project on GitHub as well. Thank you very much.
            
            
            
              Are there any questions? Yes,
            
            
            
              any sort of persistent storage like relational
            
            
            
              databases within Kubernetes and Docker?
            
            
            
              Are you talking about that as just a playground?
            
            
            
              No. At the moment we are in the middle of running a
            
            
            
              proof of concept. So the question is, is SQL server
            
            
            
              running on Kubernetes and Docker just a playground? Are we looking at something
            
            
            
              more serious? We are looking at something more serious. We've built our own Kubernetes
            
            
            
              cluster running of a pure storage array with
            
            
            
              obviously provisioning persistent volumes for the databases,
            
            
            
              the system databases, transaction logs and then backing them off site.
            
            
            
              And then we are properly nuking it. I've managed
            
            
            
              to corrupt quite a few databases so far,
            
            
            
              but that again will figure into how quickly we can restore brand.
            
            
            
              It's a lot of learning processes here because maybe we got the configuration wrong.
            
            
            
              Pure storage eraser, so lightning fast, so that's not going to be an issue I
            
            
            
              reckon. I think it's more us nuking pods whilst there's heavy transactions
            
            
            
              going on, which will happen with if you own SQL server in a vm,
            
            
            
              to be honest.