Conf42 Python 2023 - Online

Automate your CloudOps and Spend Less Time Doing Operations

Video size:

Abstract

Studies show that DevOps teams spend as much as 55% of their time on system and operational tasks. In this talk, we’ll use Pythonic Jupyter Notebooks to automate common DevOps tasks, and introduce a repository with hundreds of prebuilt Connectors and Actions to get started immediately.

Summary

  • Today I'm going to be speaking about automate your cloudops so that you can spend less time doing operations and more time building the things you need to build. Let's take a quick step back and talk about documentation. Application support and operations takes up to 55% of your developer time. Lower that time, mean time to resolution.
  • Unscript is an open source cloudops automate platform. It's built on top of Jupyter notebooks that walk through your Python code. Because it's Python, we can automate each one of those steps. You can easily document what you're doing.
  • There are hundreds of actions built into unscript. You can have a working runbook in just a matter of minutes. Show us how easy it is to start from scratch and have something running in almost no time.
  • Unscript is an open source tool that can be used to automate DevOps tasks. It can help you organize your internal documentation. There are hundreds of automations built in and it's easy to create new ones. If you're so interested, you can contribute it back to open source.
  • Doug Sillars talks about managing site reliability for some of the world's busiest gambling sites. He runs developer relations at Unscript. The easiest way to get a hold of him is on Twitter.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to Python 20 and 23. Today I'm going to be speaking about automate your cloudops so that you can spend less time doing operations and more time building the things you need to build. Let's take a quick step back and talk about documentation. As developers, we often think about documentation as when we want to build something, we go to a website, we read what we need to do in order to implement a new feature. But documentation is also really important internally, right? When something go sideways, do we have a checklist of how to fix what's wrong? When we need to roll out new infrastructure? Do we have docs that walk us through all of the steps from a cloud ops or a DevOps or a site reliability engineer point of view? A lot of times these are called runbooks or they're called checklists, right? These are just things that we have internally to help us get through our day and do all of these tasks that need to get done. And I think what we can do to make our lives better is we can automate some of these things, because it turns out that in a recent study we found that application support and operations takes up to as much as 55% of your developer time, giving you a lot less time to build those new features or those new functions that you need to work on. So clearly those are the advantages. But let's actually walk through a blog post, and this is a blog post written by a gentleman whose job was to be a site reliability engineer for a gambling website. Now, if you think about a gambling website, if they go down, like a gambling website's entire motivation is to take money from people. So if they go down, they can't take in money. So site reliability and keeping the site up and making sure that everything is available is really, really important. So this guy walked through some of the things he found. He found that when he had good documentation and good runbooks, that issues were resolved a lot faster. He found that the escalations were better. If they had a good checklist, maybe the on call could solve it and they didn't need the expert. But if they did need the expert, they knew who to reach out to. Maybe even like the NOC team could solve the issues because they could walk through the steps. They found that it was easier to onboard people, right? Because if you're hiring somebody and you say, here's how everything works, here's the checklist, it's a lot easier, you understand what's going on. It's a lot faster to understand what's going on. And they found that that improved training. Right. And as people came on board to this and started using all of the runbooks that he created, they found that they wanted to update the runbooks. They didn't get stale. That's another problem that happens with your documentation, right? Oh, jeez. We wrote that six months ago, and everything's changed since then. That doesn't help you when things are broken, right? You need to have something that's up to date so you can go boom, boom, boom, boom, boom and fix the outage. Right? Lower that time, mean time to resolution. And of course, once you have it all written down, we can start automate it, which is where we're going to get to in this talk. Now, this guy had a really, really cool boss. His boss said, all right, you can take seven months off and just build these runbooks. And he looked at every single issue, every single outage that happened for the last 18 months and built, like, hundreds and hundreds of runbooks. And then once he had them, he had to spend 10% of his time ongoing to keep them up to date. Now, let's take a step back and go back to reality. Most companies, you're not going to get seven months to stop what you're doing or take one person on your team just to build runbooks. We don't have that luxury. We need something that's going to get us there a little bit faster. And so that's what we're going to talk about today. I work for a company called Unscript, and we've built an open source cloudops automate platform. And what's great about this is it helps you get started immediately, building automations into your cloud, ops into your site, reliability into everything that you would consider DevOps. It's open source and is built on top of Jupyter notebooks. And so, as Python developers, you're probably familiar with the Jupyter notebooks. They're online notebooks that walk through your Python code. And each one of the little boxes is a set of code, and they build on top of each other and create an overall application. And if you think about a runbook, it's a bunch of steps or actions that you take to work through a process and solve it. It's sort of like a checklist or a runbook. And Jupyter notebooks are really, really well situated to be able to do this same thing. So what's great about Jupyter notebooks? They're online, they're collaborative, right? The worst thing is when there's an outage and it's like, oh, I've got that unskript on my laptop. That doesn't help the other person when they're on call. Right? So if it's online and everyone can access it, they can go through the checklist, they can go through the runbook and solve the problem. That's great. It's Python. So there's no domain specific language, there's no yaml, there's nothing like that. It's all just, we understand Python. We can write with Python, easy peasy. You can easily document what you're doing. Each one of the code snippets can be separated by a doc section where you can write up what's going on, why you're doing this, what are the inputs? Put it all in there via text or markdown so that people can follow along that, your checklist or your runbook, and know exactly what's happening in each one of those steps. And because it's Python, we can automate each one of those steps. So Jupyter notebooks are perfectly situated for this sort of cloud ops automation. So let's get started. Our GitHub repository, you can find it at runbooks. Sh. When you go there, you'll find that we have hundreds of runbooks and actions already built. So not only do you have this know, Jupyter notebooks are a great way to outline it and build a backbone for what we're doing, but we already have a bunch of the code written, so you can just drag and drop in the pieces that you need. And if the pieces that you are looking for aren't there, it's just Python, we can connect it all up. So here's an example runbook that I built, and what this is doing is it's doing a health check of our kubernetes clusters. And so what it's doing is the first action here is it's just contacting our kubernetes cluster. I configured a login and everything. All the credentials are taken care of in unskript as well. So it's really easy to write each one of these actions because the credentials are stored elsewhere, so you don't have to worry about the login or having the right API key that's all stored inside unscript, and you just have to write the code saying, hey, give me a list of all of the Kubernetes pods, right? And then once we have those kubernetes pods, let's get the logs from each one of those, right? So if you have ten pods, you might get ten sets of logs. And then I wrote just a simple python script here that says, all right, go through all those logs and tell me if there's a warning in the logs, right? If something's written in the logs, it's a warning. That's like the check engine light coming on on your car, right? Something's wrong, we should take a look at it. But nobody reads the logs. So if I run this every hour and just look in the logs for the last hour and write all the warnings and then post a message to slack, the team can be aware that there's a problem. What's really cool is three of the four actions in this runbook are pre built. They're in unscript today. So you just drag drop them in and set up your configuration, which are the input parameters, like what's the URL you're connecting to, what your credentials, et cetera, et cetera. And we'll show that in a second. The last one I had to write. But this whole thing is in a pr to our open source. So if you wanted to run this very health check, you can, because soon you'll be in our open source repo and it'll just show up in the tool. So when you're getting started, you end up at our GitHub repo. While you're there, give us a star if you feel like you want to. But at the bottom there's instructions on how to install it, and it's a docker. Install locally onto your machine, or you can put it up in the cloud, right into your cloud instance so that your team can share it. But rather than describe it, let's just go straight into chrome and we can see how it works. When you fire up the docker version of Unscript, you land at a welcome page and read about us. Report bugs, request features, explore the docs, right? Standard stuff. But here's a list of all of the runbooks, and if you click these, it'll open up in a new tab and you can actually configure your runbook so that you can run it. Let's also look through over here. When you're writing a runbook, you can save it. You can add actions or notes. An action is code, the notes or the markdown or the text code, comments, essentially credentials I'll talk about in just a second. But we can also have input parameters. And so what an input parameter is, is when you run this runbook, you could set the region if you're running an AWS, so that it always is checking the same region. Or you could send in parameters like I want to study this, and so you can input variables into the entire runbook that will then can be applied to each one of the actions in your runbook. So credentials credentials are the way we connect into the different services. And obviously you can't just have those open to the Internet. We need to have different services. But as you can see here, almost every single type of database that's out there you can connect into your observability platforms. They're all there. Your CI CD platforms, we support all the major clouds and kubernetes we can send to slack. And if you don't see a service you're using in this list, we offer rest APIs, or you can just ssh into the box that you're using if you're interested to do that. Okay, but let's just actually build a runbook. I just want to show you how fast and easy it is to start from scratch and have something running in almost no time. So I have this empty runbook here. Every single runbook has an unscript internal action that runs first. But what I want to do here is what I want to do is get a list of all my Ec two instances at AWS and get a list of them. And then we'll play around with that list of instances. So what we can do here is there are hundreds of actions built into unscript, right? We can look at Google Cloud and we can see all of the actions that are available. We can look at kubernetes, some are like this and some are k eight, right? We can look at elastic and we can see it helps if I could spell. We're going to look at AWS, filter all images. So I want to filter all EC two instances. So I'm going to just drag this action over. This is the one I want to use. And if we open this up, we can see there's python code in here, doesn't really matter because we don't need to change it. If you did need to change it, you can go in there and just configure it, right? Not a big deal, but let's just configure it. Let's get this set up so it can be run. So credentials are set up in here, or you can add a credential right here. I already have AWS credentials connected into my account built in. That way I don't have to show my secret key in the video. So now unskript knows how to connect into my AWS infrastructure. And then I want to find all the EC two instances for a specific region. And this is just a python variable. So I can just say us west two, right. I can apply these changes and run this and we get a list of all the instances. Another way we could do this is if you remember earlier I talked about the input parameters in this runbook. I have input region as a variable set to us west two. So instead if I use the variable here and apply that and hit run, we're going to get the same answer, but it's using the variable. Now what if I wanted to use these instances somewhere? I can set up an output and save this into the instances variable and then rerun it so that it's in the variable. And let's add an action and we can do something really simple like print instances and it printed them. Right. That makes sense. But what are some other things I could do here? I could take a list of EC two instances and restart them, right? So in my configuration I could say instances, right, because it's in a variable and I could set my input region, apply these, do the credential and then restart all the instances. Some of these are running in production, so I'm not going to do that for the video. But you get the idea how easy it is to just really quickly drag and drop in the bits that you want to do and you can have a working runbook in just a matter of minutes. I'm going to give one more example here. And this is again, it's an AWS example. But the idea here is service quotas. So there are certain features inside AWS that Amazon gives you a limit. You can only have so many of these in your AWS infrastructure. For example, your elastic ips for EC two in one region you can only have five of those. And so the way you find this out is you've got to know the service and you need to know this like random code that exists here. And then you can call an API and you can get the number five. Now we had a customer who was running into problems because their automations were failing, because it turned out they were hitting some of their service quota limits and they didn't know it. And so then their automation was failing because they couldn't actually provision more, in this case elastic ips, right? So let's look like at unscript right now, we actually have five of these in use right now. So we can't actually add another elastic ip at this point we're at the limit. And so this was just a runbook that I created for this talk. So our customer was really interested in this. So we prebuilt a series of actions to help them understand where they stood. Here's a runbook, or an action that takes every single VPC service quota and finds the quota limit and then also queries to see how many are used. And again, the parameter is set here. The region is set to us west two. And I am saying, let me know if I've hit 50% of my usage. Right. So in this case, my nat gateways per availability zone that has this code right here, the limit is five and I've used four of them. So what we can do is then there's another action that I have right here that says, let's ask for an increase. And this isn't automatic, like it gets put into a queue and then it gets run. But let's look at this. So the configuration I have here is, again, the region is a variable set up here. And then I'm saying for a VPC, because these are VPC services, the quota code, and because this is a demo that I'm just doing here, right here's the quota code. It matches right here. You could programmatically say, take the output from here and find the quota code and then run it. But in this case, it's hard coded. But you get the idea. And I only have one shot at this, because once you ask for an increase, you can't ask for another increase until it's been completed. So I'm going to run this, and I forgot to put in my credentials. So let's put in the credential and now let's run it. And it says, I have asked for ten. I forgot to show that I asked for ten. I said, hey, for VPC, for this quote code, the limit was for I want ten. And so I just sent an API request saying, hey, I want ten for this. And if you look down here, it says 200, like it's in the queue to now be set up to ten, which is pretty awesome. Now, one thing that's really interesting with these AWS service codes is that these are really hard to find. And part of the reason for that is there's like 2600 of them, right? AWS is huge. There's so many things going on. And as I was building these out, I really wanted to get a better understanding of what was going on. And so I'm going to scroll to the top of this runbook real quickly to show you what I did to better understand this. The first thing I did after I installed some PiP stuff was to get the list of all the service names, right? So the service name here is VPC. If you run this, it's going to give you a list of all the service names. And there's 222 of them, right? There are a lot of them, right? Cloud nine, cognito, all these different things that Amazon offers, like 222 service names. There's a lot of them. I'm just going to minimize that. Then I built another action. And we'll look at this here that says, for us, west two, for a specific service, like VPC or EC two, get me all the service quotas. Now, you may notice here, I'm not actually putting in the service name here, and that's because I'm using an iterator. So I'm actually taking a list, all 222 of them, and running through this. So this is going to run 222 times and give me every single service quota that exists inside AWS. I'm not going to hit run here because that takes a long time. So once I have all 2600 of these service quotas, I can create a csv of those and then I can save them to Google sheets. And you can see I actually have that action right here. And it comes over here into Google sheets. So let's actually run it. So you can see that this actually happens. I can come here, I can delete everything that's in the Google sheet, and I can run this action. And it says that it did it. So let's go over to there and you can see it did it. We can do it again, just because it's sort of fun to watch. Come over here. We can make this a little smaller. Run action. And bam, it created it. So now this is an automation. I can make this happen every single day. I've been running this every day for about the last week. And like four or five new services are added every single day. And you can see here, here are all my VPC service quotas, right? And here are the names, here's that secret code, and here's the quota value for my account. If you look here, you can see my nat gateways per availability zone. And it's set to five. I haven't refreshed these numbers yet because remember, this is requiring it across 222 of them. It takes a long time. So if I run it again, this might say ten now, but right now it still says five. But what's really cool about this is this automates every single day it gets uploaded to our docs. So if you're curious about this, you can go to the unscript docs and you can get the quota code for all 26 40 of these, right? There's tons and tons of these things. It's really interesting. But again, this is one of those things that's hard to find. So why not write a little bit of automation to get it into one place so that it's a lot easier? Instead of being hidden behind a bunch of API calls, you can now look it up in a Google sheet that might be useful for you. Maybe you would prefer the API for me. This was super, super helpful. So, in conclusion, runbooks are a way to describe your internal documentation. When there's an outage, when you need to provision something new, you have a checklist or a runbook that walks through all the different steps. And research and people's blog posts have shown that when you have really good internal documentation, your outcomes are improved. You lower your meantime to resolution. When there's an outage, right? When you're ripping your hair out, the last thing you want is an out of date runbook. If it's up to date and it's automated, it's going to make things easier. It improves your collaboration and you can automate it. That automation reduces that manual DevOps toil those things that you have to do, the 20 things you need to do every single day, the things that derail your day, because you've got to go do stuff that you weren't planning to do. Because if you don't, things will break. You can auto remediate things, you can increase your observability. The opportunities here are really endless. And with unscript it's open source, it's Python, it's based on Jupyter notebooks. There are hundreds of automations built in and it's all open source and easy, easy to create new ones that you can use yourself. If you're so interested, you can contribute it back to our open source. So hopefully this has given you an idea of how you can use your python experience to help you automate a lot of the day to day toil that you need to do every day to keep your DevOps, to keep your infrastructure up and running. A few resources if you're interested. Again, our GitHub repo is at Runbooks sh. If you want to give us a star, we'd love that. You can reach out to us@unscript.com there's a link to our slack community there. If you'd like to sign up and if you have questions we'd be happy to help you on your DevOps automation journey. Here's the blog post on things I learned managing site reliability for some of the world's busiest gambling sites. It's an interesting read and I'm Doug Sillars and I run developer relations at Unscript. If you want to get a hold of me, probably the easiest way is on Twitter and my Twitter handle is Doug Sillers, so I'm pretty easy to find. If you just search Doug Sillars, you'll find me. Thank you very much for listening and I look forward to talking with you and helping you on your cloudops automation journey. You can learn more at unscript and thanks for watching.
...

Doug Sillars

Head of Developer Relations @ unSkript

Doug Sillars's LinkedIn account Doug Sillars's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)