Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to Python 20 and 23. Today I'm going to be speaking about
automate your cloudops so that you can spend less time doing operations
and more time building the things you need to build. Let's take a quick
step back and talk about documentation. As developers, we often
think about documentation as when we want to build something, we go to a website,
we read what we need to do in order to implement a new feature.
But documentation is also really important internally,
right? When something go sideways, do we have a checklist of how to fix
what's wrong? When we need to roll out new infrastructure? Do we have
docs that walk us through all of the steps from a
cloud ops or a DevOps or a site reliability engineer point
of view? A lot of times these are called runbooks or they're called
checklists, right? These are just things that we have internally to help us
get through our day and do all of these tasks that need to get done.
And I think what we can do to make our lives better is we can
automate some of these things, because it turns out
that in a recent study we found that application support
and operations takes up to as much as 55% of your developer
time, giving you a lot less time to build those new features
or those new functions that you need to work on.
So clearly those are the advantages. But let's actually walk
through a blog post, and this is a blog post written by a
gentleman whose job was to be a site reliability engineer for
a gambling website. Now, if you think about a gambling website,
if they go down, like a gambling website's entire
motivation is to take money from people. So if they go down,
they can't take in money. So site reliability and keeping
the site up and making sure that everything is available is really, really important.
So this guy walked through some of the things he found.
He found that when he had good documentation and good runbooks,
that issues were resolved a lot faster. He found that
the escalations were better. If they had a good checklist, maybe the on call
could solve it and they didn't need the expert. But if they did need the
expert, they knew who to reach out to. Maybe even like the NOC team
could solve the issues because they could walk through the steps.
They found that it was easier to onboard people, right? Because if you're hiring
somebody and you say, here's how everything works, here's the checklist,
it's a lot easier, you understand what's going on. It's a lot faster to understand
what's going on. And they found that that improved training. Right.
And as people came on board to this and started using
all of the runbooks that he created, they found that they wanted to
update the runbooks. They didn't get stale. That's another problem that happens with your
documentation, right? Oh, jeez. We wrote that six months ago, and everything's
changed since then. That doesn't help you when things are broken,
right? You need to have something that's up to date so you can go boom,
boom, boom, boom, boom and fix the outage. Right? Lower that time,
mean time to resolution.
And of course, once you have it all written down, we can start automate
it, which is where we're going to get to in this talk. Now,
this guy had a really, really cool boss. His boss said,
all right, you can take seven months off and just build these runbooks.
And he looked at every single issue, every single outage that happened
for the last 18 months and built, like,
hundreds and hundreds of runbooks. And then once he had them, he had
to spend 10% of his time ongoing to keep them up to date.
Now, let's take a step back and go back to reality.
Most companies, you're not going to get seven months to stop what
you're doing or take one person on your team just to build runbooks.
We don't have that luxury.
We need something that's going to get us there a little bit faster. And so
that's what we're going to talk about today. I work
for a company called Unscript, and we've built an open source
cloudops automate platform. And what's
great about this is it helps you get started immediately,
building automations into your cloud, ops into your site,
reliability into everything that you would consider DevOps.
It's open source and is built on top of Jupyter notebooks.
And so, as Python developers, you're probably familiar with the Jupyter notebooks.
They're online notebooks that walk through your
Python code. And each one of the little boxes is a set of code,
and they build on top of each other and create an
overall application. And if you think about a runbook, it's a bunch of
steps or actions that you take to work
through a process and solve it. It's sort of like a checklist or a runbook.
And Jupyter notebooks are really, really well situated
to be able to do this same thing.
So what's great about Jupyter notebooks? They're online, they're collaborative,
right? The worst thing is when there's an outage and it's like, oh, I've got
that unskript on my laptop. That doesn't help the other person when
they're on call. Right? So if it's online and everyone can access it, they can
go through the checklist, they can go through the runbook and solve the problem.
That's great. It's Python. So there's no domain
specific language, there's no yaml, there's nothing like that.
It's all just, we understand Python. We can write with Python,
easy peasy. You can easily document
what you're doing. Each one of the code snippets can be
separated by a doc section where you can write up what's going
on, why you're doing this, what are the inputs?
Put it all in there via text or markdown so that people can follow along
that, your checklist or your runbook, and know
exactly what's happening in each one of those steps. And because
it's Python, we can automate each one of those steps. So Jupyter
notebooks are perfectly situated for this sort of cloud ops
automation.
So let's get started. Our GitHub repository,
you can find it at runbooks. Sh. When you
go there, you'll find that we have hundreds of runbooks
and actions already built. So not only do
you have this know, Jupyter notebooks are a great way to outline
it and build a backbone for what we're doing, but we already have
a bunch of the code written, so you can just drag and drop in the
pieces that you need. And if the pieces that you are looking for aren't there,
it's just Python, we can connect it all up.
So here's an example runbook that I built,
and what this is doing is it's doing a health check of our kubernetes clusters.
And so what it's doing is the first
action here is it's just contacting our kubernetes cluster.
I configured a login and everything. All the credentials
are taken care of in unskript as well. So it's really easy to write
each one of these actions because the credentials
are stored elsewhere, so you don't have to worry about the login or having the
right API key that's all stored inside unscript, and you just
have to write the code saying, hey, give me a list of all of the
Kubernetes pods, right? And then once we have those kubernetes
pods, let's get the logs from each one of those, right? So if you have
ten pods, you might get ten sets of logs. And then I
wrote just a simple python script here that says, all right,
go through all those logs and tell me if there's a warning in the logs,
right? If something's written in the logs, it's a warning. That's like the check engine
light coming on on your car, right? Something's wrong, we should take
a look at it. But nobody reads the logs. So if I run
this every hour and just look in the logs for the last hour and write
all the warnings and then post a message to slack, the team
can be aware that there's a problem.
What's really cool is three of the four actions in
this runbook are pre built. They're in unscript today.
So you just drag drop them in and set up your configuration,
which are the input parameters, like what's the URL you're connecting
to, what your credentials, et cetera, et cetera. And we'll show that in
a second. The last one I had to write. But this whole thing is in
a pr to our open source. So if you wanted to run this very health
check, you can, because soon you'll be in our open source repo and it'll
just show up in the tool.
So when you're getting started, you end up at our GitHub repo.
While you're there, give us a star if you feel like you want
to. But at the bottom there's instructions on how to
install it, and it's a docker. Install locally onto your machine, or you
can put it up in the cloud, right into your
cloud instance so that your team can share it. But rather
than describe it, let's just go straight into chrome and
we can see how it works. When you fire up the
docker version of Unscript, you land at a welcome page and read about
us. Report bugs, request features, explore the docs, right?
Standard stuff. But here's a list of all of the runbooks,
and if you click these, it'll open up in a new tab and you can
actually configure your runbook so that you can run it.
Let's also look through over here. When you're writing a runbook, you can
save it. You can add actions or notes. An action
is code, the notes or the markdown or the text code,
comments, essentially credentials I'll talk about in just a second.
But we can also have input parameters. And so what an input parameter is,
is when you run this runbook, you could set the region
if you're running an AWS, so that it always is checking the same
region. Or you could send in parameters like I want to study
this, and so you can input variables into
the entire runbook that will then can be applied to each one
of the actions in your runbook. So credentials
credentials are the way we connect into the
different services. And obviously you can't just have those
open to the Internet. We need to have different services. But as
you can see here, almost every single type of
database that's out there you can connect into your observability platforms.
They're all there. Your CI CD platforms,
we support all the major clouds and kubernetes we
can send to slack. And if you don't see a service you're using in
this list, we offer rest APIs, or you can just ssh
into the box that you're using if you're interested to do that.
Okay, but let's just actually build a runbook. I just want to show you
how fast and easy it is to start from scratch and have something running in
almost no time. So I have this empty
runbook here. Every single runbook has an unscript internal
action that runs first. But what I want to do here is what I want
to do is get a list of all my Ec two instances at AWS and
get a list of them. And then we'll play around with that
list of instances. So what we can do here is there are
hundreds of actions built into unscript, right? We can look at Google Cloud
and we can see all of the actions that are available.
We can look at kubernetes,
some are like this and some are k eight, right?
We can look at elastic and we can see it
helps if I could spell. We're going to look at AWS, filter all images.
So I want to filter all EC two instances.
So I'm going to just drag this action over. This is the one I want
to use. And if we open this up,
we can see there's python code in here,
doesn't really matter because we don't need to change it. If you did
need to change it, you can go in there and just configure it, right?
Not a big deal, but let's just configure it. Let's get this set up so
it can be run. So credentials are set up in here,
or you can add a credential right here. I already have AWS
credentials connected into my account built in.
That way I don't have to show my secret key in the video.
So now unskript knows how to connect into
my AWS infrastructure.
And then I want to find all the EC two instances for a specific region.
And this is just a python variable. So I can just say
us west two, right.
I can apply these changes and run this and
we get a list of all the instances. Another way we could do
this is if you remember earlier I talked about the input parameters in
this runbook. I have input region as a variable set
to us west two. So instead if I use the variable
here and
apply that and hit run, we're going to get the same answer,
but it's using the variable. Now what if I
wanted to use these instances somewhere? I can set up an output
and save this into the instances variable and
then rerun it so that it's in the variable.
And let's add an action and we can
do something really simple like print instances
and it printed them. Right. That makes sense. But what are some
other things I could do here? I could take
a list of EC two instances and
restart them, right?
So in my configuration I could say instances,
right, because it's in a variable and I
could set my input region,
apply these, do the credential and then restart all the
instances. Some of these are running in production, so I'm not going to do
that for the video. But you get the idea how easy it is
to just really quickly drag and drop in the bits
that you want to do and you can have a working runbook in just
a matter of minutes.
I'm going to give one more example here. And this is again, it's an
AWS example. But the idea here is service quotas.
So there are certain features inside AWS that
Amazon gives you a limit. You can only have so many of these in
your AWS infrastructure. For example, your elastic
ips for EC two in one region you can only have five of those.
And so the way you find this out is you've got to
know the service and you need to know this like random
code that exists here. And then you can call an API
and you can get the number five.
Now we had a customer who was running into problems because their automations
were failing, because it turned out they were hitting some of their service
quota limits and they didn't know it. And so then
their automation was failing because they couldn't actually provision more, in this
case elastic ips, right? So let's look like at
unscript right now, we actually have five of these in use
right now. So we can't actually add another elastic
ip at this point we're at the limit. And so this was
just a runbook that I created for this
talk. So our customer was really interested in this. So we
prebuilt a series of actions to
help them understand where they stood.
Here's a runbook, or an action that takes every
single VPC service quota and finds
the quota limit and then also queries to see how many are
used.
And again, the parameter is set here. The region is
set to us west two. And I am saying,
let me know if I've hit 50% of my usage.
Right. So in this case, my nat gateways per availability zone
that has this code right here, the limit is five and I've used
four of them. So what we can do is
then there's another action that I have right here that says,
let's ask for an increase. And this isn't automatic, like it gets put
into a queue and then it gets run. But let's look at this. So the
configuration I have here is, again, the region is a variable set
up here. And then I'm saying
for a VPC, because these are VPC services,
the quota code, and because this is a demo that I'm just doing here,
right here's the quota code. It matches right here. You could programmatically
say, take the output from here and find the quota code and then
run it. But in this case, it's hard coded.
But you get the idea. And I only have one shot at this, because once
you ask for an increase, you can't ask for another increase until it's been
completed. So I'm going to run this,
and I forgot to put in my credentials. So let's put in the credential and
now let's run it. And it says,
I have asked for ten. I forgot to show that I asked
for ten. I said, hey, for VPC,
for this quote code, the limit was for
I want ten. And so I just sent an API request saying,
hey, I want ten for this. And if you look down here,
it says 200, like it's in the queue to now be set
up to ten, which is pretty awesome. Now,
one thing that's really interesting with these AWS service codes
is that these are really hard to find.
And part of the reason for that is there's like 2600 of them,
right? AWS is huge. There's so many things going on.
And as I was building these out, I really wanted to get a better understanding
of what was going on. And so I'm going to scroll to the top of
this runbook real quickly to show you what I did to better understand this.
The first thing I did after I installed
some PiP stuff was to get
the list of all the service names, right? So the service name here is VPC.
If you run this, it's going to give you a list of all the service
names. And there's 222 of them, right? There are a lot of
them, right? Cloud nine, cognito,
all these different things that Amazon offers, like 222
service names. There's a lot of them. I'm just going to minimize that.
Then I built another action. And we'll look
at this here that says, for us, west two, for a specific service,
like VPC or EC two, get me all the service quotas.
Now, you may notice here, I'm not actually putting in the service name
here, and that's because I'm using an iterator. So I'm actually taking
a list, all 222 of them,
and running through this. So this is going to run 222 times and give me
every single service quota that exists inside AWS.
I'm not going to hit run here because that takes a long time.
So once I have all 2600 of
these service quotas, I can create a csv
of those and then I can save them to Google sheets. And you can see
I actually have that action right here. And it comes over
here into Google sheets. So let's actually run it.
So you can see that this actually happens. I can come here, I can delete
everything that's in the Google sheet, and I can run this action.
And it says that it did it. So let's go over to there and
you can see it did it. We can do it again, just because it's sort
of fun to watch. Come over here.
We can make this a little smaller.
Run action. And bam, it created it.
So now this is an automation. I can make this happen every single
day. I've been running this every day for about the last week.
And like four or five new services are added every single
day. And you can see here, here are all my VPC
service quotas, right? And here are the names, here's that
secret code, and here's the quota value for my
account. If you look here, you can see my nat gateways per availability zone.
And it's set to five. I haven't refreshed these numbers yet because
remember, this is requiring it across 222 of them. It takes a long
time. So if I run it again, this might say ten now, but right now
it still says five. But what's really cool about this is
this automates every single day it gets uploaded to our docs.
So if you're curious about this, you can go to the unscript docs and you
can get the quota code for all 26 40
of these, right? There's tons and tons of these things. It's really interesting.
But again, this is one of those things that's hard to find.
So why not write a little bit of automation to get it into one place
so that it's a lot easier?
Instead of being hidden behind a bunch of API calls, you can now look it
up in a Google sheet that might be useful for you. Maybe you would
prefer the API for me. This was super, super helpful.
So, in conclusion,
runbooks are a way to describe your internal documentation.
When there's an outage, when you need to provision something new,
you have a checklist or a runbook that walks through all the different steps.
And research and people's blog posts have shown
that when you have really good internal documentation, your outcomes are improved.
You lower your meantime to resolution. When there's an outage, right? When you're ripping your
hair out, the last thing you want is an out of date runbook.
If it's up to date and it's automated, it's going to make things easier.
It improves your collaboration and you can automate it.
That automation reduces that manual DevOps toil those things
that you have to do, the 20 things you need to do every single day,
the things that derail your day, because you've got to go do stuff
that you weren't planning to do. Because if you don't, things will break.
You can auto remediate things, you can increase your observability.
The opportunities here are really endless. And with
unscript it's open source, it's Python, it's based on Jupyter
notebooks. There are hundreds of automations built in and it's all
open source and easy, easy to create new ones that
you can use yourself. If you're so interested, you can contribute it
back to our open source. So hopefully
this has given you an idea of how you can use your python experience to
help you automate a lot of the day to day toil
that you need to do every day to keep your DevOps, to keep your infrastructure
up and running. A few resources if you're
interested. Again, our GitHub repo is at Runbooks
sh. If you want to give us a star, we'd love that.
You can reach out to us@unscript.com there's a link to our
slack community there. If you'd like to sign up and if you have
questions we'd be happy to help you on your DevOps
automation journey. Here's the blog post on things I learned
managing site reliability for some of the world's busiest gambling sites.
It's an interesting read and I'm Doug Sillars
and I run developer relations at Unscript. If you want to get a hold of
me, probably the easiest way is on Twitter and my Twitter handle is
Doug Sillers, so I'm pretty easy to find. If you just search Doug
Sillars, you'll find me. Thank you very much for
listening and I look forward to talking with you and helping you
on your cloudops automation journey.
You can learn more at unscript and thanks for watching.