Transcript
This transcript was autogenerated. To make changes, submit a PR.
I'm Doug Sillars, and today I'm going to give a talk about how I'm
using an AI bot, basically Chat GPT
to create DevOps runbook automations.
So building runbooks and automating them and writing
the code with the help of my pair programmer Chat
GPT. So let's just jump right in. The way I think about runbooks
is that in DevOps, is there a checklist? There are checklists of the things that
you need to do to accomplish a task. Whether it's something when there's an
outage, your runbook lists all the things you need to do to bring the system
back online. Or maybe it's just to provision something. You've got a runbook,
blah blah blah blah blah, and at the end you've got it done.
And being someone who works in Devrel, I am really passionate
about documentation. If you're coming in to use a product and
the documentation is not good, it's incredibly frustrating.
The same thing happens internally. If you've got an outage and someone says oh,
we have a runbook for that, and you go through the steps and
the last step doesn't work because the runbook hasn't been updated or it's out
of date or something changed and we didn't update the documentation.
So let's just give an example. You know, one example is
a few weeks ago I was looking at my GitHub data and there's
this great chart showing you how many people have visited your repo and
how many pages they visited every single day. And being
in Devrel, this is data that's really interesting and I don't want it to
be ephemeral up at GitHub. I want to just load this into a database so
that I can keep a history of this over a longer period of time than
it is stored at GitHub. And I can control how I'm displaying
it if I collect this data. And what's really cool is GitHub
has an API. So I was looking at the API and I'm like, this is
going to work. I'm going to take this data, I'm going to suck it into
postgres and that'll have this data. So I
followed all the instructions in the documentation
and of course I got a 403 error that this resource is not
accessible by a personal access token. Of course the documentation
says you need a personal access token. And so I
banged my head against the wall for a couple of days or a couple
of hours, and what I discovered is I didn't
have the right access in my personal access token, so the error message was
just wrong, and I eventually got it working. And now this is getting
sucked into a postgres database every single day. And I can collect this
data, which is really awesome, but we got stuck.
And when you have documentation or you have a runbook, you don't want people
to get stuck, stuck in their runbooks or their checklists.
So another stat that's out there for
cloud native, for SREs, for DevOps, is that the Ops
teams and the application support teams and the SREs are
spending up to 55% of their time just doing stuff.
The stuff that gets into your inbox and you've got to do.
But it isn't the most important things that you need to get
done. Is that manual tedious stuff? And wouldn't it be
great if we could automate some of this away? And so
this is where these run books come into place. Like, if you have a good
runbook, boop, boop, boop, boop, boop, maybe you can finish it faster.
If we can automate some of those steps, then we're getting even a step further.
So there's this great blog post. The URL is down there at the bottom,
and it'll be at the end and it'll be in the slides. But this
guy works for a gambling website. And a gambling
website, if they go down, they're losing money, because really, all a
gambling site is there is to take money from people. And he
found that when he built a thorough library
of runbooks, that all of their issues were getting resolved faster.
He found that escalations were easier because maybe
the knock or the team that was on call could run through
the runbooks before they called the person. So rather than just like calling
Tom, they could try the runbook. Boop, boop, boop, boop, boop, boop, boop. And if
it doesn't work, that last step is call Tom.
He found that hiring new developers and new DevOps
members of the team was easier because if they had
a question, they said, oh, we've got a runbook here. Read how it works.
And they could read how everything is provisioned, how it all runs.
And that just kind of goes into the training, right. If things work right,
then it's easier to train people on how to use it because
you already have everything documented. And then finally, as people
found how useful the runbooks were, once they worked, they kept
at it and they kept updating them. And you never ran into this bad documentation
where the runbook was out of date. And then the next
step is they started automating it. And that was speeding things up even further.
Now, when you read this guy's post,
he got seven months. When he started at the company, his boss gave him seven
months off to focus solely on every single issue, every single
outage, and create a runbook for everything. And then ongoing,
there was this 10% work to keep them going. But by
then the whole team was on board, right, because they found
how great these runbooks were. Now, the thing is,
of course, like very few of us have seven months to do this.
So another approach is from this other blog post here,
which is called do nothing scripting.
It's the key to gradual automation. And his whole goal is
let's open up a notebook, let's say a Jupyter notebook,
or just even a text, a code editor, and just write
out all the steps. These are the steps we need to do. So build the
runbook and then as you get time, automate some
of the steps. So it's like do this manually, do this manually, run this code,
do this manually, do this manually. Run this code. And if you automation
some of those steps as you use the runbook more and more often, you just
automate more and more steps of it and gradually you come to the state where
the runbook becomes fully automation.
I would add one more step here in that if you have all these runbooks,
that's great, but if they're not somewhere where the rest of the team can get
to them, you're still going to be the person who's getting paged.
So let's talk a little bit about unscript and the open
source tooling that we have to help you automate your runbooks.
It's all open source. You can see the URL down here at the bottom of
our GitHub repo, so you can check that out if you're interested. Our open source
runbook automation is built on top of Jupyter notebooks.
And so what's great about that is if you go back to that whole like
do nothing scripting, the whole idea is you can
add text sections in the middle and then you
can add code in between as you want to automate stuff.
The other advantage is these are online, so it's easily to share amongst
the team. In our open source you have a docker image that everyone
can have access to. We also have an enterprise version that is in
the cloud. You've got your text and markdown fields.
As I was alluding to a second ago where you can write down your do
nothing scripting. These are the steps and then in between. And I've minimized
the code here just so that it all fits on the screen. But we've got
automation fields and it's all Python based, so it's pretty straightforward and easy to get
started with. You don't need that seven month kickoff. You can start using your automations
bit by bit when you have a couple of minutes and you can build these
runbooks very, very quickly.
Another great advantage that we have is we have hundreds of pre built actions
that you can just drag into your runbook using AWS,
Google Cloud, Kubernetes, all of the data, lots of databases,
Jira, there's about 30 of them. And we have about 400 actions
that can just be dragged in and used with just wiring it
up to your credentials. So here's
an example runbook, and this is a Kubernetes health check.
And so what we're doing here is the first action is we're
going to list all the pods in our namespace, then we're going to get the
logs. And then I wrote this code here just in Python to
parse the logs and look for warnings in the logs,
right? And then if there's a warning in the logs, we'll post a message to
slack, say hey guys, bound a warning in the logs. Here's what it
is. Here's the pod that's having issues, we can
resolve this. This is the beginning of a
full automation. Maybe then if something happened once
we diagnose that, maybe we could create it to some actions to
auto remediate the issue with that Kubernetes pod.
The cool thing is, when I built this, these three actions right
here are all pre built. Just had to drag them in and wire them
up with the configuration and the credentials to
log into my Kubernetes namespace,
then I had to write this one. This whole runbook is available in
our open source, so if you wanted to use it, you can just use the
entire runbook, it's there. You just have to wire up all the different, the four
different steps here and you can run this on the regular
to see if there's any issues with your Kubernetes
deployments. So let's talk
a bit about these actions. In this screenshot here,
you can see it says 342. We're bordering right at almost 400
right now. So it's growing rapidly and you can
create your own. They're all python, so it's very straightforward.
And if the desired action doesn't exist.
You can write your own, and you can see here, when I took this screenshot
I had 24 that I had written. You can also extend an
existing action. So an example I like to give is we have
an action that will list all the open pull requests at GitHub,
but we don't have one that lists all the closed pull requests.
But if you go into the Python code, you can see where it says open.
You can change that to closed, and you've changed the entire functionality
of the action, and it will now list
all of your closed GitHub repositories.
Or you could create a new action. If there's not one that's close enough,
you might just have to write some Python code to create a new action.
Or you can create a new action and this is where
I started getting ready. With Chat GPT,
you can connect to an external connection. So do something with Jira,
or do something with GitHub, or with Google Cloud or Azure or AWS.
Or you could create one of these glue actions like I did in
the kubernetes where I just took the outputs of the logs and I did some
parsing. Are there any warnings? And then send the message to
slack if there are warnings.
Here are two actions that I wanted to create. I wanted to be able to
tag an EC two instance, and so tagging is a common way for
people to understand what that instance, what that virtual machine is
being used for. And if you have good tagging, then it's
pretty obvious which ones are for production, which ones are for staging, which ones need
to stay up, which ones. It's another way
of managing cost is if a project goes down and this
EC two cluster is tagged with that, you can turn it down and it won't
hurt anything and it'll save the company money. The other one I wanted to do
is I wanted to look to see if all of my Google Cloud virtual machines,
and I wanted to know if they were public or not. You can't go a
week without hearing about some company's
s three buckets or virtual machines being exposed to the Internet and getting
hacked. So if you wanted to write some security runbooks,
this is one way you could do that.
And for my copilots to
help me write these, I thought it would be fun to have Chat GPT help
me out. And if you haven't heard of Chat GPT,
the URL is down there, chat OpenAI.com chat.
And you can log in and you can ask it questions and it
will write poetry for you. It will write essays for your school.
Don't do that because your teachers know and your professors know,
but you could. And it
also writes code. And when I created this talk, we were
doing Chat GPT-3 number
four is coming out in beta right now, and I'm super excited to check out,
but I don't have access yet. So here's
how it works. If you have Chat GPT right here, you can
just ask it a question and you can see, I said, can you write Python
script to add a cost center tag with the value marketing to an EC
two instance?
And Chat GPT comes but, and says, sure, I can do that for you.
And it's
importing boto three, which is the SDK for Python,
for AWS. It sets up your instance id and
it puts a key and a value in a variable
and then makes the API call.
And just like that, here's your code.
I like that it's all commented nicely, and then it also gives you a
description to tell you exactly what it thinks it needs to do to make this
work.
And so there's the code, and I could take that and
take this code and drag it right into my unscript
runbook. And when I ran it,
it didn't work quite right. And the reason for that is when
you create your boto three
client, you also need to put a region in here. And so
like Chat GPT got it this close, like we're almost there.
Of course, the error message told me exactly what was wrong. So then I said,
hey, Chat GPT, doesn't it need a region?
So let's use a variable for the region, and let's give that
region the variable, the value
us west two.
And Chat GPT says, oh yeah, I made a mistake. That's right,
it does require that.
So now we've got the region, and you can see it's setting the region name
equal to region.
And the rest of the code is similar to what we saw in
the first video.
It. And so when we do this,
we can see that the response comes back and we get a 200
response, meaning, okay, it actually worked.
This is super, super cool. What I then did is you can
see I have inputs set for my key,
my value and my region. And that makes it even more
modular. So I have this action. I can feed variables into it,
almost like an API or a microservice, so I can feed the variables
in, and it takes those variables and makes the connection
to EC two and then creates the tags.
And this works. And if you go into unskript today.
This action is there, and it was built using Chat GPT.
You can see now the result. I ran it twice, once with cost center
marketing, and then I added a one and a one just to show that it
happened twice. So it does actually add the key
value pair that you want to your EC two
instance.
So let's walk through how I set this
all up once you have the code written. And what
makes unskript open source so
easy to use is if you create a credential, which is
for AWS, a secret and a secret key,
you can reuse that without knowing what the key or the secret
is by just selecting your aws
whatever variable you give it. And you can select this for all of your actions.
And it'll just run, it'll say, oh, I know which key value to
use, it's stored over here in vault. And now I can run this
action and you can see my variables. I have a region
variable, an instance, and a key value.
And so now by setting these all as variables, I can
run this action and it will tag this instance in
us west two with cost center one and marketing one.
Once you have actions like this created, you can just drag them and drop
them into your Jupyter notebook, and it makes it really, really easy to use.
So then the second one I'm going to build here in this video is
to get all of the Google cloud virtual machines
and then see if they're publicly.
So, you know, I asked this question, and I had to be a little bit
more specific here, because when we build things
here at unscript or with our open source platform,
I want to use the same SDK. And it was using a different SDK.
So I said, hey, can you just make sure that you're using the Google
Cloud project library to do this? And then here's
my project. Here's my region.
Actually. First it says, no, I can't because I don't know
how to do that, but I can't execute the code. But here I can give
you an example code of how it would actually work. And so it institutes
the compute engine client, it sets
the region in the project,
and for all of the vms, it just checks to see
if it's publicly available or not. And if it is publicly available,
print the list.
It says, you install the library, you need to have authentication.
And that's the cool thing about unscript, is we take care of all the authentication
and we take care of the library.
So in summary, runbooks are a form of
internal documentation, they're the checklists that you use when you need to provision
something or when you have an outage. The steps that you need to
do to resolve the issues. When you have good
internal documentation, it improves the outcomes,
right? Just like if you have good documentation, people like this
company has good documentation. They make it really easy to learn. I got started really
easily. If you have good documentation, you have good
outcomes. You're going to lower the mean time to response because step
three isn't working. Oh yeah, we changed something, right? If the stress levels
are already really high, that just moves the
stress levels even higher. They improve your team collaboration.
This isn't working. I built a runbook for that. Try this.
Oh that solved it. Thanks. And then we can automate them. If we can
automate them, then we don't actually ourselves have to go through all
the manual tasks. We can zip through at least a few of those steps because
we've automated it and we let the computer do the repetitive boring
bits by automations,
those steps, we're reducing that toil that day to day,
that sometimes as much as 55% of some professionals times is
just doing the mundane things we need to do to keep everything running.
You could build auto remediations, like if a kubernetes
pod is unhealthy or something. Your vm
is publicly available. You could just make it not publicly available,
right? Hide it. Turn off that ip forwarding,
don't let it access. And it could be automatically remediated.
So we don't have a problem. By increasing
the observability, by testing these things on the regular,
we're going to be alerted if something has changed.
We also have runbooks now in unscript
that look at your cost, spend every single day so you
don't get a surprise AWS bill at the end of the month because you'll get
an alert within 24 hours or maybe 48
hours that hey, your spend went up earlier this week,
did you know that? And you can go back and say, oh yeah, I turned
on a bunch of xlarge machines, I should spin those down now.
So unscript has this neat
niche where we're open source. We're built on top of Jupyter notebooks,
which makes them publicly available to the whole team. They help
you automate. There's hundreds of built in automations to help
you get started really, really quickly, and they help you build these runbooks and
they help you build these runbooks in an automated way. So you're improving your outcomes,
you're lowering your MTTR you're reducing your toil
and you're increasing your observability.
If you use this along with Chat GPT, you get
this prototyping of your automation and that auto while unscript
is really fast to get you to the state where you have a
runbook, once you add Chat GPT in there, you actually get there even faster,
because if you have to write an action, Chat GPT can take
you there, shaving off 80% of the
time, and then you've got your automation even faster.
So with that, thank you so much for watching the talk.
Go check out unscript. We're at Runbooks SH and
while you're there you can see the docker instructions to download
and install it and run it. Give us a star while you're there.
If you want to read more, we have lots of blog posts and documentation@unscript.com.
If you want to play around with Chat GPT, it's a lot of fun.
I recommend it. Chat OpenAI.com
and then the two blog posts I talked about, the guy who built 1800
run books over seven months and then do nothing scripting.
And so those are the links there.
And with that, thank you so much for watching. I'm really happy
to have been a part of the Cloud native 2023 conference.
If you have any questions, feel free to reach out to
me on the discord. I am there so ask me questions.
I would be happy to help with any sort of automation DevOps
runbook sort of questions. Thanks again and I'll
see you in the Discord server.