Transcript
This transcript was autogenerated. To make changes, submit a PR.
All right, thanks for listening to my session.
We're going to talk about some automation and hopefully controlling some toil
with automation. I'm Mandy walls. I am a DevOps advocate at Pagerduty
it. I'm Ellen Xchk on most of the social
medias. If you like, get in touch with me. I'm happy to chat
about stuff and you can always email me, which is great too, right?
We all get email. So I'm going to talk broadly about automation
and complex environments. But let's start with
a task that is pretty common and
maybe more common in larger organizations than smaller ones. But as an example,
hopefully it makes sense to people. So I have a developer. Her name's
Alice, and Alice works on a microservice.
It's a feature in a customer facing application.
When Alice or one of her team members is ready to
push some amount of work into the main pipelines, they're responsible for doing
some first sort of sanity checks on their code. And because
of the way it fits into this microservices environment, they need
a bigger interruptions environment than they could just run on their laptops.
So they do this in the cloud. So to ensure that her
code is going to work, Alice needs the sanity check environment
in the development cloud account. Unfortunately,
because of compliance or requirements or whatever,
Alice doesn't have direct access to the cloud account that's
restricted. So she has to put in a ticket
with another team and they'll do it for her.
So her request is submitted to the cloud operations team.
It goes into a queue with everybody else's request, whatever might
be happening for cloud ops, and it'll get handled eventually,
right? The first subtask for her
ticket is to get approval from Finops. The Finops
team reviews the request tickets, and if a
team is over budget or they have too many environments
already, they haven't cleaned up all of their old stuff, the ticket's rejected
and gets sent back to the requester to try again.
If Finops approves the ticket, it proceeds to cloudops. At some
point, hopefully, Alice will get her environment, but it
feels so clunky. We want to make sure
that all the boxes are checked, that we're using the most recent security
profiles, that we're using the right firewall rules, that we're following all
the rules and regulations and financial guidelines and all of those
other things. But this takes time away from
getting work done, and Alice should be continuing
on her development journey. They are getting more stuff done.
She wants to get back to work so she can ship more awesome features.
So we need a way of dealing with getting
through all this stuff. The hard part,
know all of the folks in our example have goals and projects and
things that they're responsible for. They have different products and services that
they're permitted to have access to, but most of all, and sort
of most importantly, they have specific expertise for
their roles, right? So how much organizational
expertise gets shared varies from team to team,
right. It's definitely common in small companies. If you work in a startup,
you might be one day helping fix a printer and the next day you're provisioning
stuff in the cloud and the next day you're like debugging the mobile app.
It totally happens. But when organizations get larger and their services
get more complex, you start to naturally see some of this siloing
off because their requirements now need somebody to
go really deep in that product, into that environment or
whatever it happens to be. So folks need to cultivate more expertise.
In our fictitious organization, the development team isn't permitted to
directly provision assets in the main cloud accounts. They don't necessarily even know
how to do that. They're in there, they're writing the code, they know their libraries
and the runtimes and the things that impact their application,
and they just want to get code into production. Right.
Then we've got our finops folks, and they're analysts mostly,
right? They may not even do a whole lot of development work other than some
tooling, but they're there to watch the spend and estimate costs
and say, oh, make some recommendations on the budgets and
say, well, we've been deploying this over here, but it might be
cheaper if we moved it to somewhere else. It doesn't really impact us, do that
kind of analysis on the thing. So they know their stuff
and then cloud ops is there. They maintain the access and security and
make sure everything is constantly updated for best practices
or recommended practices for the environment that they're using.
And they're sort of responsible for how all those
things fit together. And then they provide environments back to all the
other teams. So each of these teams knows their stuff, but they all have enough
stuff that they're responsible for to keep them busy. And it's
challenging to try and cross train folks across all these
tasks. So unfortunately, we get to a point where we've
got these important skills that get siloed off and it happens kind
of naturally. Eventually, teams get too large to share all of this stuff,
so we want to get to the point where teams
are spending time doing important work shipping code to production and
keeping customers happy. But we still have to get all the boring stuff accomplished
and that's where we're going to need some automation. So what is automation?
We know it sounds pretty good, right? We know that we probably want some.
And the dirty secret of software development is
really that a lot of the work that goes into getting
software from idea to production has nothing to do with
the application code itself, right? It's all that other stuff
that has to happen. We want our code to have a nice, safe,
cozy place to live when it gets to production. It has nothing to do
with what the application is going to do, right? So our cloud ops team knows
how to get that done. But once you've done
that once or twice and you know how to do it, it's not that exciting.
We need to get stuff there the right way, with the right controls in the
right environments. We don't want to do it ourselves.
The application engineers don't really have the time to
get well versed in all the hundreds of different things
available in our cloud platform, and we might not be
able to give that to them anyway. So we look for another layer, another process
to provide access in a safe way. So we're going to dig a bit
into automation and we're going to dig a bit into building for delegation.
First we'll take a short side quest because we get this question from time to
time. Both of these. Is AI automation? And is automation
AI? Well,
take a step back. We can use AI to help build automation.
You may have more or less success with this, depending on the nature
of your environment. If you use a lot of open source
software, a lot of publicly available stuff, you'll have more luck than if
you're using a bunch of cot stuff or everything you have is homegrown,
obviously, because these large language models that you might be asking questions to
are using publicly available information. And if your information on
your software is behind a login or a support portal that
they can't get into, they're not going to know stuff, right? So you're basically limited
to what's been published online. So if you are using a lot of open
source, you might be okay, right? So it can at least help you get
your automation built. But automation itself isn't AI.
It's not at a point yet where we can build
some automation and have it learn and grow with your systems.
It's not to a point where it finds an error
code and then goes out and looks at that error code
and then creates more code around it to the point where you don't know
what even is in there. So we're not at that point yet.
So we're not to the point where automation is going to let you likes just
sit back, relax, have a margarita and hope that the automation
is going to take care of all of this for you. There's been a number
of research trends over the years in complex systems,
stuff about self healing, stuff about intelligence systems,
and there's lots of pathways forward for
these to happen, but we're not there yet. So we're going to continue to write
most of this stuff ourselves and keep an eye on it because
we're looking for a handful of basic benefits, right? Because we
are going to invest in automation. Even if we're using AI tools
to help us generate it, we still need to invest people
hours into maintaining it, making sure it works, all that great stuff. So it is
an investment, right? It doesn't happen overnight. Hopefully you're not doing it on
your off hours. You want it to be part of the actual products thats you're
working on. So you've got these complex environments,
they're built with hundreds or even thousands of individual components.
So you've got all this complexity and all this stuff. And it's more than most
humans are capable of sort of rationalizing across all the environments.
So we want to be able to contain all of that so that we don't
get too many people confused about it. Then we
want to be able to cope with all the change that comes downstream to us
from all those components. Because things are updating constantly. You might be getting regular
emails, hey, we're sunsetting XYZ. Hey the deprecation
date on this is April 30 or whatever it is. You want to be able
to maintain all of your components in a way that reduces
risk, right? Because we don't want to be running stuff that's end of life,
but we still want to be able to maintain things in our sane and
rational way. To do that we're probably going to need some automation to
help us out. And then the hard part is that
with all these components they're all slightly different. Some of them
are configured in Yaml and some of them are in their own markup language and
some of them are like maybe still using a whole lot of Gui's or dot
files and that kind of stuff. And it's hard
to do all of thats work without making mistakes.
And we're used to bugs, we're software developers, it happens.
But we want to get to a point where we're reducing that
risk has much as possible. And if we're doing the same process
over and over and over again and multiple people have to be able to do
it, the best way to do that is to encode it somewhere that makes it
easier so people aren't copying and pasting and accidentally missing a line or missing
the long strings of options at the end of a
command. And obviously we want to get to the point where
we're reducing toil. If you're not familiar with the word toil,
it's kind of the way that the Google
SRE books talk about work that,
let's say boring, right? They don't usually use the word boring, but it's the
work that has to get done. And it expands sort
of linearly as your environment grows linearly,
right? So things like doing your security updates or
making sure that all of your containers are in line with things like
all that kind of boring stuff that has to happen. And you're doing like you're
rerolling images for security updates, all that kind of stuff.
So we know it has to get done. It's like brushing your teeth.
If you skip a couple of days, maybe, but if you
skip months, you're going to have a problem, right? So you don't want your teeth
falling out because you're not doing your toil work.
So there are some potential drawbacks to automating too much stuff,
and I'll say too much, but you kind of get to a point where things
are more comfortable. One is to look
out for is loss of expertise because you're not going to be able to completely
replace everybody with automation. But you don't
want to get to a point where folks don't know what's going on. There's a
lot of research in systems engineering and automation engineering that
talks about what do you do with junior
engineers, new team members in automated environments.
You want to make sure that they are constantly in the workflow
and constantly working with the products to make sure that they can maintain the automation
and they know what's going on. When the more complex problems come through
that the automation isn't built to work through, they still understand what's
going on and then you have a little bit
of risk from brittleness, right.
Automation is part of your lifecycle.
The systems are part of the automation as much as
your automation code is. It needs to change when the
services change, it will probably need to be updated when the operating
system or dependencies are updated. And you want to be
in a place where you can maintain those tools for a
service as well as you maintain the service itself.
So you don't want to get to a point where you cannot relaunch a thing
because you've lost the start scripts or you don't know where that stuff got to,
or nobody knows how to maintain it. We don't want to get to that point
where we're breaking things because the automation hasn't kept up.
So we want to keep up on that. And another big one is
more political and social than technical, but it's so prevalent,
it's basically a bad management meme as like the dream
of kicking back and drinking a margarita while the automation does all the work.
On the opposite side of that is the manager is like, why are you guys
all here if we've got everything automated? Like you're all fired, right? So we don't
want to get to that point either. We're not at a point where the system
is going to run fine without you. We're trying to make more time in your
day by automating some stuff. So when we go
for automation, it doesn't matter
what you're writing the automation in. And depending on your platform, you could
do things in something as simple as Powershell or bash or your favorite
programming language, whether that's go or python or whatever else you
might be deploying other automation focused tools, right?
Like infrastructure as code components and that kind of stuff. So there's
lots of ways of putting automation into your workflow
that aren't necessarily you have to build the whole thing yourself. There are lots of
building blocks for it, but you should be looking for some basic
characteristics to make things a little easier on yourself over
the lifecycle of the services that you're supporting. And these
probably look like what you want for software development because they totally.
So I've pulled these from Lee Atchison's architecting for scale,
which is a great book. You should take a look at that one if you
haven't. But it's ways to think about
the automation for service as part of the product itself and the characteristics
going to help you maintain things going forward. And these
should hopefully be pretty familiar to you as you have
gone on your DevOps journey. We want things to be testable, whether there's
a test harness for the language that we've chosen to write
our application in, or there's another whole like a test kitchen kind
of application that likes along with it. We want to get to that. You can
apply test driven development methods if that's part of your
culture, that's great too. You also want things to be flexible.
You want them to be implemented for future improvements. So we're not hard
coding things. Maybe we're going out to the configuration database to pull more information about
services and connectivity and things like that. We also want to put all
of it in the version control system so that we can review it
and we have access to its history if we ever need to
go back and look at something. We also want to keep this automation for related
systems the same. And this is like I say that
and you think, oh yeah, that's obvious. But I have worked with customers in
different places over the years that you could have two teams
sitting on the opposite sides of a wall of cubicles and they're both
running the same platform, but their stuff is binary incompatible
because they've been allowed to recompile everything
separately and then they can't share scripts and tools and all
that kind of stuff. And that's crazy. That's such a waste of time and
resources. So from a political
standpoint, you might have to have some hard talks with
the folks that have recompiled things or
have built their own stuff or have kind of gone on side quests
to make their stuff special, because that's going to be harder for absolutely everybody in
the long run. You gain more benefit from your automation
if you can share things across different teams. Right.
So a little bit of a rant there. We also
want things to be repeatable. We definitely want them to be auditable. Right.
I want to run the same every time the tool runs, I want it
to do the same thing. And I want to get to the point where if
I run it, I know I ran it. I want that to be centralized.
And folks know that because the goal here is
for me to give this to Alice, right? So I want to know when Alice
ran the thing, and I want to know that every time Alice runs
the thing, it's going to act the same way for her too. So there's not
a lot of confusion that I don't have to then deal with. So let's
take a look at what it might mean to automate for delegation.
And for folks who are used to sort of writing for themselves.
There's some things that might be a little bit different when we think about it,
right? Because we have a bunch of different teams, right? So all those folks
on the left hand side of the screen there, they're not
dumb. They know their own thing, right? They've got all their own set of tools
and all their own set of knowledge and all that kind of stuff. And then
we've got all of our tools on the right hand side there that are doing
all the cool stuff in our environments, but it's hard to share all
of the things on the right with all of the people on the left because
they don't know all of the same things and they don't have
all of exactly the same skills. And importantly,
they also don't all have exactly the same access,
right, which is another issue that always pops up. So as
we are building for the goal of taking toil off
my plate and giving it to Alice as maybe a click button
somewhere, then I want to make sure that I am resolving all of her
questions via the automation and taking care of all that stuff.
So I want to be able to say, oh, I know Alice knows
X, Y and Z and we'll understand it this way. And I can address
those kinds of gaps for her before I pass
this automation over to her team. So we're
basically building an internal product, right,
when we build this kind of automation. And like I mentioned before, this leads
really well into folks that are on a platform engineering journey,
right? We are designing for delegation
to another team and part of that is going to be supporting
how our users work. Knowing that some of our engineering teams
are super comfortable on the command line, some of them aren't.
A lot of them really just want to log into a web page and click
a button, or maybe they're working in something likes backstage and they want a module
there, or they want things in a certain place. So that stuff
is super easy for them and that's great, but you need to know that and
think about it. We also want to provide results that make sense
whether things are successful or not. If it produces an error,
it should be a helpful error. Oh hey, you're over budget, or hey,
you've already got six environments deployed in your
account, you can't have any more. Please go clean something up. Right. Rather than just
saying error, failing, right. So being helpful to
the users and for folks who have worked in the back end for a long
time, some skills that you might have to build as
you're thinking about how this delegation is going to work, you also want consistency of
experience. You want the same kind of error messages, you want the same kind of
options for the commands, you want things to be named similarly.
You don't want to use a lot of jargon if you don't have to.
You certainly don't want to use like inside jokes, right? And that kind of stuff
in these kinds of internal products because you want as many
people as possible to be able to use them and be successful with them without
having a lot of insider knowledge, right? So making things as
accessible as possible to as wide an
audience as you can, knowing that they're all just your coworkers,
right? If they don't understand something they're going to be pinging you on slack or
teams or whatever. But at the same time they really just want to get their
work done and not be burdened with a bunch of silly stuff.
We also want to include as much documentation and context as
we can, whether that's like hover overs or whatever it looks like for
your tools. So that like you say, folks don't have to come back to you
on slack all the time. So we are designing
for other people that kind of look like us, right? They are
other folks that are technically minded,
they are doing technical work and they are headed in the
same direction we are, right. We want to create awesome products for
the customers. It's just that some of us work on the backside and they work
on the front side. So like any software design we are focusing on
empathy for the users. It just happens to be that the users are the
next team over in the chart.
So we want to turn our expertise, the things we know, things cloud
ops knows, things finops knows into automation.
So that when Alice wants to request that
environment she logs into our automation platform,
whatever it is requests her environment provision environment.
And it says oh you're Alice, you're on this team,
you have these permissions to these jobs in these environments and
it just goes through the workflow right? Oh here's
the jobs for Alice's team. Click the button, off it goes, it goes
through the budget check. Doesn't need a human does it?
Automatically creates the environment, pops it back up to Alice says here's
your environment, please log in at da da da.
And every job gets logged for audit.
So that I know Alice provisioned an environment on Thursday
because then when her co worker Ben tries to
provision something on Friday and the account is full
he can say oh Alice has one, let's see if she's
done. We can turn things off. So what kinds of
things should we automate? Like you may be thinking in your head, well I've got
all this junk that I do thats if it makes the most
sense to automate. So there's a couple of ways of framing up thinking
about automation that you can take a look at. We're going to look at two
of them. So the first one is to automate based on
what feels most comfortable to sort of automate
around a specific set of tasks. Now we're pager
duty, so we took a look at some tasks that you
might do during an incident response, right? Something's wrong. There might
be some things that you need to do. Happens all the time,
right? So looking at tasks
on a set of axes, on the x axis there
on the bottom, we are looking at things that go from simple to
sophisticated or complex, right? So simple things are
like maybe one single command in one set of services, and the
next one sophisticated stuff might be looking across
lots of different services or trying to coordinate something. And then
as we look at the y axis, the bottom, it's stuff that has
no impact. This is your read only kind of stuff,
right? And then when we get
higher and higher on the y axis, things get a little bit more high impact
and might be a little scary to get automating,
especially at the beginning, right? So thinking about the incident
response context, things that in the green there
are super easy maybe, and you might actually already have
access to them, it's just a matter of putting them somewhere more easily
available. Things like just info gathering, right?
Memory, cpu error messages, get that performance
check, that kind of thing. Getting out into the yellow,
things get a little bit more complex and we might need some help
from our environment for starter restarts or
maybe multi step restarts or other more complex diagnostics
out of like maybe Kafka is pushing something wrong or there's a
weird error message there and we got into the red. There's some
things here that folks aren't super comfortable automating and some of it fortunately
will be handled by your cloud. But it might not be. Whether it's things like
scaling up a complex environment that you push a bunch of
new containers into a pod or something like thats. Or you need to do a
multi service rolling restart, maybe across different locations geographically.
So thats things all sync up and reconnect. Or you might have a process
for rollback or redeploy that you get to the point where
you really want to automate those kinds of things, right, for incident response, but you
can sort of take that and break it down. In the Alice example
too, we could do the same sort of tasks in a graph
based on provision environments in the cloud and figure out what
to do there. Another way to look at automating tasks
is through this XKCD example.
There's always an XKCD example, right? So you can
also think about automating the tasks that you do most often or
the stuff that takes the most time. Right. Because we want to look at
getting time back out of our day by not doing these sort of toily tasks.
So you can actually do this kind of graph for yourself and
think about how often do you do a task? Well cloud ops
has to provision environments like 40 times a week. It's crazy. Oh my gosh.
And how long that actually takes. And get yourself to
the point where you figure out okay, we're going to get back x number of
person hours every week by automation. This thing. And some stuff
might be hard to justify in this way, right?
You might have some jobs that you only run once a quarter, maybe it's
reporting or something like that. You think oh, we only run this once
every three months and it doesn't run that long. But even that stuff
you can benefit from automating because of the
other things thats automation gives you because you're only running it once a quarter.
It's easy for people to forget how to do it and mess it up.
So having some automation around that can prevent that stuff too. So you
have all of those sort of benefits that you want to think
about when you're thinking about what kind of things
to automate and what you're going to get back out of all of that.
And eventually you get to the point where you've got a bunch of automation or
you have a bunch of targets for automation you could think about.
Thats exactly the end state is for that automation.
Because not everything's going to get to a perfect run on its own
kind of life, right? So thinking it too
about as you're building your automation portfolio,
thinking about the evolution of those automation components, whatever you've
written them in doesn't even matter. Starting on the left we
have automation opportunities. That's just fancy way of saying things we haven't automated yet,
right? They're just in the backlog, we'll get to them. But then we
get to the most common one which is human initiated automation.
This is your common scripts and other tools the team members
are running on demand. This is Alice getting her environment right.
We want something done, we need to complete a task and it's unscheduled. This is
going to happen when some other work has been completed. So it's
human dependent, human initiated. A lot
of stuff has out in that part, right? That's where
a lot of automation kind of likes because that's the workflow,
that's the nature of this kind of work. If you get to the point where
you've got stuff that you want to kick off, say during an incident.
We get to the point where we might get some automation with oversight. This is
automation that runs on its own in response to some environmental trigger.
Some event happens. And it might be simple things like your
cron jobs that rotate logs or other stuff like
restarting a service when it stops responding to queries or whatever the
automation still requires. Maybe some humans to make sure it runs
okay. It might require permission to be able to say,
hey, this alert came in, should we try this remediation
and someone human responder has to press the button. You could
totally do that. And when we get to the point where
we're pretty comfortable with that, we might get to the point where the automation
runs on its own with a fallback stage. It runs and it only
requires a human intervention if something goes wrong. So it might just keep
running in the background that one time out of ten that it fails
with some weird error code. Then it pops back up to the human responder
and says, hey, I tried this, it didn't work. You need to tasks a look
at it, right? It will let humans know when it's not able to fix a
thing if all goes well, the other nine times
out of ten it just does what it needs to do, doesn't necessarily need to
report. Eventually you might get to the last
has on some things monitor and evaluate tasks get done with
automation. Edge tasks are pretty well managed. You don't get
errors all that often and instead of tasks coming
in and creating tickets or alerts, they just create metrics,
right? Rather than saying like Brian in April cleared n
requests for this error code this week, you might know
this error code was completed by automation end times this week.
So you can monitor it over time, right? And rather than relying on likes
how Brian and April feel about having restarted this thing or whatever
the task is, you actually have metrics based on the alerts
coming in, the automation handling things. So you end up being able to
monitor and manage the performance of the system a little bit better because you've
got the actual raw data there. Not all of
your tasks are going to reach all of these phases. Some things,
like I said, are more naturally going to hang out in that human initiated automation
part, but even that is going to gain back so much time for
your team to get all those sort of toily tasks out of your way
and put them into your automation portfolio so that you can
do more interesting stuff and learn new things and do
whatever you need to do. Otherwise. So that is most of
my talk. Hopefully, that was helpful for folks. We love to talk about thats stuff.
I didn't talk much about Pagerduty. If you want to talk about Pagerduty, we are
always available. You can find more information@pagerduty.com.
We also have an automation platform that was built off of Rundeck,
which there's information there@rundeck.com. There's also
a community version if you want to try that out. And there are
some links there on our community site for our
automation resources and code and all kinds of
examples and things like that, too. So we love to talk about this stuff,
and it would be great to hear from any of you. If you'd like to
reach out on our community site, we'd love to hear from you. So I hope
you have a great rest of comp 42.