Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are youre an SRE,
a developer,
a quality engineer who wants to tackle the challenge of
improving reliability in your DevOps? You can enable your DevOps for
reliability with chaos native. Create your
free account at Chaos native. Litmus Cloud hi,
welcome to improve your automation to reduce toil.
My name is Mandi Walls. I am a
DevOps advocate at Pagerduty. If you'd
like to get in touch with me at any point I am lnxchk
on Twitter or you can email me. I'm mwalls@pagerduty.com
and you can reach out to Pagerduty at any time on pagerduty,
our Twitch handle. So let's talk about automation,
right? So were going to start with a bit of the basics. Hopefully you've
done a bit of automation before, but maybe you haven't really
thought about what actually you're
doing, like in an abstract way. Right. Were heard the word and
have a kind of a concept of what we're after and the goals
that we have for automation. So we're going to
talk about exactly what that means and how to sort of put
it together in a way that is going to be successful for not just
yourself but for your team and youre larger organization,
right? So when we're thinking about automation for it
practices, right, were going to take some manual processes and
someone on our team had to run them piece by piece,
probably, right. Were going to take that
stuff and were going to ask some kind of machine to do it for us.
We're going to look for tasks that machines don't need
to do any other thinking for things that machines will do very well. They don't
require a lot of nuance, they don't require a lot of creativity,
they don't require a lot of lateral thinking in the moment.
And those are the things we're going to encode so the machine understands
how to manage them. Right? And then we're going to let our humans,
they're going to go do something more interesting and hopefully more valuable
to our organization. Right? Automation is a key component
in the management of complex real time systems that
a lot of us are immersed in, right? The amount of information that
has to be processed and taken into account when
you're making a decision or making a change is
immense, right? Microservices, cloud platforms,
third party and internally developed software, all these things,
they all have their own idiosyncrasies and
behaviors and there's just so much stuff that
one person really can't hope to know. All of
the things about all the things that we work on. So to
get work done and to do it right, teams rely on automation
for their common tasks. And automation is going to help them avoid mistakes.
It's going to increase their reliability, it's going
to increase the repeatability of those tasks.
And overall, we're really. After reducing
toil, we'll talk about what toil actually is in a moment.
Right. We're going to use automation to build and test and deploy our
software. We are going to use it to create and maintain
our infrastructure. Anytime we perform
a task more than once, or we
have a task that more than one person should be able to perform
in exactly the same way, we really should be considering
automation so that we can ensure that those tasks are
completed as expected. Our team members aren't automatons.
Right? They're not robots. So when we want to make sure processes
are always performed the exact same way, we produce our
automation and we create little robots, right? So automation can
take different forms. It might be a library of
scripts and tools. It might be a set of Yaml files
that are ingested by another tool. It might be encoded
into the configurations of even just your build server,
right? These are things that collect knowledge from your
colleagues and store it. In a place where
everyone can use all of the amassed knowledge of your entire team,
you are collecting up expertise, encapsulating it,
and sharing it. So let's talk about what toil means,
right? Toil for it, teams is a
four letter word, literally and figuratively.
Right? It's things that you
have to get done, but nobody really wants to do them. They're not
fun, they're not interesting. Right? Teams want to do interesting things.
They want to do tasks that impact the bottom line, that create value,
that require some kind of expertise, but also
creativity, right? And toil is kind of the opposite
of that. It's the repetitive tactical work that increases
linearly as the size of the environment increases. So think about
things like deploying software to all of the systems,
adding user or system accounts to all the systems,
scaling environments, or testing backups,
if you ever actually do that, right, even in fully
modern systems, the task of regularly building
your short gives environments.
It just becomes a new version of toil, just rerunning those things
over and over. It's not something you want to be doing by hand.
We know this work has to be done. It doesn't particularly
add value in the moment to your customer or
your end user. But leaving it neglected will definitely
impact user experience at some point sooner or later, right?
It might eventually allow a security breach. It might
cause degraded performance. It's kind of like hygiene,
right? Think about brushing your teeth. I know we've
all been stuck at home for a year and a half, but if
you skip a day brushing your teeth, okay,
most people won't notice. But if you quit completely
or you put it off for a long time, people are going to notice and
they're probably going to be concerned for you, right? And that's sort of the same
kind of things with these tools based tasks, right?
They have to be completed. But fortunately,
unlike brushing your teeth, we don't have to do these tasks
with humans, right? We really should be automating them so
that our humans don't have to do them.
So looking at some of the things that we might
have as drivers of automation, right, we might have some complexity,
speed curbing mistakes, reducing toil might be
some of our goals for things we are
targeting for automation manual processes in
general. Sre prone to mistakes. There's plenty of opportunity
for typos to cause havoc, right, when you're trying to work in
large environments, anything from typing commands
in the wrong terminal window, right, to missing options off a
very long command string, to skipping a step when you're copying and
pasting out of a wiki onto another page or onto a documentation or
out of there into the terminal or whatever it is. Modern IT systems
might be comprised of hundreds or even thousands of individual
components, right? So you've got your cloud infrastructure,
you've got containers, you've got hosts, you got networks, you have services for monitoring,
collecting metrics, you have alerting systems like pagerduty,
you have log collection, you have authentication and authorization. Maybe youre
a b testing or beta testing for new features. You have storage and runtimes
and all of this stuff. The number of possible combinations is
nearly infinite. So interacting with all of these, you're just
multiplying how many possible ways things can go wrong.
And any of them can change at any time,
right? Third party services and resources get changed by the vendor
on their own schedule. They don't care what your schedule is, right, requiring updates
and changes. And then you've got youre internal development and that
requires changes too. It's hard to keep up with where all the things
are and what they're supposed to be doing, even for teams that are super
conscientious about documentation, and I know you all are,
my list of service instances might be outdated before I finish it,
right? Especially if my environment uses sophisticated auto scaling. Any work
that I need to perform on the instances needs the most recent
data and also probably some kind of query to
an API instead of a hard coded list of instances. Hopefully we
are all sort of beyond that point where we've got like one big host
file for everything. You might not be it happens, right?
But it's easy to make mistakes when we're using these manual processes,
right? Especially things with lots of steps or complex
commands. So one thing automation really does for manual processes
is that it allows your team to permanently record all
the good options and all the preferences and put them
in a place where you don't have a chance to forget them or
leave them off or whatever it is, right? So you're
getting to a place where all of those good things that
you learned are recorded, right? And then finally we
automate to avoid toil all that repetitive work that we just talked about and
things like patching systems and restarting services or whatever it is,
a team still has to know how to do all these things in order to
automate them. But performing the same tasks over and over isn't the
best use of our team members time and
then thinking about what we actually do use our time for,
right, a couple of myths about automation, right? One of them
is, is it possible to automate yourself out of a job?
And it's a fun myth. And I love this XKCD
cartoon, right, because it kind of links back to what we were talking about with
saving time. But the cartoon is,
I spend a lot of time on this task, so I should write a program
automating it. And you find that in your future, you spend most of your
time refactoring your automation. Hopefully you're not doing that, right?
So when you're thinking about what your job is
going to be after you've automated all of your tasks, right, you're really
automating yourself into a new job. Your team
has automated the tools out of the everyday tasks. You're hopefully
left with the work that requires more creativity, things that
require more long term planning that provides
more strategic value to the organization than deploying thousands
of patches over the course of a month, right? Things like planning
improvements, building more robust disaster recovery
plan, building and shipping more new features for your users.
Smoothing out the tasks required to get basic tasks done
leaves more time for doing all the fun stuff, right? So at some
point, your day to day job will look significantly different
from what it once was, right? Instead of running manual processes,
you'll be maintaining and updating those things periodically and performing
more value tasks on a daily basis.
So the cartoon little bit tongue in cheek, you definitely
don't want to be spending all of youre time maintaining in your automation,
but there's going to be work involved and that
hopefully will be fun and of a higher value than some of the
toil things that you're doing. Another myth is
you don't have to know anything if it's all automated.
And this is an interesting one, right, because there is research in
systems engineering and automation engineering that
indicates having fully automated environments can
hinder skill development for junior engineers.org new
team members to create really useful automation,
someone on the team has to have had at some point
really robust expertise in the systems and processes
being automated. Your team won't be successful if no
one knows what's really going on even with automation, though.
So you have the automation and you might have maybe that the person who wrote
it has left. Automation is part of the lifecycle of
your application and your systems, and it will need to change
as those services change and are updated.
It will probably need to be updated when the operating
system or dependencies are updated. Maintaining the
automation tools for a service is part
of maintaining that service itself. So learning how to maintain
and test the automation and run folks and other tools is a
key way to help your new team members learn
about the services. Does the start stop script need to
be updated? Are logs now going to a different location?
Are updates now being downloaded from a different artifact repository?
These are key pieces of system information that your team will use to
maintain all the tools and things that you help run. And they
are components that will also help new members gain
more expertise into the systems that they're
working on. So keeping this in mind, right.
Making sure you're doing skill development and knowledge sharing with
younger team members, with your junior team members, the new folks on your team,
it can be super important. Absolutely. So let's
talk a little bit about what actually makes good
automation. We've seen it,
but how do we know what it is? Right? So I've borrowed
a set of requirements from Lee Atchison's
book architecting for scale because I things he has really encapsulated
the key points here into a nice bulleted list.
So there's a wide variety of tools
and platforms available to help you automate workflows across
all kinds of options and ecosystems. Right. It's hard to know
which tools might work best in your environment,
but as a baseline, there's a list of requirements here
that might be helpful. Right. So looking at some of these here,
we want our automation to be testable.
We want to be able to test that the automation is correct.
It's going to be out there in our ecosystem, it's doing things on our behalf,
so we want to trust it, right? Maybe were going to apply some TDD
methods for the automation code will help us as well. So relying
on our test suite and were making changes and making updates to
not just the application code but also the automation around it.
And then we want it to be flexible. We want to get
a lot of value out of it over time, right? So we're not relying on
hard coded system lists or other data
when we can add maybe a query or an API call.
So variables or version numbers and service names are going to help
us for upgrades and making sure that we get a long lifecycle
out of this automation that we've invested in, right?
And then we want to put our automation into our version control
systems and practice code reviews. And that will help us
maintain this piece of automation over
time, right? It's far better than having a directory full of
script sh underscore back files for folks
to wade through when they have questions, right? You use your version control,
you use your code reviews and your test suites, and that's going to
help you manage assumptions and catch issues before they become production
problems, right? They also help your new team members, like we were
just talking about, become familiar with the services and the automation
and all the rest of the pieces in the ecosystem. So keep your automation
for related systems the same, right? Make it applicable
to all the other things. This can easily get out of hand if
your application teams arent required to use the same runtimes or other
tools. I totally understand that. But anywhere
you can reuse components, you should do it. Create official libraries,
best case methods for dealing with your most common components.
Make these solutions the easiest way to get work done. Maybe it's a
fast track, maybe you don't need a change ticket or other permissions to
use it if you're using the blessed version, right? And then finally we want
repeatability and audibility, right? Auditability. We want to
know that every time the tool runs it's
going to produce a predictable outcome. And some tools
are far better at this than others and provide easier tracking
for who made a change or who ran a command. But overall,
across the sort of entire marketplace of automation tools,
this has been getting a lot better over the past several years. So hopefully the
tools that you're using and the things that you're looking at or considering are
also providing that kind of feature. So then let's talk about
keeping that in mind, right? All those requirements. Thinking about
the big vocabulary word in automation, and that's
item potency. I've also heard it pronounced idempotency.
I'll go with item potency. It just flows off the tongue a little better,
right? Hopefully youre heard this term before when you're thinking about
automation and automation products. But if not, don't worry,
let's do a quick review. Right, so idempotency
has a fancy mathematical definition, right?
So idempotency is the property of certain operations
whereby they can be applied multiple times
without changing the result beyond the initial application.
It sounds confusing and potentially ominous,
but it's super helpful for thinking about what happens when you
run a piece of automation more than once.
How are you going to handle any messages that it might
generate, right? So what happens when your systems,
you want to add a user but that account is already present?
What happens if you want to rotate logs but they're already rotated?
Does your log rotation create empty files like certain versions of
Mandrake Linux did in 2001,
which I still have nightmares about? If you are installing a software package
and it's already installed, what happens? What if you are
concatenating a new configuration line to the bottom of the file? Does it create
a new file? Does it delete it? Does it just keep pushing more and more
and more copies of that line into the file? What happens, right, for tasks
that you want to automate, youre not going to have a human
there. By definition it's automation.
You're not going to have a human there to read the output and say oh,
this has already been done, I don't have to do it again. Right. You'll want
to add some check to your automation to determine if the change
needs to be made in the first place. Right. If the thing you're trying to
accomplish has already been completed,
job done, you don't have to do it, right? So this is where
automation starts to get pretty complex, right? You don't
want your scripts to bomb out, youre don't want them to return an error.
If the work that they want to have done is
already done. You're already in a successful state, so you should report
that, right? So you also want to make sure that the
state that the system is in matches what you want and do
those things. So some system tools will already have some
of this built in so they won't try and redo work that is
already done and they won't also then drop an error. Other tools
you definitely want to verify because they might not redo the work,
but they'll return an error code, which might mean that your automation fails.
And that's not fun either. Right? So if you're building your own tools, you can
keep this in mind, right. You'll want
to build in some of this input and see yourself. I want to create this
file if it's already here, what kind of process am I going to
take? If it's here but the contents are wrong, what's my next process?
Those kinds of things and how you build that stuff in.
So over time, as you're building up your automation library,
you're going to have more skill around checking things on your systems,
depending on what kind of systems youre running on, and get some
best practices around checking the state
of things and how those things are going to work. So another
fun thing to think about when we're automating
is what stuff do we
bother to automate? Right. You could think about,
oh, I want to automate everything. I just want it to run all by itself
without me. But that's not realistic, right. You want to automate
the tasks you do most often or the
ones that take the most time. Right. That's going to save you the most
effort over the longer term as well as reducing your
overall tools. And another XKCD cartoon, because there's
always an XKCD cartoon for these. Right. This is
kind of, again, a little bit irreverent plot
of how long can you work on making a routine task youre
efficient before you're spending more time than you save. And it
plots it across gives years, right. So looking at how much time do you
shave off the task along the y axis
and across the x axis is how often you do the task.
So if you're being something all the time,
automating that even just a little bit is going to save you a
lot of time. But if you only have a task that you only have to
run in February or whatever, maybe automating that
the time to take to automate that isn't worth it if you aren't going
to save that much time. So something to think. But when you are thinking about
your tasks, right, we're going to keep in mind our
task requirements, we're going to think about our item potency, and we're going to think
about the right tasks to automate at the right time. So looking at
maybe some tasks that we could automate,
thinking about these in a slightly different way.
Right. So in this particular graph is looking
at things specific kinds of tasks, rather than sort
of the abstract view of the last one. We're looking at tasks that
we might need for incident response, right? Because I work at Pagerduty and
we do a lot of incident response. And the x axis here is impact,
right? And it tells me if a change needs to be made
and I'm going to automate it, is it a change that makes
maybe no impact on the running systems, or is it something that has a high
impact? And then across the y axis,
we have things that are simple, maybe a single step, and then more sophisticated
things that might be multistep or multi node or complex workflows or
need a little bit of orchestration, right. So if we're struggling with
what things should be automated, you can make a list of tasks
and sort of plot them out for your team so that you can
kind of tackle the things that are no impact and simple, the things on
the bottom left to give yourself some confidence
in building automation. Then the tasks with
higher impact are things like you're restarting
services, maybe it's a single service or a group of service, or you're performing a
failure or whatever. And then highest impact
tasks can change key pieces of your infrastructure,
changing your firewalls, rolling back or redeploying software
and those kinds of things. So thinking about the
kinds of tasks that one you do most often or take the longest
time and you can save the most time off of,
plus the tasks that are where
your level of comfort is in the kind of complexity and impact
the tasks themselves will have for you. So how comfortable your
team is with any of these things being automation definitely varies,
right? There's definitely places, different teams that are like,
we don't want to automate anything. We're super afraid we're not
skilled in this. We're thinking like, it's going
to go crazy, like the brooms in fantasia
or whatever, right? So you might have a lot of
things that actually sre pretty complicated, right? You might
have a service that compiles all of its libraries into memory,
and you can't really do a cold start, restart fast.
So might be something that goes to the bottom of the list. And we'll
think about automating that after we gain a bit of confidence in
youre overall automation skills and
then looking at building up our
library. The way youre humans interact with all
this automation is a bit of maturity
scale as well, right? How we look
at automation evolves over time.
Right. As our team gets more comfortable with specific types
of automation and you get better at creating it, the human interaction
really should decrease, right? And the automation runs more on its own.
Youre building confidence. You're building trust in the processes that you have.
So you start out with what we call automation opportunities.
Really a fancy way of saying things haven't been automated yet.
Everything's really still manual, right? What tasks exist
in your environment that could be automated but haven't?
Make your plot and figure out your good targets for that.
Then we look at human initiated automation. These are our common
scripts and other tools that our team members can run on demand when
they want something done or need to complete a task in can unscheduled manner.
Right? Your basic scripts and pieces
in your library directory. Super helpful, right?
Automation with oversight is automation that
starts running on its own in response to some environmental
trigger. This might be simple things like your cron jobs or
youre rotate your logs, or more interesting things like
restarting a service when it stops responding to queries.
Depending on your environment, you might have auto scaling
in this sort of thing that you don't quite trust it yet, so you
do keep an eye on it, but it kicks itself off automatically.
It might still require some humans making SRE it runs okay.
You might have a little alert that pops up and says hey, script a
is running, please check me out or whatever. So while it might start
on its own, it might also let folks know that it's
doing a thing. And if folks aren't comfortable with it yet, they can check it
out. But eventually you get to automation with fallback and the automation runs
and only requires humans to look at it
if something goes wrong, if it finds an error or an unexplored
edge case, that it has its own escalation
functionality to let humans know that it wasn't able to finish its
task or fix what it was supposed to do. But if all goes well,
the automation doesn't necessarily need to report, right? You'll see
a significant reduction in overall toil at
this point, right? So when you have built up
all of this trust in your library of
automation and components and scripts like that, and eventually
you might get to the monitor and evaluate phase and
you might get to this phase with certain tasks and not others, right?
When things get done, edge cases
are managed and instead of tasks creating tickets.org
alerts, they might just create metrics rather
than a report saying hey Brian in April cleared n
requests for x task. This week youre might have a metric instead
that says x task was completed n times this week by
the automation, and then you're still sort of managing it as part of
your environment, but your automation is taking care of all the work and you
don't have to do that, right? So not all of your automation tasks, not all
of your systems will get to this point. They won't all go through all the
phases. You might already have some stuff that you totally trust, right? But some
youre complex time tasks might only ever get to
automation with fallback due to the nature of your complex systems. That's totally
fine. And the important part is to be thinking
about where you're headed, the things that you need to do and the things you
need to accomplish with your automation so that
youre sort of constantly evolving and improving and making
sure that the automation that you're producing is
actually helping your teams get better and reduce
their overall tools. So let's talk about just to sort
of wrap up a tool called rundeck. And I won't
do a full demo, but we'd love to show you all the wonderful
things that Rundeck does if you'd like to see them. But Rundeck is
an automation tool. It's an automation platform, right. It's a software
solution really specifically built for
the kind of automation that production teams,
teams that are working on services that are
customer facing, user facing, and the tasks that need to
get done there. Right. Rundeck, you can combine
it with pagerduty, right, for auto remediation of issues before
they become incidents and accessible tooling for responders during
incidents. And we love that because we're driving down our meantime
to resolve and those kinds of things that the
ability to automate some of those tasks, whether it's something simple like just gathering
up the logs or doing a restart,
can create a lot of improvements when we're dealing with incident response.
But Rundeck itself really provides more of a generalized
platform for your team to securely
perform kinds of tasks in production and then
delegate things to other teams. So what you really have is a
way to encapsulate expertise. You have the folks that are
the subject matter experts and they can write little bits
of automation and different steps. And the
best practice for restarting this thing, or here's how we
do our patches and updates on this platform,
and here's how you rotate the logs for this particular
runtime or whatever it is, and you
take that piece of knowledge and you stick it into your Rundeck server,
and then you make it available to anyone who might need it, and you
can hook it into your authorization authentication
servers and off it goes, right. So they
can build up these complex workflows and
allow anybody else to manage things. So if you're coming at it from the
perspective of say, an SRE team, you create all the
tasks and tools and little bits of things that people
ask you to do all the time. And there's tickets coming in and they're like,
can you rebuild this dev environment for me? And blah, blah,
blah. And what you can do then is take
all this stuff that you've learned and put it into Rundeck and say, hey,
yeah, here Brian, go run this task.
You now have permissions to run this task in the dev environment and that's going
to redeploy the thing that you needed and you don't have to ask us for
it anymore. So your automation is
less and running on its own all the time. It's going to
be human initiated automation,
but the humans that are initiating it, SRE folks
that don't necessarily need to have all that same expertise that say your SRE
team does. So it can help you deal with that kind of
everyday tools and requests and things like that
that come in. And the way that ends up working is
you have your users who maybe need a thing done right now
because they're blocked on something and they can just go
to the Rundeck server, request the task and off it
goes. Because someone has already prepared the automation, tested it,
and then provided it for them in a secure way. They don't ever
have to touch any of the nodes
that might be out there living in the real world that you don't want them
to touch. You have to stay away from them. And then one of the good
things about these kinds of platforms is that you get
reports back, right. One of the hard things about writing
your own automation and putting those platforms back is like producing
the kinds of detailed information for
sort of the unskilled users, or not necessarily unskilled,
but not necessarily knowledgeable in the tasks that you know about,
right. They have other main primary tasks that they do,
and you're giving them a bit of abstraction for things
that they know what they cant the outcome to be, but they don't know all
the details, but some nice green text on a
screen, they can tell, hey, the thing did okay, and if they hit some
red text and it didn't go okay, then they can reach out to youre.
But you're really giving them a way
to act like you do when you interact with
the system. They're doing the same things that you would do without having
to distract you from the work that you're doing on a regular basis.
And then one of the other really key pieces
for automation is a lot of organizations
are when they have hesitancy around automating tasks,
it's because they're like, you have to tell us exactly who did what,
when they did it and what happened. And when you're
building your own automation, you can cobble together
some things and maybe have some log files and youre
send status over to other components and
things like that. But providing it again to the folks that
might be youre auditors or doing compliance reports or
those kinds of abstractions that really
aren't necessarily down to
digging your cron tabs or whatever, but want to see what kinds of things
were run, then you can provide them with an audit log.
And looking at automation platforms and automation tools that
provide you with that is another good way to help your team
build confidence around the automation that you're providing, the things that
you're writing, and over time, trusting all
of that stuff in a much more sophisticated way so
that they're more likely to cant to automate more stuff in the
future. So the next set of tools,
tasks that you have, they're more comfortable
with producing automation for them. So there's lots of things to sort of
keep in mind as youre building your own automation, as you're looking
at tools to help you with your automation, thinking about your item potency,
thinking about your flexibility and testability and all those
great things, and then finally thinking about how
your automation impacts how
your team interacts with other teams, how your automation is
going to impact the perception of youre team, maybe from
other places like, oh, those folks will produce
lots of tasks for us and give us automation
so that we can do it ourselves and we don't have to wait for them,
right? So you're seen as the folks are super responsive because you're
asking people to do the work themselves via the automation.
So lots of things to think about. If your team isn't super familiar with
automation, hasn't really gone on that jaunt
yet. Some things for them to maybe take a look at and think
about. We have some resources. There's lots of stuff
on the Rundeck website about the approaches to
automation and things like that. We have an entire,
we call them ops guides. It's kind of a white paper. It's at autoremediation
pagerduty.com for sort
of a written format of the things discussed in this talk and to
give youre some ideas on how to plot your journey
into automation. If you've just been kind of doing it catch as catch can
and thinking about it as more of a direct
part of the job that you want to do as an
SRE or as platform engineering or whatever kinds
of tasks that you might be doing. So hopefully this was helpful.
We're happy to answer any questions will be on the discord.
And like I said earlier, if you'd like to reach out to me at any
time, I'm at lNxchk. I hope you enjoy the rest
of the conference and thanks for listening.