Transcript
This transcript was autogenerated. To make changes, submit a PR.
Started shoreline to solve the problem of runbook automation.
Deal with the false positives, deal with the simple,
repetitive issues, and make sure that even for
a complex issue, you have all the diagnostics before a
human has to come to the machine to take an
action. It just made so much sense to me. After eight teams
at AWS, there's one big problem,
though. Hardly anyone had runbooks, and even
when they did, they were stale or
they were inconsistent with other things out there.
So what was someone to do? What we did
here at Shoreline is we decided to make it easy to create runbooks,
and that's what we're talking about here today. So about
a year ago, when llms first started becoming popular,
even before Chat GPT took off,
we started working on prompt engineering using
the llms that were available then. And of course, we've made
those yet better over time, both for diagnostics
and for repair. Let's take a look at what that looks
like. So over here you'll see a
top problem report where we're pulling data in
from a ticketing engine. In this case, it's pagerduty.
We do that on a continuous basis and measure
MTTA, MTTR, number of people who are involved in
an incident, how long it took, and the number of times we saw it.
So you might go and say, well, let me work on the things
that have the largest overall aggregate MTTR.
In this case, that's the Apache server being down.
And the way we calculate these groups is that we
apply lightweight clustering algorithms using machine learning to
see what tickets look similar. So for example,
just because there's a hostname in the ticket, it doesn't matter. And then
we also apply some semantic understanding using
an LLM to say, hey, this thing is
talking about a disk being full. This other thing is talking about a persistent
volume claim being full. Those are actually
the same issue, or they can be addressed with the same diagnostics
and repair commands. So let's take
this particular thing about the Apache server being down and generate
a runbook so you can see what's happening is
that second by second it's running,
getting me all the diagnostics to check the status
of the VM, check the logs, see if the deployment
is running, see if the VM itself is accessible, and so
on, and if the necessary ports are open. And then I can also pick
other things. So maybe I want to add a script to
say, hey, did it crash? Should I have to restart it?
And now it's going to take a second or two and
generate that script for me, I hope.
And there we are. So here's a
bash script to do that. You can also add your own diagnostic
prompts here, and you can also add
remediations. Right? So in this case, let me go
and do a restart and get
that in my bash script. There we are.
And so you can run this on kubernetes, you can run it on VMS,
you can run it on Linux or Windows, you can
run it on the three major cloud providers, et cetera. And once
you're happy with it, after adding cards, replacing things,
et cetera, you can export it to markdown,
in this case, maybe a conflux wiki, or you can export
it to shoreline. And we'll talk about that in a.
So one of the problems with llms is
that they can hallucinate. We've all seen that over the last
year. And so one of the things we do here
at Shoreline is that we manually curate the runbooks
that we create or are created by our customers.
We're at about a little over 300 right now,
and I hope to get to about 1000. That doesn't
mean you're going to use all thousand. It means that across the variety
of things that you use,
there's a runbook for you that's been created and
tested out already. So you can start from a
clean place, and you don't have to run into an issue
just to have the repair ready at hand. So, for example, here we're seeing
MongoDB issues. I'm not an expert at
MongoDB, but isn't it nice that we have eleven
things from people who are next?
For me, the core problem with runbooks
is that they don't actually run.
You have to cut and paste into every node that you
want to modify, and that makes it super inefficient.
So wouldn't it be nice if our runbooks were like Jupyter
notebooks and just had both the markdown and the
repair actions right there? So this is
something for an application load balancer that's running into
500 errors. And it's basically saying across
all my hosts, which are running, let's say,
an ALB go and grab
the load balancer names, get their details, describe the ingress
paths, and so on, so forth, and eventually
maybe even do some remediation with a heap dump and a rolling
restart and so on.
Last but certainly not least, once you
have these things in an environment that is
controlled, you can apply fine grained access
control and audit capabilities. So, for example,
this is a notebook run for a
high cpu, and in this case there's a set of actions
we want to do. Make sure we have the right number of bookstore
instances, make sure that the release is correct,
check them for high cpu over a minute so that we
aren't bothering if the issue has already gone away.
Keeping the metrics over time,
listing the top processes using a top command or
the top threads. In this case, I'm going to see that.
Yeah, it's the JVM thread, so I probably have a
JVM issue and I might go and look
at the logs and yeah, it's definitely
running into allocation failures. So let me go and
do a dump and restart, and after
that validate that the issue is gone. And so all
of that was done in the past, but I have all of the data
from an auditability perspective, including what
was returned out of standard out and standard error.
And that's really important. The other using that's important that I'm
not showing here is that you can also apply fine grained
access control. Not just who can run a runbook,
but who can run what actions on which resources
at what time, for example, only when on call or
for the other commands, what sorts of approval workflows
do they want and where should those go, and how do I
integrate all of this with my observability tool of choice,
slack or teams or my incident
management tools or ticketing tools?
So hopefully that gives you a quick flyby over.
I hope you found it interesting. You can always reach out to me at Anurag
at Shawline IO if you want to hear more or have any
questions or feedback. Thank you so much.