Transcript
This transcript was autogenerated. To make changes, submit a PR.
Well helps and welcome to my talk. LMAO helps
during outages I'm Richard Lewis, I'm a senior DevOps
consultant with three cloud solutions and I've had the pleasure of working with
both software development teams and operations teams
with. I currently have over 20 years experience of working in those
industry. I'm also the co organizer for the Chicago
Honorary Enthusiast User group and as you
can tell from everything around me, I'm a diehard White Sox fan.
So a little bit about my company three cloud is the largest peer play Azure
service partner in those world. We have about 600 people
who are dedicated Azure professionals focused
on data and analytics, app innovation
and helping clients provide a modern
cloud platform.
So LMao, as you could
probably guess by now, it is not just
laughing your way through the outage, but it's actually
an acronym for something else. It's an acronym
for logs,
Metrics, Alerts and an observability
tool. And these are the things that you need to have to make up
a LMAO strategy. So what
am I actually talking about today? I'm actually talking
about providing platform support strategies for
your team members, creating a standard for
knowledge sharing, helping you
reduce the meantime resolution, and building
the psychological safety of your team members so that way they're able
to be able to lmao their way during LMAO outages.
So if you're not familiar with logs and metrics,
those are the things that's going to provide you the insight as to what and
when it's happening. It's going to
help you figure out how many errors occurred,
how many requests you got, and it's going to help you figure out
the duration of those errors and request.
Now as for alerts, they come in two
main ways. Pages, those are critical things
like hey, our whole network is down, or hey,
our website's offline. And tickets,
those would be things like hey, the hard drive
that runs our website is at 80% full
or 85 or 90% full.
Or we're seeing a slowness
in our ecommerce website. Not outage,
just some general slowness, just some pain.
And like most of you, I've experienced alert trauma.
This is me, my first IT job circa
2011 and I was supporting an Azure
97 application. I has the only
one doing this. I was on an on call rotation 365
days a year for 16 hours a day. Because it was also a call center
and they worked two shifts, I had no way to connect to
the office remotely. Luckily I lived within 10 miles of
the office, so whenever there was an outage I would get paged
using actually this pager right here that you see
on the left side of your screen and I would respond to it.
My company had no logging framework, no concept
of the logging framework. And within six months, you guessed
it, I burned out. But I didn't quit.
I let my boss know what was going on, my thoughts
and opinions. My boss was very receptive to these thoughts and
opinions and we worked on putting together a LMAO strategy.
We tackled the issue of logging framework first.
We used log for net. We tackled the
on call rotation issue by hiring additional people
and spreading that load out. So we created a little schedule
when we're going to do these things. So managing
your alerts effectively requires certain things. And as
you guessed it, as I just said before, scheduling your team
members appropriately. That's an important part of it. You can't
have a person on call 24 hours a day,
365 days a year. It's just not effective
and they're going to leave. Then you take with them their tribal
knowledge and now you're stuck in that position all over again.
Avoiding alert would take wherever possible. If they
don't need to be woken up or paged about it, let's not page
them. Collecting those data on the alerts that are actually
going out, look into ways to reduce those alerts so
those would be some kind of continual improvement opportunity.
If the alerts that are going out is because
the servers have reached a high number
of cpu, look at things like auto scaling.
Or if it's an ecommerce website, all of a sudden the traffic
has gone through the roof, say like every
day between certain hours a website increases in traffic.
Then maybe look at having auto scaling there as well.
Or if it's a hard drive, as I said earlier,
the hard drive hit a certain capacity threshold, then look
at some kind of run hooks or run books that could automatically heal
those kind of things. And be cautious when you're
introducing new alerts. Every alert has a purpose.
That's two. That's right, alert you. So be very
cautious about those alerts that you're introducing and the impact that they're going to
have on your team.
Observability on the other hand, is a little bit different.
Observability is more of the voyeurism
of it. So you get to see what's actually
going on on a nice pretty screen or dashboard or something like
that. It's going to be helpful for youll to understand your KPIs
or SLAs. You can actually monitor your usage of different
systems the same way. And spot trends dashboards definitely
help spot trends, and that's where you see products like power, Bi and
tableau and things like that. So you can spot trends.
Observability dashboards do the same thing.
And when it comes to observability tools to help you with your observability
dashboards, there are hundreds of thousands of products out
there. I actually took
this screenshot here from the cloud Native
Computing Foundation's website. Under their monitoring
section, they highlight a ton
of different products. This is not exclusively all the products
out there, but this is just a wide range of different
types of products that you can use for observability.
I'm sure that there are at least one or two products
on this screen, if not more, that you're probably using
in your current place. Looking at this screen,
I counted at least five or six.
Okay, more like nine products that I regularly recommend to
clients that are helpful for their different situations.
I don't think that there's one size that fits all. I think
that you want to use the appropriate tool for the appropriate cost
at the appropriate time. They all fluctuate on costs and
trade offs. This is actually a
dashboard here from
new relic, and this
dashboard actually highlights. In the
left corner, you're seeing synthetic failures and
violations of policies that they have in place.
And I believe it's just a coincidence that it's working out where you have 13
policy violations and 13 synthetic
failures. And that violation could be just
because every time that synthetic failure policy is violated,
it registers as a new violation. So the numbers are correlating right
there at that time. But the top one
is those violations of policies. The one below it is
the synthetic failures. And synthetic failures
are usually generated by something third party tool monitoring or
testing your system. So that would be just doing like
if it's alive or dead, call to see if your service is up
or down, or how long does it take for your website to load
using some kind of framework like selenium or
something like that. The next thing we
see is the errors that are occurring
per minute. So now we know how many failures
have happened and a duration of how many failures we're seeing per minute.
And this makes me wonder, what did we change recently
in our system that caused this problem?
And that's where you see this here. This is actually the
deployment notes, and luckily they're writing good
release notes, I'm assuming. And that's how you know what's
in those most recent release or has a possible impact to what's going on here.
I doubt that those readme with endpoints
is causing the problem, but it could be things
is another dashboard here, and this one's actually made by a company called Grafana.
This here actually, I really like this dashboard. It's a good example of
being able to embed a wiki and documentation
directly into the dashboard. So that way who's ever on call, they're able
to just pull up this dashboard related to that application and
click on a link to get from here to wherever they need to
go to to see a true source
of truth, or to go access details
about who to contact or details
related to the third party service or something like that.
There's also a diagram in here and it's showing how you can load
images into it of what that service
is actually looking like. Like that service's path and
architectural diagram.
And we're highlighting here up
and down of the service. So now we know the frequency of
the service going up or down, followed by again we're
highlighting over aimed those status
codes that are being received by those service are sending out from this
service. So we see a combination of 500 errors and 200
errors.
So the next thing that's really important to think about is preparing
your team for those outages. Having a playbook
is one of those critical and key things,
practicing for those outages as well.
And having a playbook comes down to what you actually
put in it. So I
noted on here, it's important about having it in a location where
it could be quickly accessible. Sometimes you may
want to put your playbooks internally, and I kind of like,
I'm 50 50 on those kind of things situations.
If your internal system has to be accessed through a third party system
that may be possibly having an outage, then you're
going to delay yourself to get into that, your playbook. So if you're using something
like Azure ad to authenticate to get into
your company network, then if there's a
problem with Azure ad, then it's going to delay you
getting to your playbook.
I like other systems as well, also tied together
still using SSO, so single sign on,
but also tie those together with other systems like Atlassian's
confluence or a third party wiki system
or SharePoint. So back
to those ad again. Or SharePoint or
Microsoft Teams has a wiki system built into that. If you're a user
of that as well, somewhere where you can keep it
outside of your network but still accessible, but kind of
lessens the likelihood of having an issue there.
But inside those playbooks, though, you want to put things like links
to your application that may be
related to those observability tools that you're using,
as well as details about the golden signals of that application.
So that way when a person is actually looking at what's going on
and hearing from the users of what's going on, they're able to
say this is within those normal range. This is not within the normal range.
Any relevant notes or information from previous outages, those kind of
situations help you title things back together.
So you're doing like a post mortem after
the outage and putting a link to
the post mortem notes are quite helpful contacts
for those application owner or any third party services
that's owners of it as well. So say if your applications run
something like Azure or AWS,
then links to the premier
support contact information. So that way who's ever on call knows how
to get a hold of them to escalate and get the right people on the
call properly. Or that
way they don't have to call a manager or something like that and say,
who do I call about this? Or so forth,
or links to
things like Stripe's website, if you have a payment service that may
be causing a failure or something like that as
well. And anything else that youll may think there's a ton
of things you may want to put in your playbook around that application. I do
suggest, though, dedicate a playbook per application opposed
to doing just a single playbook of hundreds
of thousands of things. You can put all of your playbooks together in one
same system, but you kind of want to have them breaking out
by section at least. And as I was talking about
their preparation and training, I know I mentioned before
the importance of this, but I come from the midwest of
the United States, where we have a lot of tornadoes.
And so as kids, we are trained, like you see on your
screen here, when we hear that
siren, to go into the hallway and put
our hands over our head and curl up into a little ball in
preparation of a tornado coming through those area if one was to happen.
But we trained for it and we knew what
to do as muscle memory. So the
practice of chaos engineering is something
still being worked out. It's been around for a while.
The concept has been, it was created by Netflix.
It helps you increase resiliency,
is the goal there, really? And you're able to identify and
address signal points of failure early. What you're doing
is you're running controlled experiments against your system
and you're predicting
those possible outcome that outcome could actually happen
or the outcome could not. And that's where the chaos engineering comes
into play. You don't really know if it's going to happen or not, but the
goal is, in the end, to identify your failure points,
address those failure points. So that way, if something was
to happen on those failure points, you youll be able to sustain it.
There's a great article about how Netflix
practices their castle engineering. I put a link below for
you if you want to go take a look at it. And after
those outages, though, you want to take the time to do a post
mortem, usually within a day or so of the outage,
while it's still fresh in everyone's mind. You want to get everyone together
around the conference table and just talk through what
went right, what went wrong, where did you get lucky?
And just figure out what needs to be documented
in terms of preparation for a possible future
outage. I really like this quote here from Devon
with Google, and it is the cost of failure is
so talking about back to my first on call
rotation job. One of the things I was required to do regularly
were to do write alongs with technicians for appliances.
And when we would go out to customers houses,
the customer would have something broken,
but they may have tried to fix themselves, but may have
done it wrong, not fully followed the instructions that they got from the manufacturer,
or weren't fully listening to a YouTube video that they were following
and missed some key details. So we would be able to
quickly resolve those issues within a matter of minutes.
And the cost of education in that case was our
service call fee. And so the
customer, youll learn something new. They would learn how to fix that problem in the
future, but at the same time, no,
that gives them the ability to get their system back up online really
quickly. So the cost of failure is
education. It's a good quote.
So my takeaways from my talk today are pretty
simple. Have an lmao strategy in
place, have that documented and ready to
go. Everyone knows where it's at.
Update those documents regularly after
your outages, go back, update them,
have a revision date on those documents.
So that way, you know in the last time it was updated. And if they
haven't been updated in six months, either you're not
having allergies around that system or you're
not documenting what's happening with that system. So good or
bad there, avoid alert fatigue.
The less alert fatigue you have, the more psychological safety you're building
into youll people, and the more comfortable they are,
they know where their documents are and they're able to go forward from there,
the less likely that they're going to want to leave your organization and take that
tribal knowledge with them. It'll cut down turnover and everything
and run readiness preparation drills regularly.
Chaos engineering again, it's a newer thing, but there's
a lot of tools out there that can help you. Gremlin makes some great products,
great documentation out there from them.
Microsoft has a great has engineering product as well,
great documentation to help you think about ways to do these things.
And thank you so much for listening
to me today. Thank you for your time and
enjoy the rest of the conference.