Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hello. I hope you're
enjoying comp 42 chaos engineering. My name
is quintessence, and I'm here to talk to you about sensory friendly monitoring
or how to keep the noise down relevantly. Make sure you
have any echo Alexa devices in listening range
or use headphones, because as a reminder, we are in each other's
homes. But I won't be trolling you today. When we try to know everything,
it looks a little something like this. You start off with
a bunch of slack notifications. Maybe you're using Trello or Jiro
or both between different teams, and those
get integrated into your chat platform. Or when you're going beyond
engineering, you can also take a look at HR things and lunches
and things that people are trying to do together, especially now that we're
all very isolated and our primary means of communication are virtual
add status page and pagerduty into the mix,
and all of a sudden, there's just a ton of notifications that need to be
handled. And the end result of this is just too much noise.
It's too much noise to be effective. And when you have
too much noise, it buries high importance and severe
alerts in a sea of all these low priority notifications. And then
you're going to mute individual notification sources, right? How many
people have closed slack for focus time, for example,
or whatever? Teams, right? And the result of
these closed notifications is that you won't get them,
right. You've created an artificial silence, you said, okay, well,
I'm getting too many notifications in slack or teams, so I'm going to close that.
And now, if something important comes through, how are they going to reach you?
And when that happens, it's not as cathartic
as you would hope, because, you know you've manufactured the silence.
The silence isn't because nothing's wrong. The silence is because you've turned
on mute. And what we need to do with all of this is find
a happy medium. So first, let's talk about the cost of noise.
Now, when we're thinking about our brains, we're actually thinking
all the time. We're osmotically absorbing what's in our environment.
Let's think of any time any of us have actually lived with another human,
whether or not we're doing so. Now, if that other human had any significant
or serious interest in a topic, you might find
you're suddenly able to speak to that topic despite never having
read, listened to, or otherwise engaged with it personally, yourself. But they're
doing it in your environment. And this is one of the
earliest indicators that your brain is always processing that background
information and relevantly to alerts. If you think to yourself,
oh, I'll just ignore these alerts, you're not actually ignoring
the alerts. You're trying to multitask after a fashion.
And the time cost that you lose when you redirect your focus
is 25 minutes. And I'm talking about the little alerts here.
This isn't something like, oh, there's an outage, and now I'm completely redirecting
my attention. I'm talking about in the walk up days when
we had an office, Joe the engineer would walk up for the Wifi password,
and you would give him the wifi password and redirect your attention.
Or on a more iterative level, you get a notification in the
upper right hand corner of your monitor, and you choose to dismiss
it. Right. Very quick interaction, and the
time loss is actually 25 minutes. And the reason it's 25 minutes is actually
study by Gloria Marks. And she covered the cost of interrupted
work at UC urban. And what she found was that when
you're building through complicated tasks, kind of like we do in engineering
and in SRE and in chaos, we're trying to understand very complicated
systems and very complicated things. And what ends up
happening is we're building this mental model, and then you see the distraction,
the alert, the notification, and all of a sudden, that mental
model goes. And in order to get back to where you were on the
problem that you were trying to solve, you need to rebuild that model. And that
is where this time sink comes in. So what happens if,
instead of trying to redirect and direct back, you do try
and multitask? And this was actually covered by
George Mason University. And they did a test with AP tests,
actually. And what they did, and for those unfamiliar AP tasks, are a zero
to five scale, and they wanted to see what would happen if people were interrupted.
And again, these were the quick interruptions, not the entire repositioning
of your workflow. And they found that the AAP test
score dropped, on average, about half a point for 96%.
They all did worse, right? Because what they tried to
do is they tried to ignore whatever the background noise was and focus,
in this case, on the writing section of their test. But they lost the essay
that they were building. They lost their place, and so they tried to
add it back in, and unfortunately, it hit the overall quality of the assay.
So then there was a follow up study, someone who's familiar with both the UC
Irving study and the George Mason University study, and they decided
to say, okay, well, what happens if we give you your time back?
And they gave back approximately 30 minutes, in alignment with the earlier
study, and they found that the same 90 some odd
percent still did worse. Their score did not recover even when the time
was added back in. And that's a lot. And one
more study relevant to all of this, multitasking with sequence
based tasks. So this is when you reduce the complexity of the task. So instead
of writing an essay or building a cluster,
imagine Bert writing lines on a chalkboard, same line over and
over. And in this case, it was a very short, sequential task.
And so they decided to do, again, background radiation, notification,
alert noise, et cetera. They basically introduced noise into the environment.
And what happened? Their error rate tripled.
So even when you reduce the complexity of the task, you still encounter
problems from high distraction mode. None of
this sounds very great, so let's talk about what we
can do. How can we fix this? So, first thing we
can do is be aware, not overwhelmed.
You want to be aware of the notification sources and everything that's coming
through, but you don't want it to overwhelm you while you're trying to make your
way through your day. That said, you also need to determine those sources of
noise, and then you need to categorize them, channel them,
and then create a routine to clear the clutter. And I'll get into that later.
And so when you're not overwhelmed, right, you're trying to determine the
source. So instead of saying, oh, I wish the slacker
teams was less chatty, or, I wish that this notification
wasn't going off so much, or I wish that people didn't have so many questions,
you can say, okay, well, what are the sources of the noise? Is it human?
Is it machine? Is it a chatty service, for example,
that has a lot of low level alerts that maybe aren't paging my phone but
are certainly going into whatever channel they're being sent into, and if
you can say, oh, I know this source, you can actually do something productive with
that. You can make decisions based on that. You don't want to necessarily mute everything,
but you might want to redirect what's happening there. And you might also
notice that you're keeping around a few legacy things that don't even belong. And this
goes into the clutter. Right? So if you're thinking, okay,
we actually channels that endpoint. And so the reason the service is chatty or not
chatty enough is because it's checking on the wrong condition,
easy. I can change it or delete it depending on what the situation
allows for. And this actually happens pretty
quickly as a really quick example of a
very basic infrastructure, right? You have your office wiki, you might have something
like elasticsearch, you have slack, you have mail,
you'll have an alert notification platform, you have the info, you have the code base.
And this isn't even getting into all the microservices that might be existing all
around all of these things. Lots and lots of noise.
Then you'll also notice, and I've hinted at it a couple of times already,
the next side is a human, because that human is you.
While you're being aware of the sources of noise, one sources of noise
is you and your other humans. How often are you introducing
noise into your own environment? Think about when you check your
email between other tasks, or checking social media like
Twitter or Facebook or text messages or other forms
of messages, and the things go on, right? You might even have your phone on
the desk right beside you, face up, screen up,
and you can just see the notifications or the news alerts just popping right through,
cranking along throughout the day with our very rapid news cycle. So with
all of these things go on, you're introducing that distraction in addition
to whatever is happening in your work ecosystem and the machine you're working
on. And this brings us back to this. The reason I mentioned it in the
beginning is because another source of noise in this virtual reality
is we're in each other's homes. I'm presenting to you while
you're at home instead of on a stage in an in person environment where we
could say hi, give hugs or elbow tap or whatever we're comfortable with,
right? But because I'm in your know, you could have like a Google
Home or an echo device that could accidentally be activated by
the current speaker, myself or anyone else. And that is another source of
noise. And sometimes it's intentional, right? We sit on
hangouts and we have our huddles or
our sprint sessions or whatever muting we're having. And if someone notices you
have can echo hanging out in view somewhere behind you, they might
try and talk to it just to get a laugh. And it's funny once,
maybe not ten months in or what's time anymore,
really. So you want to make sure that when you're interacting in virtual environments and
virtual meetings like this one, you want to make sure those are unmuted, or you
can just use your headphones and then the device can't pick up anything but you
anyway. You also need to establish boundaries and communicate them
with other people. You need to be able to set times to focus on work
and then be able to mute those noncritical alerts. You don't want to just mute
the notifications without communicating. You're doing so, but you want people to know that that's
what's about to happen. And this includes those messages from friends
and family. They need to know what's relevant and they need to know what's expected.
So when you're setting those focus times, you might say, okay, so this is my
time for lunch and or after lunch is my focus time for writing code,
review, whatever it is, whatever the focus time is being devoted to. And you can
say, okay, so I know we're not really going out anymore. So this
example is pre pandemic dated, but you don't want to get a
text message from your roommate or spouse or whomever ever to
say, hey, can you pick up XYZ from the grocery store on your way home
when you're trying to focus, when that's the goal of your focus time, you can
tasks them to say, hey, can you just send it at the end of the
workday, five or six or whatever that is, and I'll address it on my way
out the door. Right, and that's not what a relevant emergency is,
but you can say, there's a situation that needs my immediate attention,
please still call me. Right? I'm not muting everything. I just need
enough to focus. And as you communicate these things, make sure
you're communicating to coworkers too. Coworkers don't always know your
schedule, even if you do have your calendar up to date because there's one
of you and however many of them to you is how many of you there
are to them. Right? Ten people, 20 people on a team
or two teams that are interacting together. It can be very hard to just pull
up everyone's calendar to see when everyone's available or focusing or not.
And so you can just send out a message on the right communications platform,
email, chat, whatever, and say, hey, I'm going has down
for a few hours. Please use alternative communication
method and say what it is to reach me if you need something urgent
in that time. Otherwise, I'll check all my messages after my focus time
block is over. And that will give them an expectation. They know when it starts
because they can see your message. They know when it ends because you've told them
how long, and they know what to do if something actually urgent pops up while
your head's down. And then there are the external sources of noise. So these are
the non human noises, like pager duty or any
other notifications that you're using to actually keep track of your infrastructure.
Monitoring, et cetera, goes back to categorization.
You know, that there might be false positives, negatives, fragile systems,
and then anything that happens that's very frequent. And so when you're getting into false
positives and negatives, that means you're having either too much noise when you shouldn't or
too much silence when you shouldn't. And neither of those sit very well when
you're trying to make sure that everything's live in production the way it's supposed to
be. And so when you see a lot of these false positive and negatives,
when you see something that's fragile or frequent, you can actually check
the conditions. Do they make sense? Are they measuring what
they mean to measure or checking what they mean to check is what you're
checking on, for example, actually notifying you of what you
wanted to learn. If you're, for example,
saying on this is a common one, HTTP status
code 200 could mean that there are no errors. You might only be checking for
errors if they come through on a different status code. But how many have seen
the joke where it's like HTTP 200 and
then the body text that's being sent over is an error message, it's just not
being flagged as error for whatever reason in the code. Now obviously that's something else
to fix, but that also is a situation that will rise and give
you false negatives, right? So something's coming through that's an error.
The error is not being categorized correctly, so you don't hear it.
It doesn't come through as a notification. If something's fragile,
that means you're getting a lot of low priority notifications in a high priority way.
That can be either push notifications again, look at the corners of your monitoring,
or on your phone you might be getting page. This trains you to ignore
noise. So when a really high priority notification
comes through, it's just going to get buried underneath all that. So you want to
make sure, if you can't get out rid of things outright or pivot
them correctly. You want to make sure that the very least, you're setting the urgency
correctly. You can get it via email or chat channel, whatever your
integrations specify. Rather than blowing up your phone, you also
want to make sure that you're creating a flow around it. You need
to know what needs to be known, who needs to know it, how soon
they should know, and how they should be notified. And the idea here is
to try and catch anything that's being misrouted. So when
I was working as an SRE cloud engineer,
whatever my title was at the time, it wasn't necessarily uncommon for me to
get alerts that were routed to me when they might have been better
suited to somebody else. So they weren't in my area of expertise.
If they're not my area of expertise, I can't tasks action on them. Which means
that there is now time lost and active incident for me to hunt down whomever,
the person who actually knows this particular service or situation
and get them to respond to this alert. And every time that
happens, you should definitely be tracking it. Just reroute them one by one. And if
you notice it's pervasive, sit and take a look through your monitoring and
notifications systems to make sure that things are overall flowing
correctly. And then as you're taking a look at all that, when you're looking at
how soon and how to be notified, not everything is critical.
So there are things that should be paging me on my phone that I need
to drop everything and address them. And then there are other things
that I can get to in an hour or by end of business, or if
it's after hours, the next business day. And this will in part
be determined by what the sources of the noise is and what the situation
is, but also the criticality of the service. If you're looking at what's
defined as a t one or in some cases a t zero service, where it's
business critical, that's going to have a higher level of urgency almost
across the board than anything that happens in a tier two or tier three,
where you're still reliant on them, but you don't necessarily need to wake
up at 04:00 in the morning for them. You also want to make sure that
you have enough redundancy in order to prevent a vacuum.
So we started off this year, and I'm sorry to pick on slack,
but just a little. We started off the beginning of this year with a massive
slack outage. And that was kind of great to come back
from the holidays and just have an artificial vacuum that everyone got to
experience together. But if that was your single point of
failure for certain low criticality alerts, and they weren't getting rerouted
to email or somewhere else to compensate for the outage,
then you are going to basically be dropping those into an ether and no one's
going to see what's going on with the chatter. And remember, the idea
with low importance alerts is that you still need to know it. You just don't
need to know it right away. So you don't want to lose that information either.
It's important to be able to change endpoints or change services
as needed. So if you're looking at slack and you're like, oh,
wow, you're really reliant. So this was just a bad day. In this situation,
it was unusually long, then that's fine. But if you notice that you have a
microservice, you need to be able to reevaluate them because maybe they're
becoming less reliable or more reliable over time. And you need to know if you
need to switch into or away from them. You need enough flexibility
in your design to accommodate that. And all of this goes around building
trust because the noise needs to be resilient in order to build trust.
I need to trust that silence is actually because nothing is happening and
not because something is missed or something's been configured
incorrectly or I've had to mute something going back to anxious squidward in
the beginning. And if I've muted it, then I'm just made that silence.
It's not actually silent. You want to make sure, again, how reliable
are your tools and services. Make sure that they're keeping you in alignment
with whatever Sli SlO Slas that you have,
and so that you can keep those slas with your customers or users of your
services. And you want to make sure, again, how much duplication
is needed. You don't necessarily need a backup for every single thing.
This ties into criticality. You want a backup or a way to pivot away
from anything that's highly critical. But you don't necessarily
need to invest a massive amount of time, energy and money into things that
are less critical. You just need to be aware of them.
And driving home this point just one more time,
can you switch endpoints in the event of an outage?
Because everyone has them. They're not normally as long as what I mentioned
a little bit earlier, but you want to make sure that you can accommodate
those. And are you regularly evaluating
the reliability of your services? And this includes internal tools.
So I know I talked a lot about Slack has their big name, we all
recognize, but really, you might have some homegrown things. And are you keeping
track of their reliability to make sure that they're keeping in alignment
with what you've promised and what you've come to expect from them.
And when you're evaluating those internal services, make sure that you also have the
ability to switch. You might make the executive decision at some point that,
you know what, we built this homegrown solution. It's not as reliable
as we would like, and we would like to devote our engineering hours to this
other project feature or whatever. And so we don't have
the time to actually maintain that. So maybe we will switch to a
third party provider. What are the process that those third party
providers offer us and can we use them? So again,
make sure that you're reevaluating reliability internally and externally,
and make sure that you're keeping things as duplicated as makes sense within your
business. And I've mentioned it a couple of times and now I'm
going to actually talk about it. Sprint cleaning, or how to
clean things out when you notice that they're clogged. So every time an alert
triggered triggers, ask is it needed?
Was it resolved, can it be automated? And is there
a permanent solution? And was it urgent? So when you're looking at all these
things, this goes back into do I need this notification? Is it at
the correct urgency and can I delete it? And sometimes we get a
little anxious around deleting things because we think, oh, I might need that later.
But if you can determine that you really don't need that later, you're just setting
yourself up for unnecessary noise. So go ahead and delete it. But these are questions
that can help, you know, if you can delete it. So if
something's been automated away, you might, instead of alerting on the outage
itself, alert on the self healing, because you don't care if it's starting
to fail, you care if it can't recover right? Or if you built a permanent
solution to whatever the problem is, then you don't want alerts off of that
condition anymore because you've changed something to make it irrelevant.
And when you're doing all these things, go ahead and delete. And you want to
make it. I called it sprint cleaning because you want to make it relatively frequent.
Now, the first time you do this, if you haven't done it to this point,
more likely than not you're going to actually need to do kind of
a multi phased approach to it so that you can actually go
through and see that everything's mapping correctly. And that's going to be a bit of
a project to prevent it from being a project again. If you
do it iteratively in small, just like with sprints, just like with daily stand ups,
you can actually prevent yourself from having to make a massive project that will
last that long in the future. If you want to take a look at any
of these resources, I have them available up on noticed
and thank you all for your time. Again, that link is at
the bottom. My name is Quintessence and I'm a developer advocate at Pagerduty.
If you have any other questions, please feel free to hit me up either on
Twitter. My dms are always open or via email,
which I can just be emailed at quintessence at pagerduty. I hope you enjoy the
rest of the conference and the other speakers. They have a great lineup and have
a great rest of your day.