Conf42 Chaos Engineering 2021 - Online

Sensory Friendly Monitoring: Keeping the Noise Down

Video size:

Abstract

As infrastructure increases in complexity and monitoring increases in granularity, engineering teams can be notified about each and every hiccup in each and every server, container, or process. In this talk, I’ll be discussing how we can stay in tune with our systems without tuning out.

The ability to monitor infrastructure has been exploding with new tools on the market and new integrations, so the tools can speak to one another, leading to even more tools, and to a hypothetically very loud monitoring environment with various members of the engineering team finding themselves muting channels, individual alerts, or even alert sources so they can focus long enough to complete other tasks. There has to be a better way - a way to configure comprehensive alerts that send out notifications with the appropriate level of urgency to the appropriate persons at the appropriate time. And in fact there is: during this talk I’ll be walking through different alert patterns and discussing: what we need to know, who needs to know it, as well as how soon and how often do they need to know.

Summary

  • When you have too much noise, it buries high importance and severe alerts in a sea of low priority notifications. What we need to do with all of this is find a happy medium. Make sure you have any echo Alexa devices in listening range or use headphones.
  • The time cost that you lose when you redirect your focus is 25 minutes. George Mason University introduced noise into the environment and their error rate tripled. Even when you reduce the complexity of the task, you still encounter problems from high distraction mode.
  • You need to set times to focus on work and then be able to mute those noncritical alerts. You don't want to just mute the notifications without communicating. Make sure you're communicating to coworkers too. Monitoring, et cetera, goes back to categorization.
  • You also want to make sure that you have enough redundancy in order to prevent a vacuum. Are you regularly evaluating the reliability of your services? And this includes internal tools.
  • Quintessence: How to clean things out when you notice that they're clogged. Every time an alert triggered triggers, ask is it needed? Was it resolved, can it be automated? And is there a permanent solution? And was it urgent?

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hello. I hope you're enjoying comp 42 chaos engineering. My name is quintessence, and I'm here to talk to you about sensory friendly monitoring or how to keep the noise down relevantly. Make sure you have any echo Alexa devices in listening range or use headphones, because as a reminder, we are in each other's homes. But I won't be trolling you today. When we try to know everything, it looks a little something like this. You start off with a bunch of slack notifications. Maybe you're using Trello or Jiro or both between different teams, and those get integrated into your chat platform. Or when you're going beyond engineering, you can also take a look at HR things and lunches and things that people are trying to do together, especially now that we're all very isolated and our primary means of communication are virtual add status page and pagerduty into the mix, and all of a sudden, there's just a ton of notifications that need to be handled. And the end result of this is just too much noise. It's too much noise to be effective. And when you have too much noise, it buries high importance and severe alerts in a sea of all these low priority notifications. And then you're going to mute individual notification sources, right? How many people have closed slack for focus time, for example, or whatever? Teams, right? And the result of these closed notifications is that you won't get them, right. You've created an artificial silence, you said, okay, well, I'm getting too many notifications in slack or teams, so I'm going to close that. And now, if something important comes through, how are they going to reach you? And when that happens, it's not as cathartic as you would hope, because, you know you've manufactured the silence. The silence isn't because nothing's wrong. The silence is because you've turned on mute. And what we need to do with all of this is find a happy medium. So first, let's talk about the cost of noise. Now, when we're thinking about our brains, we're actually thinking all the time. We're osmotically absorbing what's in our environment. Let's think of any time any of us have actually lived with another human, whether or not we're doing so. Now, if that other human had any significant or serious interest in a topic, you might find you're suddenly able to speak to that topic despite never having read, listened to, or otherwise engaged with it personally, yourself. But they're doing it in your environment. And this is one of the earliest indicators that your brain is always processing that background information and relevantly to alerts. If you think to yourself, oh, I'll just ignore these alerts, you're not actually ignoring the alerts. You're trying to multitask after a fashion. And the time cost that you lose when you redirect your focus is 25 minutes. And I'm talking about the little alerts here. This isn't something like, oh, there's an outage, and now I'm completely redirecting my attention. I'm talking about in the walk up days when we had an office, Joe the engineer would walk up for the Wifi password, and you would give him the wifi password and redirect your attention. Or on a more iterative level, you get a notification in the upper right hand corner of your monitor, and you choose to dismiss it. Right. Very quick interaction, and the time loss is actually 25 minutes. And the reason it's 25 minutes is actually study by Gloria Marks. And she covered the cost of interrupted work at UC urban. And what she found was that when you're building through complicated tasks, kind of like we do in engineering and in SRE and in chaos, we're trying to understand very complicated systems and very complicated things. And what ends up happening is we're building this mental model, and then you see the distraction, the alert, the notification, and all of a sudden, that mental model goes. And in order to get back to where you were on the problem that you were trying to solve, you need to rebuild that model. And that is where this time sink comes in. So what happens if, instead of trying to redirect and direct back, you do try and multitask? And this was actually covered by George Mason University. And they did a test with AP tests, actually. And what they did, and for those unfamiliar AP tasks, are a zero to five scale, and they wanted to see what would happen if people were interrupted. And again, these were the quick interruptions, not the entire repositioning of your workflow. And they found that the AAP test score dropped, on average, about half a point for 96%. They all did worse, right? Because what they tried to do is they tried to ignore whatever the background noise was and focus, in this case, on the writing section of their test. But they lost the essay that they were building. They lost their place, and so they tried to add it back in, and unfortunately, it hit the overall quality of the assay. So then there was a follow up study, someone who's familiar with both the UC Irving study and the George Mason University study, and they decided to say, okay, well, what happens if we give you your time back? And they gave back approximately 30 minutes, in alignment with the earlier study, and they found that the same 90 some odd percent still did worse. Their score did not recover even when the time was added back in. And that's a lot. And one more study relevant to all of this, multitasking with sequence based tasks. So this is when you reduce the complexity of the task. So instead of writing an essay or building a cluster, imagine Bert writing lines on a chalkboard, same line over and over. And in this case, it was a very short, sequential task. And so they decided to do, again, background radiation, notification, alert noise, et cetera. They basically introduced noise into the environment. And what happened? Their error rate tripled. So even when you reduce the complexity of the task, you still encounter problems from high distraction mode. None of this sounds very great, so let's talk about what we can do. How can we fix this? So, first thing we can do is be aware, not overwhelmed. You want to be aware of the notification sources and everything that's coming through, but you don't want it to overwhelm you while you're trying to make your way through your day. That said, you also need to determine those sources of noise, and then you need to categorize them, channel them, and then create a routine to clear the clutter. And I'll get into that later. And so when you're not overwhelmed, right, you're trying to determine the source. So instead of saying, oh, I wish the slacker teams was less chatty, or, I wish that this notification wasn't going off so much, or I wish that people didn't have so many questions, you can say, okay, well, what are the sources of the noise? Is it human? Is it machine? Is it a chatty service, for example, that has a lot of low level alerts that maybe aren't paging my phone but are certainly going into whatever channel they're being sent into, and if you can say, oh, I know this source, you can actually do something productive with that. You can make decisions based on that. You don't want to necessarily mute everything, but you might want to redirect what's happening there. And you might also notice that you're keeping around a few legacy things that don't even belong. And this goes into the clutter. Right? So if you're thinking, okay, we actually channels that endpoint. And so the reason the service is chatty or not chatty enough is because it's checking on the wrong condition, easy. I can change it or delete it depending on what the situation allows for. And this actually happens pretty quickly as a really quick example of a very basic infrastructure, right? You have your office wiki, you might have something like elasticsearch, you have slack, you have mail, you'll have an alert notification platform, you have the info, you have the code base. And this isn't even getting into all the microservices that might be existing all around all of these things. Lots and lots of noise. Then you'll also notice, and I've hinted at it a couple of times already, the next side is a human, because that human is you. While you're being aware of the sources of noise, one sources of noise is you and your other humans. How often are you introducing noise into your own environment? Think about when you check your email between other tasks, or checking social media like Twitter or Facebook or text messages or other forms of messages, and the things go on, right? You might even have your phone on the desk right beside you, face up, screen up, and you can just see the notifications or the news alerts just popping right through, cranking along throughout the day with our very rapid news cycle. So with all of these things go on, you're introducing that distraction in addition to whatever is happening in your work ecosystem and the machine you're working on. And this brings us back to this. The reason I mentioned it in the beginning is because another source of noise in this virtual reality is we're in each other's homes. I'm presenting to you while you're at home instead of on a stage in an in person environment where we could say hi, give hugs or elbow tap or whatever we're comfortable with, right? But because I'm in your know, you could have like a Google Home or an echo device that could accidentally be activated by the current speaker, myself or anyone else. And that is another source of noise. And sometimes it's intentional, right? We sit on hangouts and we have our huddles or our sprint sessions or whatever muting we're having. And if someone notices you have can echo hanging out in view somewhere behind you, they might try and talk to it just to get a laugh. And it's funny once, maybe not ten months in or what's time anymore, really. So you want to make sure that when you're interacting in virtual environments and virtual meetings like this one, you want to make sure those are unmuted, or you can just use your headphones and then the device can't pick up anything but you anyway. You also need to establish boundaries and communicate them with other people. You need to be able to set times to focus on work and then be able to mute those noncritical alerts. You don't want to just mute the notifications without communicating. You're doing so, but you want people to know that that's what's about to happen. And this includes those messages from friends and family. They need to know what's relevant and they need to know what's expected. So when you're setting those focus times, you might say, okay, so this is my time for lunch and or after lunch is my focus time for writing code, review, whatever it is, whatever the focus time is being devoted to. And you can say, okay, so I know we're not really going out anymore. So this example is pre pandemic dated, but you don't want to get a text message from your roommate or spouse or whomever ever to say, hey, can you pick up XYZ from the grocery store on your way home when you're trying to focus, when that's the goal of your focus time, you can tasks them to say, hey, can you just send it at the end of the workday, five or six or whatever that is, and I'll address it on my way out the door. Right, and that's not what a relevant emergency is, but you can say, there's a situation that needs my immediate attention, please still call me. Right? I'm not muting everything. I just need enough to focus. And as you communicate these things, make sure you're communicating to coworkers too. Coworkers don't always know your schedule, even if you do have your calendar up to date because there's one of you and however many of them to you is how many of you there are to them. Right? Ten people, 20 people on a team or two teams that are interacting together. It can be very hard to just pull up everyone's calendar to see when everyone's available or focusing or not. And so you can just send out a message on the right communications platform, email, chat, whatever, and say, hey, I'm going has down for a few hours. Please use alternative communication method and say what it is to reach me if you need something urgent in that time. Otherwise, I'll check all my messages after my focus time block is over. And that will give them an expectation. They know when it starts because they can see your message. They know when it ends because you've told them how long, and they know what to do if something actually urgent pops up while your head's down. And then there are the external sources of noise. So these are the non human noises, like pager duty or any other notifications that you're using to actually keep track of your infrastructure. Monitoring, et cetera, goes back to categorization. You know, that there might be false positives, negatives, fragile systems, and then anything that happens that's very frequent. And so when you're getting into false positives and negatives, that means you're having either too much noise when you shouldn't or too much silence when you shouldn't. And neither of those sit very well when you're trying to make sure that everything's live in production the way it's supposed to be. And so when you see a lot of these false positive and negatives, when you see something that's fragile or frequent, you can actually check the conditions. Do they make sense? Are they measuring what they mean to measure or checking what they mean to check is what you're checking on, for example, actually notifying you of what you wanted to learn. If you're, for example, saying on this is a common one, HTTP status code 200 could mean that there are no errors. You might only be checking for errors if they come through on a different status code. But how many have seen the joke where it's like HTTP 200 and then the body text that's being sent over is an error message, it's just not being flagged as error for whatever reason in the code. Now obviously that's something else to fix, but that also is a situation that will rise and give you false negatives, right? So something's coming through that's an error. The error is not being categorized correctly, so you don't hear it. It doesn't come through as a notification. If something's fragile, that means you're getting a lot of low priority notifications in a high priority way. That can be either push notifications again, look at the corners of your monitoring, or on your phone you might be getting page. This trains you to ignore noise. So when a really high priority notification comes through, it's just going to get buried underneath all that. So you want to make sure, if you can't get out rid of things outright or pivot them correctly. You want to make sure that the very least, you're setting the urgency correctly. You can get it via email or chat channel, whatever your integrations specify. Rather than blowing up your phone, you also want to make sure that you're creating a flow around it. You need to know what needs to be known, who needs to know it, how soon they should know, and how they should be notified. And the idea here is to try and catch anything that's being misrouted. So when I was working as an SRE cloud engineer, whatever my title was at the time, it wasn't necessarily uncommon for me to get alerts that were routed to me when they might have been better suited to somebody else. So they weren't in my area of expertise. If they're not my area of expertise, I can't tasks action on them. Which means that there is now time lost and active incident for me to hunt down whomever, the person who actually knows this particular service or situation and get them to respond to this alert. And every time that happens, you should definitely be tracking it. Just reroute them one by one. And if you notice it's pervasive, sit and take a look through your monitoring and notifications systems to make sure that things are overall flowing correctly. And then as you're taking a look at all that, when you're looking at how soon and how to be notified, not everything is critical. So there are things that should be paging me on my phone that I need to drop everything and address them. And then there are other things that I can get to in an hour or by end of business, or if it's after hours, the next business day. And this will in part be determined by what the sources of the noise is and what the situation is, but also the criticality of the service. If you're looking at what's defined as a t one or in some cases a t zero service, where it's business critical, that's going to have a higher level of urgency almost across the board than anything that happens in a tier two or tier three, where you're still reliant on them, but you don't necessarily need to wake up at 04:00 in the morning for them. You also want to make sure that you have enough redundancy in order to prevent a vacuum. So we started off this year, and I'm sorry to pick on slack, but just a little. We started off the beginning of this year with a massive slack outage. And that was kind of great to come back from the holidays and just have an artificial vacuum that everyone got to experience together. But if that was your single point of failure for certain low criticality alerts, and they weren't getting rerouted to email or somewhere else to compensate for the outage, then you are going to basically be dropping those into an ether and no one's going to see what's going on with the chatter. And remember, the idea with low importance alerts is that you still need to know it. You just don't need to know it right away. So you don't want to lose that information either. It's important to be able to change endpoints or change services as needed. So if you're looking at slack and you're like, oh, wow, you're really reliant. So this was just a bad day. In this situation, it was unusually long, then that's fine. But if you notice that you have a microservice, you need to be able to reevaluate them because maybe they're becoming less reliable or more reliable over time. And you need to know if you need to switch into or away from them. You need enough flexibility in your design to accommodate that. And all of this goes around building trust because the noise needs to be resilient in order to build trust. I need to trust that silence is actually because nothing is happening and not because something is missed or something's been configured incorrectly or I've had to mute something going back to anxious squidward in the beginning. And if I've muted it, then I'm just made that silence. It's not actually silent. You want to make sure, again, how reliable are your tools and services. Make sure that they're keeping you in alignment with whatever Sli SlO Slas that you have, and so that you can keep those slas with your customers or users of your services. And you want to make sure, again, how much duplication is needed. You don't necessarily need a backup for every single thing. This ties into criticality. You want a backup or a way to pivot away from anything that's highly critical. But you don't necessarily need to invest a massive amount of time, energy and money into things that are less critical. You just need to be aware of them. And driving home this point just one more time, can you switch endpoints in the event of an outage? Because everyone has them. They're not normally as long as what I mentioned a little bit earlier, but you want to make sure that you can accommodate those. And are you regularly evaluating the reliability of your services? And this includes internal tools. So I know I talked a lot about Slack has their big name, we all recognize, but really, you might have some homegrown things. And are you keeping track of their reliability to make sure that they're keeping in alignment with what you've promised and what you've come to expect from them. And when you're evaluating those internal services, make sure that you also have the ability to switch. You might make the executive decision at some point that, you know what, we built this homegrown solution. It's not as reliable as we would like, and we would like to devote our engineering hours to this other project feature or whatever. And so we don't have the time to actually maintain that. So maybe we will switch to a third party provider. What are the process that those third party providers offer us and can we use them? So again, make sure that you're reevaluating reliability internally and externally, and make sure that you're keeping things as duplicated as makes sense within your business. And I've mentioned it a couple of times and now I'm going to actually talk about it. Sprint cleaning, or how to clean things out when you notice that they're clogged. So every time an alert triggered triggers, ask is it needed? Was it resolved, can it be automated? And is there a permanent solution? And was it urgent? So when you're looking at all these things, this goes back into do I need this notification? Is it at the correct urgency and can I delete it? And sometimes we get a little anxious around deleting things because we think, oh, I might need that later. But if you can determine that you really don't need that later, you're just setting yourself up for unnecessary noise. So go ahead and delete it. But these are questions that can help, you know, if you can delete it. So if something's been automated away, you might, instead of alerting on the outage itself, alert on the self healing, because you don't care if it's starting to fail, you care if it can't recover right? Or if you built a permanent solution to whatever the problem is, then you don't want alerts off of that condition anymore because you've changed something to make it irrelevant. And when you're doing all these things, go ahead and delete. And you want to make it. I called it sprint cleaning because you want to make it relatively frequent. Now, the first time you do this, if you haven't done it to this point, more likely than not you're going to actually need to do kind of a multi phased approach to it so that you can actually go through and see that everything's mapping correctly. And that's going to be a bit of a project to prevent it from being a project again. If you do it iteratively in small, just like with sprints, just like with daily stand ups, you can actually prevent yourself from having to make a massive project that will last that long in the future. If you want to take a look at any of these resources, I have them available up on noticed and thank you all for your time. Again, that link is at the bottom. My name is Quintessence and I'm a developer advocate at Pagerduty. If you have any other questions, please feel free to hit me up either on Twitter. My dms are always open or via email, which I can just be emailed at quintessence at pagerduty. I hope you enjoy the rest of the conference and the other speakers. They have a great lineup and have a great rest of your day.
...

Quintessence Anx

Developer Advocate @ PagerDuty

Quintessence Anx's LinkedIn account Quintessence Anx's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)