Conf42 Site Reliability Engineering 2022 - Online

LMAO Helps During Outages

Video size:

Abstract

This short fast pace talk is aimed towards those who might be on pager duty. Richard will cover the four things that will help you survive any outage nightmare. Cause when your company has an outage no need to sweat just remember to LMAO and you’ll get through it.

Summary

  • Richard Lewis is a senior DevOps consultant with three cloud solutions. LMAO is an acronym for logs, Metrics, Alerts and an observability tool. Managing your alerts effectively requires certain things.
  • Having a playbook is one of those critical and key things, practicing for those outages. Having a playbook comes down to what you actually put in it. The practice of chaos engineering is something still being worked out.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Well helps and welcome to my talk. LMAO helps during outages I'm Richard Lewis, I'm a senior DevOps consultant with three cloud solutions and I've had the pleasure of working with both software development teams and operations teams with. I currently have over 20 years experience of working in those industry. I'm also the co organizer for the Chicago Honorary Enthusiast User group and as you can tell from everything around me, I'm a diehard White Sox fan. So a little bit about my company three cloud is the largest peer play Azure service partner in those world. We have about 600 people who are dedicated Azure professionals focused on data and analytics, app innovation and helping clients provide a modern cloud platform. So LMao, as you could probably guess by now, it is not just laughing your way through the outage, but it's actually an acronym for something else. It's an acronym for logs, Metrics, Alerts and an observability tool. And these are the things that you need to have to make up a LMAO strategy. So what am I actually talking about today? I'm actually talking about providing platform support strategies for your team members, creating a standard for knowledge sharing, helping you reduce the meantime resolution, and building the psychological safety of your team members so that way they're able to be able to lmao their way during LMAO outages. So if you're not familiar with logs and metrics, those are the things that's going to provide you the insight as to what and when it's happening. It's going to help you figure out how many errors occurred, how many requests you got, and it's going to help you figure out the duration of those errors and request. Now as for alerts, they come in two main ways. Pages, those are critical things like hey, our whole network is down, or hey, our website's offline. And tickets, those would be things like hey, the hard drive that runs our website is at 80% full or 85 or 90% full. Or we're seeing a slowness in our ecommerce website. Not outage, just some general slowness, just some pain. And like most of you, I've experienced alert trauma. This is me, my first IT job circa 2011 and I was supporting an Azure 97 application. I has the only one doing this. I was on an on call rotation 365 days a year for 16 hours a day. Because it was also a call center and they worked two shifts, I had no way to connect to the office remotely. Luckily I lived within 10 miles of the office, so whenever there was an outage I would get paged using actually this pager right here that you see on the left side of your screen and I would respond to it. My company had no logging framework, no concept of the logging framework. And within six months, you guessed it, I burned out. But I didn't quit. I let my boss know what was going on, my thoughts and opinions. My boss was very receptive to these thoughts and opinions and we worked on putting together a LMAO strategy. We tackled the issue of logging framework first. We used log for net. We tackled the on call rotation issue by hiring additional people and spreading that load out. So we created a little schedule when we're going to do these things. So managing your alerts effectively requires certain things. And as you guessed it, as I just said before, scheduling your team members appropriately. That's an important part of it. You can't have a person on call 24 hours a day, 365 days a year. It's just not effective and they're going to leave. Then you take with them their tribal knowledge and now you're stuck in that position all over again. Avoiding alert would take wherever possible. If they don't need to be woken up or paged about it, let's not page them. Collecting those data on the alerts that are actually going out, look into ways to reduce those alerts so those would be some kind of continual improvement opportunity. If the alerts that are going out is because the servers have reached a high number of cpu, look at things like auto scaling. Or if it's an ecommerce website, all of a sudden the traffic has gone through the roof, say like every day between certain hours a website increases in traffic. Then maybe look at having auto scaling there as well. Or if it's a hard drive, as I said earlier, the hard drive hit a certain capacity threshold, then look at some kind of run hooks or run books that could automatically heal those kind of things. And be cautious when you're introducing new alerts. Every alert has a purpose. That's two. That's right, alert you. So be very cautious about those alerts that you're introducing and the impact that they're going to have on your team. Observability on the other hand, is a little bit different. Observability is more of the voyeurism of it. So you get to see what's actually going on on a nice pretty screen or dashboard or something like that. It's going to be helpful for youll to understand your KPIs or SLAs. You can actually monitor your usage of different systems the same way. And spot trends dashboards definitely help spot trends, and that's where you see products like power, Bi and tableau and things like that. So you can spot trends. Observability dashboards do the same thing. And when it comes to observability tools to help you with your observability dashboards, there are hundreds of thousands of products out there. I actually took this screenshot here from the cloud Native Computing Foundation's website. Under their monitoring section, they highlight a ton of different products. This is not exclusively all the products out there, but this is just a wide range of different types of products that you can use for observability. I'm sure that there are at least one or two products on this screen, if not more, that you're probably using in your current place. Looking at this screen, I counted at least five or six. Okay, more like nine products that I regularly recommend to clients that are helpful for their different situations. I don't think that there's one size that fits all. I think that you want to use the appropriate tool for the appropriate cost at the appropriate time. They all fluctuate on costs and trade offs. This is actually a dashboard here from new relic, and this dashboard actually highlights. In the left corner, you're seeing synthetic failures and violations of policies that they have in place. And I believe it's just a coincidence that it's working out where you have 13 policy violations and 13 synthetic failures. And that violation could be just because every time that synthetic failure policy is violated, it registers as a new violation. So the numbers are correlating right there at that time. But the top one is those violations of policies. The one below it is the synthetic failures. And synthetic failures are usually generated by something third party tool monitoring or testing your system. So that would be just doing like if it's alive or dead, call to see if your service is up or down, or how long does it take for your website to load using some kind of framework like selenium or something like that. The next thing we see is the errors that are occurring per minute. So now we know how many failures have happened and a duration of how many failures we're seeing per minute. And this makes me wonder, what did we change recently in our system that caused this problem? And that's where you see this here. This is actually the deployment notes, and luckily they're writing good release notes, I'm assuming. And that's how you know what's in those most recent release or has a possible impact to what's going on here. I doubt that those readme with endpoints is causing the problem, but it could be things is another dashboard here, and this one's actually made by a company called Grafana. This here actually, I really like this dashboard. It's a good example of being able to embed a wiki and documentation directly into the dashboard. So that way who's ever on call, they're able to just pull up this dashboard related to that application and click on a link to get from here to wherever they need to go to to see a true source of truth, or to go access details about who to contact or details related to the third party service or something like that. There's also a diagram in here and it's showing how you can load images into it of what that service is actually looking like. Like that service's path and architectural diagram. And we're highlighting here up and down of the service. So now we know the frequency of the service going up or down, followed by again we're highlighting over aimed those status codes that are being received by those service are sending out from this service. So we see a combination of 500 errors and 200 errors. So the next thing that's really important to think about is preparing your team for those outages. Having a playbook is one of those critical and key things, practicing for those outages as well. And having a playbook comes down to what you actually put in it. So I noted on here, it's important about having it in a location where it could be quickly accessible. Sometimes you may want to put your playbooks internally, and I kind of like, I'm 50 50 on those kind of things situations. If your internal system has to be accessed through a third party system that may be possibly having an outage, then you're going to delay yourself to get into that, your playbook. So if you're using something like Azure ad to authenticate to get into your company network, then if there's a problem with Azure ad, then it's going to delay you getting to your playbook. I like other systems as well, also tied together still using SSO, so single sign on, but also tie those together with other systems like Atlassian's confluence or a third party wiki system or SharePoint. So back to those ad again. Or SharePoint or Microsoft Teams has a wiki system built into that. If you're a user of that as well, somewhere where you can keep it outside of your network but still accessible, but kind of lessens the likelihood of having an issue there. But inside those playbooks, though, you want to put things like links to your application that may be related to those observability tools that you're using, as well as details about the golden signals of that application. So that way when a person is actually looking at what's going on and hearing from the users of what's going on, they're able to say this is within those normal range. This is not within the normal range. Any relevant notes or information from previous outages, those kind of situations help you title things back together. So you're doing like a post mortem after the outage and putting a link to the post mortem notes are quite helpful contacts for those application owner or any third party services that's owners of it as well. So say if your applications run something like Azure or AWS, then links to the premier support contact information. So that way who's ever on call knows how to get a hold of them to escalate and get the right people on the call properly. Or that way they don't have to call a manager or something like that and say, who do I call about this? Or so forth, or links to things like Stripe's website, if you have a payment service that may be causing a failure or something like that as well. And anything else that youll may think there's a ton of things you may want to put in your playbook around that application. I do suggest, though, dedicate a playbook per application opposed to doing just a single playbook of hundreds of thousands of things. You can put all of your playbooks together in one same system, but you kind of want to have them breaking out by section at least. And as I was talking about their preparation and training, I know I mentioned before the importance of this, but I come from the midwest of the United States, where we have a lot of tornadoes. And so as kids, we are trained, like you see on your screen here, when we hear that siren, to go into the hallway and put our hands over our head and curl up into a little ball in preparation of a tornado coming through those area if one was to happen. But we trained for it and we knew what to do as muscle memory. So the practice of chaos engineering is something still being worked out. It's been around for a while. The concept has been, it was created by Netflix. It helps you increase resiliency, is the goal there, really? And you're able to identify and address signal points of failure early. What you're doing is you're running controlled experiments against your system and you're predicting those possible outcome that outcome could actually happen or the outcome could not. And that's where the chaos engineering comes into play. You don't really know if it's going to happen or not, but the goal is, in the end, to identify your failure points, address those failure points. So that way, if something was to happen on those failure points, you youll be able to sustain it. There's a great article about how Netflix practices their castle engineering. I put a link below for you if you want to go take a look at it. And after those outages, though, you want to take the time to do a post mortem, usually within a day or so of the outage, while it's still fresh in everyone's mind. You want to get everyone together around the conference table and just talk through what went right, what went wrong, where did you get lucky? And just figure out what needs to be documented in terms of preparation for a possible future outage. I really like this quote here from Devon with Google, and it is the cost of failure is so talking about back to my first on call rotation job. One of the things I was required to do regularly were to do write alongs with technicians for appliances. And when we would go out to customers houses, the customer would have something broken, but they may have tried to fix themselves, but may have done it wrong, not fully followed the instructions that they got from the manufacturer, or weren't fully listening to a YouTube video that they were following and missed some key details. So we would be able to quickly resolve those issues within a matter of minutes. And the cost of education in that case was our service call fee. And so the customer, youll learn something new. They would learn how to fix that problem in the future, but at the same time, no, that gives them the ability to get their system back up online really quickly. So the cost of failure is education. It's a good quote. So my takeaways from my talk today are pretty simple. Have an lmao strategy in place, have that documented and ready to go. Everyone knows where it's at. Update those documents regularly after your outages, go back, update them, have a revision date on those documents. So that way, you know in the last time it was updated. And if they haven't been updated in six months, either you're not having allergies around that system or you're not documenting what's happening with that system. So good or bad there, avoid alert fatigue. The less alert fatigue you have, the more psychological safety you're building into youll people, and the more comfortable they are, they know where their documents are and they're able to go forward from there, the less likely that they're going to want to leave your organization and take that tribal knowledge with them. It'll cut down turnover and everything and run readiness preparation drills regularly. Chaos engineering again, it's a newer thing, but there's a lot of tools out there that can help you. Gremlin makes some great products, great documentation out there from them. Microsoft has a great has engineering product as well, great documentation to help you think about ways to do these things. And thank you so much for listening to me today. Thank you for your time and enjoy the rest of the conference.
...

Richard Lewis

Senior DevOps Consultant @ 3Cloud

Richard Lewis's LinkedIn account Richard Lewis's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)