What can possibly go wrong: lessons in building resilient systems at Amazon and Meta

Video size:

Abstract

In an ideal world, you wouldn’t want a pager waking you, your dog, or your neighbors in the middle of the night! How could this have been handled differently? Drawing from real-life experiences, this talk will share valuable lessons to help keep those late-night pager alerts at bay

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Today's talk is, what can possibly go wrong? Lessons in building resilient systems. The motivation of this talk is that we all know the cost of maintenance and software is huge. and we all know that the outage sucks for everyone, especially the on calls to wake up in the middle of the night, solving sets, thinking, what am I doing with my life? You might be waking up your neighbors, yourself, your spouse, your dogs, it's not a pretty scene. And the question I keep asking myself is how can we prevent outage from happening in the first place? And this talk will talk about a few, real life scenarios and the lessons we'll learn and we'll put it all together on how do we build resilient, software. Let's first start with real step happen and their lessons. The first one is flood of traffic. You get alert about CPU near 100 percent of your host, followed by availability drop, high client side error rate. The background can be there is unpredictable side wide event with just simply more traffic. We're sending traffic to you, more traffic. And then you're seeing in auto scaling, Panel, if you're using cloud provider that the host keep getting replaced because they failed health check and you're like, what's going on? I'm just getting more host to it, but they keep failing. The host stays down. and then you see that the host is down because it failed the health check. So when the host has 100 percent CPU, it cannot respond pin back to the. Out of scaling service and out of scaling, console will think your host is done. So you really can't spin up any host back on. What do you do then? you can, you might then realize that you just have dropped traffic for it to do, for the host to spin back on when host doesn't get traffic. It doesn't go down for 200 percent CPU, so you can draw traffic and you can just a mass connection to the host that instead of getting all these connection to your host that you cannot handle, you can set a mass connection to handle only the connections you're possible to have for your host. The lesson here is that, for the long term, you should do a load testing on max connection a host can possibly handle. So you don't run into issues of a host getting too much traffic and lead to 100 percent CPU on your host, and you cannot spin up any host because it keep going down because of the traffic, right? It's a Really, bad loop of, of situation here. The next one is a retry storm. So this, you can get alert or something like client bill ability below X percent. and then if you take a little closer, you see the traffic goes up very high and, And the crux of the problem here is that a lot of services like to see, when they see error, then they retry. So if each microservices retries, the number of calls to database will then multiply. let's take this example on the right here. So you have, let's say, a site calls the service once. Service 1 calls service 2, and service 2 calls the database, and the database says it's not available. And then service 2 retries two times, and it's still not available. So you go back to service 1 saying it's not available. Service 1 says it's not available. Let me retry two times, so it goes back to service two for each of those service requests. It retries two times again, so the total retry now becomes six times. And just imagine if you have more services in between and more retries for those services, this can easily multiply and multiply. And then the service at the bottom layer will get overwhelmed, cannot handle any possible traffic. The mitigation here, happened along the big shopping events that it was to drop majority of the traffic, significant dropping traffic to mitigate this, and later added, retry configuration changes and more hosts. So the lesson here is to really standardize retry logic, such as exponential retry, respect to too many requests, HTTP code, and also, add circuit breaker for dependency. The circuit breakers are ways to say, hey, dependency getting too many errors. I just stopped calling that dependency because I don't want to. Make matter even worse for that dependency and being good citizens of not causing a retry storm to, services downstream. This one is about Plan B went poorly. So a letter you might get as dependency availability below X percent. So sometimes there are more than one dependencies to satisfy what we need. in the case of circuit breakers, if there is a trouble dependency, we send back. Traffic to another dependency now dependency be also service on traffic in many cases. So dependency be what you immediately go down because it's unable to handle this newly, send traffic and cost problem to services using dependency be. So it may matter worse. Not only your service not working, you also make other services failing. So the mitigation here is to really focus on fixing dependency A, and, try to fix what's issue in there, and, A lesson here for the long term is to not introduce a backup dependency unless it's really been prepared well enough, such as acting as a lever for high velocity events. And, in day to day operation, focus on the plan A that will get your service up and running. The next one is a very common one. It's a back commit. An alert you might be getting is server availability dropped 100%. A back commit that can go out of production is not really the fault of the commit owner, but rather lack of proper process to prevent it from happening in the first place. You think about lines of defenses, right? So you have code review, You have testing that covers all aspects of the flow, such as unit testing, end to end testing, menu testing, QA testing, right? And the list goes on. The testing did not catch any of this. And there are the gatings that people typically add to make sure you can flip it back or flip it on and monitor the feature releases. And there are things like change management, risk assessments, and there are pipelines to get all these changes, put it all together. So let me ask you this. If you have all of this and still cannot catch the bad commit, is it really the fault of the person writing the commit? Especially that hasn't happened before? No, right? So every time there's a bad commit exposed to something that's missing, is it? Did this commit somehow bypass a code review that did not catch this? This can happen. Did it? Not have a proper testing or the, or did not follow the guardrail of what's necessary to push those changes out. And somehow we miss all this. So every bad commit actually carries some positive lesson for you to reflect what was missing in the release process and how you, how can we catch this and prevent it from happening in the future. So in many ways bad commit is actually a good way to improve your resilience, resiliency in your services. Now, the next one is lack of sufficient ownerships. This one sometimes can be comical. You might receive an escalation of people asking you, hey, user reported they are able to use this feature. And you're thinking, what is this feature? I never heard about it. And then after some back and forth, You're actually the team owning this change and feature and, you have to fix it and also be accountable for the, for the set of impact. So sometimes this can happen when there is a team changes, the, some code don't have sufficient ownership, and this can get to a very bad situations. so you don't, you want to wait that, all right, you want to really make sure there's a process of transferring co ownership. You want to make sure there is sufficient annotations. Of co ownership, whether that's in code, whether that's tooling, whether that's wiki, can all help you to avoid this, from happening. S script is also very notorious, some of the very best apps are because of people running scripts and not testing as efficiently. Alright, you might get an alert saying system's down at 0 percent availability, you're wondering what's going on. And it turned out to be a script reference in the wrong book that was not tested thoroughly. And, the lesson here is to really treat the production impacting scripts or changes the same rigor as production commitments, such as sufficient testing, QAs, and proper code reviews. because remember, production changes are all very risky, and if you don't test your script properly. Something bad can happen, especially if it's, in the wrong book, that people are very likely to execute that script. So make sure you have the script tested or any production impacting changes tested properly. Okay, now let's talk about the next topic, which is, prevention. Let's put everything together. The first thing is to have A defensive coding practices, a lot of the, that's can actually prevent it if you have, a lot of defensive coding practices. But the 1st 1 is to pay your changes in many companies, people typically get their changes inside. some gating so you can use some service side knobs to increase allocation over time. This is a great way to quickly roll back and also gradually roll out. So you don't impact as many users. Impact radius will be relatively small. And remember to log any exceptions. Corner cases and also remember to have critical information in your login. If you don't have critical information, sometimes people have to actually make that code changes right away. And with deployment to finish, you don't want that to happen and do not overly rely on backup options. Like the example I mentioned earlier, it's better to fail fast and focus on plan A. I've seen this many times, like some of the steps are just because. It doesn't have proper error handling. This can be like no pointer exceptions. In Java that happens a lot. So make sure you have, no safe, coding practices. if you're using a more, no safe languages like Kotlin, you wouldn't have this problem. A lot of modern code, programming languages have this nowadays. remember send a timeout. So if you're a service calling a dependency and you don't have a timeout and a dependency doesn't have a timeout either, then you got a situation where it hanks, right? So you want to away that from happening. And last thing I want to remind is having a good retry practice, use exponential retry, and then, use it very sparingly to avoid a retry storm. The next session we'll talk about is detections. This has three parts. The first part is alerts. So for alerts, having an operational goal is very important when setting alerts. So why you're setting alerts is to make sure, is it availability wise, you want to meet a 99. 99 percent availability, or higher or lower, or is it about latency, your service is committed to, what's the service, Level agreements, SLA you have, and based on these operational goal, you can set a reasonable thresholds and you want to really find the balance between noisy alerts and the misalerts. If you have a very strict threshold, you will have very noisy alerts. If you have a relaxed one, you will have misalerts. And you can also use the historical data to inform the alert threshold. What was it before? What do we generally need? What's a reasonable threshold so we don't get too many alerts or misalerts? Alerts should have a round table to cover common steps on what to do and lessons from past scenarios. when you're getting alerts, on calls, you should not be thinking what to do. You should have a plan. Concrete run books with the links and steps what to do. It should not be thinking too much. 3am stuff shouldn't be needing any thinking. Neither are you capable of doing something like that. And dashboard offers a different, benefits and the compliments alerts a lot, in my opinion, inform people about general trends. This can be a seasonal patterns. It can help to debug and find correlations of what's happening. Also, in your dashboard, you should have a clear name, time period, and useful references or markers. This can be alert set criterias, week over week changes. And as part of your operational goal, you should also know your system limitations. If you don't know your system limitation, you're awaiting a set to happen. So you want to know what's the maximum you can support. If you're online shopping, you want to know how many orders you can support. If you're running a social media company, you want to know how many users you can concurrently support. And you can know this limitation by doing load testing and prepare yourself for dependency failures or Other type of failures by doing chaos testing. So mitigation, when you get alerts, a mitigation is like saving life in emergency room. Time is essence here. You're running against time. So follow the wrong book. Do not have your random thoughts to a brainstorming. What's happening? Follow the wrong book. Look at your alerts and dashboard. A few questions you can think about is it happening at a specific time? Does it line up with a specific commit, aversions, or gating changes? Is there any exceptional logs or metrics to pinpoint specific files? Are there related issues identified by other teams at the same time? Is it company wide events? Do you need escalation? and there are also times where you can prepare for, you need to prepare for high velocity events. So what are the high velocity events? So high velocity events are dangerous times. you can think about it as, you know, of, something you want to prepare for. So this can be an anticipated large increase of traffic. Let's say It's prime day, or if it is a feature release of a major feature release, you have a lot of traffic going to your service. In those cases, you will have a risk mitigation plan based on factors such as traffic estimations. the risk sometimes it can be beyond just operational, it can be about integrity, it can be about media risk, PR risk. in any cases, you need to set up dashboards and learn specifically and pay very close attention to critical metrics to monitors and the deltas that this advance will bring. you should also have levers for the worst case scenarios. so the levers can be, hey, if traffic is two times my expectation, what would I do in that case? How to, how am I able to accommodate all that traffic? And this is also a great time to audit your services. so rerun your load test, chaos testing, reassess your levers, think about what things can go wrong, do a check on your, service health. Now, when Seth is finished. self review or sometimes in some companies it can be called other names. But the essence is that you will review the self, as a person who went for on call. The self review is intended to be a learning experience. It's supposed to be blameless. after the self, you want to, write a report that includes the timeline of the events. When did it happen? When did we detect it? When did we start? root causing it and, mitigate this, and all these timelines so we can match and find out how we can improve on these timelines, such as quicker time to detect and quicker time to root cause, mitigate the set. for the root cause 1, Useful exercises, think in terms of why, so for example, your service is down. Why is your service down? Because there's a memory leak in your host. Why is there a memory leak? because there's a bad commit that introduced a memory leak, in productions. And why did this memory end up in production? Because we did not have proper load testing to catch memory leak issues. Why did we not have that? the load has, and then that comes to your action item to add the load has, and the proper guidelines and action item prevented from happening in the future. in general, you want to focus on, the next steps and action items, making sure those action items are tracked, making sure there's a timeline for when those action items will be finished, because if there's no action items, All this learning is for nothing, right? You need to find ways to prevent it from happening. very rare cases is that this is act of God. There is no possible action items you can take. And at the end of the day, self review is really a place to foster, this continuous improvements to make sure that people are aware, we can build a resilient systems and we're able to, make it better, make our own cause like better. yeah, that's end of the talk. I want to thank you for attending this talk. And hopefully there's some takeaways, you get. Out of this talk, and, hopefully you have fewer steps next time you're going on call, following some of the, lessons we discussed in this sessions. yeah, thank you.

Slides

Download slides (PDF)

See all 36 talks at this event!

Conf42 Incident Management 2024 - Online

October 17 2024 - premiere 5PM GMT

What can possibly go wrong: lessons in building resilient systems at Amazon and Meta

Video size:

Abstract

Summary

Transcript

Slides

Zuodong Xiang

Join the community!

Featured event

2025

2024

Info

Conf42 Incident Management 2024 - Online

October 17 2024 - premiere 5PM GMT

What can possibly go wrong: lessons in building resilient systems at Amazon and Meta

Video size:

Abstract

Summary

Transcript

Slides

Zuodong Xiang

Join the community!