Graceful Degradation: Keeping The Lights On When Everything Goes Wrong

Video size:

Abstract

Graceful degradation has been a buzzword in the reliability space for many years. In this talk, you’ll learn practical techniques for graceful degradation through workload prioritization, enabling you to serve high quality experiences to customers even in the face of application failures.

Summary

Tanveer Gill is co founder and CTO of Flux Ninja. He'll share techniques for implementing graceful degradation through prioritized load shedding. By prioritizing which workloads or users receive resources during a degraded state, we can ensure that critical user experiences are preserved.
A lack of effective load management can lead to cascading failures and even complete system outage. The goal of this discussion is to highlight that auto scaling is not a complete solution on its own, but rather a piece in the larger load management puzzle. In the last part, we will introduce you to aperture, which provides a dynamic and adaptive concurrency limit system.
When the service becomes overloaded, requests start to back up and response times skyrocket. There are several reasons why a service may become overwhelmed. In microservice architectures, any failure can quickly escalate into wider outage due to cascading failures.
Autoscaling is a popular solution for managing load in microservices. But it's not without its limitations. One of the main limitations is that auto enabling can be slow to respond. By setting a concurrency limit, we can ensure that the service remains performant even under high traffic.
Aperture is a platform for reliability automation. It brings rate limits and concurrency limits to any service and even performs load based auto scaling. With its integration with Prometheus and the ability to perform continuous signal processing on matrix, Aperture offers a comprehensive solution for controlling and automating microservices.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. I am Tanveer Gill, co founder and CTO of Flux Ninja. I've spent the better part of last decade working with SRE and DevOps practitioners. I co founded two companies in this space, the last one has in the observability domain, I've gained a deep understanding of the challenges and problems faced by practitioners like yourself. Today, I'm eager to share my insights and knowledge with you in this presentation on graceful degradation, keeping the lights on when everything goes wrong as operators and practitioners, we know all too well that despite our best efforts to design robust microservices, failures are an inevitable reality. Whether it's due to bugs introduced through high velocity development or unexpected traffic spikes, the complex interdependencies of microservices can lead to cascading failures that can take down entire systems. In this presentation, I'll be sharing practical techniques for implementing graceful degradation through prioritized load shedding. By prioritizing which workloads or users receive resources during a degraded state, we can ensure that critical user experiences are preserved and services remain healthy and responsive. I'll help you form an intuition about load management by building up from basic principles of queuing theory and littlestock. These concepts are universally applicable to any system that serves requests, making them pretty valuable tools in your arsenal. So join me as we explore how to keep the lights on even in the face of unexpected failures. If you have any questions during or after the presentation, please feel free. CTO reach out CTo me either over LinkedIn or Twitter. I've shared my handles here. Let's go over the agenda for this presentation. In the first part, we will discuss the consequences of poor load management in microservices. This will involve exploring how failures in one part of the system can impact others. Given the interdependent nature of microservices, we will see how a lack of effective load management can lead to cascading failures and even complete system outage. In the second part, we will examine the limitations of autoscaling as a solution for managing load and microservices. We will use the case study of Pokemon Go's migration to GCLV to understand the limitation of autoscaling and how it can impact the overall performance of a system. The goal of this discussion is to highlight that auto scaling is not a complete solution on its own, but rather a piece in the larger load management puzzle. In the third part, we will discuss the benefits of using concurrency limits in managing load in microservices, but we will also highlight the challenges in implementing concurrency limits in a constantly changing microservices environment. In the last part, we will introduce you to aperture, which addresses these challenges by providing a dynamic and adaptive concurrency limit system. Let's get started. Let's take a look at what happens when a service becomes overwhelmed. The top of the diagram depicts a healthy service under normal load with a steady response time. However, when the service becomes overloaded, requests start to back up and response times skyrocket, eventually leading to timeouts. This is depicted in the lower part of the diagram. There are several reasons why a service may become overwhelmed, including unexpected traffic spikes during new product launches or sales promotions. Or there could be service upgrades that introduce performance regression due to bugs, or there could just be slowdowns in upstream services or third party dependencies. Through load testing, we can determine that a service latency increases under heavy load as various factors such as thread contention, context switching, garbage collection, or I o contention become bottlenecks. These factors lead to a limit on the number of requests a service can process in parallel, and this limit is called the concurrency limit of a service. But no matter how complex the inner workings of a service might be, it can still be modeled through little's law, which states that l the number of requests in flight is equal to lambda the average throughput multiplied by w the average response time. Let's apply little slaw to a microservice. As we discussed, l the maximum number of requests in progress is capped due cto the nature of underlying resources w the response time is predetermined based on the nature of the workload. Thus, the maximum value of maximum average throughput lambda can also be inferred based on these two previous parameters. The service cannot handle any throughput beyond lambda, and any excess requests must queue up in front of the service. Therefore, there is an inflection point whenever the number of requests in lights exceeds the concurrency limit of the service. Beyond this point, any excess requests begin to queue up, leading to an increase in response time latency. The following chart helps to illustrate the impact of a service becoming overwhelmed. The xaxis plots the incoming throughput of requests, while the left y axis shows median latency and the right y axis shows availability represented as a proportion of requests served within timeout. As shown in the left portion of the graph, as long as the number of requests in flight stays within the concurrency limit, latency remains normal. Once the concurrency limit is reached, any increase in throughput contributes to an increase in latency. As a queue begins to build up, the availability line measures the number of requests served within the timeout limit. Once median latency becomes equal to the timeout, the availability drops to 50%. As half of the requests are now timing out has throughput continues to increase, availability rapidly drops to zero. This chart clearly shows the importance of managing load in a service to avoid latency spikes and ensure that requests are being served within the desired time frame. In microservice architectures, the complex web of dependencies means that any failure can quickly escalate into wider outage due to cascading failures. For example, when service a fails, it can also impact service b, which depends on a. Additionally, failures can spread laterally within a service. This can happen when a subset of instances fail, leading to an increased load on the remaining instances. This sudden increase in load can cause these instances to overload and fail, leading to a domino effect that causes a wider outage. It's important to remember that we cannot make assumptions about the reliability of any part of the system to ensure that each subpart of the system can degrade gracefully, it's crucial to implement measures that allow for graceful degradation. By doing so, we can minimize the chance of cascading failures and keep our services up and running even during localized failures. Now that we have a clear understanding of the consequences of poor load management in microservices, let's see if autoscaling could offer a solution. We'll examine the limitations of autoscale and see why it's not sufficient as a standalone load management strategy. Autoscaling is a popular solution for managing load in microservices. While it's great for addressing persistent changes in demand and optimizing cloud compute costs, it's not without its limitations. One of the main limitations is that auto enabling can be slow to respond, especially for services that need some time to be available. This means that it may take some time for auto scaling to actually respond to a change, which can result in increased latency for your end users. Another limitation of auto scaling is that it's limited by resource usage quotas, particularly compute quotas, which are often shared amongst multiple microservices. This means that there may be limits on the number of resources that can be added, which can limit the effectiveness of auto scaling and managing load. Additionally, auto scaling can also contribute to load amplification and dependencies. This means that adding more resources to one part of the system can actually overload other parts of the system, potentially leading to cascading failures. The case study of Pokemon Go's migration to Google Cloud load balancer is a good illustration of this point. In their migration, they moved to GCLB in order to scale the load balancing layer, but it actually resulted in overwhelming their backend stack as the load increased once they scaled the load balancing layer. So this actually ended up prolonging their outage rather than helping. So while auto scaling is a helpful tool for managing load, it's important to be aware of its limitations and to consider graceful degradation techniques such as concurrency limits and prioritized load shedding. Graceful degradation techniques can ensure that services continue to serve at their provision capacity while scaling is performed. In the background in this section, we'll explore how to optimize the availability and performance of a service using concurrency limits. The idea is to set a maximum limit on the number of inflight requests a service can handle at any given time. Any requests beyond that limit would be rejected or loadshed. By setting a concurrency limit, we can ensure that the service remains performant even under high rate of incoming traffic. The approach is based on the assumption that we have a clear understanding of the maximum concurrency limit that the service can support. If we can accurately determine this limit, we can take proactive measures to maintain high performance and availability for our users. Let's take a look at how concurrency limits can help preserve service performance. The following chart provides a visual representation of this concept. On the Xaxis, we have the incoming throughput of requests. The left y axis shows the median latency, while the right y axis shows availability, which is represented as a proportion of requests served before the timeout. As shown on the chart, the availability remains at 100% even when the incoming throughput becomes much higher than what the service can process. This is due to the fact that any excess load is shed because of the maximum concurrency limit set on the service. This means that the service can continue to perform optimally even under high traffic. The chart demonstrates how concurrency limits can help us preserve performance and availability for our users even in the face of high traffic. Implementing concurrency limits for a service can help to preserve performance, but it also presents some challenges. One of the main challenges is determining the maximum number of concurrent requests that a service can process. Setting this limit too low can result in requests being rejected even when the service has plenty of capacity. While setting the limit too high can lead to slow and unresponsive servers, it's difficult to determine the maximum concurrency limit in a constantly changing microservices environment. With new deployments, auto scaling, new dependencies, popping up, and changing machine configurations, the ideal value can quickly become outdated, leading to unexpected outages or overloads. This highlights the need for dynamic and adaptive concurrency limits that can adapt to changing workloads and dependencies. Having such a system in place won't just be able to protect against traffic spikes, but also against performance aggressions. Now let's examine how we can implement concurrency limits effectively. As we saw earlier, concurrency limits can be an effective solution for preserving the performance of a service, but it can be challenging to determine the maximum number of concurrent requests that a service can support. Therefore, we will be using the open source project aperture to implement dynamic concurrency limits, which can adapt to changing workloads and dependencies. This is a high level diagram of how aperture agent interfaces with youll service. Aperture agent runs next CTO your services. On each request, the service checks with the aperture agent whether to admit a request or drop it. Aperture agent returns a yes or no answer based on the overall health of the service and the rate of incoming requests. Before we dive deeper, here is some high level information about aperture. Aperture is an open source reliability automation platform. Aperture is designed to help you manage the load and performance of your microservices. With Aperture, you can define and visualize your load management policies using a declarative policy language that's represented as a circuit graph. This makes it easy to understand and maintain your policies over time. Aperture supports a wide range of use cases including concurrency limiting, rate limiting, workload prioritization, and auto scaling. It integrates seamlessly with popular language frameworks so you can quickly and easily add it to your existing environment. And if you're using a service mesh like envoy, you can easily insert aperture into your architecture without having to make any changes to your service. Aperture policies are designed using a circuit graph. These policies can be used as ready to use templates that can work with any service. The policy been shown adjusts the concurrency limit of a service based on response times, which are an indicator of service health. Let's take a closer look at how the circuit works. On the top left, we have a promQl component that queries the response time of a service from Prometheus. Then this response time signal is trended over time using an exponential moving average. The current value of the response time is compared with the long term trend. CTO determine if the service is overloaded. These signals are then fed into an AiMD concurrency control component which controls the concurrency. This control component is inspired by TCP congestion control algorithms, so it gradually increases the concurrency limit of a service by ramping up the throughput. If the response times start deteriorating, there is multiplicative backoff in place to prevent further degradation. To demonstrate the effectiveness of aperture's concurrency control policy, we will be simulating a test traffic scenario. We have designed the service in a way that it can serve only up to ten users concurrently. Using k six load generator, we will alternate the number of users below and above ten users in order to simulate periodic overloads. This will help us show how aperture dynamically adjusts the concurrency limit based on the service health let's compare the response times of the service before and after installing aperture's concurrency control policy. The first panel shows the response times of the service, while the second panel displays the count of different decisions made by aperture's agents on incoming requests. The left part of the chart highlights the issue of high latency without aperture. However, once aperture's policy is in place, it dynamically limits the concurrency as soon as response times start to deteriorate. As a result, the latency remains within a reasonable bound. The second panel shows the number of requests that were accepted or dropped by aperture when the policy is applied. The blue and red lines indicate the number of requests that were dropped by aperture. One important question in load management is determining which requests to admit and which ones to drop. This is where prioritization load shedding comes in. In some cases, certain users or application paths are more important and need to be given higher priority. This requires some sort of scheduling mechanism. Inspired by packet scheduling techniques such as weighted fair queuing, aperture allows sharing resources among users and workloads in a fair manner. For example, you can specify that checkout is a higher priority than slash recommendation or that subscribed users should be allocated a greater share of resources compared to guest users. Aperture scheduler will then automatically figure out how to allocate the resources. In the test traffic scenario, there are equal number of guests and subscribed users. The yellow line in the second panel represents acceptance rate for subscribed user, whereas the green line represents the acceptance rate for the guest users. As can be seen, during the overload condition, subscribed users are getting roughly four times the acceptance rate of guest users due CTO the higher priority assigned to them. And that concludes our talk on graceful degradation, keeping the lights on when everything goes wrong I hope you have gained valuable insights on how to improve the reliability of your microservices through graceful degradation. In this presentation, we have covered the importance of load management and microservices and the consequences of poor load management. We've also explored the limitations of auto scaling and the challenges of implementing concurrency limits. We introduce you to aperture, a platform for reliability automation which brings rate limits and concurrency limits to any service and even performs load based auto scaling. With its integration with Prometheus and the ability to perform continuous signal processing on matrix, Aperture offers a comprehensive solution for controlling and automating microservices. We encourage you to check out the aperture project on GitHub and give it a try. Your feedback and contributions are always welcome to help us improve the platform and make it better for the community. Thank for joining us for this talk today.

Slides

Download slides (PDF)

See all 16 talks at this event!

Conf42 Chaos Engineering 2023 - Online

February 16 2023

Graceful Degradation: Keeping The Lights On When Everything Goes Wrong

Video size:

Abstract

Summary

Transcript

Slides

Tanveer Gill

CTO @ FluxNinja

Join the community!

Featured event

2026

2025

Info

Conf42 Chaos Engineering 2023 - Online

February 16 2023

Graceful Degradation: Keeping The Lights On When Everything Goes Wrong

Video size:

Abstract

Summary

Transcript

Slides

Tanveer Gill

CTO @ FluxNinja

Join the community!