Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I am Tanveer Gill, co founder and CTO of
Flux Ninja. I've spent the better part of last decade working with SRE
and DevOps practitioners. I co founded two companies in this space, the last
one has in the observability domain, I've gained a deep understanding
of the challenges and problems faced by practitioners like yourself.
Today, I'm eager to share my insights and knowledge with you in this presentation
on graceful degradation, keeping the lights on when everything goes
wrong as operators and practitioners,
we know all too well that despite our best efforts to design robust
microservices, failures are an inevitable reality. Whether it's
due to bugs introduced through high velocity development or
unexpected traffic spikes, the complex interdependencies of microservices
can lead to cascading failures that can take down entire systems.
In this presentation, I'll be sharing practical techniques for implementing
graceful degradation through prioritized load shedding. By prioritizing
which workloads or users receive resources during a degraded state,
we can ensure that critical user experiences are preserved and
services remain healthy and responsive. I'll help you form an intuition
about load management by building up from basic principles of queuing theory
and littlestock. These concepts are universally applicable to any
system that serves requests, making them pretty valuable tools in
your arsenal. So join me as we explore how to keep the lights
on even in the face of unexpected failures. If you have any questions
during or after the presentation, please feel free. CTO reach out CTo me
either over LinkedIn or Twitter. I've shared my handles here.
Let's go over the agenda for this presentation. In the first part,
we will discuss the consequences of poor load management in microservices.
This will involve exploring how failures in one part of the system can impact others.
Given the interdependent nature of microservices, we will see how a
lack of effective load management can lead to cascading failures and
even complete system outage. In the second part, we will examine the limitations
of autoscaling as a solution for managing load and microservices.
We will use the case study of Pokemon Go's migration to GCLV
to understand the limitation of autoscaling and how it can impact the overall performance
of a system. The goal of this discussion is to highlight that auto
scaling is not a complete solution on its own, but rather a piece in
the larger load management puzzle. In the third part, we will discuss the
benefits of using concurrency limits in managing load in microservices,
but we will also highlight the challenges in implementing concurrency limits
in a constantly changing microservices environment. In the last part,
we will introduce you to aperture, which addresses these challenges by providing
a dynamic and adaptive concurrency limit system. Let's get started.
Let's take a look at what happens when a service becomes overwhelmed.
The top of the diagram depicts a healthy service under normal load with a
steady response time. However, when the service becomes overloaded, requests start to back
up and response times skyrocket, eventually leading to timeouts.
This is depicted in the lower part of the diagram. There are several reasons why
a service may become overwhelmed, including unexpected traffic spikes during
new product launches or sales promotions. Or there could be service
upgrades that introduce performance regression due to bugs, or there could just
be slowdowns in upstream services or third party dependencies. Through load
testing, we can determine that a service latency increases under heavy load as various
factors such as thread contention, context switching, garbage collection,
or I o contention become bottlenecks. These factors lead to a limit
on the number of requests a service can process in parallel, and this limit
is called the concurrency limit of a service. But no matter how complex the inner
workings of a service might be, it can still be modeled through little's law,
which states that l the number of requests in flight is
equal to lambda the average throughput multiplied by w the
average response time. Let's apply little slaw to a microservice.
As we discussed, l the maximum number of requests in progress is
capped due cto the nature of underlying resources w
the response time is predetermined based on the nature of the workload.
Thus, the maximum value of maximum average throughput lambda
can also be inferred based on these two previous parameters.
The service cannot handle any throughput beyond lambda, and any excess requests
must queue up in front of the service. Therefore, there is an inflection
point whenever the number of requests in lights exceeds the concurrency
limit of the service. Beyond this point, any excess requests begin
to queue up, leading to an increase in response time latency. The following
chart helps to illustrate the impact of a service becoming overwhelmed. The xaxis
plots the incoming throughput of requests, while the left y axis
shows median latency and the right y axis shows availability represented
as a proportion of requests served within timeout. As shown in the left portion of
the graph, as long as the number of requests in flight stays within the concurrency
limit, latency remains normal. Once the concurrency limit
is reached, any increase in throughput contributes to an increase in latency. As a
queue begins to build up, the availability line measures the number of requests served within
the timeout limit. Once median latency becomes equal to the timeout,
the availability drops to 50%. As half of the requests are now timing
out has throughput continues to increase, availability rapidly drops to zero.
This chart clearly shows the importance of managing load in a service to avoid latency
spikes and ensure that requests are being served within the desired time frame.
In microservice architectures, the complex web of dependencies means
that any failure can quickly escalate into wider outage due to cascading failures.
For example, when service a fails, it can also impact service b,
which depends on a. Additionally, failures can spread laterally within a service.
This can happen when a subset of instances fail, leading to an
increased load on the remaining instances. This sudden increase in load can cause these instances
to overload and fail, leading to a domino effect that causes a wider
outage. It's important to remember that we cannot make assumptions about the reliability
of any part of the system to ensure that each subpart of the system
can degrade gracefully, it's crucial to implement measures that allow for graceful
degradation. By doing so, we can minimize the chance of cascading failures
and keep our services up and running even during localized failures.
Now that we have a clear understanding of the consequences of poor load management in
microservices, let's see if autoscaling could offer a solution.
We'll examine the limitations of autoscale and see why it's not sufficient as
a standalone load management strategy. Autoscaling is a popular solution
for managing load in microservices. While it's great for addressing persistent
changes in demand and optimizing cloud compute costs, it's not without its
limitations. One of the main limitations is that auto enabling can be slow to
respond, especially for services that need some time to be available. This means that
it may take some time for auto scaling to actually respond to a change,
which can result in increased latency for your end users. Another limitation
of auto scaling is that it's limited by resource usage quotas,
particularly compute quotas, which are often shared amongst multiple microservices.
This means that there may be limits on the number of resources that can be
added, which can limit the effectiveness of auto scaling and managing load.
Additionally, auto scaling can also contribute to load amplification and dependencies.
This means that adding more resources to one part of the system can actually
overload other parts of the system, potentially leading to cascading
failures. The case study of Pokemon Go's migration to
Google Cloud load balancer is a good illustration of this point. In their
migration, they moved to GCLB in
order to scale the load balancing layer, but it actually resulted in overwhelming their
backend stack as the load increased once they scaled the load
balancing layer. So this actually ended up prolonging their outage rather
than helping. So while auto scaling is a helpful tool for
managing load, it's important to be aware of its limitations and to consider graceful
degradation techniques such as concurrency limits and prioritized
load shedding. Graceful degradation techniques can ensure that services continue
to serve at their provision capacity while scaling is performed. In the
background in this section, we'll explore how to optimize the availability and
performance of a service using concurrency limits. The idea is to
set a maximum limit on the number of inflight requests a service can handle at
any given time. Any requests beyond that limit would be rejected or loadshed.
By setting a concurrency limit, we can ensure that the service remains performant
even under high rate of incoming traffic. The approach is based on the assumption
that we have a clear understanding of the maximum concurrency limit that
the service can support. If we can accurately determine this limit, we can take proactive
measures to maintain high performance and availability for our users.
Let's take a look at how concurrency limits can help preserve service performance.
The following chart provides a visual representation of this concept. On the Xaxis,
we have the incoming throughput of requests. The left y axis shows the median latency,
while the right y axis shows availability, which is represented as a proportion of
requests served before the timeout. As shown on the chart, the availability remains at
100% even when the incoming throughput becomes much higher than
what the service can process. This is due to the fact that any excess load
is shed because of the maximum concurrency limit set on the service.
This means that the service can continue to perform optimally even under high
traffic. The chart demonstrates how concurrency limits can help us preserve performance
and availability for our users even in the face of high traffic.
Implementing concurrency limits for a service can help to preserve performance, but it
also presents some challenges. One of the main challenges is determining the
maximum number of concurrent requests that a service can process.
Setting this limit too low can result in requests being rejected
even when the service has plenty of capacity. While setting the limit too high can
lead to slow and unresponsive servers, it's difficult to determine
the maximum concurrency limit in a constantly changing microservices environment.
With new deployments, auto scaling, new dependencies,
popping up, and changing machine configurations, the ideal value can
quickly become outdated, leading to unexpected outages or overloads. This highlights
the need for dynamic and adaptive concurrency limits that can adapt to changing workloads
and dependencies. Having such a system in place won't just be able to protect against
traffic spikes, but also against performance aggressions.
Now let's examine how we can implement concurrency limits effectively.
As we saw earlier, concurrency limits can be an effective solution for preserving
the performance of a service, but it can be challenging to determine the maximum number
of concurrent requests that a service can support. Therefore,
we will be using the open source project aperture to implement dynamic concurrency limits,
which can adapt to changing workloads and dependencies. This is a high
level diagram of how aperture agent interfaces with youll service. Aperture agent
runs next CTO your services. On each request, the service checks with the aperture agent
whether to admit a request or drop it. Aperture agent returns a yes or
no answer based on the overall health of the service and the rate of incoming
requests. Before we dive deeper, here is some high level information about
aperture. Aperture is an open source reliability automation platform.
Aperture is designed to help you manage the load and performance of your microservices.
With Aperture, you can define and visualize your load management policies using
a declarative policy language that's represented as a circuit graph.
This makes it easy to understand and maintain your policies over time.
Aperture supports a wide range of use cases including concurrency limiting,
rate limiting, workload prioritization, and auto scaling.
It integrates seamlessly with popular language frameworks so you can quickly
and easily add it to your existing environment. And if you're using a service mesh
like envoy, you can easily insert aperture into your architecture without having
to make any changes to your service. Aperture policies
are designed using a circuit graph. These policies can be used as ready to use
templates that can work with any service. The policy been shown adjusts
the concurrency limit of a service based on response times, which are an indicator of
service health. Let's take a closer look at how the circuit works.
On the top left, we have a promQl component that queries the response time
of a service from Prometheus. Then this response time signal is
trended over time using an exponential moving average. The current value
of the response time is compared with the long term trend. CTO determine if the
service is overloaded. These signals are then fed into an AiMD
concurrency control component which controls the concurrency.
This control component is inspired by TCP congestion control
algorithms, so it gradually increases the concurrency limit of a service by ramping up
the throughput. If the response times start deteriorating, there is multiplicative
backoff in place to prevent further degradation. To demonstrate
the effectiveness of aperture's concurrency control policy,
we will be simulating a test traffic scenario. We have designed the service in a
way that it can serve only up to ten users concurrently. Using k six load
generator, we will alternate the number of users below and above ten users in
order to simulate periodic overloads. This will help us show how aperture dynamically
adjusts the concurrency limit based on the service health let's
compare the response times of the service before and after installing aperture's
concurrency control policy. The first panel shows the response times
of the service, while the second panel displays the count of different decisions
made by aperture's agents on incoming requests.
The left part of the chart highlights the issue of high latency without aperture.
However, once aperture's policy is in place, it dynamically limits
the concurrency as soon as response times start to deteriorate. As a result,
the latency remains within a reasonable bound. The second panel shows the number
of requests that were accepted or dropped by aperture when the policy is applied.
The blue and red lines indicate the number of requests that were dropped by
aperture. One important question in load management is determining
which requests to admit and which ones to drop. This is where prioritization
load shedding comes in. In some cases, certain users or application paths are more
important and need to be given higher priority. This requires some sort of scheduling
mechanism. Inspired by packet scheduling techniques such as weighted fair queuing,
aperture allows sharing resources among users and workloads in
a fair manner. For example, you can specify that checkout is a higher priority
than slash recommendation or that subscribed users should be
allocated a greater share of resources compared to guest users. Aperture scheduler
will then automatically figure out how to allocate the resources. In the test traffic
scenario, there are equal number of guests and subscribed users. The yellow line
in the second panel represents acceptance rate for subscribed user,
whereas the green line represents the acceptance rate for the guest users.
As can be seen, during the overload condition, subscribed users are getting
roughly four times the acceptance rate of guest users due CTO the
higher priority assigned to them. And that concludes our talk on graceful degradation,
keeping the lights on when everything goes wrong I hope you have gained valuable
insights on how to improve the reliability of your microservices through graceful degradation.
In this presentation, we have covered the importance of load management and microservices and
the consequences of poor load management. We've also explored the limitations of auto
scaling and the challenges of implementing concurrency limits. We introduce
you to aperture, a platform for reliability automation which brings rate
limits and concurrency limits to any service and even performs load
based auto scaling. With its integration with Prometheus and the ability to
perform continuous signal processing on matrix, Aperture offers a
comprehensive solution for controlling and automating microservices. We encourage you to check
out the aperture project on GitHub and give it a try. Your feedback and contributions
are always welcome to help us improve the platform and make it better for the
community. Thank for joining us for this talk today.