Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Brian Barkley. I'm an engineer at LinkedIn,
and I'm going to be joined by my colleague Vivek Deshpande to
discuss a framework that we've helped to develop called Hodor.
LinkedIn counts on Hodor to protect our microservices by detecting
when our systems have become overloaded and providing relief to
bring them back to a stable state. So this is
our goal. We want to increase our overall service resiliency,
uptime, and availability, while minimizing impact to
these members who are using the site. We also want to
do this with an out of the box solution that doesn't require service owners
to have to customize things or maintain configuration that's specific to
their service. I'll also note that what we'll be presenting
here is all within the context of code running on the Java virtual
machine, though some of the concepts would translate well to other
runtime environments. So here's our
agenda for the talk. I'll provide an overview of Hodor, then Vivek
will discuss the various types of overload detectors we've developed.
I'll talk about how we remediate overloads situations,
how we rolled this out to hundreds of existing services safely.
We'll take a look at a success story from our
production environment and discuss some related work and
what else we have planned for the future.
So, for anyone wondering about the name, as you can see,
Hodor stands for holistic overload detection and
overload remediation. It protects our services and
so also has a similarity in that respect to a certain fictional character.
So let's get into some of the details of Hodor situations
that it's meant to address and how the pieces fit together.
Start with let's talk about what it means for a service to be overloads.
We define it as the point at which it's unable to serve traffic
with reasonable latency. We want to maximize the goodput
that a service is able to provide. And so when a service has saturated
its usage of a given resource, we need to prevent oversaturation
and degradation of that resource. What sort of
resources are we talking about here? They can be physical or virtual
resources. Some obvious examples of physical resources are cpu
usage, memory usage, network bandwidth,
or disk I O, and all of these have hard
physical limits that are just not possible to push past or increase via
some configuration tweaks. Once these limits are hit,
performance starts to degrade pretty quickly.
Virtual resources are different altogether, but can have the same
impact when they are fully utilized. Some examples include threads
available for execution, pool DB connections,
or send before permits.
These limits can be reached in different ways. A clear case
is increased traffic to a service if the number of requests
per second goes up five x or ten x, normal things
aren't going to go well on that machine if it's not provisioned for that amount
of load. Another more subtle case is if some
service downstream starts to slow down. That effectively
applies backpressure up the call chain, and services higher up
are affected by that added latency in those calls. It'll slow
down the upstream service and causes it to have more requests in
flight, which could lead to memory issues or thread exhaustion.
One more example is if your service is running alongside others
on the same host without proper isolation. If a neighboring
process is using a lot of cpu, disk or network bandwidth,
your service will be negatively impacted without any change to
traffic or downstream latencies. So this
is where the holistic part comes in. We want to be able to catch
as many of these types of issues as possible, though the
services using hodor can have wildly different traffic patterns,
execution and threading models and workloads, and as
I mentioned before, it shouldn't take any configuration. We have
hundreds and hundreds of services running on tens of thousands of machines, and tweaking
things for each of those just isn't feasible to
addressing the problem once it's been detected. We begin dropping requests and
return 503, which is service unavailable responses.
But we want to minimize these and drop just enough traffic to
mitigate the problem. The tricky part here is that
the amount of traffic that needs to be dropped and the overall capacity of the
service can easily vary depending on the nature of these overload
itself. For example, the amount of traffic that a service
can handle may be much different if the cpu is saturated
compared to if there's back pressure from a downstream
and memory usage is becoming a problem.
So we have to be flexible and dynamic, both in detecting overloads
and also in knowing how much traffic to drop.
So what's hodor made of? There are basically three main
components. First, there are what we call overload
detectors. There could be multiple of these registered,
including any that might be application specific.
Vivek will be talking about the standard ones we provide in a bit.
The detectors are queried for each inbound request with some metadata
about the requests. This allows the detectors to operate on a
context specific level if needed, and potentially only detect
overload and pushed traffic for a targeted subset of requests.
Most of these detects, though, operate on a global level and don't
do any request specific processing. Instead, they're fetching
an asynchronously calculated state indicating whether the detector
considers things to be overloaded.
Second, we have the load shutter, so this decides
which traffic should be dropped. Once a detector is signaled that there's an
overload, the shutter needs to be intelligent about which requests
should be dropped to make sure that too much traffic isn't rejected,
but that enough is to exit the overloaded state.
The load shutter takes the signal from the detects as input,
as well as some contextual information about the request to make these decisions.
Finally, there's a platform specific component that combines the
detectors and load shutter and adapts request metadata into Hodor's
APIs. The detectors and shutter are platform
agnostic. At LinkedIn, we primarily use an open source project we developed
called restly for communication between services. So that's
the first platform we had Hodor running on. We've since
adapted it to work with GRPC as well as HTTP
servers such as the play framework. I'm going to hand things
off to Vivek now, and he will get into more details
of how some of our detectors work to determine when the service is overloads.
Thanks Brian. Hello everyone. Now let's talk about how the
overloads detection is done using different detectors.
The first detector is CPU detector. The objective of CPU detector
is to quickly and accurately detect CPU overloads.
The idea is to have a lightweight background thread running at same
priority as the application threads, which execute business logic.
This background thread, known as the heartbeat thread, is scheduled to
wake up every ten milliseconds. The overall amount of work here is trivial
and adds no major variable impact to application performance.
The heartbeat overload detection algorithm monitors whether the heartbeat
thread is getting cpu every ten milliseconds. Once the heartbeat algorithm
realizes that the heartbeat thread is consistently not getting cpu
time at the expected time intervals, we flag an overload.
It is important to note here that a few variations are not
enough to flag an overload, and that the algorithm flags an
overload only when we have high confidence that a
cpu is overloaded. In concept, this idea is straightforward,
but determining the appropriate variables and parameters
for this algorithm to maximize precision while have high
recall was challenging. To provide a concrete
example, we may have the thread slip for ten milliseconds each
time, and if the 99th percentile in a second's worth of
data is over 55 milliseconds, that window is in violation.
If eight consecutive windows are in violation, the service is
considered overloads. Values for this thresholds that we
use are determined by synthetic testing as well as by sourcing data
from production and comparing it with performance metrics when the
services were considered to be overloaded. The rationale
behind using heartbeat thread is one it directly measures
useful cpu time available to the application in real time.
What we mean by this is that just because you see 30% free cpu
on something like top command does not mean that it is
useful cpu. And number two, the concept of the heartbeat thread is
applicable everywhere irrespective of these environment or the application
type.
So this is an example of the cpu detector in action.
In the top left graph you can observe that the heartbeat detector
is able to capture a cpu overload. Notice that the performance
indicators such as average and p 90 latencies and
p 95 cpu all spike when the heartbeat detector
flags an overload. Now let's move on to
the next detector which focuses on memory bottlenecks.
The objective of GC overload detector is to quickly and
accurately detects increased garbage collection activity for
applications with auto memory management like Java applications,
these idea is to observe overhead of GC activity
in a lightweight manner to detect overload in real time.
On each GC event, we calculate the amortized percentage
of time spent in GC over a given time window.
We call that as GC overhead. A schedule is set
on top of a GC overloads. So for that schedule, the GC overhead
percentage range is divided into tiers called as GC overhead
tiers. If the duration spent in GC overloads tier
exceeds the volatility period for the tier, then the GC overload
is signaled. The volatility period is smaller for
higher GC over a tier as a higher GC over tier indicates more
severe GC activity. For example GC over it
of 10% or more for say 30 seconds
for consecutive gcs is considered overload or
lower tier such as Gc overhead of 8%
or more for say like 60 seconds is considered overload
and so on. So the rationale behind using percentage
time in GC is it causes both GC duration
and GC frequency that can catch different GC issues.
And also setting a common threshold is possible which
work across all the applications with different allocation rates,
old generation occupancy levels and so on.
So similar to cpu detector, this is an example of GC
overload detector in action when GC activity increases
because of increase in traffic. In the top left graph you
can observe that these GC detector is able to capture
a GC overload. Notice that the performance indicators
such as p 90 p 99 latencies both spike
when the GC detector flags an overload.
Now we will look at a virtual resource overload detector.
Study of QA time and its data
suggests that there is a good correlation between increased KPIs,
such as latencies, and increased thread queue at times
for synchronous services. Consider a synchronous service
requests will start spending more time waiting in a queue if current
request processing time increases, either due to
an issue in the service or in one of its downstreams.
The capacity of a service can also be reached when latencies of downstream
traffic increase, which can cause the number of concurrent requests
being handled in the local service to increase with no change
to the incoming request rate. But without knowing anything
about these downstream service, we can assert at upstream by monitoring thread
pull queue time that there is a thread pull starvation,
and by dropped traffic we can help alleviate the downstream.
At LinkedIn we use JT server side
framework extensively and hence we target that as a
first step. But the logic of observing thread pool q wet time
is applicable widely,
similar to the previous detectors. This is an example of the thread pool
overload detector in action, where there is an issue in downstream processing
that causes increase in the thread pool wet time. In the
left top graph you can observe that the thread pool detects is able to
capture an overload. Notice that the performance
indicators such as average p 99 latency, spike when
detector flags an overload. Now back to Brian
who is going to talk about remediation techniques. Thank you.
Thanks for that. So the question now is, once we've
identified there's a problem, how can we address it with minimal impact?
Well, we need to reduce the amount of work that a service is doing,
and we do that by rejecting some requests. The trick here
is to identify the proper amount of requests to reject, since dropping
too many would have a negative impact on our users,
we've tried and tested a few different load shedding approaches and found that
the most effective is to limit the number of concurrent requests handled by a service.
The load shedder adaptively determines the number of requests that need to be dropped
by initially being somewhat aggressive while shedding the traffic,
and then more conservative about allowing more traffic back
in. When the load shedder drops requests, they're returned
as five hundred and three s, and these can be retried on another healthy
host if one is available. We experimented
with other forms of adaptive load shedding, including using a percentage
based threshold to adaptively control the amount of traffic handled
or rejected. But during our tests we found that a percentage base
shutter didn't really do that good of a job, especially when traffic
patterns changed as it was continually needing to adapt
to the new traffic levels, whether they were increasing or decreasing over
previous thresholds. The graphs shown here
are from one of the experiments we ran where the red host
was unprotected and the blue host had load shedding enabled.
They started off by receiving identical traffic levels until becoming overloaded
where the behavior diverged. As you can see in the middle graph.
You can see as the overall queries per second increases,
the protected host is forced to increase the number of requests
that are dropped. You can also see that the overall high
percentile latency is lower on the protected host,
but there are a few spikes where the load shutter is probing to
see if the concurrency limit can be increased by slowly
letting in more traffic.
So I'd mentioned that holder rejects requests with 503s.
This is done early on in the request pipeline before any business
logic is executed so they're safe to retry on another
healthy host. This reduces overall member impact
because the 503 response is returned quickly
to the client, giving it time to retry the request someplace else.
But we don't want to blindly retry all requests that are
dropped by Hodor because if all hosts in these cluster are overloaded,
these sending additional retry traffic can actually make these problem
worse. To prevent this, we've added functionality
on the client and server side to be able to detect issues that are cluster
wide and prevent retry storms. This is done by
defining client and server side budgets and not
retrying when any of these budgets have been exceeded.
I'm going to talk briefly now about how we went about rolling this out to
the hundreds of separate microservices
that we operate at LinkedIn. So we needed to
be cautious when rolling this out to make sure that we weren't causing impact
to our members from any potential false positives from the
detectors. We did this by enabling the detectors in monitoring mode,
where the signal from the detectors is ignored by the load
shutter, but all relevant metrics are still active and collected.
So this allowed us to set up a pipeline for rollout where we could
monitor when detectors were firing and correlate those
events with other service metrics such as latency,
cpu usage, garbage collection activity, et cetera.
At the same time periods before enabling load shedding.
Though, we monitored a service for at least a week, which would include
load tests that were running in production during that time, we found that
some services were not good candidates for onboarding using our default
settings. These were almost always due to issues with garbage collection
and could usually be solved by tuning the GC. In some
cases, this actually led to significant discoveries around inefficient memory
allocation and usage patterns, which needed to be addressed in the service
but hadn't been surfaced before. Making changes
to address these ended up being significant wins for these services as they led
to reduced cpu usage, better overall performance
and scalability, and they were able to onboard the hodor's
protection as a side benefit.
So at the bottom here is a quote from one of the teams
after adoption adding overload detectors to
our service has surfaced unexpected behavior that owners were generally
not familiar with, and we've truly found some odd behavior
surfaced by our detectors. For example, in one service
we found that thread dumps were being automatically triggered and written
to disk periodically based on a setting that the owners had enabled
and forgotten about. The manifestation of this
was periodic freezes of the JVM while the thread dumps were happening,
which lasted over a few seconds in some cases, but this
didn't register in our higher percentile metrics,
so the service owners were never aware of the problem. Once onboarded
to Hodor, though, it became very clear when the detectors fired
and the load shutter engaged. There are other examples similar
to this where fairly impactful and usually fairly interesting behavior
went unnoticed until uncovered by our system.
So next I'm going to go through a quick example from one of our production
services.
So this is from our flagship application
which powers these main LinkedIn.com site as well as mobile
clients for iOS and Android. So we periodically do traffic
shifts between our data centers for various reasons. In one of
these cases, there was a new bug that was introduced in the service that only
appeared when the service was extremely stressed.
This traffic shift event triggered the bug,
and Hodor intervened aggressively to handle the situation.
You can see in these top graph that Hodor engaged for a
good amount of time with a few spikes which lined up directly with when
latencies were spiking. Overall,
about 20% of requests were dropped
during this overload period, which sounds bad, but when
sres investigated further, they found that if the load hadn't been shed,
this would likely have become a major site issue instead of a
minor one with a service down instead of still
serving partial amounts of traffic. We currently
have over 100 success stories similar to these where Hodor engaged to protect
a service and mitigate an issue to
end with. I'm going to talk about some related work that integrates
well with Hodor and some things that we have planned for the future.
First up is a project that has been integrated with Hodor and
live for some amount of time. Our term for it is traffic tiering,
but it's also known as traffic prioritization or traffic
shaping. It's a pretty simple concept. Some requests
can be considered more important than others. For example,
our web and mobile applications will often prefetch some data
that they expect might be used soon, but if it's not available, there's no
actual impact to the user, it just gets fetched later
on. Demand requests like this can be considered to be lower
priority than directly fetching data that the user has requested.
Similarly, we have offline jobs that utilize the same services
that take user traffic, but nobody's sitting
behind a screen waiting for that data. It's safe to retry
those offline requests at a later time when the service isn't overloaded.
So with traffic tiering, we're able to categorize different types
of requests into different categories and start
dropping traffic with the lowest importance at first, and only moving
to affect higher priority traffic if necessary.
Secondly, we're working on developing new types of detectors to cover blind
spots in the ones we have. One of these is actually a method of
confirming that the detected overload is impacting core metrics.
So we've had cases of false positives where there is an underlying
issue with these service, usually GC related, which isn't
affecting the user perceived performance, but is impacting the ability
of the app to scale further and maximize its capacity.
In these cases, we don't want to drop traffic, but we do want
to signal that there is an issue. So we're working on correlating CPU
or GCU overload signals with latency metrics
and only dropping traffic when there's a clear degradation in performance.
We're also starting to adapt Hodor to frameworks other than restly,
such as the play framework, as well as Neti. These have
different threading models and in some cases don't work as well with the heartbeat
based cpu detection. So for example, we're working
on detects specifically for Netty's event loop threading
model. Finally, we're looking into leveraging the overload
signals to feed them into our elastic scaling system,
so this seems like an obvious match. If a service cluster has become overloaded,
we can just spin up some more instances to alleviate the problem,
right? Well, it turns out it's not that simple, especially when
there are a mix of stateful and stateless systems within an
overloaded call tree. In many cases, just scaling up one
cluster that is overloaded would just propagate the issue further
downstream and cause even more issues. This is an area
where we're still exploring and hope to address in the future.
So I hope that this presentation was enlightening for you and
you learned something new. Thank you for your time. I'd also
like to thank the different teams at LinkedIn that came together to make this project
possible and successful. That's it from
us. Enjoy the rest of the conference.