Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. My name is Nishant Roy and I'm excited
to be here today at 42 Golang 2023
to talk to you about heap optimizations for Go systems.
After this session, you should have a good idea of how to triage,
whether your application is being plagued by memory issues, how to
track down hotspots in your code, and how to go about optimizing
your application's performance.
Before we dive in, here's a little about myself.
I'm the engineering manager for the ad serving platform team at Pinterest,
and our team owns multiple critical systems that help power Pinterest's $2 billion
a year over $2 billion a year ad delivery systems.
Our central ad serving platform itself is implemented in Go and
has really high performance requirements, which is why we spend a lot of time
thinking about how to scale our systems efficiently. And one
of the areas in particular that we spent a lot of time on is taming
the impact of the Go garbage collector to improve our system's performance.
So I'm here to talk about what I've learned from that experience.
So let's start with a really quick intro to memory management and how it works
in Go. Memory management at a high level refers to
allocating memory for an application upon request and then releasing it
for use by other applications once it's no longer needed.
The great part about Go is that it does not require users to perform any
manual memory management, so users do not need to manually allocate and
clear memory. Both these functionalities are abstracted away from
them, and the benefit of this is that it minimizes the chance of memory
leaks.
The Go garbage collector, in order to run it basically has a
threshold. So every time that the heap hits a certain target size,
which by default is whenever the heap grows by 100% since the
last time the garbage collector ran, the Go garbage collector is going
to run one more time. This setting is configurable through a config flag,
and there are more config flags that have been rolled out in recent versions
to make this tunable at a more granular level.
So the go garbage collector uses what is known as a tricolor algorithm
for marking the objects, which means it divides objects into three different
sets. Objects that are marked as white are collectible, since that
means that they're not in use in memory. Objects marked as
black are not collectible since they are definitely in use in memory, and then objects
that are marked as gray, which is the third color, means they may be collectible,
but it hasn't been determined yet. So by using this
tricolor algorithm, the Go garbage collector is
able to run concurrently with your main program without
using a stop the world pause similar to some other languages like Java famously
used to, which therefore minimizes
the impact of garbage collection on your main program itself.
So then the question is, how does garbage collection actually impact your
application's performance? The Go garbage collector
aims to use no more than 25% of the available cpu resources,
which obviously ideally minimizes the impact on your program's
performance and latency, et cetera. However,
as memory pressure starts to increase, which means
the heap size is really large, the garbage collector suddenly
needs a lot more cpu resources. So it starts to steal resources
from your main program, which can then really start to hinder the performance
of your program itself. So, for instance, if the rate of memory allocation is
really high, then the Go garbage collector is going to start stealing
Go routines or threads from your main program to assist with the marking phase
in order to quickly and efficiently scan all the objects
in the heap and determine what can be cleared up. This does
two things. Firstly, it allows us to ensure that the rate of memory allocation
is not greater than the rate of memory cleanup, preventing the heap from growing to
be very large. Secondly,
it slows down your main program itself,
which therefore reduces the rate of memory increase as well.
So what causes GC to actually run slower? What does memory pressure
mean? So, in order to determine what memory is
ready to be cleaned up, the garbage collector needs to scan every single
object in the heap to see if it is still in use or not.
So as the number of objects in the heap grows,
so does the amount of time spent scanning the entire heap.
Then the next question is, what is actually on the heap in the first place?
And the heap essentially is one of two areas that a computer
system uses for memory allocation. The first one is known
as a stack, which is a special area of the computer's memory which stores any
temporary variables or memory allocations that are created by
a function or method. Since each function stack is
then cleared once it's done executing. If the variables
within that function were not moved elsewhere, we would have no way of accessing
these variables later on. So that's where the heap comes in. The heap
is sort of a more free floating memory region used
to store global variables or variables that are referenced outside
the scope of function, shared between functions, between packages, et cetera.
So how does go determine what needs to go in the heap? There's this
process called escape analysis, which is beyond the scope of this talk, but at
a high level the way you can think about it is if an object is
only referenced within the scope of a certain function call, then we
can allocate it to the stack just for that function.
The stack will be cleared once that function is complete, and we'll lose that
object forever. So you don't need to worry about scanning it, cleaning it up later.
But if an object is accessed outside that function, then it needs
to be allocated to the heap in order
for it to be accessible later on. So that
is the essence of escape analysis.
So then how does one go about determining if garbage
collection is actually the problem for your application? So typically
the way this conversation starts is you see that your application is suffering from really
high latency issues. So that's your symptom, that's what you observe.
Intuition is really the first step towards figuring out if GC
is the problem. So typically, if garbage collection
is the reason for your application's performance suffering, you'll see really high tail
latency. And what that means is we have a small percentage of
requests to a system. So again, I'm talking about large scale distributed
systems with really high volumes of traffic,
enough to get a decent percentile breakdown of latency,
which is what Pinterest systems are like, of course. So tail latency
means that we have a small percentage of requests coming into our system that result
in really slow responses. So we often talk about latency as
percentiles. So high tail latency here might refer to really
high values for p 99 latency or even p 90 latency.
Typically for Gc, what we've seen is the p
99 latency is what really gets affected because of the infrequency of the
garbage collector. Running it only really affects that last
1% of requests. So if
you're also observing systems like this really high p 99 latency,
then there's a good chance that garbage collection pressure could be the
root cause. Especially if you already know that your program has pretty
high memory usage, which you can tell by just
observing various system metrics how much memory is being used on the host
that is running your application, et cetera, et cetera. So the next step is
to confirm your hypothesis. You can use this runtime
environment variable that go makes available, called go debug.
By setting it to go. Debug equals GC, trace equals one. As you can see
on the slide here, you'll force your program to output debug logs
for every single GC cycle. And this will also
include a detailed printout of the time spent in
the various phases of garbage collection. And then
the last step is to take what you measured and align it with your system
metrics. So the way we did this was we looked at the logs from
Gctrace and if we noticed that the system's
performance so there were spikes in latency that aligned with when
the GC cycles were occurring, that's a great way to
conclude that there's a good chance that GC is
the cause of your performance regression.
So here's an example of what GCT trace output looks like, with an
explanation with a detailed breakdown of every single component in there.
Credits to Arden Labs here. If you want to find the blog post, you can
just look up GCT trace Arden labs. That's how I found this screenshot.
So taking a quick look at this, we see that GCtrace gives us a
lot of information. It shows us how many GC cycles we've
had so far since our application started,
how much of our program's total cpu has been spent on garbage
collection, how much wall clock and cpu time was
spent in the various phases of GC, what our memory users looks like
before and after garbage collection runs, et cetera. Et I'm not
going to go too deep into these aspects, but check out the blog
post if you're looking for a detailed breakdown of all
of these GC components. What I found helpful is really just to
let GC trace run in the background. And I added a separate background
thread to print out certain key system metrics, things like p
90, p 99, n latency observed over like
a 1 minute to 32nd period. Print these
out in a regular interval and look for correlations between JC cycles occurring
and latency degradations.
So let's assume now that we have a reasonable amount of confidence
that garbage collection is the root cause for our application's
poor performance. How do we then go about profiling our heap
usage? So go has quite a few built in tools
to study our heap usage, and I'm going to talk about two main ones here.
These are the two that I found really helpful. The first one is the
memstats library, and then the second one is the PPRF package.
So memstats is essentially this library that is built
into go runtime and provides you with statistics about the memory
allocator itself, things like how much memory has been allocated,
how much memory is requested from the system, how much memory has been
freed, GC metrics, et cetera, et cetera. I'll dive into
that a little bit more in a second, and the second one is pprof which
is a system profile visualizer, and we'll talk about that in a little bit more
detail as well. But these are really helpful to understand how your application
is managing memory and also visually
inspect your system's cpu data
or cpu usage, heap usage, et cetera.
So here's just a really short glimpse into what memstats gives
you. These are some stats that I found helpful. Like I said, it essentially exposes
these stats about the system's memory usage, garbage collector performance,
et cetera, et cetera. So we can use this library to monitor a
few different things. What I found helpful is to monitor the total number of objects
in the heap. We discussed this earlier, but as the number of objects in the
heap increases, it takes much longer for the garbage collector
to mark the entire heap to scan and mark the entire heap.
So if we notice this metric going up,
there's a good chance that GC pressure is going to increase.
Similarly, if that metric is going down, we made some good optimizations and
the impact of GC should be decreasing. So I used this
metric as one of my indicators for success. As I rolled out new
optimizations, this metric dropped and I noticed that the system's performance
started to improve. And the memsite
docs provide a really clear explanation of all the various statistics.
I think there's close to 20. These are the three that I use once again.
So heap objects number of allocated heap objects heap alloc
is actual bytes that are allocated to heap. This is helpful because
this is how the go runtime determines when to actually trigger GC.
So like we said before, it essentially by default triggers whenever your
heap grows by 100% since the last cycle. So that's
what heap alloc can be used for. And then lastly,
heap sys talks about the total bytes memory obtained from the OS.
So actually requesting memory from the operating system is a slightly
heavyweight process because it's essentially blocking.
So if you're seeing that this number is also continuously going
up, there's a good chance that you're continuously having
to request a lot of memory, which is also blocking threads and impacting
your system's performance. I don't have slides on this,
but one new cool feature that Go has rolled out
since I made these slides originally is another
runtime flag, which allows you to actually set a soft memory
limit. So rather than the default behavior
of GOGC triggering whenever your
heap grows by 100%, you can actually set a target saying only
trigger go Gc when my heap size hits x megabytes,
x gigabytes, whatever it is, which therefore lowers the number of
times GC needs to run, therefore lowering the impact of
GC in your application's performance. That's one way to go about
it, and can be an easy and dirty way to just tame
the impact. However, some of the steps we'll talk about here will really just
help you tune your actual heap usage itself,
which is likely well, one, it's a good practice,
and two, it's likely to give you more consistent and perhaps more significant wins
as well. So here's a quick program that
I put together on how to use memsats, so just wrote
this little method on the right here to read
memsats every however frequently you need it.
Print out number of heap objects allocated, number of bytes allocated
to heap, et cetera, as well as the number of GC cycles that have been
triggered. Since this can be really helpful to see how often and how frequently
GC is getting triggered. The example I did here
is essentially we're allocating this slice of integers
or this array of int slices,
and you can see how I'll
show you in the next slide. You can essentially see how the number of heap
objects and heap allocated bytes changes, as well as how the
GC counter increments as well.
So here's what we got when we ran it.
You can see that the heap objects drop whenever we
run GC, which is basically the penultimate
line in this slide. Otherwise, heap objects continue
to increase. You can see that on the last line we see num GC incremented
to one, and that's where heap objects dropped.
It's a clear indicator that things worked as expected. You can also see that heap
alloc dropped very significantly, almost to ten
or 11% of what it used to be. So GC did its job, and we
freed up a lot of space on the heat. This is a really
simple program, but you can use something very similar to essentially understand the memory
behavior of even more complex systems. So this is how memsats
can be really helpful.
The second package that I talked about is Pprof. It's a
built in package as well. It allows us to visualize several
different system profiles. It is CPU memory usage, heap, et cetera.
Here we're going to talk specifically about the heap profile.
So the tool comes with a bunch of options to investigate specific aspects
of the heap, and those are the ones listed here.
So if you were concerned about auto memory issues, you may be
interested in inspecting the actual amount of memory
used rather than objects, for instance. So you can use the right option accordingly.
In our case, we know that GC pressure is what we're investigating.
It's tied very closely to the number of objects in the heap. So the
inused objects or allocated objects, fields or options are more
useful to us here. So the first command shown here,
go tool pprof and input your options. Then pass in
the URL of wherever your application is running and pass in the
API endpoint that you want to hit, which is debug. PProf is going
to essentially download that profile data to your machine
and puts you in an interactive command line tool to start visualizing this data,
and it's really helpful. So one thing I forgot to mention is in order
to generate this profile, you do need to register this
HTTP endpoint upon application startup.
I don't have a slide for that either, but you can just quickly look up
the pprof docs on Go's main
doc site and it's essentially one line to
register this HTTP endpoint and generate your heap profiles.
So like I said, when you run this, it'll put you in a command line
interface to start playing around with the data. You can essentially run help in your
command line tool and command line interface,
and it'll show you all the available options to slice and dice this data.
What I really like is to run a second command, the last one shown here,
which is gotool pprof, pass in the port that you want to run
the web UI on, and then the path to the actual profile
data itself, and it'll open up an interactive web browser, which I
find much easier and more helpful in inspecting heap usage.
So to jump ahead and show you what that looks like,
here is one of the visualizations that Pprof gives you.
It lets you see the number of objects in use by various call
stacks, which can be really helpful in narrowing down problematic code.
So here it's showing you the entire call stack.
The size of the box is roughly proportionate to
whatever is allocated in the most number of objects, so it really helps you
narrow down in this case if you see buff Iot new reader size is
about 45% of our heap allocation. So we can
conclude that that is one of the reasons for our heap allocation,
or the number of objects in our heap being so high.
Then we can trace through that stack and try and figure out what we can
do to optimize this. Some options are not creating
a new reader every single time we need to use it, perhaps reusing
one, pooling them, et cetera, et cetera.
This is another visualization that Pprof offers that I actually use really heavily.
It lets you visualize heap usage as a flame graph. And this flame
graph is also interactive, so you can click on any bar to focus in on
it and the call stack below it, et cetera, et cetera. The depth
of the call stack doesn't really matter here, but the width of the call stack
is what represents the number of heap objects that are allocated.
So essentially, the wider call stacks use a higher number of heap objects, at least
when this profile was captured. So it's really easy to just jump
in to certain hotspots and dig deeper into there to try
and find the lowest hanging fruit and the biggest possible optimizations.
So I'm also going to show you what the CLI
can be used for. So from the previous slide here,
we can try and figure out which method or which call stack
is allocating a large number of objects. And then through the CLI,
you can use this list command, which is really cool to pass in
a function name and see line by line which lines
of that method are allocating how many objects. So in this one,
this is a fake method. But let's say we have a method called create catalog
map that is essentially creating this map of products that a particular seller
has. We can jump in. We know
that this method creates a large number of objects itself. Here we can go in
and see line by line, exactly how many objects are allocated by each
line in the object in the method, and figure out where to focus
our efforts. So here you can see that lines
233 and through 237 create a lot of new objects,
which results in a large number of feeb allocations.
And then line 241, surprisingly,
is not actually creating new objects, but it's adding all those objects to a
map, which is also causing a large number of feeb allocations. So that
looks a little suspicious. We'll come back to that in a second.
Let's first talk about how to lower or limit the impact of garbage collection on
your system. First one we've been talking about for a while, lower the
number of objects in your heap. This is going to reduce the amount of
time it takes a garbage collection to scan your heap and therefore lower its impact.
The second one is to reduce the rate of object allocation. And then the third
one is actually to optimize their data structures to minimize how much memory they
use, which will therefore reduce the need for more frequent
GC triggers. So these three are
ways that we can use to mitigate the impact of garbage collection, make our application
more lightweight, and free up more resources for our program to
operate efficiently.
So let's dive a little bit into the first one. How do we
reduce objects in the heap? So really the
question is, how do you reduce long living heap objects? Because these are objects
that are essentially living in the heap for a long time, and we
expect them to keep living there, which means every single time the garbage collector
runs, it needs to scan these objects, determine that they're still in use,
and they can't be cleaned up, et cetera, et cetera. So rather than having these
objects live on the heap, they can be created as values rather
than references on demand. So for instance, let's take
the Pinterest ad system as an example. If every single time that
we're determining which ads to show a user,
let's say we need some data for each item in that user request.
So every potential ad candidate has some data associated with it.
Rather than pre computing that data and storing it in this long lived map,
we could just compute it on a per request basis to
reduce the number of objects in the heat. So what that is going to do
is increase the amount of computation for
each average request. However, it is going to reduce the sort
of like tail latency problem, because you have a very reliable
measure of how much compute is being used per request,
and it's easier to essentially optimize a particular request
rather than optimize this long tail latency.
So that's one way to do it, create your objects in demand rather than storing
them in a long lived map on the heap.
The second and third are very related, but be mindful of
where you're using pointers. Go makes it really easy to create and reference
pointers. However, if we have a reference to an object and that object
itself contains further pointers or further references within
it, these are all going to be considered individual objects
in the heap, even though they may be nested together. The reason for this is,
if you think about it, I have a pointer to some object x,
or let's say the object is a person is of type person.
Each person has a name, each person has an age, et cetera. If I have
a pointer to the person's name and it's referenced somewhere,
there's a good chance that the name may be used even after the main person
object ceases to exist. So the go memory allocator
needs to store that object separately in memory, which means it's
a whole second object that needs to be scanned by the garbage
collector later on. So reducing the number of
pointers that we use, reducing the number of nested pointers is going to
reduce the number of objects that your garbage collector needs to scan.
The third one is just sort of a gotcha.
Strings and binaries are treated as pointers under the hood. So each
one is going to be an object in the heap. So wherever possible,
if you try and represent these as other non pointer values. So strings,
perhaps you could represent as integers or floats if possible,
hashing them for instance, or representing
dates as actual time time objects, so on and
so forth. Those are ways to reduce the number of strings you're using, and therefore
reduce the number of pointers.
So going back to our example, if we
look at line, if we
look at line 272 37, we're creating a new catalog listing
each time. And then on line 241, we're assigning
it to a map. So we're using this catalog listing
key, which we're doing by encoding product id and
seller id together. Let's say
this catalog listing key is actually a string object.
If we then change how we're creating the key to
instead using a struct. So lines 239 to 241
here show that we are starting to use a struct for the key instead,
rather than using a string as previously, we can see that we
reduce the number of heap objects by 26 million between
these slides, which is around 20% of our heap usage. So we didn't actually change
that much, we just changed how we're representing the exact same data,
and we're able to significantly reduce the amount of work that our garbage
collector needs to do. So here's one example of how a simple thing like
removing strings can actually have a very significant impact on
your application's heap usage, and therefore its performance.
So the other thing you can think about is reducing the rate of allocation.
So if your program tends to create a large number of short lived objects
in bursts, object pooling is something that might benefit you,
because you can use that to object pools can essentially be
used to allocate in free memory blocks manually and reduce the number
of GC, the amount of work that your garbage collector needs to do.
Because object pools are expected to be retained for a longer scope, we don't
need to keep allocating, clearing up these objects, and GC doesn't
scan it over and over again. However, I will put
out a warning here, because the garbage collector is not going
to scan and clear up your object pool for you,
it can lead to memory leaks if not used properly. So I'd only recommend
using this if you know what you're doing and if you've exhausted all other options.
For instance, if you're continuously allocating new objects
rather than reusing objects from
the pool, this could lead to a memory leak and cause your
application to crash due to out of memory errors. A second potential problem
here is if you're not properly sanitizing your objects before returning
them to the pool, data may be persisted beyond its intended
scope and could potentially be leaked to other scopes. So if we're storing some
sensitive, personally identifiable information on
a per request basis for each user, and we're
using pools to represent the user object, if we
don't sanitize that data, then there's a good chance that we could potentially leak
data from one user's profile to another user's profile, which would
obviously have really disastrous consequences,
not only in terms of our application itself, but in terms of the user's privacy
concerns, et cetera, et cetera. So these are the risks of object
pooling, but it can be a really powerful tool to reduce the amount of
work that your garbage collector needs to do and give you some more control over
memory management yourself.
The third thing that we talked about is thinking about how we organize and
represent our data to reduce the amount of memory that
it's using. So one way to do this is to clean up any unused
data fields. Basic types in Go are going to have default values.
For example, a boolean is going to default to false, an integer is going to
default to zero, et cetera, et cetera. So even if you're not using these fields,
the go memory allocator still needs to allocate space on
the heap for these objects, and they're there for consuming memory.
So fields I through L here are unused,
but they're still taking on their default values. So if we remove those,
we essentially went from 64 bytes to 40 bytes, which is a pretty significant
win if you think about the number of objects that you might be storing on
heap on a very large scale application.
The other side benefit of this is that you're actually simplifying your code and making
it easier to understand and reducing the amount of errors that might come up from
someone who misunderstands what a field is in the future.
This 1 may be a little familiar to folks coming from A-C-C plus plus background,
but the ordering of your fields can actually really impact
your memory usage as well. Stago memory allocator does
not optimize for data structure alignment. So in this case
we have two objects with completely identical fields. They're just ordered differently.
The way the memory allocator works is it goes down the
fields, allocates them one at a time. So in order to respect word
alignment, it might need to add padding to the data in
memory. So going through here, going to the bad object,
starting with field a, it's a boolean, which is one byte. So it allocates one
byte in memory, and then it needs to allocate eight bytes for field b,
which is nn 64. Now, if it allocated
those eight bytes right after, it would break the system's word alignment.
So therefore, it needs to pad on seven bytes first, and then
allocate the next eight bytes for field B.
And you can see this goes on. So for field c, it allocates one byte,
and then field d is an n 32, which means it needs four bytes.
So it pads in three fields and then adds in field d, so on and
so forth. If we simply reorder
these, as we did in the good object on the right, you can see the
memory allocation is much better aligned. And we
went from having an object that consumes 40 bytes
to an object that contains 24 bytes. So we did two
things here. We just removed unused fields, which is great,
and then we rearranged the remaining fields that we actually need, and we went
from 64 bytes to 24 bytes, which is a 62%
drop in the amount of memory used per object.
Think about, again, a large scale system with thousands, millions, or even billions
of such objects in use. This simple method could
just really reduce your memory usage and
improve your system's performance.
So, to conclude, the Go garbage collector is
highly optimized for most use cases. It's a fantastic piece of
technology, and most developers do not need to worry about how it's
implemented and don't need to worry about its performance. However,
for some heavy, very large scale use cases, the garbage collector
could cause pretty significant impact to your program's performance.
And in this case, having an understanding of how the GC works,
how memory management works, and then understanding some of the built
in tools that the Go team provides, can be really, really important
to understanding and reducing the problem.
From there, we have a lot of options to actually optimize our system,
improve performance, and have much happier users and
much happier engineers. So three steps.
Start with observing. We have some ways of knowing intuitively
that there are certain systems, like really high tail latency,
that might be caused by GC. From there, we go
in and add some measurement. We can look at heap usage. We can look at
GC trace output, et cetera, to try and narrow down whether GC
actually is the problem. And then from there, we talked about a few different
ways by which we can start to optimize our system.
That's all I have for you today. Thank you all for listening. I hope this
helped you understand how guard garbage collection works and go. And how you can go
about optimizing your system to minimize the impact of the garbage collector.
Thank you. And if you have any questions, feel free to reach out to me.
Have a great day.