Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Good morning, good afternoon,
good evening everyone. Welcome to Conf fourty two or Java.
Today I'm going to be sharing some really cool stuff about garbarge collection,
so let's get on to it. These are some important disclaimers
saying that this prestigious for information purposes. So this
is the rough outline of what I'm going to be going through with you today.
IBM going to start with the garbage collection overview. Going to start slowly and
then go into them the more details with the OpenJ nine GC
algorithms and I'm going to finish off with advanced structure
called double mapping. A little bit about me I'm a software developer here at IBM.
IBM almost for almost two years. I graduated from
University of Waterloo. My master is in 2019 and
one of my interests are systems, compilers,
machine learning and AI. And a fun fact about me is that I'm a tennis
aggregate. All right, if you want to get the builds for
either hotspot OPG nine, I recommend going to adopt OpenJDK.
There you can get both builds and terms.
So you sure whats you have the best jvm
and garbage collection which is the main part of the talk. Objects nine is
a fairly mature project. It went to the open 2017.
A fun fact about it is that it started as an embedded
jvm. If familiar with Java Macro edition. That's how OPG
nine started, which is why it has some really good memory performance
when it comes to memory and it's open community.
We welcome anyone who's interested and wants to contribute to the project.
All right, garbage collection. So what is it? This is a
more by the book definition. Garbarge collection is a form of automatic
memory management. The garbage collect attempts to reclaim memory occupied by
objects that are no longer used by the application. That's opposed to
other programming languages like C Plus plus where you have to manually allocated
memory with new and malloc and free those memory with free
and delete. So what are the responsibilities of garbarge collection? They are allocate
memory, identify the liveness of data and everything else is going to be considered garbage.
Therefore the garbage collection is going to reclaim. So these are the most important
responsibilities of the garbarge collection positives GCs.
They are automatic managed automatic memory management.
So it frees the burden of the developer to manually have to
deal with memory. Also helps reduce some certain categories of
bugs like dangling pointers and double freeze. On the other hand,
there is the negative part, right? It requires some additional resources like
additional cpu cycles. More memory can cause unpredictable pauses.
You will see that they are known as stop the road pauses.
It may introduce runtime costs and application. Whats little control
of when memory is reclaimed. Here are some of the digital algorithms.
I'm going to be explaining each of these as I go
through the presentation. One, and that I won't be talking about, is reference counting,
but in simple terms. Imagine I'm an object and I have
two objects that's pointed to me. My reference count is going to be two.
As soon as these objects deletes the references to me,
my reference counts goes to zero. So pretty much I'm counting how many
references I have, and once that count reaches zero, I'm considered garbage.
Therefore I can be reclaimed. That's reference counting in 30 seconds.
All right, this is what I'll be going through with you today.
These are the different policies regarding open, unite first one's
op throughput, the stop the world pilot collector, the second Gencon
generational copy collector. Gencon Cs is an improvement of
a Gencon's apostles collector. Metronome is an incremental,
soft, real time collector and balance, the region based collector balance
is really close to g one and John cone is really close to
cs from hotspot optroput. So optroput is a parallel garbarge
collection. GC operations are completely stop the world pauses,
meaning all the application threads are paused. Marksweep and
optional copat collector. I'll explain whats in more detail a little bit.
And all GC operations are completed in parallel, meaning multiple threads. Garbarge collection
the heat and GC native memory overhead for macmap and work
packets. These are metadata needed by the GC
to collect some garbage. So whenever you start an application,
we allocate objects in what we call the heat.
Normally is a contiguous block of memory, but it doesn't necessarily whats
to be, but over 90% of the time it's going to be a contiguous block
of memory. So we have a flat heap, but for optuput we further
divide into two logical spaces, small object area and large
object area. Most objects are allocated in the small object area,
so this is how an application behaves under this policy.
The white arrows represent the application thread and the red arrows represent the
GC thread. So application starts right, and as soon as we
hit an allocation failure, we deplete the memory of
the heap. We start a GC cycle. That's the red arrows there. So GC
collects the heap while all application threads stop. That's why
it's called stop the road, and then the application threads can continue from
there. So the GC is divided into three phases, marking, sweeping,
collection, so marking where it's where we find all the live objects.
So just go through the object graph, finding all the live objects.
Then we sweep the dead objects, meaning all those objects
that are not marked are dead objects, therefore we sweep them.
And then there's a compaction phase because the heap might get fragmented. So what does
it mean, a heap to get fragmented? So imagine through the lifetime of an application
when we have many DC cycles, our heap might
get like a swiss cheese, meaning it's going to have very little
small free spaces, but maybe not enough contiguous free space to allocate
an object. Therefore we need to compact that heap. This is what this slide
is trying to explain, where we take all those live objects, put in some
part of the heap, and now we have all that contiguous memory that
we can work with. Jenco if IBM familiar with cs
from hotspot. It's really similar to the generational copy collector,
where objects have edges provides a significant reduction
in the assist of the repulse times. Introduces right barrier for a member set and
we have a cocurrent global marking phase where now we're going to be marking objects
concurrently while the application track is running. So here, instead of dividing the
heap into small object area and large object area, we divide the
heap into nursery and tenure. In other words, nursery is a new space,
tenure is old space. Furthermore, we divide nursery into allocate
and survivor, and all objects are actually locating
allocated space. This is how it behaves though. Don't panic. So again,
the white arrows represent the application threads and red,
the GC thread. Scavenge here represents a copy collector.
So what is a cop collector? Cop collector means it's copying objects from
one space to the next. So it's copying objects from allocate to survivor.
So when we hit the limit of the allocate space,
we start the scavenge, which is top the word copy collector. And now we start
copying objects from scavenge to survivor. That's what scavenge is
doing. Global mark face is concurrent
mark phase, where it's actually marking the tenure space,
because scavenge is only dealing with a new space. So how do we deal with
the old space? That's what global mark phase is doing, marking the
tenure space concurrently. And at the end, at the global end,
we sweep the tenure space to collect those original tenure space.
So because of this concurrency, we need what we call write
barriers. So imagine we are trying to store a reference
into an object field like that, and imagine
we have the following object graph. We are in the middle of marking
objects and applications running. So GC has already processed the roots.
GC has already processed object a. As you see with the green tick,
GC is in the middle processing object b and hasn't seen c yet.
Then this happens. Mutated thread comes along and
deletes pointer from b to c and adds a reference from
a to c. And then the GC thread comes along and okay, b doesn't
have any pointers, so I'm done marking. Therefore this is my resulting
graph where I found roots. Object a, object b,
object C is normal, therefore it's considered garbage. So IBM going to collect erroneously.
So that's a problem. So how do we remember, how can we remember object c?
We do that through write barriers. So how this is
implemented. So we have a simply check there if it's concurrent
disease is active. We remember that object. So we put that in a car
table so that at the beginning of the GC cycle
we actually go to the car table to look for these.
Sorry, it's not the beginning. At the end we go through the car table to
remember these references that we might have missed.
But we do need an extra barrier because
we remember we have new space and old space,
but we might have references going from old space
to new space. And scavenger is not aware of that. So in order to
capture these references coming from old space to new space, we need another right
barrier. And this right barrier looks like this. All we do is check if a
is in old space, a is tenured and c is not. Meaning a is an
old space, c is in a new space. Then we also remember that since
we have two barriers here, we can put that in the same check to save
some cpu cycles there. And that's the resulting right barrier for
Gencon. Cool. Gencon concurrent scavenger,
that's an improvement upon Gencon. This is a puzzleless collector. If you're familiar with
Shenandoah and ZGC, those are also puzzles collection.
They are some differences there, but concurrent scavenger is very similar to Gencon
here. The generation cup collector introduced a readbearer because now
we're going to be moving objects concurrently. Not just marking, but also
moving objects. The heap structure is the same, right? We have
the new space, old space, allocate, survivor, nothing new there. So the
biggest difference here is the scavenger, which is instead of being stopped,
the world now is going to be concurrent. The CS there stands for concurrent scavenger.
So application starts instead of having a stop the world scavenger. We're going
to have two small pauses, one at the beginning, one at the
end for like synchronization, marking the roots. And the bulk of the
work is going to be done concurrently with the application, which is where
you see the yellow arrows there which represent application threads running concurrently
with the application. And global marking phase here is still the
same, right. The only thing is that the cop collector now is running
concurrently with the application. Now we have a problem
here, right, because the GC thread is trying to move objects while mutator
is trying to access those same objects. So let's step
back a little bit. Let's think about the parallel scavenger, right?
Parallel scavenger is a stop toward scavenger. So I have multiple GC threads
trying to move an object, right? So we can deal with that with an atomic
operation, cas cop and swap, right? So let's see what happens in this
case. So we have two DC threads trying to move an object,
right? The green and brownish thread here. So what would
happen in this case? So they both try to copy the object,
right. So they race to cop the object. So in this case
the brown one wins the race. Copy the object, install the
foreign pointer, the green thread is going to be. Okay, I lost all good.
I'll just continue do my work and cop some other objects. So far
so good. Now what if we had a mutator in a DGC thread trying
to do that? So let's keep this same Cas
operation and see what happens. Right.
So we have the same cas there, but instead of two GC threads we have
a GC thread which is the green one, and a motel thread. Okay,
so let's see what happens. So in this case,
let's assume the GC thread won the race which copies the
object. And then the motra thread is going to come along say
okay, I lost the race, but I see a foreign pointer is going to
follow that to access that object in the two space. But since
GC thread's not done copying the object,
the motel thread is going to access some garbage memory
and that's a problem. So how do we fix that? That's how we fix it.
So what we do is we unconditionally copy
the object, right? So we copy object. So both
is going to copy the object and what they do is they erase to
install the foreign pointer. So how does this work actually?
So both is going to copy the object, right? So that's going to be the
resulting at the beginning. Both is going to have their own copy. And here at
this point they're going to race to install the foreign pointer in the
foreign space copy. So in this case, the GC thread won. The race to install
the foreign porter material thread is going to be okay, I lost, but no matter,
I'm just going to go to the foreign pointer to access that
object. But since the G thread have already copied
the object, the object, everything is fine because the
object's there. The GC thread is done on copying the object, so everybody's
happy. So that's what we call a read barrier. You're familiar with Shenandoa
version two? They use the same barrier as here.
Okay, metronome. Metronome is an
incremental, soft real time collector. It provides an upper bound on GC
post times. What I mean by upper bound is that imagine
a window of ten milliseconds on that window.
We guarantee that 7% of that time is going to
be dedicated to the application, while the other 30% is going to be for
the GC. So it's up to the GC to do with those 30%. Whatever he
wants to do can do some GC work or give some time
back to the application. Here uses a different write barrier called snapshot
at the beginning. And because of this barrier,
it has a lot, a high percentage of floating
garbage, and you're going to see why. So metronome heap is not divided into new
and old space. We actually divide the heap into regions.
Here is a fixed size region 64,
bigger the heap, the more regions we have. Regions are assign a cell size.
So here it's a very nice characteristic
is that objects on the same region have the same size. And if
an object is larger than 2,
can actually span multiple regions with the exception
of arrays. So if we have an array that's bigger than
a region, we actually is going to divide whats
array into chunks and put a chunk here, another chunk here,
and have the array header points to these different chunks.
Because normally if we have an object that's logging region,
we're going to trigger a stop the world GC, right?
For arrays, we can just chunk that array and put this
in different chunks. That's what we call arraylets. I'm going to explain a little bit
more what arraylets are at the end of the presentation.
This is how metronome behaves. Very, very simple, if we
will remember opt throughput. From the beginning of the presentation,
it had one huge top to roll, right?
So imagine that huge top, the road pause divided into little
increments. That's what metronome is, right? It does a little
bit of work here, a little bit of work here, but never going
over that 30% threshold, right. Remember those ten milliseconds
window, only 30% is going to be dedicated to the GC. And that's what these
increments are. A little bit of GC work here, here and here and here until
it's done. So the right barrier that I mentioned to you was
snapshot at the beginning, right? And it causes a lot of
floating garbage because the barrier is actually performed before the store.
So imagine we have the following object graph. We are marking object concurrently,
right? And the GCA has already processed the
roots, has already processed object a, is in the middle of processing object
b, it hasn't visited object c, neither object D.
The mutated thread comes along and does this, right. Deletes the reference
from b to c and adds a reference from a to d. And then if
we didn't have a barrier, this will be the resulting object graph and we
would erroneously collect object D. So how does
snapshot at the beginning actually work? It is implemented this way,
meaning we remember the reference before we actually do the
store. So we put it in a temporary field. Remember that if the berries
active. So if you come back to the object graph here, the berry
is actually triggered. Whenever we delete the reference from
b to c, that's when the berry is triggered. So instead of
having a resulting object graph like this, we would have something like this.
In this case we would actually remember object C, which that is
considered garbage because we don't really need object C.
It just happens to be in the path of object D. That's why we have
a lot of floating garbage with this type of barriers. Now, our last
but not least, DC policy is balanced. If you're familiar with
g one from hotspot, it's very similar. Both are generational,
both are region based. One difference that I see is that the g
one from hotspot has only three ages and we
have 23 to 27 ages.
Whats we have a lot more ages. Like I said, they are both region based.
Balance provides significant rendition. Max stop the world
poses. We introduce right barriers similar to the one from Gencon,
because you might have reference from different regions coming to my
region. And we have an incremental heap defragmentation. So similar to metronome,
the heap is divided into regions. But here the bigger the heap,
the larger the region, because we try to have between 1000 2000
regions in this policy, objects are located in what we call addon regions,
which is our new regions. And if an object is
larger than a region, we just throw an out of memory error, with exceptions
of arrays, same thing as metronome. We divide the arrays into chunks
and put them into different regions. This is how a balanced
application behaves. The balanced policy. PGC stands for partial
garbage collection. It's a copy collector. So what it's doing is copying
the live objects from one region to the other and tries to
find those regions with the highest return of investment. So what does
that mean? So imagine we have a region with only one live object.
All we have to do is cop that object over. And there we go,
we have an entire free region to work with.
So PGC tries to find those regions which has the most
garbage, which is going to do the least amount
of work to free that region. And GMP here
is a global marking phase. It doesn't collect
objects, it actually helps PGC.
So one problem about PGC is that it only has
local knowledge about each region.
As time goes on, it loses this local knowledge.
So GMP is there to help update this local knowledge
of PGC. And at the end of the GMP, it updates PGC
information so that PGC can do its work more
effectively. So GMP does not
reclaim memory, right. It prefers a marking phase only.
It's scheduled to run between pgcs and
it gives an accurate mark map of the entire heap.
And it's going to use this mark map to update the
PGC. So this is more of a visual
representation of how PGC actually looks. So on the X axis we
have time and the Y axis we have free heap. So lines
going up is the PGC doing its work and lines going down
is the application consuming memory. So you can see this trend going
down is the heap getting fragmented because of the
PGC. And that happens because PGC loses this local
information. So GMP is smart enough to trigger at some point here so
that it finished in a timely manner to update PGC so
that it can defragment the heat. So you can see around there where
I have affected the fragmentation. That's where we just updated PGC
information. And you see that trend going up. And that's when PGC has the most
recent information of every region, so it
can better collect and defragment. And with time that trend
going down, it's going to start up again and GMP is going to kick in
again. So you can update BGC. And that's how a balanced policy
behaves. So in balance, we do have a right
barrier. It's very similar to the Gencon barrier because we
might have multiple references coming from different regions to this region,
and PGC doesn't really know how to keep track
of that. So we need a right barrier for that. So actually here we do,
unconditional barrier if you think about it. So we just study the card
whenever we set a field of an object and then at the beginning of the
PGC we go through this card to see if there's an iter region reference.
So you see here, whenever we think about, read and write barriers,
it's always a matter of who should we put the burden on.
Should we put the burden on the motel thread or should we put the burden
on the GC thread. So here we chose to put the burden on the GC
thread so that we don't penalize motel threads as much.
Now, arraylets, so you probably heard me talking about
arraylets a few times here. So arraylets are a
cleverer way to store larger rates in
a region based GC. So these arrays have a spine,
right? And this spine is going to be pointing to the data allocated
with this array. Each leaf of this array, which is
are the different chunks of the array, consumes the entire region. So how does
it look? Actually, if I were to graph forward to
visualize this, so we have the heap here, and the bottom we have the
array header. Arrayoids represent the pointers to
the array data. So imagine if you want to
access element one of this array. What we do is we follow reference,
the first reference here, the first arrayoid, and you can see we go to the
last blue region and access the element that way.
So you can see here, it's really nice way to store large
arrays in a heap as opposed to having this array
span multiple regions. But there's a problem.
This array that's divided into regions cannot
work with APIs that require a contiguous representation
of this array. So how do we deal with this before? So this
is the case of Jni APIs, Java native interface
critical APIs. They actually need a contiguous review
of the array, which is not the case of array, lets. Right. So whats we
do? So we actually make a copy of this array, right, and then
copy element by element to this temporary array, pass that array,
copy to the J nine API. Then the J API is going to do whatever
it does with the array, maybe modify some of its elements,
and then we need to copy everything back. So you can see there
are like two, imagine we have large arrays like 1050 megabytes
array. As you can see, that's really expensive. So how can
we deal with that? That's where double mapping comes in. So we can make
this large arrays discontinuous arrays look contiguously.
So we take the idea that physical memory is limited,
but virtual memory is not. On the contrary, virtual memory address space
is very very large in 64 bit systems to the power 64
in fact. So what we do is we map two virtual memory addresses to
the same physical address. And the nice thing about this is that any
modifications to any of the virtual address is reflected in the
other virtual address. So ZGC does a similar trick to keep track
of the different faces of the GC. So every object is
going to have actually multiple copies in the vitro address space, but they actually point
to the same physical address. Here we just do it for arrays.
How would this look like if you were to visualize? So in the
bottom part there we have the actual arraylets, right, the different regions of the arrays
and then we map that to a second virtual address and we
do in such a way that the second vitro memory is going to look contiguously
in the vitro space. All we have to do now is pass the address of
that second virtual memory to the gen nine API and every
modification to that virtual memory is going to be reflected in the original
virtual address. That way we don't have to copy everything to the temporary array and
copy everything back. So there's a really nice trick that we do in our
region based DC and with that we actually boosted array operations
by 30 times, which is pretty cool. So what's next? Double map
has some downsize rights, only available Linux and 64
bit systems. But can we do better? Some platforms doesn't
really support this double mapping trick. So what we're actually working
right now is on a technology called off heap where
we're actually going to store large objects at a secondary heap
and that way we can do some more tricks and try to
optimize this large object even further. So stay tuned.
This is a quick GC policy guide.
So on the left we have the GC policy and on the right we
have some characteristics. For instance, some of them,
some of these gcs are generational, right Gencon and Gencon
cs and balanced. Some of them is going to have some concurrent faces.
The only one that has no concurrent face at all is of throughput but of
throughput has the highest throughput and that's because it
doesn't have to deal with read or write barriers. And these different policies
have different heap layouts, right? Gencon, Genco has the
new and old separation right balance and metronoma region
based and opt throughput has a small object area and large object area.
This is another table where it shows for which domain is
which policy is better. This is not, you shouldn't follow this by
the foot, right? You should actually test the policy because
even though you might have like a desktop application, depending on your workload,
another policy might work best. So here for such a general tip.
Gencon are good for web servers, desktop applications,
balanced. RG one is good for very large heaps, and metronome is
good for systems that have softer real time
constraints. And optropod is very good for small heaps.
Or if you don't care about GCP poses at all. And you can see here
different policies are going to have different throughput and different gcpuzzles. You can
see Genco Cs has not the best throughput,
but whats the fewer GCPU on the other hand of throughput
has the highest but the largest GCP.
And that's because of concurrency and how we deal
with that. That's because read barriers and write barriers has
overhead on the GC. And you can see here all
policies are going to have some type of additional memory.
Maybe they have like a mark map or something to keep
track of these objects, because we need a way to keep
the DC needs these extra data structures
to help collect the object. And that's what
this middle column is telling you about. This is
like a diagram showing the similarities between the different
and the most common GCS policies out there. So on the top there
in the yellow we see the most common gcs. So Gentcon is very
similar to cms from hotspot. I think I've been saying cs,
I apologize. So Gencons is similar to cms from hotspot.
So both are gencon, both have neo and old space balanced,
and G one very similar to each other. Right? Both are
rigid based, both are generational, right? Then we have
on the blue section there we have the puzzler gcs,
which their objective is to have the least amount
of poses and the shorter puzzles, right? So you
can see the thickness of the lines there are thinner
because even though they share some similarities, not as similar as the ones
above, right? So ZGC, Chin and door, they're both region based,
but they are not generational. Gen conference is not region based, but it
is generational. Metronome is region based, but it's
not generational. And remember, ZGC uses
that multimapping kind of technology, right? Shindo version
two uses the same read barrier as Gencon CS Shillando
version one uses Brooks Pointer, which is kind of a different
technology. And on the green side there we have
Azul C four, which also puzzle legacy, but you require
like a special hardware or some special software.
Now this is the summary of the talk here.
If there's one thing that I want you to get out of this is that
there's no perfect fits, no perfect GC that fits everything,
all workloads, right? And if you want a GC that
has perfect throughput and you don't care about puzzles at all, you should go with
the perfect stop, the wood GC, because no matter what, that's always going to have
a higher throughput as opposed to a perfect puzzleless GC. And that's because the
read and write barriers overhead, even though they're very important for
the currency of these puzzles, gcs, there is an overhead of,
because for every object read or write you
need to make sure that the GC has that knowledge whenever
we are changing those references along with the application,
right? So if you don't care about the pauses, go for perfect
stopwdc. If you do care about deposits, go for puzzles
GC. But if you care about both, or you want to balance
between the two, can go for the balanced GC or G one, right? Or Gencon
Semas. So you have the options, right? Both extremes,
perfect stop the road, perfect puzzlers and something somewhere in the middle.
And these different policies deal with heap fragmentation
differently, right? Depend they might have
the heap configuration different, so therefore they're going to collect objects
differently, right? Remember the barriers, right? Snapshot at the beginning,
barrier against just a regular write barrier, right? So there
are a lot of things here to playing when it comes
to a perfect GC policy. But again,
there's not going to be a perfect GC for every policy.
You should, whatever workload you have, you should try three, four different
policies and to see what's going to work best for that workload.
So here are some links. OPG nine is a very cool
project. We do recommend people to come check it out. That second link there,
CMD line migration is for those people that are familiar
with hotspot, right. And want to try out OPG. So I really recommend looking
at that. Adopt opendicates, the website whats will get the binaries.
And Omar is another project that has many different
components to build a runtime. And Opgnan uses a lot of
these components. For instance, uses GC component and compiler
component. It's really cool. Also open source project and we welcome
contributors there as well. So that's it for me.
I'm very happy if you can connect with me LinkedIn,
Twitter and before I open to
questions, I want to show this book here. If you want
to know everything there is about garbage collection, this is the book. Everything you want
to know about is in there. And these are some
of the common options that if you're
interested to go through. If you want to start, like tweaking your
application, you can start with these ones. They are most,
let's say simple. The first one there is only available on
Obajni. And the nice thing about verbose GC
option is that you can actually visualize how your memory is
behaving under an application with a tool called GcMV.
And you can only visualize this if you're running an application with
the Openj nine VM because with hotspot they have a different variables.
So if you're running with Openj nine, you can use that verbals GC to
capture the lock of the GC and visualize that with GcMB, which is really
cool. If you remember that graph I showed you balanced, I use GCMB
with that. I put some colors into it, but that's pretty much what you're going
to get with GcMB, which is really cool. So I think that's it for
me. I'll stop here and I'll check it out for any questions.
Thank you so much.