Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hello everyone,
and thanks for tuning in to my talk at this year's
Conf 42 Java conference. Today I will talk about
benchmarking the warmer performance of hotspot VM,
crawl VM and OpenJ nine. Let's get started.
So, my name is Frank Kriegl. I live in Heidelberg,
Germany, and I've been working as a Java developer since
almost four years, and currently I'm also finishing
my master of science in parallel to my regular employment,
which is also where this talk originates from.
Recently I had to write research paper
and I was simply curious to learn more about
jvms in general and saw this as a chance to deepen
my knowledge. And that's why this talk's subtitle
is also a learner's journey. So here's today's agenda.
First, I will start with a brief introduction.
Then I will set the baseline with some information bits
about JVM internals. Next I will talk
about my learnings when I try to compare the warmup performance
of the three different JVM. So I will spend
some words on the pitfalls of creating good or
bad benchmark tests.
Next I will describe my test adapt
for benchmarking the warmup performance, and also
mention some configurations I made for the
jvms on the test. Finally, I'd like to present you
my test results and for sure also give some interpretation
on them. In the end, I will draw a short conclusion.
The goal of my talk is actually to motivate you to
start with java micro benchmarking on your own. So I hope
that in the end of this presentation you will have a basic
understanding of JVMs and are ready to get started with
your own java or JVM benchmark measurements. So let's
talk about the what and why. What is warmup actually?
Usually, warmup is defined as the number of iterations the
JVM needs to increase the speed of method execution between
the first and the nth invocation by
applying JIT compiler optimizations on the bytecode. Okay, that's the
definition, and now let me show you what it actually means. So in this
chart you see on the vertical axis the time
per operation, and on the horizontal axis the number of
iterations and warmup is the part
here. From the first iteration, it takes quite long to
complete one operation for the JVM one method execution, and over
time it's getting faster. So after 200 iterations
it's much faster than the first iteration.
And this decline in execution time is called
warm up. Next, I'd like to answer the question why?
I would like to compare this. So mainly out of curiosity,
to be honest. But there are for sure also some actual
reasons. Like I was searching on the Internet and I
only found little research in this area. So the most
interesting article I found was don't get caught in the
cold warmup your jvm and it's but hot
tub which is a new JVM implementation to
use pre cared JVM to avoid the warmup
overhead. And so I thought, okay, why not doing my own research? I just
wanted to see how much the jvms I wanted to
test how they differ in the method, warmup speed, and eventually
if there is a difference. Well, there is a difference and
what it looks like between JIt compilers and Aot compilers,
and the code they produce. Okay, next it's about
setting the baseline. So on this slide you
see a picture of the Java heap structure.
The example is for the hotspot VM, and the Java
heap is actually separated into several parts.
So you see here there is the young generation and
the old generation. And the young generation itself consists
of Eden space and two survivor spaces.
So how does memory allocation happen in the JVM?
New memory is always allocated in the Eden space,
and as soon as just before the Eden space
fills up, some minor garbage collection occurs
and the objects are transferred into survivor spaces.
One survivor space is always free and the other one
is occupied. And if one survivor
space is running full, minor garbage collection will
just swap these spaces, clear up unused objects
or unreferenced objects, and the survivors will stay there.
If there are long living objects that survive several minor
garbage collection cycles, it could actually happen that some
major garbage collection occurs and these objects
are then transferred into the old generation space,
also called tenured space. Next, and I already mentioned it,
I briefly touched the topic of garbage collection.
So meanwhile there exists, I think, seven garbage
collection algorithms, at least in the version of Java
eleven. Thing is that garbage collection can have
unwanted side effects in performance testing,
so you better try to eliminate that.
Luckily, there's the Java enhancement proposal 318,
which is about epsilon, a no op garbage collector.
I linked it here, can read the details if you
like. And that's actually a garbage collection algorithm which will
always allocate memory but never freed up again.
Next, it's about JIT versus Aot compilation.
So as you might know, Java code is pre compiled to
Java bytecode, which will then be run on any
JVM. The JIT compiler, just in time compiler is
first doing some profiling on the bytecode and
then it will apply optimizations like
method inlining, branch prediction, loop unrolling, dead code
elimination, and many more.
And it will also only compile parts of
the bytecode to machine code because it has to decide which parts
of the code need to be optimized. Then on the other hand, there's the head
of time compiler, which will just directly compile
all the bytecode to machine code when JVM starts
up. So for the jig compiler
since JDK eight, there are actually five
levels of tit compilation. At least that's what applies for
the hotspot VM. The first level is just about interpreting
bytecode, so it's level zero, and the JVM
will not compile anything at all, but just run as
an interpreter. And after a few iterations the chit
compiler will make use of its first compiler. It's the C one
compiler, also called client compiler,
and produce some simple c one compile
code. So we talk about level one, two three
compilations, which are all done by this C one compiler.
After about 10,000 invocations, code will
eventually become marked as hot,
and then it will become subject to level
four compilations, which is then done by the c two compiler.
This is cared a server compiler and it will do some
much better optimization with your Java bytecode. Okay,
next we continue with Java micro benchmarking. My lessons
learned so I was not sure in the beginning
how to start my learners journey.
I searched on the Internet and found that there are existing benchmark
suites like spec JVM 2008,
which is from 2008, and the Decapo
benchmark suite, which was first released in 2009.
While the last maintenance release of the cared benchmark
suite is almost two years ago, which was eight months
before the release of JDK eleven, for me they felt quite
outdated, so I didn't want to use them for that reason.
Also, not all benchmark tests were working with the
targeted Java version eleven, so I was actually trying to use them,
but failed. And finally the
output format. The measurement units did not suit
or run it in a suitable format, which I could use for further analyzing
the collected data. So simply using some
out of the box benchmark suites did not work for me.
So I came up with the idea of writing my own benchmark.
You have to know, writing a good benchmark is not easy.
There are two fault categories. On the one hand, there are conceptual
flaws when designing a micro benchmark, which I will show you an example
in a minute, and on the other hand, there are contextual effects when
running it. Here is an example for a conceptual flaw.
On the left hand side we have the method create arrayupto
and the method that code elimination, which will
invoke the first method to create an array with the length of 21,000
containing values from one to 21,000. The array is
then processed and all the values are accumulated
into the result variable, but this variable is actually
never returned, so the calculation result is not used at
all. If we then execute this for like
18,000 times, invoke system current time
millies before and after the method invocation, we could calculate
the duration it takes to execute that code elimination
method by subtract the start value from the end value.
But here's the issue. When running the code, the JVM will
first just interpret your method and eventually collect
some profiling data on it and figure out that the
result of the method is actually never used because
it's never returned. So at some point in time the JVM
will just eliminate this invocation and you'll see that
in your output that at some point in time the
execution time will just drop to almost zero milliseconds
because what you measure is just the time between invoking
system current time release the first time and the second time.
But there is JMH to the rescue, so conceptual flaws
can mostly be avoided by using frameworks like JMH.
JMH is the Java benchmarking house,
and it is a tool that was created with the intention to help
developers in avoiding common pitfalls when writing and
executing Java benchmarks. So it's actually
quite handy to use it. But also you have to be careful what you're
doing. And here you can see one of my first tries
where I was using JMH to write my own benchmark. I actually
asked for some feedback on Twitter and got none,
but didn't stop me from continuing my learning journey.
You can see there are two things I'd like to point out here. One thing
is that JMH provides you with black holes,
which you can use to consume some objects in
your benchmark. So this will make sure that the code is not eliminated
by the JVM. You could also just return that or print it to
system out that will have the same effect, but there are black holes then.
Second is that you should also consider warm up and
there's an annotation at fork and you can specify the
number of forks which you want to execute. So how often
the benchmark test should be executed in standalone
jvms and also warmup iterations to actually avoid
warmup when benchmarking your code. But in my case I wanted to
measure warmup, so I set this to zero to get some observation.
I tried several different approaches to write some
good benchmark tests. I tried to reuse existing benchmark tests
from the Dakarpo or spec JVM suite, but that all
didn't work out for me. But in the end I ended
up with a sudoku backdracking algorithm, which turned
out to be working quite well for my case. So you can find
that code on GitHub. I will not go into details there,
but this is the code I used to benchmark the JVM warmup
performance. So here's my test environment.
I did all the benchmarking on a virtual machine,
which is not optimal, but I tried to
compensate that with multiple test runs. See that in a minute.
So the operating system is a Ubuntu version 2064
bits, and I had eight virtual cpu cores based
on AMD Opturam processor. There were eight
gigs of ram available, no swap configured, and a storage
of eight gig hard drive disk. My test setup
I decided to execute my benchmark tests with
21,000 iterations to also see some
effect when a method gets marked as hot.
Every one consisted of 20 forks, which means that JMH
will spawn up 20 independent jvms
to not accidentally make use of already pre cared
code. Then I executed twelve runs
at different days and daytimes to eliminate these
contextual effects I would face in a virtual environment.
When you multiply all these numbers, 21,000 iterations
in 20 forks and twelve runs, you get
5.4 million sudoku solved per JVM.
Always the same sudoku though the JVM parameters.
I did not touch much because I wanted
to take the approach of simulating a
daily user who would just throw code the JVM at the JVM and
run it. Besides two exceptions, the one is that I was using
the no operation garbage collector epsilon or
respective other ones for the other jvms,
and also the pretouch memory option, which I will explain in
a minute. So here are my jvms under test I
decided for the tried and trusted hotspot VM
where I used an OpenJDK 64 bit
build from adopt OpenjDK. And as you
can see I also configured some alias for every jvm which
I could use later and just shorten the amount
of text on my slides. So secondly,
I went for GraalVM, which is a polyglode VM.
I used the community edition for my benchmark testing in version
22. And last but not least,
Opengenine as an enterprise JVM, which actually
promises to have better performance on its website than hotspot
VM. We'll talk about that later. Yeah, with this test
setup, I started my measurements. So let's take a look at
the runtime flex which I used to execute my benchmark.
This one is for hotspot VM, and let's go through the
lines step by step. So here I specify
the benchmark target, which is my backtracking algorithm,
and this is just some syntax given by JMH.
The next line I will have to provide some JVM
arguments for JMH that it will use for every
fork it spawns up to execute the benchmark.
I used a configuration of 5gb of heap
and also provided the flag heap dump on
out of memory error to just show me if my
JVM crashes. Next you see some double
x flags like unlock experimental VM options, which I need
to make use of the Epsilon garbage collector, which I
mentioned earlier to avoid garbage collection interrupting
my measurements. And then there's also the always pre touch
option which will claim physical memory
from the operating system right at the beginning rather
than on the fly. So this would also eliminate
some interference by the JVM when it would find
out that it needs more memory. This flag will just tell JMH
where to store the measurement output and in which format,
so it can output things in adjacent format and also
others. And last line, I specify the number
of iterations, which is 21,000 per fork. I run
20 forks and the timeout is just
set to 360 minutes, which is very high,
but just didn't want to let JMH time but
and abort my measurements. Okay, and the last line I just wanted
to collect the output of my program into a
log file. The runtime flags for GraalVM look quite
similar. For one small exception,
I did not find any no operation garbage collection algorithm
for GraalVM in this version. So I made use of a
workaround. I set the max new size parameter
for libcall compiler to a number which
is higher than the actually available memory for the heap,
which makes the JVM create a huge young
generation, but no old generation space in the heap.
So what would occur here is that actually no garbage collection
can occur, or before it would occur, the JVM
would run out of memory. So it's important to have
enough memory for your benchmark tests available. Opengen nine
has also a slight difference here. I unfortunately
found that the Linux version of Openj nine
does not offer a pretouch option. So this
one will claim memory on the fly if it needs more from
the operating system. Okay, that was the setup. And now
I would already like to share some test results. Here you
see the overall chart which I generated out
of the collected data from the benchmarking
of hotspot vm. It's on the vertical axis,
again the time per operation in nanoseconds.
And on the horizontal axis, the number of iterations up to
21,000. If we now zoom in a little,
you can see that there is a light red colored background
of the warmup graph. And I call this light colored graph
the scatter shade because this actually represents the scattering
of the different fox individual data
points of any given time slice. So they are
the interquartile ranges, q one to q three,
and the red line is the median value of
the execution time. So on this slide, I again
zoomed in to the first thousand executions.
And here you can actually see that there's already in the beginning
a significant drop in the execution time. There are several things we
can observe here. First of all, we see that the
scatter shade is tightly following the median curve and
also narrowing over time. So that shows that the
execution time is generally declining. Next,
the median curve is also tending
to be at lower bound of the interquartile ranges of the scatter shade.
Which allows the conclusion that data points between
the median and q three quartile under spread
compared to the range from q one to the median. Which makes absolutely
sense because there's a physical lower bound when executing
and this behavior and the scatter shades can also be observed
for the other JVM charts for GraalVM and Openj
nine. Here we have the chart for GraalVM for the
first thousand benchmark iterations. Both GraalVM
and hotspot actually have this sudden
decline at around 100 executions where the
execution time significantly drops. There's not
only in the median curve, but also in the scatter shade this
significant decline. And we also see
that at this point, the q three boundary.
So the upper part of the scatter shade will eventually fall
below the q one boundary of previous data points.
So I tried to visualize that with this red bar. You see that
here the under bound of the scatter shade is below the lower
bound. That's another view on the GraalVM warmup
chart between iteration 6000,
406,800 GraalVM
actually shows this bump. And I did not dig
into details here because I didn't have a good profile
at hand. However, I think it would be definitely interesting to investigate this
anomaly. If you have any guess what this bump is about,
please let me know. So the blue chart is for openj nine.
Again, we look at the first thousand iterations for
this benchmark. You can already see that the warmup chart of openj
nine looks somewhat different than the others.
So first of all, there is no sudden decline at
the mark of 100 iterations, but instead there are some
spikes in the execution time for single iterations.
You see that here are some spikes, and also later on
they're getting less over time, but they are always present.
I was thinking, okay, maybe these spikes could be cared
by the missing pretouch option, which is not available
in open gen nine for Linux. To find out if this behavior
could be, or the spikes could be attributed to the missing pretouch
option. I would have expected to observe
similar behavior for the other two jvms when I disabled the
always pretouch option for them. So therefore I made
another measurement series with GraalVM and hotspot
having the always pretouch option disabled.
But the warmup charts looked the same. There were no spikes for
GraalVM or hotspot. There were no hints for my
suspicion, which leads to the conclusion that in my test setup,
fetching actual memory from the operating system had only
minor or even no effect on the measurement series.
And these spikes in the warmup graph of
opengenine cannot directly be attributed to the fetching
memory from the operating system. Okay, so up to now we just had a look
at each JVM individually. Now I'd like to
continue to compare them. To get started, I just talk about
the average execution times. So on the left hand side you see a histogram
which includes the execution times
for opengen nine, hotspot and GraalVM,
all in JIT compiler mode. You can see that the histogram
for hotspot and GraalVM looks quite similar, and opengenine
describes a rather different curve. However, they all have this tail to the right.
The average execution time for hotspot and GraalVM is
around 0.4 milliseconds. Graalvm seems to be
a little bit faster, and Openj nine is following
tightly at almost 0.5 milliseconds.
Then I also made some measurements where I enabled the
Aot compiler for opengenine, and this one turned
out to be faster than opengenine in Jit code,
but still slower on average than GraalVM
or hotspot. So that's also what you see here on the right
hand side in the chart. Purple curve is opengenine
in Aot mode. It's faster than OpenJ nine in Jit mode
overall. Okay, let's dig deeper. One interesting
thing to observe in the warmup charts is the amount
of time, the number of iterations it takes to speed up
the method execution from five milliseconds to 0.5
milliseconds. I'm talking in the unit of milliseconds,
because that's easier to pronounce. But just don't get confused
by the scales here. It's still nanoseconds on
the chart. The first red bar is at five
milliseconds, the second one is at 2.5 milliseconds,
and the third one is at 0.5 milliseconds.
So for this blue chart, which represents openj
nine, it takes 150 iterations
to gain a 90% performance improvement within
the first iterations of the benchmark test. So in
numbers, this means that for every next
execution, the JVM can execute the method 0.3
nanoseconds faster than the previous operation.
If we take a look for this KPI at hotspot,
we see that the negative slope is
not as steep as in open G nine, and we can also
prove that by calculating it. So,
reaching the lower bound of 0.5 milliseconds
from the beginning, where we start at five milliseconds,
takes around 700 executions of the benchmark method.
So we can say that with every
next execution, the JVM or hotspot can
speed up the method execution by zero point
63 nanoseconds per operation compared
to the previous operation. Which means that during the first few
iterations where warmup takes place, hotspot Vm is 4.6
times slower than open gen nine. If we compare all
the three jvms together, you will see that Opengen
nine will only win the race within the first few hundred
iterations. But if we compare that after
around 600 iterations, we'll see that
the blue chart is above the green and red chart
of hotspot and gravm, which means that in the end,
Openj nine will be slower than its opponents. But just right
at the beginning, it's warming up faster. I also promised to shortly
talk about JIT compilers versus Aot compilers,
and for that I made some measurements with OpenJ nine jit mode,
which is the blue graph again. And in AOT mode, which is
the purple graph here you can see the flags you need to provide
to enable the Aot mode on open genine, and you
can easily spot that right from the beginning. The Open
Geni Aot compiler starts at its maximum
performance and executes the code always in the same
speed, while the jig compiler will take up on that
after a few hundred executions again. So having all
these nice looking charts is quite cool actually.
But I also wanted to know what's actually happening there.
Why is the warm up as it is, and what's
causing it? So for that I found ditchwatch which is
a block analyzer and visualizer for the hotspot jig compiler,
and it's a really cool tool actually. You can enable it with
these flags if you provide these runtime flags on your jvm.
However, you have to know that this will have a
negative impact on performance, so do not do that during
your benchmarking, but just afterwards to investigate.
And the output file, which is a XML log file
you can just load into jitwatch afterwards. Then jitwatch
will show you the compilations for every single method. So here's
an example for compilation list of the method
solve integer array, which is one of the methods
in my Sudoku benchmark tests. You see actually that there
are some c one compilations happen and happening, and also some
c two compilations also on stack replacements,
but all only after 20 seconds,
which is actually like half of the time the
benchmark test runs. So this is way beyond the
initial warm up we saw, and actually they do
not have a lot of effect on the execution time anymore. So I
was wondering what else would then cause the warmup
in the initial 1000 iterations if
all these compilations shown by Jitbotch kick in much later.
So while Jitbotch is a useful tool to visualize the
actions of the JIT compiler, I encountered discrepancy
between the compilations shown by Jitbotch and the JIT
compiler actions locked on the terminal by providing the runtime
flex print compilation and print inlining.
So the terminal lock output showed several inlining
operations taking place already during the first iterations of the benchmark
execution. These inlining operations also fit to the warmup
charts where we see a steep decline over
the first few hundred or thousand iterations. So these inland
operations cared probably the main driver for the fast decline
in the warm up graphs we've just seen. The difference between
the XML compilation log file used by Jitwatch
and the compilation log output on the terminal
can actually be explained by the fact that there's a limitation
in the log compilation option, which leads to
the fact that these inline decisions made by the c one
compiler early on are not included in the XML log
file which is used by Jitwatch. You can read the details
here where I provided a link to the OpenJDK wiki.
Okay, now I'd like to draw short conclusion
and also have some additional remarks to my benchmark
measurements. First of all, all the benchmark measurements I made
were done for JDK version eleven for all the mentioned
jvms. I did not perform measurements in any other
JDK version. Secondly, the benchmark
measurements I conducted in October and November 2020.
So meanwhile there are new versions of the jvms,
so it would be interesting to also take a look at them. Yeah,
and here are also my final thoughts. As just said,
graalvm version 21 was recently released.
It now comes with the espresso JVM, which is a
JVM fully written in Java. It's Java on truffle,
if you know what that means. Yeah, maybe I find the time
to also do some warm up performance benchmarking on the
espresso JVM. The second thought that comes to my
mind is that Opengenine's benefit is definitely
its Aot mode. So it's performing better in Aot mode,
at least in my measurements. But I'm asking myself,
why don't they make this the default configuration if they
also advertise with it that they are faster than the hotspot
VM? Last but not least, I think there are many other JVM that
also deserve to be benchmark on warmup performance because they
become more and more important in the world of different JDK
releases. For example, the Amazon Krata JVM or Alibaba Dragonwell.
All right, that was my presentation. I hope you
like it. It was the first talk I ever
held in public, and if you want to check out
my references or take a look at the source code,
you can find many more details on my blog post about
that topic, which is linked here. If you have any questions
about my measurements, my talk, my learnings,
or want to discuss something, yeah, just send me an
email. Here's my contact information and thank you for tuning in.
See you.