Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, welcome to Con 42 Golang. I hope you had
a great time with all the talks until now. I am Yash and today I
am extremely excited to be talking about profile guided optimization
and how it can boost your code and your compiler. So moving
on, let me first of all introduce myself. I am Yash and currently
I am working as a software engineer at Red Hat. I am also pursuing a
masters at University of Waterloo in Canada and
around my work I deal with Openshift, kubernetes,
Golang, containers, docker, cloud, native stuff.
Apart from that, in my free time, whenever I get some time I
do open source dev. I love to do long distance running and I like to
just build stuff. So moving on, let's first of all talk
about compilation, right? So computers, as you all know, don't understand
Golang, python, Java, all that stuff.
Computers only know zeros and ones, that's the binary language.
So there is, there should be an entity which translates a
Golang code to zeros and ones so that the computer can execute
it. And that entity, as most of you would know it,
is called the compiler as shown in the image below. You can see
that on the left hand side there's a Golang code printing hello world and there's
some compiler magic happening which produces the machine code which is
the binary or assembly, and that is fed to the linker to
produce the actual binary or the executable which
you execute to print hello World. Across this talk,
we won't talk much about the linker, we would care and talk more about
the compiler. So let's move on and let's dive a little bit into
the magic inside the compiler. Magic. So compiler is an
amazing piece of tech. I mean it's doing so many things so gracefully and
beautifully, so behind the scenes it does a bunch
of steps. Firstly, it breaks down the code into a bunch of tokens through
the means of a lexer. That's that process is tokenization and
then that is fed and a tree is created out of
that set of tokens and that tree is called abstract syntax tree,
which is a representation of your code. That tree then goes
through a phase of semantic analysis to determine whether the
type checking is happening correctly, whether a code is correct, all that stuff.
And after that a code is converted into a platform
independent representation called IR, which is intermediate
representation, and that is fed to the backend of the compiler
where a bunch of optimizations happening happen and a lot of
magic happens, which by the way, we will dive further into the talk.
And finally, this optimized IR is fed to
the machine code generator which actually produces the machine code and
yeah, feeds it to the linker to generate the executable.
So let's talk about the optimizations. Sorry, the magic we care
about which is the compiler optimizations. So the thing
is that the code which you write, the code which I write it
might not be the most efficient piece of code behind the scenes,
meaning that there can be a bunch of efficient optimizations
which can be made to it, and compiler takes care
of it for us. So if I give you an example, imagine your
code has an if statement which will never be executed. For example,
it says if false print hello world.
Now you and I, we both know that this if statement is
absolutely useless. And the compiler also is smart enough
to go through these lines of code and see that hey, this if statement
would never execute, so let's not even include it during the compilation.
So in this manner, the compiler actually ends up ignoring
this small chunk of code and ends up producing a much lighter executable
or binary. Now this was one trivial example, but the compiler actually
like applies a bunch of crazy optimizations, and the ultimate
benefit is that you end up with the lower executable, sorry, lower size
of the executable. There are lesser number of instructions and code
jumps happening in runtime because of lower number of instructions.
And also the compiler can optimize your code during
the compilation on the basis of the underlying hardware on top of which your code
would run. For example, if your code is doing a lot of GPU programming and
the hardware on top of which the compiler is running
is also based out of a lot of GPU hardware,
then the compiler can really make use of
this information and optimize your code accordingly. With the help of the
this power of compiler and optimizations, you can actually
write a very clean piece of in a very beautiful and readable piece of code
in an abstracted manner, and you won't have to be at
the performance over it because the compiler would deconstruct these abstractions
behind the scenes. So let me give you some
examples of these optimizations. One optimization is pre calculation of constants
where if you have an expression like a is equal to two multiplied by
three plus five, the compiler during the compile time itself
can calculate this and say that hey, there is no need
to compute this again and again in runtime. Let me just compute it and
straightaway feed it in the compilation phase itself.
Similarly, there's a process of loop unrolling where the for
loop is unrolled further. So that in runtime, when the
for loop runs, it does not have to check the condition phase of the
for loop multiple times. That also saves in a bunch of cpu cycles.
And finally, there's a an optimization called Dead Star elimination where a bunch
of useless code is ignored.
For example, in the right hand image you can see that the
variable is constantly getting updated. But all of those updates
don't matter because the last update is going to overwrite all
of the previous updates. So the compiler is smart enough to notice
this, and it ignores all the above two lines and only considers
the last line during the compilation. So the compiler does a lot of
these cool optimizations, but we are interested in
one optimization, and that is inlining.
And this talk is going to be centered around this optimization.
So inlining originates from a problem. The problem
is the act of calling a function. All of us
write a code in a pretty functional manner. We define functions, we call those functions
multiple times. The problem is that in runtime,
the act of calling a function is a slow operation, because when
you just call the function, when you just invoke the function, a bunch
of stuff happens behind the scenes which adds up to the runtime over it.
For example, a new stack is like dedicated for the scope
of the function. The parameters are pushed onto that stack whenever the
function is completed with the execution, the returned
values are returned back to the caller. So all of these things add
up to the performance overhead in runtime. So the what the compiler
does is compiler performs this optimization called inlining,
where the compiler takes the body and implementation of the
function and just places it wherever
that function is invoked. So if you look at the image below,
we have a function called sum, which sums up two parameters,
x and y, and we have a main function where we are invoking this function
twice. So if we perform inlining here,
the compiler can actually just strip off the definition of the sum
function and just replace the invocations of the sum
function with their actual implementation. So sum one
comma two becomes one plus two, some two comma four becomes
two plus four. And the beautiful part is that this
newly inlined piece of code can be further optimized through,
say, pre calculation of constants to actually know
in the compilation phase itself that res one is equal to three,
res two is equal to six. So from a runtime perspective, things become very much
more efficient, and that ends up elimination, eliminating the functional
colleagues as well. But we have a problem.
Nothing is perfect in word. What if you have
defined a function and it has too many invocations?
If there are too many invocations, then the inlining process will be
too much if we perform the inlining blindly, and that would lead to
too many new lines of code. This means that the code size would
increase massively and that would lead to a bloated binary,
and that ends up having some bad consequences in the runtime.
For example, if I really go into the depth, then uploaded
binary, basically behind the scenes means more number of static instructions
and the higher number of static instructions in a binary in an executable,
the bigger the instruction caches, and that leads to a bad
cache hit ratio or cache it rate, and that leads
to instruction cache message, which is quite bad in runtime. Moreover, a bloated
binary causes page faults and whatnot. And if you are running your binary on
a small device like a raspberry PI or something, even thrashing
could happen. So blindly inlining a piece of code can be a bad
situation. I have shown that as an example on the right hand image as well,
that inlining a 506 lines of code,
uh, can result in a 1000 lines of code result.
So we have a weird problem here. Now if we do less in line inlining,
we have a bad runtime performance because of the overhead introduced
due to those function calls. And if we do more inlining,
then we have a bad runtime performance due to the bigger executable,
the page faults, the instruction cache misses and whatnot.
So what to do then? That is frustrating.
Well, we just need the right amount of inlining.
So what does it mean? Ideally, we want to inline
only the hot functions and not inline the cold functions.
So the hot functions are the one which happen to execute a lot more frequently
in runtime, while the cold functions are the one which don't execute
that often in runtime. So the benefit of inlining hot functions is that
these functions would execute a lot of times in runtime. This means that if you
inline these functions, then you end up avoiding the function
call overhead in runtime. So this gives you the benefit, the runtime benefit of
inlining. You don't need to inline the cold functions
because they are not executing enough amount of time, so there's no
benefit in lining them. So by avoiding to
inlining the cold functions, you actually end up saving on the
binary size and the instruction count, and save yourself from
all those bad stuff associated with excessive inlining.
So ultimately we just need a sprinkle of inlining. And the
problem is that compilers don't know a lot. When a compiler is compiling your
code, it's literally reading across your code, it does not know
a lot more information. So. And just your written code
is not enough to tell how frequently a function would execute in runtime.
Clearly compilers in more info. And what if we
have a solution? The solution is that what if the compilers could look
at the application in runtime and learn from that application,
learn that which functions are hot and which functions are called?
Or in other words, from a more implementation perspective, what if
your applications runs in runtime and some somehow
you collect various numbers and metrics about its behavior in runtime
and finally feed those numbers and behavior as an information
to the compiler. Next time you compile our code.
So this with the next time you compile our code, the compiler would know that
hey, this function is hot and that function is called. So let me just
inline this function. So this kind of looks like a feedback loop,
right? And that brings us to feedback driven
optimization. So as the name suggests, it just
teaches the compilers how and where to optimize the code
on the basis of a feedback. Now that feedback can be benchmarks,
user traffic profiles, all that kind of stuff. That feedback
ultimately tells the compiler that which function is hot, which function is cold,
and ultimately the compiler just does the right amount of inlining.
Now the early days of feedback driven optimization were kind of
governed by instrumentation based process. So what
is instrumentation? It's very simple. With instrumentation,
what happens is that whenever a code is compiled by the compiler,
the compiler actually injects extra lines of code within your
code. And what are these extra lines of code? Some timers,
some call counters, all that kind of stuff. And the purpose of these
additional lines of code is to track and instrument the behavior of your
code in runtime. So once your code is compiled with these extra
lines of code, a bunch of benchmarks are run against your application or
code. And through the means of these extra lines of code, while these benchmarks
are running, a bunch of information gets instrumented and collected,
and this information becomes the feedback for the next build of the compiler.
This information ends up representing how your application is performing in runtime.
Now this sounds good, but is it really that efficient or that reliable?
Because the thing is that now with this whole process, the code becomes
much more bloated with all those compiler introduced instrumentation.
And we have an extra benchmarking step, which just like makes
the entire process of compilation and building
quite slower and boring. And the biggest problem is that what
if the benchmarks don't even resemble the reality of how your code
runs in production, for example, right?
What if it's actually quite opposite. And the benchmarks
might end up introducing some wrongfully assumed optimizations and
that can cause some really bad performance degradation.
So a potentially hot function might be perceived as a cold function by the benchmark,
and your compiler would not optimize it rightfully.
So what do you want? Let's talk first principles.
We want faster build times. We want realistic runtime data instead
of benchmarks pretending to be real. And finally, we want light
and small executables with no extra lines of code due to instrumentation.
If you hate benchmarks, don't use them. So what
we can do is that instead of benchmarks we can actually use the actual
behavior of your code as a feedback to your compiler. And the
beauty is that if you do that, you have now become independent
of the benchmarking process as well, with which now you can just compile our
code and that's what you are done. You don't have to execute the benchmarks,
and this gives you the faster build times. Also, this is more realistic because you're
not relying on any pretentious benchmarks pretending to be real users.
So how do you do that? To be specific,
it's sample based profiling. So the beauty of profiling is that it tracks the runtime
behavior of your code without needing those extra lines of
code inserted during for the sake of
instrumentation. And when I say profiling, I mean sample based
profiling, because like if we go really technical,
then instrumentation is also a type of profiling. So what I'm talking about here is
actually sample based profiling. But across this talk,
whenever I say profiling, just here, sample based profiling.
Alright, so yeah, how does it work? It's pretty simple.
Now, just because we don't have those additional extra lines of
code in the, in our code during the compilation
phase, we need some external entity to poke and monitor
and profile your application. And this is where the kernel and enters
into the picture. So imagine an application is running in user
space. How does it get profiled? Well, the kernel has a beautiful component
called Linux perf, and the Linux perf schedules a bunch
of interrupt events with the cpu hardware,
and that's that hardware inside the cpu ends up scheduling
and triggering those interrupts. And whenever those interrupts are triggered,
the interrupt handler captures them inside the kernel, and the
interrupt handler correspondingly pokes your application in
user space and gets the runtime data at that point
in time. Now that runtime data includes instruction pointer and
call stack, which is enough to tell which function
is executing at that time, which parameters are allocated in
that function, what is the memory footprint? What is the resource footprint
of that function? So a bunch of good profileable data,
right? And that's it. Once all of this data has
been gotten, this data is stored somewhere to be
ultimately used by, for whatever purposes.
And how we leverage that into the feedback driven optimization
situation, we get profile guided optimization.
PGO is compiler optimizations, which are
guided by the profiles of your code collected during its runtime.
So again, the same thing. Compiler is going through a feedback loop
of, you know, the application runs, collects runtime behavior and feeds that
to the compiler, and compiler optimizes it accordingly.
Here the feedback is just the profiles, sample based profiles collected during runtime,
right? So now talk is deep, let's walk inside the code.
So let's take an example. Let's say we have a very simple server which exposes
a post endpoint and which is called slash render.
And whenever you call the post endpoint where the body is the markdown file,
you get a rendered markdown in response. So an HTML
rendered markdown, let's say this server is running, serving a bunch of
users out there, and you want to compile to optimize
just rightfully to it, depending on the usage
patterns. So first of all, let's just
compile this code without PGO, without profile guide optimization,
and see what are the inlining decisions it takes. So after compiling
this code, you can see that it's performing a bunch of inlining. I mean,
it's quite readable here, but if I expand it further,
specifically these method calls, these function calls are getting inlined.
But if you notice carefully, there are two places where this inlining
is not happening. Now, the decision to not inline a
function is a consequence of multiple factors. One reason
is that if a function is a non leaf function, it's like imagine a tree
where each function leads to a child node. Each function call leads to a child
node. If a function is a non leaf function,
there's a good chance it won't get in line. Not necessarily, but a good chance.
So probably that might be the reason why these two functions are not getting in
line. But if you really think about it,
inlining could have been useful here because these functions happen
to be a part of the render function, which get
always executed whenever the API is hit, whenever that endpoint is
hit. And imagine thousands of users hitting that API
again and again and again the function call over
it. The overhead associated with calling the function and invoking the,
for example, IO dot read all function not getting inlander
can be huge, so it would be quite nice to inline the
I o readall function, for example,
and accordingly leverage the runtime benefits of it.
But again, this is a piece of information which only,
which can be only known if you perform the code in runtime. Actually, if there
are barely any users, then it doesn't make sense to inline this function,
right? So, so let's proceed and let's run
and profile the program. So in the first image I built the program,
then I built the binary without PGO,
and I exported the binary as main nopigo.
I ran the no pgo binary, and then I ran the
server by this binary, and then I ran a load
test generator against this markdown. And once
I started executing the load behind the scenes, I opened up a new terminal,
new session, and I just started profiling the server
for 30 seconds just to collect the data associated with it. And finally,
when the new set of profiles were generated after the 30 seconds
of profiling, I stored it in a file called default PGO.
Now the so ultimately we ended up with these files like main dot
PGO, which was the binary, roughly 8.5 megabits large,
and we have now cpu, nope, go pprof of,
or the default PGO files. Both of these are the same files. These represent the
profile of our application running in runtime facing
alleged user load. Now let's compile our program with
PGO, and as soon as you notice the dash PGO
auto flag here, this basically tells the Go tool chain to
use the default PGO profile as a
feedback to compile the code. So this default PGO
actually told the Go tool chain that how, you know,
how frequently the load was hitting the render function
and the IO dot redole. And as you can see,
Golang now finally decided, the compiler finally decided that
it should inline the I O dot read all function as well, which wasn't getting
inland previously, by the way. And, and to
be honest, this is not that, I mean behind the scenes, a lot
of other internal functions which might be getting called by internal
libraries might be getting further in line because of this profile
guided optimization. And we can actually confirm that if
you list the both the binaries before and after PGO,
you can see that the previous binary was smaller and the larger. And this newer
one is larger in size, roughly, I mean,
0.2 megabytes size larger. That's because
the new binary has more amount of inlining into it for the sake
of better runtime benefits. So of course its size is going to be larger as
well, because with inlining the code size increases, which is getting in line,
right? Because instead of function invocations into
your binary present, in your binary, instead of function invocations,
you would actually have the definitions of function,
which wherever the functions are in line. Alright,
so now let's load test the old and new binaries. Again,
as a part of first step, I ran the old binary,
which is the main nopigo, and I ran a bunch of benchmarks
and I basically dumped those benchmarks into a text file,
which is a Nopigo test file. I did the similar thing with the binary
with PgO and I dumped its benchmark into a
separate text file called withpego. Txt. So ultimately we have two
txt files. One, one file contains the results
of the benchmarks of old binary, and the other
one contains the benchmarks of the new binary. And one
thing to be noted here, there is no change of code in both these
binaries. No change of code. You I did not change
anything. So it is just the magic of profile guided optimization,
which we are about to witness. So as you proceed further,
if we use benchtack to compare all of these, both of these text files,
you would actually notice that with PGO, the runtime,
the average runtime actually reduced. Runtime performance actually reduced
or improved by roughly 2%, 1.88%.
Now, I know it's not much, but the thing is that
we just simulated a very slight amount of load and this was
a very random, simple application. But imagine a fully fledged server
with thousands and millions of users. In that case, some serious
optimizations can be performed and this, this small difference of
2% can be substantiated further to even five
or 10%. You never know, right? And the best part is
that the amount of effort involved with this improvement was nothing,
it was negligible. I did not have to change a single line of code and
I got this benefit out of the box, right?
So let's do one thing, let's get
our hands dirty. I mean, slides and all are fine, but let's get our hands
dirty and actually play with profile guided optimization, right?
On terminal. Alright, so pop
open my terminal and you can see that I have a bunch
of files here. So, alright, so let
me do one thing, let me clean up this stuff, let me just clean up
this stuff here, let me clean up this stuff. And so
I'll first of all clean up all the text files. We don't care about those
results anymore. I'll clean up the old binaries.
All right. Main with big.
Oh, I'm just cleaning up all the old binaries.
And let me clean up all the old profiles as well. All right,
so we have nothing. Now we have just, and of course let me remove the
default p o file as well.
So we are at scratch now. Okay, we are at scratch. We don't have anything.
Now let's first build our code with nothing, with the,
let's call it main. With, sorry, main,
no pgo. And we are compiling the
main door go file. Simple. Now with this we have
compiled a code and if I run this,
sorry, if I run this as this
is running. Alright, now let's go
to separate terminal again into a separate session. And actually let's
run the load test. Okay, let's run the load test
now here the load test is running behind the scenes against an application. All right.
Now if I go here and if I further start
the profiling process, let's say 30 seconds.
Now for 30 seconds I'm, I've started the compilation or profiling process.
And with this profiling process, what's happening is that behind the scenes,
my binary, my server is getting hit by requests
by this load. This means that is actually facing real time user traffic
and that's actually simulating real time user traffic.
And the profiling process is capturing all of the information about, you know,
at each and every time. What is the memory footprint, what is the
cpu footprint, what is the resource footprint of the binary execution.
So as you can see, we are done and we have the profile ready
here. Let's just rename our profile to know
to default P O because that's something which is recommended by the
Go tool chain. Of course you can have a custom name as well, but just
for the sake of convenience, I put it up here. Now let's do one thing.
As you can see, we have the default p go here. Now let's compile it
again main. If I do the
compilation process again, this time let's
call the output binary as main. With PGO, we're doing main
go and we will introduce the PGO automatic
flag. This basically tells the compiler to read
the default PGO file as a feedback for performing the
compilation. And let's see, it's definitely taking some more time as
compared to last time because again, now it's just going through more
amount of scanning through the profiles and accordingly taking the rightful
amount of optimization decisions.
And yeah, it's just waiting, it's building a lot of suspense.
But as you can see, it's doing a lot of compilation and whatnot.
And yeah, it's taking a
lot of time. So behind the scenes.
Yep, we have stopped all the stuff. So yeah, yeah,
it's not built and if I do a version M dot
with ego you can actually see that it did.
Consider this default PGO file while doing the compilation.
Alright, now let's run this main with PGO
file. But before that let me show you something interesting as
well. So let's look at the sizes.
I want to do an ls. Why do list? Yeah, just notice
the size of actually,
let's do a grep main as well. Then notice how
the size of the one with PGO is actually larger and the one without Pigo
is smaller because the one with Pigo actually told the compiler
to perform more inlining, which led to the increase in
the binary size. And finally let's execute
main with PGO. This time let's
actually perform a bunch of. No, let's perform benchmarks.
Let's do some benchmarking here. So I'll
actually start the benchmarks here and
these benchmarks while these benchmarks are happening, let me tell you what's happening. The benchmarks
are running and they're basically firing a bunch of load against binary
with PGO and they're storing the
results. The benchmarks are storing the results inside this file called with
Pigo Txt. So right now we are doing this with PGO.
Once these benchmarks are done, we will do these ones without PGO
and then we can use a tool called benchtat to compare
our results. Alright, so let's wait for a few
seconds and this should be done in anytime soon. And there's
nothing running in the second terminal that the way we generated the load
previously was only for the sake of generating profiles that said nothing else.
Right now we are doing benchmarking. So we are doing it in a different manner.
And as you can see the benchmarks are executed.
Now let's close this binary and run
the one without PGO. And as soon as we run here,
let me just name this file as nope.
And again, this time as well we are running but this time across
these benchmarks we will snapshot these benchmarks in
a separate file called not nope Exe. And as I said,
we will separate out these files and then benchtat against them
to compare the real numbers that how things happen in runtime.
They might be slightly different from what I showed in the PPT because
again the load is kind of non deterministic
in that manner. But yeah, let's see how it, how it happens.
So it should be done anytime soon.
We have nothing here and the no PBo file is running here.
Yep, it's done. Now if you do bend status,
no pgo txt and with
pgo txt you can see what a beautiful
difference it is. The one with no PGO, your average time was
309 microseconds. With PGO it was 301. This is
even a higher significant difference,
even even a more significant difference with PGO,
which is 2.33 as compared to the slides. So let's go
back to the slides and just make some concluding notes.
So that's pretty much it. I guess we learned quite a lot and see
how profile guided optimization can actually benefit us a lot. So we
explored the process of compilation, how compilation can be made more
effective with feedback driven way by feeding it some
runtime data and our sampling profiles based. Profile guided optimization
works even more effectively in our favor and
we actually got a very practical perspective by
getting our hands dirty by playing with profile guided optimization
with a server like code resembling a real life scenario.
And you can find all of these slides and all of the associated content
at the link below in my GitHub under my GitHub profile and
let me share the references. So there's a bunch of references I used
to learn about this topic and to inherit the content of these slides.
Of course, this was just meant like I just sat on the shoulders
of these giants who implemented all of this cool stuff.
Finally, feel free to connect with me. All of my handles are
given here. And that's all folks. Thanks a
lot for your time. I appreciate you giving me your time and
attention for this, for attending this talk, and feel absolutely
free to raise any questions or reach out to me whenever. Hope you have a
great day.