Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to a squeezing ago function. Today I'm going to talk about
how to squeeze is the performance out of your go functions.
First of all, I want to talk about what is
optimizing. If I ask
you what is better orange juice probably
most of you will say that fresh squeeze orange
juice is the best one, but it depends on what you want.
If you want to have a really good taste,
probably fresh squeeze orange juice is probably
the best option. But if you want something that is comfortable,
maybe a brick of orange juice is great. If you
want something that has to last forever, probably a
tank will be a better option. So optimizing
is about that. Optimizing is about finding the right ocean for
you. Sometimes it's being faster, sometimes it's consume
less memory, sometimes it's doing less I o.
But not necessarily is just being faster or just consuming
less cPu. So take that into consideration whenever
you are optimizing.
Another thing is important to optimize at the right level.
You can buy an f one car, but if this is
your road, you are screwed. So there's no way that you
are going to get 300
road. So probably you need to optimize
your infrastructure first. You need to optimize probably your architecture
first until you reach the point where you need to do micro
optimizations at the goal level. Go is
a really fast programming language, so the
runtime is really efficient, the compiler is
really good. So probably your bottleneck is not going to be
in that go function. Probably your bottleneck
is in the architecture, in the I O,
maybe on some SQL queries or things like that.
But normally you don't need to optimize at go level
anyway. Sometimes you have good reasons to
optimize at Go function level.
But first of all, analyze all the other stuff first,
optimize what you need when you need it.
If you think about optimizing a go function, you can optimize it
a lot. You can optimize until a
completely absurd level. But the reality is
you need to know when you need to optimize that function
and when you need to stop. Because sometimes
you are looking for the bottleneck and the bottleneck is in that function and
you get obsessed with optimizing that function until the
best level possible. But the reality is sometimes you
reach the point where there is no longer the bottleneck
there. So if you are over optimizing that function,
you end up wasting time in something that is not needed.
So you just should measure where is your
bottleneck, fix that bottleneck and look for
the next bottleneck to fix not get obsessed with one
specific function.
Do not guess. One of the most important lessons about
optimizing is guessing is not
good. We are not good at guessing. Nobody is good at guessing
in these things.
You can think about estimations in
software. Nobody is good at estimations, nobody is
good at guessing when it's going to be
a performance penalty somewhere,
when the optimization is going to work well,
this is because the systems are really complex.
You have to think about a lot of stuff.
I o, hard drives,
spinning disks, cpu,
other processes working in that same cpu context,
switches of threads,
a lot of crazy stuff that the operating system is doing all the time.
A lot of crazy stuff that other processes can be doing.
There's so many variables, so many things
going on in your computer at the same time that guessing is not
going to help you at all. So the
rule is basically measure everything. If you
want to optimize, the first thing that you need to do is measure it.
You need to know how much time
or how much memory or how much resource that you want
to optimize is consuming your program,
and from there you can start optimizing.
Well, how do you measure things? Well,
the most common way of measuring things when you are doing
macro optimizations in functions in go,
is doing benchmarks. A benchmark is just part
of the go standard library. It's part of
the testing framework and allows you to generate
these kind of similar test functions
that allows you to measure how much time it's consuming
something. For example, here I'm comparing the I'm
doing two benchmarks, one for MD five and one for
Sha two, five, six hashes.
So I'm basically doing a benchmark and see
what function is consuming more time.
If I execute that, I'm going to see that the MD
five is twice faster.
I can basically do double of MD five hashes
than SHA 255 hashes in the
same time. So that's one thing.
Also, I can ask the benchmark to report the allocations.
How many allocations is doing all these hashes functions?
If we see there, basically, in this case, the hashes are very well implemented
and they have zero allocations. So they are really
fast and they don't consume
any heap memory.
Well, let's see another example. Let's see open
file. Whenever I open a file,
it's taking, in this case, it's taking 120
bytes per operation and it's doing three allocations.
So that's an example of something that is really doing allocations.
Later we will see why allocations are important.
Another way of getting information about
performance is profiling. Normally you want
to know what a part of your system is the
bottleneck, and profiling is for that.
So here is
the example of our benchmark before,
so I can run that. But in this case I'm asking for a
mem profile. So I asking hey, give me a mem profile
and I can use the gotool pprov to
get the information about where my memory is getting consumed.
In this case, I can see there that 75
megabytes in
this benchmark is consumed
by the OS new file function.
Also I can see that in an SVG format to
see in a more nice way. And clearly I
see there is a big red block that is saying hey,
most of the memory consumed by your
benchmark, by your binary in this case is because the
OS new file function.
Also I can see exactly the line where that's
happening is whenever I create this file strap.
Also I can check for the CPU profile again,
this is more distributed
in different places, but 30% of the time
is getting in the Cisco six function.
And if I see that in a graph, I see that silhouette
six is consuming 30%.
There is 14% in eple weight
and 22% in eple ctl.
Anyway, let's take a look. This is our syscall six.
That is basically a function that syscalls for
the operating systems and is writing in
go assembly. So probably you are not going to optimize
this,
but you can dig where that is happening and optimize
that. This is from the standard library. So probably it's really
well optimized already. But you can use the same tools
to explore your source code. That is the interesting part.
Okay, well, I'm going to show
some examples of small optimizations.
I want you to think about this more
about the process. I want you to learn things during
the process, learn things about the
different microoptimizations that I'd be using here.
But I think it's more interesting in the process how you create the
benchmark, how you run the benchmark, optimize that,
run to the benchmark again and see the results. And probably
apart from the benchmark, you should have some
unit tests to verify that you keep the code correct.
But I'm showing here only the benchmark and
the microoptimizations. So one of the simplest
optimizations for reducing the cpu usage is basically
doing less. So for example, I have here a find
function that have a needle and a high stack and
basically loop over the high stack and it find the needles,
it returns itself,
the IDX as a result and returns a
result. This is a very naive approach to
do this, but anyway, it's clear that you can optimize
this. Let's start with a benchmark. I create
ton of data for a high stack, and I do
a test searching a value that is a bit over
the half of the high stack.
And if I run that, it's taking two,
eight, nine nanoseconds per operation.
Okay, sounds fast. Let's see if we can optimize that.
If we just return early, return the result,
because I don't need to keep searching if I already found something
in the high stack.
If I return that and run again,
it's going to be 172
nanoseconds per operation. That is a
bit more than the half, because we are looking a but
more than the half in the high stack in
the benchmark. So that makes sense.
Basically what we are doing here is just being a bit smarter
and trying to not keep executing code if it's
not needed. Okay. Another interesting
thing is reducing allocations. Let's see
here. If I have a slice,
and I initialize that slice, and start
adding integrals to that slice,
in this case a million integrals, I'm going to do
a benchmark that. See what happens if I run that.
And the thing is,
each operation is taking 39 allocations.
And if you see there, it's taking 10
million nanoseconds. That is like ten
milliseconds per operation.
So let's see,
I can initialize that. I can say,
hey, I know that I going to have 1 million
integrals there. So I going to just
create upfront that array of 1 million,
that slice of 1 million elements. If I do that
now, we are talking about less
than a millisecond execution and
only one allocation. So that's a huge improvement.
We are reducing in one order of magnitude the
time consuming by the function. But we
can do better. We know upfront in this
case, we know upfront of the compilation time that
we are going to have a million elements there.
So instead of creating a slice with make
and giving original size, I'm going to create an array directly.
In this case, it's now more
or less a third of what we had before, and zero allocations.
We are not allocating anything in the heap. That means
that, it doesn't mean that it's not
in memory. It means that it's in the stack, not in the heap.
That's the idea.
Let's see another kind of optimization. Let's talk about
packing.
Well, whenever you have a structure,
the compiler is going to add
space or add padding to our fields to
make them align with the processor,
basically make it more efficient for the cpu.
In this case we have this struct that have a boolean, a float
and an integer, and the compiler is going to add bytes in between
that things. So if we create a slice
of 1 million elements of that struct
and benchmark, that we are going to see that we
are consuming 24 megabytes per operation.
We found a location in this case what is kind of fast, but still
24 megabytes per operation.
If we go here and just change that struct and
change the order of the fields, the compiler is going to
be better at generating that padding. It's going
to say okay, b float is already aligned,
c in 32 is already aligned, add the
bool is not aligned. But I going to align with three
extra bytes because well,
I can pack that together with the previous int and all
that stuff. From here we have 24 bytes
per instance of the strack, and here we have 16
bytes for each instance. If we run
the benchmark there, it's still one allocation, but we are saving eight megabytes
of memory in the heap.
This can sound silly, but it's
important if you see there, even if you are not caring
that much about memory. In this case you see that there is
consuming two milliseconds and
here is consuming, well,
2600 milliseconds, 2.6 milliseconds here 2.02.6.
So it's not only affecting the memory usage, it's affecting the cpu
time also. So another interesting
thing is function inlining.
I'm going to explain why inlining is important, but here
is an example of not inline function.
The compiler is going to take care of the inlining, and it's going to do
that by analyzing the complexity of the
function. And if the function is simple enough, it's going to
inline that automatically. In this case it's not
in line because I explicitly asked the compiler to not inline
it. So if you have this, this is the not inline
function and this is the inline function. And if I
run a benchmark here, we are saving a whole
nanosecond there. That is not, to be honest,
it's not important at all. If you are
fighting for nanoseconds you are in a very interesting
field. But the reality is nobody cares about that nanosecond
normally. But why is important in line,
and we are going to see why it's important in liner in a minute,
let's talk about escape analysis. Escape analysis is
something that the compiler does to decide if something goes to
the heap, also goes to the stack sometimes.
For example, in this case, this is not in line.
So in this case we have this value, and I return the
pointer to the value that is going to escape
to the heap because I'm returning a pointer. So val
can be in the stack of the function, because I have to return
that and it's going to be accessed from the outside.
So I need to store in somewhere that is not the stack of
the function. So it needs to go into the heap.
So for that reason,
now we have one allocation there, one heap allocation
there.
This is non escape function.
I get the value,
the integral value, and I return the value
as an integer. So I'm not returning the pointer, I'm returning the whole
value. So I'm returning a copy
of the value, a copy of the stack. So it's a copy of the value
that is storing the stack. So it is not requiring
storing memory. So let's see,
what is the impact of that with the escape function
where we need to store something in the heap
we have there that is consuming eight bytes in the heap and is consuming
one allocation per operation. And that means
that in this case it's taking 13 nanoseconds
per operation, and the non escape function
is taking only 1.5 nanosecond approximately.
So that means that we are almost an order of magnitudes
faster with the non escape function, because we don't
need to access the heap, and the heap is expensive
than the stack. So this
is important. But we are going to see, other things that are important relate
to escape analysis, and it's scape analysis plus inlining.
If you combine escape analysis and inlining, you get something
very interesting. For example, here we have a document that have
a path, and I have a function,
and I create a new document with that path. The new document
is just creating the document, doing certain stuff,
and returning the document, the pointer to the document. Okay,
ecpC, right. If I run that,
I get three allocations in
this function. Okay, sounds good, that's fine.
But if I change that, I say okay,
now I create a new document, and then
I call an in function. The new document is super simple function
that have almost zero complexity. It's just returning a new pointer
with an empty version of the document. And then I have an
init function, and that init function is going to
take care of the initialization.
Doing this, what I'm doing is ensuring that the
new document is in line, and because the new
document is in line, we now have
the d value there is not in the heap, it's just
in line there. So it doesn't need to go in the heap goes directly into
the stack of the my function and in the my funk is
going to have their stack and it's going to store that
document directly in the stack, it doesn't need to go in the heap.
So whenever I run in it, I pass the
pointer that is still in the stack and it's going to work
well. So this way I'm going to save allocations.
If I see here now, instead of having three allocation,
I have two. Instead of consuming 56
bytes per operation, I consume 40,
32, and instead of being 128
nanoseconds per operation, I only consuming
73. So this small
optimization in the constructors can lead to
save a lot of allocations. Basically the
lesson learned here is try to make your
constructors really simple, and if you need a complex
initialization, do that complex initialization as
part of a secondary method, something like an init method.
Okay, let's talk about concurrency. We all love concurrency
because in go is so easy and so cool to
do concurrency, but have some important
implications in performance. Of course, here we think
we have a fake I O function
that simulates some I o, some blocking
process. In this case I just doing some
time slips and I have the
fake I o parallel that is going to execute that in Go routines
and the number of go routines passed as parameters.
In this case I execute in three benchmark I executing
one benchmark where I do parallel I
o hundred go routines. Parallel I o based on
the number of Cpus and parallel and
serial I o. If we run the
benchmark on this, we are going to see that the
parallel one that have one
go routine per job is going to have way more allocations,
but the
number of nanoseconds per operation is the smallest.
So we are trading here probably memory and allocations
against cpu time. In one
per cpu you have less allocations and have a more or
less acceptable, well, it's slower, it's ten
times lower, but it's way faster than execute.
That in a serial way and
in a serial way is one allocation only, but it's taking
a lot of time, multiple order
of magnitude slower than
the per CPU.
So let's do another example.
Let's play with the CPU. I create a fake cpu function
that is going to consume cpu. Basically it's doing MD five
sums again,
a method to run that in parallel. And now I'm running that
in parallel for 1000
go routines in one per cpu and in serial.
The interesting thing here is one
go routine per job is
going to be the slowest option here. It's going to be the more memory
consuming, the more cpu consuming. Everything is wrong
with trying to do cpu bound workloads in
unbounded number of go routines. That's completely inefficient.
If I use one go routine
per cpu, that's going to give me the best in this
case, because you can see there that it's taking 65
microseconds and it's only 25 allocations. The co one
is not doing allocations at all, but it's taking
way more time to execute that. It's basically
double of time to execute the same thing that you are executing
when you are executing parallel with cpus.
So that's it. This is more or less the idea of how
you optimize a go function.
There's some tricks that I shared, but I think
the most interesting part is all this flow of
creating a benchmark, trying to optimize that, trying to
measure if your benchmark is doing something interesting
or not, trying to reduce allocation allocations.
Also, it's not only going to reduce the amount of
memory consuming and the cpu spent in
getting that memory allocated, also it's going to
reduce the garbage collector pressure. So the garbage
collector needs to recollect any allocation that you are
doing. Whenever you finish with that allocation, the garbage collector
needs to collect that. So every allocation is going to add
pressure to the garbage collector. So reducing
the allocation is going to have an extra impact
in your program execution.
So some references, there's a very interesting
book called Efficient Go by Bartolomeck Plota.
I really recommend that it
has a very methodic approach around
testing, benchmarking and optimizing. So really interesting
book. High performance growth workshop from Dave Cheney.
Also very interesting and very practical way of
optimizing the go Perth book by Damian Grisky.
And the ultimate goal course from Arden Labs
is a really good one also
for learning a bit more about hardware sympathy
and optimizations.
And well, all the images here are creative commons,
so here is the references to
the outforce and that's
it. Thank you.