Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, it's Marko here. This is my first time as a speaker and I'm
glad to be here at Conf 42 to bring some
goodness in go. So let me introduce
myself. So I am a software engineer at AIon.
I'm a master degree student in AI at University of Pisa,
and I'm into everything that is related to
cloud, native and kubernetes. I also participated at
closer 2023 Cloud conference and
fortunately I won the best paper award and the
work I proposed in that conference was something related
to how to solve smells in a Kubernetes
microservices environment. So let's
go ahead and I'm going to introduce you some topics
that you might already know, that more experts might already know. But still,
I think it's still interesting to catch.
As a first topic, I just want to go deep down through
the go runtime scheduler and the memory model and
how they are impacting the performance of Go applications and
then how to measure your applications and
find bottlenecks, if there are any, and then some best
practices. But first things
first. So ghost is a fast language, but why?
What makes go stand out in the gorgeous
realm of programming languages when it comes to performance? And to
ask to that question, we need to look under the hood and
examine two fundamental aspects of go, that is the memory model
and the runtime shutter.
So, but let's switch from go a
little bit to Java. Okay,
so some of you already know that Java
uses native thread in us, okay,
operating systems, right? So that means that every Java
thread is mapped to one kernel thread.
In this way, Java cannot determine which thread would occupy
the core. And this is completely up to
the OS shadware. So it's completely dependent on how
many threads you have, right? So a problem
could be, for example, if I am executing
Java thread inside a certain OS
kernel thread, I save the state and then the
java thread is scheduled into another OS thread.
Then I would suffer from context switching. And we
will see an example. Let's do a
Java example here. I have a function that does
something as we, as we see for x
times, and we have a
number of threads that are executing this, this function,
this number of times.
Now if I run that code with
100 thread rounded threads, okay,
and with 1000 threads,
we have different results, right? So you
can see that in the picture.
Basically when the number of threads is set to 100,
about 51% of the cpu time is spent in
the function. Do something. So our real function. But then
when you increase the number of the tries to,
to 1000, the cpu time spent
for the actual function went down about
27%. And all of these
metrics basically are telling us that the cost of the thread,
the Java trading model, suffer in nine concursive
scenario. So in
standard operating systems, the threads are scheduled after
a certain amount of time. And when a
timer, hardware timer, interrupts the processor,
the OS kernel suspends the current executing thread,
save its state in registries. So when it has
to resume it, it doesn't lose anything,
and then it finds among all the threads available for being executed
the next one to one. And as I said,
this is called the context switching process. And this
process is kind of slow even, because as I said
a couple of slides before, there could be like a
catch miss. Okay, so this
is the main reason why the
co founders created the go runtime shadow.
Okay, so go doesn't totally rely on the OS
shader, but it has its own runtime shadower.
And it uses a threading model called mn
trading model, where basically m coroutines are
being shadowed among N OS threads.
So as we can see here, there are kind of,
let's say interfaces between the coroutines
and the actual kernel threads. So the kernel threads
in this picture are the white triangles and the
go teams are the one. The goutine can be
green in this case, that means that is actually running
if it's red is in the queue, but that
yellow boxes act as an interface.
And those yellow boxes are the context.
Okay, they are fundamental. So once our
context has run a goutine until a scheduling point,
it pops a gootine of the queue,
sets the stack and destruction pointer, and begins running
the goroutine. And what
can happen is that a yellow box, in this case a context,
is running out of goroutines and automatically,
without basically calling any kind of interrupts
it automatically still work,
still go teams from other contexts.
And this makes sure that there is always work to
do on each of the contexts, which in turn makes sure
that all the threads are working at their maximum capacity.
So taking advantage of this runtime shadow
go was built upon a
CSP model. CSP stands for communicating sequential
processes, where basically we have goutines and these
go teams can communicate each other via, via channel
that can be buffered and buffered.
But the cool thing is that we
as a developer, we don't have to care about basically
accessing shared data, so we
don't have to set matrixes or semaphores
or locks or queues and other
things like that, sync variables and other things like that.
So this is really, really cool. And go, this is something that
I really appreciate because I love writing pipelines, for example.
So for the pipelines, this kind of model is really,
it's really cool even, because the way
the two go teams, as they are not threads, are communicating, is really,
really, really fast. Let's go back
with Java, and this is a comparison. So we have the,
this, this code here we are doing a slip of ten minutes
and we, we're using 1000
os net Java threads, right? And at the end of the execution
we can see that there are actually 1000
threads plus 18, and that 18
is the number of threads occupied for handling the JVM.
And the real power that we encounter
when we execute in code is especially on the number of threads used,
because the same code actually
needs only two threads. And this is, this is really cool.
We said that a couple of steps
before that the Go teams saved their state, right?
So Goat saved the Go team state.
But let's do a comparison on how all
threads are saving the state and
how the Go team saved their state. So the host
threads have a fixed style stack for saving the state.
But this is kind of a problem because it
is kind of a huge state waste of space. If we
imagine that there is a single goutine, but at the same time
can be too strict for hundreds of thousands of goutines that are
being created, which is not a real event in a go based
application. Therefore the
go memory, memory model work in another way,
maybe because we want to have a good amount
of coroutines in contrast to the approach that we
saw before, go creates very small stacks for each coroutine
around two kb. And the surprising thing is that each of
these stacks is growing and shrinking as needed.
And this is possible because of the capability of these stacks to
borrow memory from the heap. And the fact that these stacks
are dynamic allows us to
have better memory management. And as
the memory hover is reduced when context switching happens,
and that's because of the small states we have to
save for each goutine. So this
was like a sort of introductory topic
for the new bio, for the
new gophers. But once we,
let's assume that you now know very well this concept,
and you started take all
the goodness off go for your applications.
So now how
can measure the performance of your go application applications
in a systematic way?
Before starting explaining how
to benchmark the applications, I want to do
some preconditions. So the preconditions that every time
that we run our benchmarks, when you run
benchmark you want to always keep the same environment.
We don't want to get affected by the external environment.
So another thing that is crucial
to do is to isolate the code that is being benchmarked
from the rest of the program. So how to write
a benchmark? So this is a practical slide.
So we create an underscore
test go file where we put all the benchmark functions
and the benchmark function as a specific
signature. And we have to specify the
b variable that has type testing b.
And as we're gonna see right now, for example
in this two, sorry,
as we can see later on, that this b dot n represent
the number of iterations that dynamically, the go,
the go runtime.
Decide for your function
to run. Okay. So you don't have to
touch that variable at all. Okay. So it's totally up to the
go runtime to decide that how many times to iterate
your benchmark. So let's
take two functions. Here we have two pipelines.
So the pipelines are,
these two pipelines have the same structure. They are a producer that
is producing some strings. Then there is like some
stages where basically we take all the screens,
the strings, we lower them, then we merge the results
from the old coroutines and then we send them
concurrently into the other stage, which basically takes
the, take the string and
capitalized, capitalize the first character and
then merge all the result. Right, so they have the same structure
here, but they are different under the hood. Right.
So we want to test benchmark these functions
and we now can do
a smart thing since we, we want to use a tool
called bench start. So bench start, what it does is to
compare to benchmark results,
okay? But in order to compare,
to compare these benchmarks, the benchmark
function must be the same. Okay, so a
cool strategy could be to first create
your benchmark, then call the function, for example,
rampipeline one, you produce the benchmark as
we do in the first, sorry, in the second image.
So return the output as a
before bench. And then we can go
back to the same benchmark function and
change the inner function. So instead of rank pipeline one, we're on pipeline
two. Okay, so the result is in the first
is in the number one code. Okay,
so now we have two benchmarks.
But before doing the bench start and comparing
them, we can open one of them. For example,
we can open the whatever
the before dot match. Okay,
so how to read a benchmark?
So we have four values.
And let me switch back, because an important thing
is that we have to see those flags here. So the minus
run equals to x means that basically what
the benchmarking engine does is.
Okay, I see the run pipeline function in the benchmark function. So I
need to run all the tests related to that function,
and I'm lazy, I didn't want to
do that. So with the minus run equals to x,
you avoid that. And also another important flag is the minus benchmam
that allows you to keep track of the memory used for
your functions. Now we can go ahead. So with
that command we get this result.
Okay, so we can see in the red circle that there
is the number of iterations. So the b n that we measured
before. And the
blue circle is one of the most important because it
explains to you on average how fast your progression
is in terms of nanosecond per operation, the number
of bytes per operation, and the number of allocs per
operation. And the last two results came from
the fact that we use the minus benchmam
flag. So let's go ahead and use
bench sort, which is a really great tool to compare to
benchmarks. And as we can see,
okay, so the before bench was running
the run pipeline one, and the after bench was running
run pipeline two. And we can see three rows.
Okay, the first row is the speed, the second
row is how much memory used your
function, and the third row is how many allocations
your function did during the benchmark.
And we can see that the run pipeline too is faster by 40%.
It wastes less memory,
a lot like 40%, and it allocates
less than the ram pipeline one
does. So almost 86%.
So this is kind of a systematic way to
measure your applications. But now you
maybe are wondering why the RAM pipeline one function is
slow, since the structure is the same.
So here comes profiling.
Profiling is the process
of keeping track of all the inner functions that your
main function runs. And it allows you
to track the cpu and the memory of all instructions.
For doing that, we use a pprof. Pprof is a
tool for visualize the profiling data.
And it's available for free, of course, as a,
as a go tool, and it's
based on protocol buffers.
So in order to exploit this
pprof, now we want to add
two more flags, which is the cpu profile
and the memory profile. And we load the result of
the profilings, one for the cpu
and one for the memory, into two different prof
files that are basically protobuf type files.
Okay, what if we run the pprof? As we
can see above the line above, there is go tool pprof cpu
one. Prof. Now if we go
back this like before, there is no
one at the end of the minus bench flag. That means that
all the benchmark functions that prefix
starts. Okay? And that case I put basically
the benchmark pipeline one and separated the benchmark
pipeline two so that we can compare using the profiling,
the two functions.
Okay, so using this command,
I'm inside the pprof and I can run the top 100 command.
So the top 100 command gives
me the most, the most 100 cpu expensive tasks.
And I can see that, for example, there is a sleep and
we have the strings dot lower function.
Another important thing is that in the top list there
are basically only stages
that comes from the run pipeline one function
and that, and we're not seeing any kind of function
from run pipeline two. And that's because of course
we saw before with the benchmark, that run pipeline one
runs actually slower. Okay.
Since this list is in
descending order of how
much time has been spent.
So we are a little bit suspicious
here. So, and instead
if we scroll down, we can see our fast stages.
So we are a bit suspicious here. So what we can do is
to dive into the code of ramp API
one. So we run again
the, the pprof
and we can use the list function,
the list, sorry, the list method of pprof,
where you can basically say which
function you want to analyze. In this case, I want to analyze the first stage
of the run pipeline one, which is transform to lower
one. And if I go inside apart from the
code above. Okay, we have to focus on the red box
and seems that we are losing time here because we
see that we are doing
a loop where we are lowering
chart by chart and then return the result.
But this is kind of a problem because we know already that
there is a function that is
coming from the strings module that is
lowering our strings in a faster way.
Okay, so let's
analyze the faster one. So as we can see, we have the
strings lower function that is coming for the module and
it takes 26 seconds. Okay,
so overall we saved almost 5 seconds,
you know, because here we have to
take into account this, this amount of time.
So 70 seconds. But we always, we also have to take
into account the time that has been spent to send the data to
your channel. That is almost 13 seconds.
Okay, so here's why there's the difference.
So it's 13 plus 17,
okay. Rather than 26
overall. Okay. We found out why our
function is lower, okay. But we
now analyze the cpu profile.
Okay. But we also produced a memory profile. And as
you can see, for example, you is can that in the slower
function we have a waste of memory and we
have enormous quantity of allocations.
So of course
this is what I wanted to show you. So the
systematic way to find
out how, where your program is low,
where your program is fast. Okay, so,
but before concluding, a suggestion
is that I want to like that to be
completely accurate. Any benchmark should be
careful to avoid any kind of optimization optimizations
that the compiler does. For example,
if we, if we don't save
the result of the ram pipeline, sometimes it happens that the
compiler eliminates the function under the test and
somehow it's artificially lower the runtime of the
benchmark and we don't want that.
When utilizing the profiler, it's important to consider that
it samples both cpu usage and memory at
specified frequency. Okay. However, this sampling
may not always be 100% representative,
particularly if the sample time is very low.
So to announce accuracy,
I recommend increasing the bench time
parameter accordingly.
Okay, so in this way
you allow the profiler to collect more samples over an
extended period and then we can obtain more
precise insights into how our application performance.
So of course try to design your
applications as a pipeline of coroutines.
Always keep track of the memory usage as it can use,
actually can cause garbage collection to run and
therefore that means that we potentially wasting time.
And of course, as I said, as in the precondition,
try to execute benchmarks on a stable machine without
having spikes during the test.
So here I left some study references
that I really found that interesting.
So how gouting works, how the memory
model works and the shadowing works, and how you
can use benchmark and profiling to improve
your functions performance.
So that's it. Thank you for
your attention. Hopefully this was
an important way,
an important moment for you to learn something more
about go. So I'm really
excited to hear you in the comments and so that I can
improve my speaking skills as this is
my first time speaking in public. So really speaking
with the bottom of my heart, really, thanks.
And I hope that this is the first of
an infinite number of speakings.
Okay, so thank you guys
and have a great day.