Conf42 Golang 2023 - Online

Squeezing a go function

Video size:

Abstract

Performance is great, right? But sometimes, knowing what is consuming my memory or CPU is hard. Sometimes it is obvious, but sometimes it is subtle. In this talk, I want to introduce you to some examples of micro-optimizations and the reasons behind them.

Summary

  • Today we talk about how to squeeze the performance out of your go functions. Optimizing is about finding the right ocean for you. One of the most important lessons about optimizing is guessing is not good. Measure everything.
  • Small optimizations for reducing the cpu usage is basically doing less. Another interesting thing is reducing allocations. Let's talk about packing. I want you to learn about the different microoptimizations that I'd be using here.
  • Another interesting thing is function inlining. The compiler is going to take care of the inlining by analyzing the complexity of the function. If the function is simple enough, it's going to inline that automatically. And if I run a benchmark here, we are saving a whole nanosecond there. That is not important at all.
  • Escape analysis is something that the compiler does to decide if something goes to the heap or the stack. If you combine escape analysis and inlining, you get something very interesting. This small optimization in the constructors can lead to save a lot of allocations.
  • One go routine per job is going to be the slowest option here. Everything is wrong with trying to do cpu bound workloads in unbounded number of go routines. This is more or less the idea of how you optimize a go function.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome to a squeezing ago function. Today I'm going to talk about how to squeeze is the performance out of your go functions. First of all, I want to talk about what is optimizing. If I ask you what is better orange juice probably most of you will say that fresh squeeze orange juice is the best one, but it depends on what you want. If you want to have a really good taste, probably fresh squeeze orange juice is probably the best option. But if you want something that is comfortable, maybe a brick of orange juice is great. If you want something that has to last forever, probably a tank will be a better option. So optimizing is about that. Optimizing is about finding the right ocean for you. Sometimes it's being faster, sometimes it's consume less memory, sometimes it's doing less I o. But not necessarily is just being faster or just consuming less cPu. So take that into consideration whenever you are optimizing. Another thing is important to optimize at the right level. You can buy an f one car, but if this is your road, you are screwed. So there's no way that you are going to get 300 road. So probably you need to optimize your infrastructure first. You need to optimize probably your architecture first until you reach the point where you need to do micro optimizations at the goal level. Go is a really fast programming language, so the runtime is really efficient, the compiler is really good. So probably your bottleneck is not going to be in that go function. Probably your bottleneck is in the architecture, in the I O, maybe on some SQL queries or things like that. But normally you don't need to optimize at go level anyway. Sometimes you have good reasons to optimize at Go function level. But first of all, analyze all the other stuff first, optimize what you need when you need it. If you think about optimizing a go function, you can optimize it a lot. You can optimize until a completely absurd level. But the reality is you need to know when you need to optimize that function and when you need to stop. Because sometimes you are looking for the bottleneck and the bottleneck is in that function and you get obsessed with optimizing that function until the best level possible. But the reality is sometimes you reach the point where there is no longer the bottleneck there. So if you are over optimizing that function, you end up wasting time in something that is not needed. So you just should measure where is your bottleneck, fix that bottleneck and look for the next bottleneck to fix not get obsessed with one specific function. Do not guess. One of the most important lessons about optimizing is guessing is not good. We are not good at guessing. Nobody is good at guessing in these things. You can think about estimations in software. Nobody is good at estimations, nobody is good at guessing when it's going to be a performance penalty somewhere, when the optimization is going to work well, this is because the systems are really complex. You have to think about a lot of stuff. I o, hard drives, spinning disks, cpu, other processes working in that same cpu context, switches of threads, a lot of crazy stuff that the operating system is doing all the time. A lot of crazy stuff that other processes can be doing. There's so many variables, so many things going on in your computer at the same time that guessing is not going to help you at all. So the rule is basically measure everything. If you want to optimize, the first thing that you need to do is measure it. You need to know how much time or how much memory or how much resource that you want to optimize is consuming your program, and from there you can start optimizing. Well, how do you measure things? Well, the most common way of measuring things when you are doing macro optimizations in functions in go, is doing benchmarks. A benchmark is just part of the go standard library. It's part of the testing framework and allows you to generate these kind of similar test functions that allows you to measure how much time it's consuming something. For example, here I'm comparing the I'm doing two benchmarks, one for MD five and one for Sha two, five, six hashes. So I'm basically doing a benchmark and see what function is consuming more time. If I execute that, I'm going to see that the MD five is twice faster. I can basically do double of MD five hashes than SHA 255 hashes in the same time. So that's one thing. Also, I can ask the benchmark to report the allocations. How many allocations is doing all these hashes functions? If we see there, basically, in this case, the hashes are very well implemented and they have zero allocations. So they are really fast and they don't consume any heap memory. Well, let's see another example. Let's see open file. Whenever I open a file, it's taking, in this case, it's taking 120 bytes per operation and it's doing three allocations. So that's an example of something that is really doing allocations. Later we will see why allocations are important. Another way of getting information about performance is profiling. Normally you want to know what a part of your system is the bottleneck, and profiling is for that. So here is the example of our benchmark before, so I can run that. But in this case I'm asking for a mem profile. So I asking hey, give me a mem profile and I can use the gotool pprov to get the information about where my memory is getting consumed. In this case, I can see there that 75 megabytes in this benchmark is consumed by the OS new file function. Also I can see that in an SVG format to see in a more nice way. And clearly I see there is a big red block that is saying hey, most of the memory consumed by your benchmark, by your binary in this case is because the OS new file function. Also I can see exactly the line where that's happening is whenever I create this file strap. Also I can check for the CPU profile again, this is more distributed in different places, but 30% of the time is getting in the Cisco six function. And if I see that in a graph, I see that silhouette six is consuming 30%. There is 14% in eple weight and 22% in eple ctl. Anyway, let's take a look. This is our syscall six. That is basically a function that syscalls for the operating systems and is writing in go assembly. So probably you are not going to optimize this, but you can dig where that is happening and optimize that. This is from the standard library. So probably it's really well optimized already. But you can use the same tools to explore your source code. That is the interesting part. Okay, well, I'm going to show some examples of small optimizations. I want you to think about this more about the process. I want you to learn things during the process, learn things about the different microoptimizations that I'd be using here. But I think it's more interesting in the process how you create the benchmark, how you run the benchmark, optimize that, run to the benchmark again and see the results. And probably apart from the benchmark, you should have some unit tests to verify that you keep the code correct. But I'm showing here only the benchmark and the microoptimizations. So one of the simplest optimizations for reducing the cpu usage is basically doing less. So for example, I have here a find function that have a needle and a high stack and basically loop over the high stack and it find the needles, it returns itself, the IDX as a result and returns a result. This is a very naive approach to do this, but anyway, it's clear that you can optimize this. Let's start with a benchmark. I create ton of data for a high stack, and I do a test searching a value that is a bit over the half of the high stack. And if I run that, it's taking two, eight, nine nanoseconds per operation. Okay, sounds fast. Let's see if we can optimize that. If we just return early, return the result, because I don't need to keep searching if I already found something in the high stack. If I return that and run again, it's going to be 172 nanoseconds per operation. That is a bit more than the half, because we are looking a but more than the half in the high stack in the benchmark. So that makes sense. Basically what we are doing here is just being a bit smarter and trying to not keep executing code if it's not needed. Okay. Another interesting thing is reducing allocations. Let's see here. If I have a slice, and I initialize that slice, and start adding integrals to that slice, in this case a million integrals, I'm going to do a benchmark that. See what happens if I run that. And the thing is, each operation is taking 39 allocations. And if you see there, it's taking 10 million nanoseconds. That is like ten milliseconds per operation. So let's see, I can initialize that. I can say, hey, I know that I going to have 1 million integrals there. So I going to just create upfront that array of 1 million, that slice of 1 million elements. If I do that now, we are talking about less than a millisecond execution and only one allocation. So that's a huge improvement. We are reducing in one order of magnitude the time consuming by the function. But we can do better. We know upfront in this case, we know upfront of the compilation time that we are going to have a million elements there. So instead of creating a slice with make and giving original size, I'm going to create an array directly. In this case, it's now more or less a third of what we had before, and zero allocations. We are not allocating anything in the heap. That means that, it doesn't mean that it's not in memory. It means that it's in the stack, not in the heap. That's the idea. Let's see another kind of optimization. Let's talk about packing. Well, whenever you have a structure, the compiler is going to add space or add padding to our fields to make them align with the processor, basically make it more efficient for the cpu. In this case we have this struct that have a boolean, a float and an integer, and the compiler is going to add bytes in between that things. So if we create a slice of 1 million elements of that struct and benchmark, that we are going to see that we are consuming 24 megabytes per operation. We found a location in this case what is kind of fast, but still 24 megabytes per operation. If we go here and just change that struct and change the order of the fields, the compiler is going to be better at generating that padding. It's going to say okay, b float is already aligned, c in 32 is already aligned, add the bool is not aligned. But I going to align with three extra bytes because well, I can pack that together with the previous int and all that stuff. From here we have 24 bytes per instance of the strack, and here we have 16 bytes for each instance. If we run the benchmark there, it's still one allocation, but we are saving eight megabytes of memory in the heap. This can sound silly, but it's important if you see there, even if you are not caring that much about memory. In this case you see that there is consuming two milliseconds and here is consuming, well, 2600 milliseconds, 2.6 milliseconds here 2.02.6. So it's not only affecting the memory usage, it's affecting the cpu time also. So another interesting thing is function inlining. I'm going to explain why inlining is important, but here is an example of not inline function. The compiler is going to take care of the inlining, and it's going to do that by analyzing the complexity of the function. And if the function is simple enough, it's going to inline that automatically. In this case it's not in line because I explicitly asked the compiler to not inline it. So if you have this, this is the not inline function and this is the inline function. And if I run a benchmark here, we are saving a whole nanosecond there. That is not, to be honest, it's not important at all. If you are fighting for nanoseconds you are in a very interesting field. But the reality is nobody cares about that nanosecond normally. But why is important in line, and we are going to see why it's important in liner in a minute, let's talk about escape analysis. Escape analysis is something that the compiler does to decide if something goes to the heap, also goes to the stack sometimes. For example, in this case, this is not in line. So in this case we have this value, and I return the pointer to the value that is going to escape to the heap because I'm returning a pointer. So val can be in the stack of the function, because I have to return that and it's going to be accessed from the outside. So I need to store in somewhere that is not the stack of the function. So it needs to go into the heap. So for that reason, now we have one allocation there, one heap allocation there. This is non escape function. I get the value, the integral value, and I return the value as an integer. So I'm not returning the pointer, I'm returning the whole value. So I'm returning a copy of the value, a copy of the stack. So it's a copy of the value that is storing the stack. So it is not requiring storing memory. So let's see, what is the impact of that with the escape function where we need to store something in the heap we have there that is consuming eight bytes in the heap and is consuming one allocation per operation. And that means that in this case it's taking 13 nanoseconds per operation, and the non escape function is taking only 1.5 nanosecond approximately. So that means that we are almost an order of magnitudes faster with the non escape function, because we don't need to access the heap, and the heap is expensive than the stack. So this is important. But we are going to see, other things that are important relate to escape analysis, and it's scape analysis plus inlining. If you combine escape analysis and inlining, you get something very interesting. For example, here we have a document that have a path, and I have a function, and I create a new document with that path. The new document is just creating the document, doing certain stuff, and returning the document, the pointer to the document. Okay, ecpC, right. If I run that, I get three allocations in this function. Okay, sounds good, that's fine. But if I change that, I say okay, now I create a new document, and then I call an in function. The new document is super simple function that have almost zero complexity. It's just returning a new pointer with an empty version of the document. And then I have an init function, and that init function is going to take care of the initialization. Doing this, what I'm doing is ensuring that the new document is in line, and because the new document is in line, we now have the d value there is not in the heap, it's just in line there. So it doesn't need to go in the heap goes directly into the stack of the my function and in the my funk is going to have their stack and it's going to store that document directly in the stack, it doesn't need to go in the heap. So whenever I run in it, I pass the pointer that is still in the stack and it's going to work well. So this way I'm going to save allocations. If I see here now, instead of having three allocation, I have two. Instead of consuming 56 bytes per operation, I consume 40, 32, and instead of being 128 nanoseconds per operation, I only consuming 73. So this small optimization in the constructors can lead to save a lot of allocations. Basically the lesson learned here is try to make your constructors really simple, and if you need a complex initialization, do that complex initialization as part of a secondary method, something like an init method. Okay, let's talk about concurrency. We all love concurrency because in go is so easy and so cool to do concurrency, but have some important implications in performance. Of course, here we think we have a fake I O function that simulates some I o, some blocking process. In this case I just doing some time slips and I have the fake I o parallel that is going to execute that in Go routines and the number of go routines passed as parameters. In this case I execute in three benchmark I executing one benchmark where I do parallel I o hundred go routines. Parallel I o based on the number of Cpus and parallel and serial I o. If we run the benchmark on this, we are going to see that the parallel one that have one go routine per job is going to have way more allocations, but the number of nanoseconds per operation is the smallest. So we are trading here probably memory and allocations against cpu time. In one per cpu you have less allocations and have a more or less acceptable, well, it's slower, it's ten times lower, but it's way faster than execute. That in a serial way and in a serial way is one allocation only, but it's taking a lot of time, multiple order of magnitude slower than the per CPU. So let's do another example. Let's play with the CPU. I create a fake cpu function that is going to consume cpu. Basically it's doing MD five sums again, a method to run that in parallel. And now I'm running that in parallel for 1000 go routines in one per cpu and in serial. The interesting thing here is one go routine per job is going to be the slowest option here. It's going to be the more memory consuming, the more cpu consuming. Everything is wrong with trying to do cpu bound workloads in unbounded number of go routines. That's completely inefficient. If I use one go routine per cpu, that's going to give me the best in this case, because you can see there that it's taking 65 microseconds and it's only 25 allocations. The co one is not doing allocations at all, but it's taking way more time to execute that. It's basically double of time to execute the same thing that you are executing when you are executing parallel with cpus. So that's it. This is more or less the idea of how you optimize a go function. There's some tricks that I shared, but I think the most interesting part is all this flow of creating a benchmark, trying to optimize that, trying to measure if your benchmark is doing something interesting or not, trying to reduce allocation allocations. Also, it's not only going to reduce the amount of memory consuming and the cpu spent in getting that memory allocated, also it's going to reduce the garbage collector pressure. So the garbage collector needs to recollect any allocation that you are doing. Whenever you finish with that allocation, the garbage collector needs to collect that. So every allocation is going to add pressure to the garbage collector. So reducing the allocation is going to have an extra impact in your program execution. So some references, there's a very interesting book called Efficient Go by Bartolomeck Plota. I really recommend that it has a very methodic approach around testing, benchmarking and optimizing. So really interesting book. High performance growth workshop from Dave Cheney. Also very interesting and very practical way of optimizing the go Perth book by Damian Grisky. And the ultimate goal course from Arden Labs is a really good one also for learning a bit more about hardware sympathy and optimizations. And well, all the images here are creative commons, so here is the references to the outforce and that's it. Thank you.
...

Jesus Espino

Staff Engineer @ Mattermost

Jesus Espino's LinkedIn account Jesus Espino's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)