Transcript
This transcript was autogenerated. To make changes, submit a PR.
In this talk I will give you some guidelines to develop efficient Rust code for
numerical application. Whether you are starting from scratch or improving on
top of an existing application, I hope this talk will give you an overview
of the general workflow for developing performance applications.
Harnessing the power of was so a little bit about
me well, I'm a scientific minded software engineer with a
background in scientific simulations. Currently I develop
autonomous trading systems in Russ and Python. During my
career I have been implementing numerical applications in C.
Haskell, Fortran, Python and Russ. In this talk,
I don't intend to compare one programming language to another. Instead,
I would like to give you my distilled experience. Specifically, what I have learned about
numerical implementations in was. Now, numerical implementations
are usually associated with applications that require a lot of speed.
So in fields like science,
engineering, finance, math, we need to perform a large number of
calculations and simulations. We want to understand how proteins work.
We want to know as soon as possible whether the seasonal rain discharge is going
to put us underwater. We want to know whether to buy or sell.
Now, we don't want to buy or sell in a couple of months.
So we of course have several constraints over the calculation that we
can perform because we have limited human and computing resources.
Our goal is then to run as fast as possible,
squeezing all the performance that we can from our hardware.
Now wires for numerical applications as I
mentioned before, we are looking for running as
fast as we can, and Rust is blatantly fast. Together with its ownership
model, it makes a superb language for numerical applications.
But what I think is Rust's killer feature is the tooling.
So having cargo taking care of compiling and linking is just amazing.
If you ever develop numerical applications with other
programming languages like Fortune or C Plus plus, you probably have to deal
with cmake. And if you compare Cmake and
cargo, well, you realize that cargo is light years ahead.
So I really recommend you. Cargo for the medical applications was
in general all the tooling system. Another great reason to
use was is increasing ecosystem, having a
strong open source community providing the right tool for the job.
There is a way to build better software. Now we
all know that, well, real work is complex and writing
numerical applications to deal with the real world is difficult.
So I would like to share with you my philosophical mindset to deal
with this complexity. I firmly believe that intelligence
is just a robust methodology to recursively improve upon
my stupidity.
I think that it doesn't matter how much we prepare for a given
project, whether we know what are the
right algorithms, and we carefully design the architecture. There are always going
to be issues that we are going to discover Apostri.
So we can be either eternally frustrated about it or we
can try to recursively improve of our mistake. So I think that have
a very open minded it's going to help us to keep the peace of mind
and build great software. So when this prelude let us dive into numerical
applications, so the general
algorithm could be like follow. So we
start from a correct but probably a slow version of a numerical workflow.
This initial version could be handed to us by the research team who's
developing it in Python. Then we implement an MVP in Russ.
We decide whether this MVP is fast enough for our use
case. Then if it's not, we would like
to optimize. But before optimizing we would like
to profile and benchmark, and I would like to highlight those two things and I'm
going to discuss it in the next slides. So what is so
important to do? So let's start with the MVP.
So I like to see the development of the MVP
as going through baby steps. So first we
would like to use Clippy. So if you don't know what clippy is, it's a
linter for us and it's using to suggest you to change the
code using more performance patterns. If we don't have
the best code. So it will basically
say, okay, please replace this function by another one or
change the code in a specific way. So I totally recommend
you to use clipping then do not fight
the border checker. We humans are terrible
at knowing which part of the code is taking most of the time.
So before you start drawing reference left
and right, or trying to remove clones
from the code, just focus on writing correct and easy
to read code. Then we can work into
optimizing the code. Finally, I totally recommend you
to use a battle test library, especially if it's for performance critical
operations. I would like to come
back to the third party library subject in a bit now.
So we have the MVP and it's time to decide whether it's
fast enough. Well, it is up to your applications to know
whether it's fast enough or not. So maybe it is
or it is not, but you can throw hardware to it.
Great, maybe it's okay, but you would like to squeeze the most out of it.
So great. So you would like to do some optimizations,
but before doing some optimizations, we need to profile,
profile, then benchmark, then optimize.
So what we would like to profile.
So as we mentioned before, we humans are
terrible at the using game of what in
the code are we spending most of the time. So unless that you work
professionally developing compilers, I just advise you to measure.
So for measuring we can use tools
like perf. So perf will help us to figure
out where in the code we spend most of the time. Perf is
a performance analysis tool for Linux that allows to collect and analyze various performance
related data, as we will see next, we can use perf
output using, we can visualize perf output using flame graphs.
So let us see, what are flame graphs?
So flame graphs are used to visualize where
time is spent in our program. How do they work?
Well, every time that our program is interacted by the os, the location
of the function and its parents is recorded. This is called stack
sampling. These samples are then processed in such a way
that common functions are added together. The metrics are
then collected in a graph showing the color stacks where each level of
the stack is represented by horizontal bar. So the width
of the bar correspond to the amount of time spent in that function.
And the bars are arranged vertically to create a flame light shape.
So what it means is like the main of our program is going to
be in the bottom, so the libraries that we call are going to be in
the middle and our final functions are going to be on top.
So it's very important to realize, like the
x axis doesn't indicate time.
What we really care is about the width of the bars. So the width
of the bars indicates where most of the time is
spent by the cpu doing that particular call.
So perf flame graphs are going
to give us an indication of what are the functions, what are the libraries that
are consuming most of the time.
Now, once the profiling is done,
we would like to do some benchmarking. And the reason
for doing the benchmarking is like, okay, we need to know
what is with confidence, to know with competence what is going to be the
real impact of changing the code. So we don't want to do random changes and
hope for the best. We would like to be able to measure what
is the impact of changing certain part of the code. And the
perfect tool for that is criterion. So criterion
is a benchmarking library that provides a statistical, that provides
statistical confidence on the size of the performance improvements or regressions.
So what it does, it allows us to see
if we change something, how much it affects the general performance
of the code. So first I would like to show you how
it works, and then I would like to show you some metrics of
the output. So in this toy example,
I'm going to use criterion, so we can use here criterion
cmake. Then I'm going to use a library called
mullah. It implements a function
called convolve that is using to perform one dimensional convolution.
If you don't know what is one dimensional convolution, do not worry about it.
So the idea is like we are going to ask criterion
to run this one deconvolution using
arrays of 100 length. So we are going
to generate these arrays. The elements of these arrays are
random numbers taken from a gaussian distribution.
Criterion is going to run this function many times. First it's
going to warm up, and then it's going to run this function many times.
And then we set out a baseline to know
if we may make some changes. What is the actual performance using the baseline?
So now that we have an idea how to use criterion, let us talk
a little bit about optimizations. So I
think that the three most important applications that we can perform in a code is
first you make sure that you pick the right algorithm.
Once you know that, then you make sure that you pick the
right algorithm. And finally you ask someone else to check for you if
you have picked the right algorithm. So what it means is
like, well, you need to do your math homework. If you don't
feel confident about your math, you need to chase your favorite mathematician
to give you some help about it. Once you are pretty sure you have chosen
the right algorithm, then you can write other applications.
Like let us try to preallocate the vector. So there are functions like from back,
like preallocate a vector with certain capacity.
Or you can use a non cryptographic hash algorithm for the hash
map. So hashmap the default behavior.
It use a hashing algorithm that is for
cryptographic applications. What it means is, takes some time to compute the hashes
of the keys, because numerical applications are usually far away
from user inputs. We want to run
the hashing algorithm as fast as we can. So I totally recommend you
to use a non cryptographic algorithm. So I will point
to one of those algorithms, not the algorithms,
but the actual library in a bit.
If you want to know more, I recommend you to have a look at
the perf book of Ras. Now let's go back to
benchmarking plus optimizations.
So we are trying to optimize our combo
1D function. So what we do first is that we run
the criterion without changing any code. So what it's
going to do is going to try this function many
times. It's going to warm up and then it's going to set a baseline.
Then we do some code, some modification in the
code, and then we run it again. We see
that this new modification generates a meager
2.5% of improvement. Then we say,
okay, this seems to be going okay. I would say then
let us try something different. Then you try get another change in
the code. But unfortunately there are no optimizations
gained. I would like to tell you something that
is really important about benchmarking. So, benchmarking assumes
that you are running in an isolated machine. So what it means
is like if you try to run a benchmarking, your local machine where you
are developing, and then you just have
browser Spotify, a lot of application running. Then if you
just stop Spotify, for instance, you will see an increase in performance of
15%. So actually you need to run
in a machine that is not doing anything else. Otherwise you're going to have spurious
performance gain. So when you are benchmarking,
just have another machine, a VM, another local
machine, somewhere that you can access and only run the
calculations that you are trying to benchmark.
Okay, so now we know about benchmarking, we know
how to optimize, we know what we are targeting.
So it has arrived the moment to test.
I think that there are three complementary strategies
that we can use for testing. So let us review them. So first,
you can ask your large language model if every large language model to
generate unit tests for you. I'm not going to discuss whether AI
is going to overtake the word. I'm just simply saying that large
language models are fantastic for unit testing. So use them for generating
tests for you. The other thing that you can do is just look
for edge cases. By edge cases I mean you
can try to look for inputs for your model, for model,
or for your workflow that can generate
numerical instabilities. Other thing that you
can try, it is a property testing framework,
something like protest. I think protest deserves its own
explanation. So let me give you a little example of how it works.
So, protest is a property testing framework. It's inspired
by hypothesis from Python. What it does,
it allows you to test that certain properties hold for
your code, for arbitrary input. And if something fails,
it will try to create a minimal test case that is going to tell you
in what specific input your property is not hold
and why it fails. So in this example,
I have two functions that we
are trying to find. The maximum of a slice of
loads and the minimum of a slice of loads.
Then using protest, we are going to set up
a test that is going to generate random
vectors when dimension from zero to
1000 using floating points, and then
using the
protest macro, we are going to check that
the property holds. So what is the property?
Well, we want to know whether the mean value is
smaller than the rest of the slice, and we want to know
whether the max value is larger than the
rest of the values in the slice. So prop test
is basically going to help you to set up your
tests in such a way that you can generate a bunch of
random inputs and test it. The property of your interest holds.
Great. So now we have covered testing and
we have covered what is the general workflow for developing a
numerical and algorithm. So to implement then,
now let us talk about some other aspects
that we need to know about numerical applications.
So now one important thing is flaunting points.
So in computing a round of error or random error is
the difference between the result provided by a given algorithm. So if
we take one algorithm and we use arithmetic,
and the difference between that and using
the same algorithm with the same input, using finite
precision rounded arithmetic. So as we know, floating points,
they cannot represent all real numbers accurately,
so they are always running errors. So whether
you are implementing something from scratch or taking an offshore
library, you need to realize that there
could be potential issues when plotting points and numbers. So there
is a subfield of mathematics called numerical analysis
that deals when designing methods that get approximate but accurate numerical
solutions. So if you are into numerical
applications, it is really important that you have a look into
what numerical analysis says about the algorithms that you are going to use for your
applications. Another important thing to mention is like
if you are doing operation in finance, there is
a library called rust decimal
that is going to help you to perform calculations
for finance without running errors.
Now, once we have covered that,
let's move to the third party libraries.
Usually when we need to decide
whether we want to implement sorting or we want to bring a third
party library, we need to ask us several questions.
Well, how important is this algorithm that
we are trying to implement or to bring from another library to us?
How confident are we about using, able to implement this?
Are we willing to maintain this library? Or if there is something
else outside, well,
around in the open source community that already implement this algorithm,
it is in a good state. These are the questions that we need
to ask ourselves before deciding. I think
that the general rule of thumbs is
as follow. I think that for other than trivial algorithms,
other than three lines algorithm, we should use
a third party library, even though it's writing in a non
so shiny language like c or c plus plus. And the reason for that is
as follows. So if you want to know how an algorithm works,
I think that the best way to do it is just try to implement it
yourself. And this is a fantastic way of learning about how algorithms work,
and I do it all the time. But there is a big difference between trying
to learn something and trying to come out with a faster,
better, more performant and more robust general application.
That is usually what we found in open source code. So in
open source libraries, what's going to happen is there have been
many cycles where different approaches have been tried.
It have gone through many errors, it have gone through a
lot of applications. And I think that it would be really naive
to try to think that for really like a really
popular algorithms, we just go there and we come up with something much better.
Well, if you are implementing an algorithm from scratch and you are a master
in that part of mathematics, and you are the only one who knows that,
yes, by all means, just go ahead and bring your algorithm to the community.
But if we are talking about really well known algorithms, I think it is much
better to join forces with another
community, even though it's not the rust community, and try to maintain the
most efficient algorithm that is available.
Finally, I would like to speak about
two things. One is some applications that are
already available for was, and other is like the interface between
was and python. First, let us talk about some libraries
that are using every day, and that I think they are fantastic for numerical
applications. So the first of them is not a single library, but a family
of libraries that are called was ndrrate. So they are used for
array manipulation, statistics, linear algebra, for array. They are
fantastic well maintained library. Another well known
library is Ryon. That is basically if you have sequential
calculation and you want to run in parallel, you can use Ryon to help you
to do that. Polars is a light infos data frame library.
Rascash provides you with
a non cryptographic hash algorithm, what I mentioned before. So like
for caching of all numerical application, we are going to use a lot of hash
maps, but we don't want a cryptographic hash
algorithm because it's really slow. And also there are other two
libraries approach. So grass and the array
offers a feature when approached. So to help you
to compare floating points. So when you are doing calculations
using arrays, it's really important to be able to know whether two arrays
are the same or not. Finally, there is this other library called Order
float that is going to help you to
compare totally order floats.
With that I think that I have covered some of the perks that I consider
in Amus Health for the medical applications.
Finally, I would like to have a
word about Russ and Python. So,
engineers, scientists, quants,
many professionals from the medical fields are proficient in Python,
and they are not super proficient in rust. So what happened is, like,
most of the numerical algorithms are going to be implemented in Python.
And, well, as we probably know, Python is
not the most efficient language there. So for
me, it is natural. Like, we would like to bring
grass to the Python community. So we would like to integrate grass
with Python. So if you don't know, there is a
library called matrix that is the perfect tool for this situation.
Materin is just a was tool that facilitates the creation of python models
that wrap ras code. So they make really easy to
call your was code from inside Python without the need to
do some integration using, say, python. And I think that by
exposing our APIs, our libraries to the Python community,
we're just not only giving the Python users
well, we are going to accelerate
their calculations, but also we are going to expose them to how
was works. So I think this is a win, a win situation for
us and python communities. And with that, I would like to thank you for your
time. This is what I had to say today. If you have any questions,
please, you can also always drop me
a message reach, or you can send me a message through LinkedIn.
Thank you very much for your time.