Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, and welcome to this talk on floating
point challenges and how we can overcome this in
Rustlang. My name is Prabhat
and I'm just 23 years old.
I'm just getting started in this industry and I
came across lots of problems while I was working in
Rustlank, specifically with floating point numbers. But before all of
this talks, let's get some
more info about myself. So that's me here as you can see.
So talk about some of my interests. They are like I
love Linux and I like systems programming.
Also I like to music and dj a lot. Additionally for some
physical activities I like cycling as well. Apart from that,
talking about my professional experience, I have done multiple programming
jobs, freelancing ones and otherwise, during my college years,
specifically in the domain of systems programming which involved
dealing with c rusts directly,
and I've contributed to some open source projects in
rust. Some minors contributions are there, but yeah, I did
that as well. And recently I completed my internship as
a rust developer for a startup known as Afroshoot. It's one of the
world's most thriving AI
photography startup in the domain of photography,
and that was some details about myself.
Also, if you want to reach out to me, you can have my LinkedIn
is in the right top hand corner and along with
my GitHub so you can scan it. Reach out to me if you have any
questions about this talk. On that note, let's proceed to the
next slide. So these are the
things that I will be covering in my talk.
Before designing the content for my
talk, I kept in mind that the approach that I kept in mind was
I won't be getting too much into the details about every single thing
because you can easily find it out in the web somehow.
But my approach would be to instead
just give you a sort of like indicator on where
you can look out in case if you get stuck or what
are the issues that can come along the way if you are
working with floating point numbers,
specifically if you are in rust? And all of these concepts
would still be applicable for the other languages as well. That being said. But yeah,
I have especially kept rust in mind and
how we can use rust to mitigate these issues.
So going back to the, I'm sorry,
proceeding to the another slide, I would first like
to give you a sense of idea on why the
importance of floating point numbers is so much. So here's
one basic example on what happened in
the year 1991. So there was a patriot missile system.
We all know that us has lots of different kind of weapons
in their arsenals so the patriot missile failure which
occurred in the Gulf War, was again the result of some floating
point calculations errors. And this was interesting because
instead of like having f 64, some other type
in their embedded architecture,
which drives the missile, which guides it from one direction to
the other, and it
guides them on which location, target location of the region it should
drop, there was some calculations there in that,
and this error was in the one 10th of a second.
So, since a missile
also carries an embedded hardware itself that needs some programming,
and there was an inappropriate use of data type which couldn't accommodate
the precise value of the one 10th of a second, which resulted
in gradual accumulation of the rounding errors over time because
of the limited precision which we'll take a look about in
the next slides. Because of
that rounding, error propagation came in picture. We'll also take
a look on this one as well. So the patriots tracking
with DAR system relied on time calculations, precise time calculations.
Because of such a high degree of precision,
about one 10th of a second has to be maintained to make it drop to
desired location. That didn't happen. And the result
was there was casualty about 22. We came
out with casualty of 22 soldiers or something.
I don't know the exact numbers here, but that was a result like
it dropped to totally another region, it drift off from the
target region out of somewhere else. So this
can be the significant impact if things are not fall correctly,
because implicitly we might tend to think, okay, why not just use integers?
Why not to just use any other data types? But skipping something
necessarily doesn't means it's a solution as well.
So let's go to the slides.
So these are some of the plain
memories we need to know before we get started to
know about it. Just kind of like a quick note.
So, real world representation, I mean,
the point of the floatingpoint numbers is to provide a real world representation of the
real numbers. So I mean, the numbers which contains decimal place,
again, that's the fundamental purpose of any
floating point number in any language in PI. And they
are used for scientific notations.
They are represented in a form of scientific, through the scientific notations.
Also, they have precision and the range thing
in them. So floating point offers like a trade off
between the precision and the range, suitably for a wide spectrum of numerical
calculations. We'll understand this as well in the first
slides, like why does it happen?
Specifically, why there's a trade off between
the precision and the range.
And also we'll take a look on limited,
we covered about limited precisions here, and also we'll
try to understand mitigation strategies, like how developers
can apply mitigation strategies to manage rounding errors
and ensure reliable results are being there in
their code and doesn't cause any havoc.
So let's start like how is the floating
point numbers are standardized? So IEEE or
the Institute for Electrical and Electronics Engineering seven five four standard
is like the widely used specifications for how if floating point
numbers should be represented using the scientific notations.
So all the things like operations or its method that are being carried
out is being defined by the standard.
Also, the standard defines several formats for representing numbers
in the precision thing. So we have different kind
of precision range depending upon the powers of two. So two par five
or 32 bits here represents a single precision, and two par
six is reserved for double precision.
We'll understand this in the next slides. And extended precision is
like for more higher these
are more specifically used with supercomputers where we require
more higher degree of precision values.
So let's start by taking a look onto the
precision.
So, single precision, or 32 bit in IEEE
74 single precision format is
composed of like, or rather I should say is
divided in summation
of different sets of bit. So the first bit here represents the
sign. If the first bit is zero, it is
for positive, and if it's one, then it's negative according to
the scientific notation. And eight bits here represents
the exponent. So the power of tens that's
responsible for representing the decimal point
in the entire numbers. Another is this
one is the most critical one, and this is significant bit.
So 23 bits here are the ones that actually represents
a value in it. So two part 22 is the range that
which you can have your numbers represented along with the help of the sign
bit that represents whether it's assigned value or
non signed value. So we don't
have a binary division of the two part 23 from
two, which takes up a divided range. We don't have that,
we have just positive values from the significant coming, and then we take
a signed bit to represent actually whether this negative or not.
So here's a formula. I won't get into the details of the formula,
but just for the sake of thing, I've put it here. So going
to the next one. So also we need to take a look into double precision
as well, because these two are the most used one in
any programming job. So double precision in the IEEE 74
contains similar thing, but with
enlarged precision and range. So of course,
in any floating point number, the first
bit will always be reserved for representing the signed or the unsigned
status of the number the exponent bit here will going to represent the exponent
part of the floating point number, and the significant bit
will represent the fractional, I mean the significant part or the Mantis,
which we better know as. So here you get more range
and precision.
So coming to the actual visual
representation. So as you can see, this is therefore this
format here is for the single precision, that is 32 bits
one. So as you can see that as per the previous slides,
the first bit is reserved as the sign and the rest.
And the exponent one is used to represent the exponent part, which is
the power of the ten in the decimal number system.
And the fractions here represent the actual number
which will be represented from it. So the significant
component here will be represented by the fraction part,
which is composed of 23 bits.
Multiply that with the exponent with the base of
ten, and you get a scientific notation where
a decimal real number in a number
in a decimal system can represent using
a number in a base two system.
So that's exactly the structure of how it has been
devised by the IEEE San five four standard. So that's where
we're going to take a look at mostly. So again, we covered
this as well. I'll just go through
it quickly. So first of all, there's a couple of
things you need to have about floating
point numbers. That is important things,
or rather the issues that can come up while you're coding it out.
So this is a drawback. Somehow we can say that
floating point numbers are like they have a limited precision
and rounding errors problems. So since the number of
bits which are available is finite, so representation
of the real numbers often cannot be done exactly,
it can be done to a certain range only because we
don't have infinite bits, we have limited number of bits, we have finite bits,
so we can only represent it on a certain range of values,
which we'll take a look on that in the examples in the slides.
So this inherit limitation reach rounding error. So what we need to
do is like, since let's say any number is
way more long in the decimal
number system, then we need to round it up to represent it to the closest
value possible in
the given number format. Like if we have 32
bit single precision floating point number,
then we have to make sure that we round it
from this significant component so that it can
be represented properly and we try
to minimize the loss. This is like inherently present.
So there are certain addresses
regarding this limited precision thing.
So let's start to talk about the problems,
or rather issues one by one. So limited
precisions and rounding errors. Right? So every 14 point
numbers, as I told you previously in the previous slides, have a finite number
or a finite number of bits to represent a significant
or a mantisa here. Now, rounding errors will
happen when the real number cannot be represented exactly.
So the term real number here must be understood as
like this applies both rational numbers and irrational numbers.
So something that terminates well, the numbers which
make sense are the ones that rational. They're usually finite and terminating,
whereas irrational numbers are non terminating. Like one prominent example
is PI. So PI is an irrational number which is non terminating and
non repeating mostly. So here, as you can see in the illustration
given below, is like the
green boxes, represents the finite number of bits we have in any given precision
in any given precision type. And the
rest of the red boxes here might indicate that
those particular bits can't be accommodated here.
So the entire bits given here in the top are the
actual numbers. But since we have limited number of bits,
we have to round it to represent since the red boxes is the ones
that we don't have in our precision range, so we can't represent
them. Now here's one
common problem with it. So let's say if you have two floating point numbers,
this is quite common. I'm sure many of you might have come across it already.
So if you have two floating point numbers, the chances are
that it might not output you zero three. Exactly. It might
output you zero three up
to nine, something like that. So because
of this, lots of issues they happen. So to address it,
we have fixed point arithmetic so we can
use numbers to represent them as a scale integers to avoid precision
loss. One step could be taken like this. Another could be like
we can perform arithmetic operations on scale integers to maintain the precision.
So it's like you do not
need to directly use the floatingpoint numbers in the first place. You can rather use
arithmetic operations on the scale integers, which will always be
precise and return you the results. And you can apply some
operations which will result to you which
will help you preserve the accuracy of your calculations.
Another could be to use a big decimal libraries. So this is also
an advanced approach, but it is quite useful. You can
google up more about the big decimal here. I won't be covering it in
lots of detail here, but you can use like numb big integer
crate and rust deck crates for arbitrary precision decimal arithmetic.
And obviously one more choice which we have left here is a trade
off between the memory and the precision kind of thing.
So you have to make a choice between f 64 and f 32.
Wisely depending upon what your requirements are.
So you have to go through that. Coming up to the another
one is the rounding error. So since we have limited
number of bits, as we talked about, because of
that diff comes another interesting issue is of rounding
error. So we try to round it to represent to the closest value possible,
just like let's say we have 0322
and non terminating, but we have to represent it up to the fourth decimal
place only after the point. So we might
write like as 0322 and five
something, or maybe zero could be there. So that's what
the meaning of rounding is. So precision loss can happen because of that,
because of the rounding errors, since not real numbers
cannot be expressed exactly using the limited precision here.
And because of this, rounding errors can accumulate as a result of approximations
made during the arithmetic operations. So let's say you
have a complicated expression statement which has
lots of different mathematical operations present
with itself, and what will happen is certain
expressions, certain components of those mathematical expressions,
we're going to accumulate more error than you
might think of. So hence what the consequences
could be that it can lead to discrepancies
between the expected mathematical results and the actual results obtained
when using floating point numbers. So you can have redundant
error being thrown out in your calculations, which is very undesirable
considering if you have some kind of applications where you desire precision.
So here's one example of it. As you can see,
there's two number x and y, and y specifically being represented using
the exponent part e to the power -15
and when we sum it up,
you can see that this thing is not very suitable,
I mean, not giving out the results which we actually
desired for. So,
to solve this, we have something like we
can use the decimal types to work with. So rust decimal crate. Why? It's that
it's similar to the number big crates that we previously mentioned here.
So with that, we can work with the decimal based arithmetic,
which can minimize the rounding errors. Also,
we can try to round only when necessary. Another thing is,
avoid cumulative rounding. So you
can structure your calculations in such a way, if it's long, in such
a way that it doesn't result in a cumulative errors.
Rather, it's like done in a way that avoids it in
the first place. Also, one very good technique to overcome
it is avoiding divisions.
So mostly it's been seen that it tends to
happen more with the divisions, because division,
we have more
degree of numbers represented, we have more
degree of numbers generated while we are applying division operations.
So rather it can be instead done
using with multiply and then carrying out
the end result, in other words, trying to minimize the division
here directly if possible. Another good
approach to rounding error prowess is to use interval arithmetic.
You can google it up to learn more about it.
So using interval arithmetic libraries like
there was one INRI. Yeah, so INRI
crate in rust provides you with the interval
arithmetic operations that you can apply in your calculations
so that you can have some loss
prevented coming to the next point here.
Or the next issue is the loss of significance. So the
loss of significance is because of the consequences in
the limited precision of the floating point representation,
again, where the available bits
are allocated to the most significant bit. So the difference between
rounding error and the loss of significance is that we
just take in the most significant
digits, most significant sorry bits in
a given binary number, and because of that we have
something like loss of significance. They are quite close with each other, rounding errors
and loss of significance. But rounding errors are something which
we do explicitly, whereas loss of significance
is something which is, again, not in our hands directly.
So it can take place, let's say, due to nearly
equal numbers, like if they are being subtracted and
hence the cancellation or
the difference here can be completely avoided,
resulting in something which is not very precise.
So, as you can see on an illustration on this slide that we
have two different bit blocks here.
The first one is in the orange and another
one is in blue, and the third row is something which we
see as a result, that after subtracting
it, if we had
more numbers than we expected, then the rest of the ones which are in
the three red boxes would get discarded, simply because we don't have range to
accommodate it. And we'll take up the most significant bits in the given binary
numbers for the significant or the mantisa part.
So this can result in that. So, as you can see,
here's an example illustrating this. So we have two numbers which
are quite close with each other. Like the difference is not huge,
but what you will see in the output here in the difference is that it
will have something like 00899
or something. So again,
it won't be direct and very precise.
So solutions for mitigating
this one is again quite clever, like reordering the operations.
So let's say if you have subtraction in some early inner
complex calculations, you can restructure it down
so that it can be done later on, if not later.
If it cannot be eliminated, then at least it can be done later.
You can use alternative formulas to rewrite your entire
expressions. One of my favorite personal favorite ones
is something which is taught to engineering students in their first year of the college.
Like Taylor's expansion. So Taylor's expansion is something you can
learn more about. I won't be able to cover it here. So basically,
Tyler's expansion is a mechanism through which you
can precisely compute
the number of arithmetic, I'm sorry, the number
of decimal digits you
would need. It can go as long
as you might want it to be. So it's kind of a formula in itself.
And this can be used to represent any complex calculations
in a discrete fashion. So let's say you want to approximate sine x,
you can do it with the Taylor's expansion, you want to approximate cos x,
you can still do it with the Taylor's expansion, you want to
approximate tan inverse of x, you can still do it
with the Taylor's expansion.
So yeah, this is one of the very clever techniques that we can have for
loss of significance to be avoided.
Excuse me.
And another one is to use arbitrary precision arithmetic,
the one which we saw in the case of limited precision ones. So you can
still use D squids as well if you require it.
Now, coming up to the associativity, the issue
of associativity and the order of operations.
Now this is quite interesting.
Most of us are quite aware about the rule of commutativeity
for summation and multiplication. So most
of us might not notice the difference here, that for
the result one and result two, the output must be the
same, right? Most of us might assume that, and looking at it in the first
place. But the thing that is behind the scenes is applicable is
like flowing point operations are not always associative.
In fact, because of the issues which you talked about previously,
like limited precision, rounding errors, and loss of significance.
So changing orders can lead to different results.
It's quite possible. So you can run this code, you can find it
by yourself, that it might result. In the first case, it will result you,
though we are doing the exact same thing mathematically, but the
output which we're getting in the result one variable is
one, and for the result two, it's zero. So this is
very much undesirable and something we want to avoid
at all costs. Now,
overcoming this issue could be done using the
solution to this is pretty straightforward, so we
can use parentheses and explicit groupings, just like in the
case where we do it. In the case of pointers
rereferencing, where we have the scope of the evaluation to be decided using the
help of parentheses, we can still do
it using here as well. But we have to be
little cautious about it because still it might not give you the exact
results. But if you are aware about the order of calculations in your
expression, then this can be a good
mitigation technique. And another one is reordering operation,
not just using parentheses, but you can also reorder your operations so that the loss
of significance, I mean the parts of the calculations where the loss of
significance is the least can come up before
the rest of the part where the loss of significance could be the most.
So reordering your operations can be another good way to deal with that.
Another thing here is this is very advanced,
and honestly I couldn't able to find lots of material about it yet,
because probably this is still in the RFC stage of the rusts
compilers. So you can use
fuse, multiply and add operations. So it's an SIMD operation.
Specifically. If you don't know about it, you can still google it
up to learn more about it. But FMA operations can paralyze
the certain components of the certain
calculations of the component of the expression in a way that
reduces overall rounding errors.
So you can take a look into this more in
depth if you wish to. Now, coming up to
the another interesting issue here with floatingpoint numbers is comparison
and the equality testing. Trust me,
this one is the one which most of you will come across if
you ever will work with the floating point numbers in your code. So I
came across first time when I was in my high school,
I couldn't compare the floating point numbers because the result was like it
was so arbitrary that it was not possible for me to
write a code logic at that time. But coming to the point here,
if you, let's say, want to compare,
if you want to compare two same floating point
numbers, then it's better idea to use equality.
Instead of using direct equality comparison, you can use epsilon
based comparison for the approximate equality. Now, what does it mean?
Basically, it means that as you can see in
the function approximately equal, we have two distinct floating
point numbers and we are first subtracting it and we are taking
the absolute of them because we don't want the negative results to be,
we only care about the magnitude of the result, not that the sign,
whether it's positive or negative real number.
So the thing here is we can compare it
with the minimal difference
of the epsilon, which is defined using the constant epsilon.
By the way, standard library in rust provides you with
all of these constants. I've just written it explicitly
for the sake of giving an idea how exactly
it is represented, but it's already there in the standard library, so you can directly
import it from the standard crate. So you
can compare the result of the magnitude of the result.
If it's less than the epsilon, then it's fine. If it's greater
than the error, I mean, sorry, the difference
is greater than the epsilon one, then it's possibly not approximately
equal to each other. So this
one technique can be employed to test your code. I mean to test like
a floatingpoint comparison. So getting into more
specifics here, when we saw that you can use epsilon based comparison instead of using
equality operator in the first place.
Another good technique here is relative error comparison.
This is very similar to the epsilon based comparison, but except
the threshold of the comparison is something which we decide
as developer, having the knowledge of the system that how much
can there be an acceptable range
of error difference, sorry, the acceptable
range of the subtractive difference from two numbers here,
which will be fine to describe it.
Another technique we can use is units and last piece comparison.
So we can use ULPL equal
functions to compare them in case we
only want equality. This comes from the float compare
crate in rust, so you can use this one. It's quite good
as well. I've used it personally and
comparing with some specific tolerance value. So this is again
similar to the relative error comparison.
But relative errors can be dynamic, whereas this can be static.
So this technique can be used
as well to avoid having some errors in the equality
comparison. Now another case, this is not a problem.
Rather this is like,
I should say, the specification given in the
seven, five, four standard of the IEEe that we can have not a number
and infinity handling in the floating point numbers. Let's say that you want to
divide two numbers and those two floatingpoint
numbers. Floating point numbers happen to be zero, absolute zero. Then in
that result, mathematically it should be nothing.
So you get certain methods which
is not a number method, which will result you whether
it's number or not. If that's true, then result is an n
and result if it's infinite, then you can check for that condition as
well. So these two methods are given to you so that you can check if,
let's say, certain parts of your calculation might throw you undesirable
condition and you don't want to write your own technique
for testing it out directly. So you can have this inbuilt methods
to test it out for you as a safety check. Now,
since these operations arise from division like,
I mean, all these conditions can arise from having a square root of a negative
number and something like dividing zero by zero. So all
these conditions where we can't express anything correctly,
or the calculation might get into complex numbers.
We should use these APIs to
test whether this one is actually an odd number.
Coming up to another one here is compilers
optimizations now this is not
an issue, but this is rather given to you
as a safety net so that you can deal with your compiler
directly and instructed explicitly not to optimize
or do some,
or to do some kind of premature
optimizations during the phase of compilation of your code
to execute both. So compilers can optimize
floatingpoint calculations for performance range. So this can
happen quite a lot of time and may reorder or combine operations
which may affect precision and could result in an
associativity of the orders problem which we saw in the previous slides.
So to overcome that you can use table comparison
flags to control those optimizations. So for example here
in this code you can see that his own function perform calculations
which take two distinct swimming point numbers.
And what can happen end up is it could be written as such
as like it could be computed in a way such as
b times 2.0 and then it could be added, or it could
be previously added and then it could be like multiplied
with the two. I'm not saying that
the order of evaluation is completely ignored, but how
compiler is going to deal with such a complex expression is up to
it. So you can't predict it almost every time. So it's better
to use appropriate compiler flags to help it ignore it.
So one of them is like inline never. So you are explicitly telling your instructing
compiler to never inline this function call and always let it
be computed once there is a value passed to it, because otherwise
this can lead to undesirable output and accuracy errors.
So some few compiler optimization techniques
include like using compilers attribute. So rust
provides you with the attributes that lets you control the behavior of the floating
point operations. One such attribute is target feature,
and within this target feature you can have a parameter enabled
that is a fast math. So this will going to do
some internal instructions,
it will release some internal instructions while compiling that this
piece of code should have fast math support enabled.
You can learn more about fast math like
from the web. I'm not going to get into lots of detail, but fast math
is basically something that allows you to overcome
the calculations that are arising from the compilers optimization.
Another thing you can use is use compiler flags directly
in the phase of compilation rather than in the code base.
So you can have a target cpu given,
and you can use LLVM's
arguments to enable fast math in your build
file. So that since rust is built on LLVM,
so we can have a direct optimization option available from
LLVM as well for the fast math thing.
Now, coming to the accumulative errors,
this one is pretty interesting. So quite sometimes,
let's say that you might have to deal with some kind of stories of a
floating point numbers, and in that scenario it
becomes quite necessary to know that there can be lots of,
as you saw with the previous slides, that there can be lots of issues with
the limited number of bits, there can be rounding error
problems, there can be loss of significance, associativity, et cetera,
et cetera. So because of all this, there can be an accumulation
of the rounding errors. So it's like you're cursively calling a function
whether it's iterative or cursive or iterative
whatever. And because of a given set of calculations, that error
will get amplified over the entire period of
the calculation.
So to minimize that, we have something like Cahan's
summation algorithm and compensated summation algorithm.
So this one here is given to the
code shown to you on your slides is a Cahan's summation problem.
It takes in like array of floating point numbers, which is series of floating
point numbers, and the output might not be the one. But since we
are using Cahan's summation algorithm, we get a definitive
result here. So you can use Cahan summation as well if in case you're
dealing with some kind of iterative algorithm which
does a calculation of floating point numbers, or with the help of floating point numbers,
could be recursive as well.
So we covered lots of things in
this talk here.
These two slides will be about summarizing it like
what we just saw. Excuse me. So first
of all, actually choose
the appropriate data type depending upon the requirements precision
requirements. So go with f 32 if you know that your range of calculations
won't going to cross that boundary. Otherwise you have to switch to f 34 if
you need more range and precision.
Now leverage some numerical computation crates for
enhanced accuracy, such as rug num, big into
dick and INRI. You can take a look on
these crates. These are grape crates. If in case you need to deal with some
kind of 40 point operations, you can use them directly.
These are material ones. Also avoid
direct equality comparison and instead use epsilon
based comparisons like relative ones and tolerance based ones.
Also perform error propagation analysis to estimate potential
inaccuracies. This is quite advanced. It doesn't employ use
of any library or crate or something. But what
you can do is you can use some weight
to estimate how much error is going to throw out if you deliberately
put in some errors in your calculations so that you can know how much error
is going to accumulate. Now, coming to part two, you can consider using
arbitrary precision libraries for critical computation of your code. So arbitrary
precision libraries gives you more range to compute
and more precision since they rely on
decimal based calculations, so they will give you more specific
results. Also, minimize accumulated errors in iterative
algorithms. Use techniques like Cahan summation, which we
saw this couple of slides previously, and also
use stable compiled flags to control floatingpoint behaviors,
which can happen unknowingly, which can be done unknowingly by
your compiler for you. So you must want to avoid that.
Also, thoroughly test numerical code with diverse input set this
is more or less we are talking about fuzzing
here specifically. So thoroughly test numerical codes with
diverse sets of inputs to identify adverse inaccuracies. So if you
are not sure about certain edge cases, go ahead and put it there in your
test cases, just to be ensure that your
test case is that once you're shipping out something to your production,
it's tested and test proof and you don't have something
like a patriot missile crisis happening in
case you have to work on something like that. So that's all from my side.
I would thank you, all of you,
for listening to me till here at this end. And yeah,
in case you have some questions, you can always reach out to me. I will
be more than happy to answer whatever, and I will
be more than happy to help you as well. So yeah,
take care guys.