I’ve written in the past about how to compare floating-point numbers for the common scenario where two results should be similar but may not be identical. In that scenario it is reasonable to use an *AlmostEqual* function for comparisons. But there are actually cases where floating-point math is guaranteed to give perfect results, or at least perfectly consistent results. When programmers treat floating-point math as a dark art that can return randomly wrong results then they do themselves (and the IEEE-754 committee) a disservice.

A common example given is that in IEEE floating-point math 0.1 + 0.2 does not equal 0.3. This is true. However this “odd” behavior is then extrapolated in some ill-defined way to suggest that all floating-point math is wrong, in unpredictable ways. The linked discussion then used one of my blog posts to justify their incorrect analysis – hence this article.

In fact, IEEE floating-point math gives specific guarantees, and when you can use those guarantees you can sometimes make strong conclusions about your results. Failing to do so leads to a cascade of uncertainty in which any outcome is possible, and analysis is impossible.

One of the fundamental guarantees of IEEE math is that five basic operations will be correctly rounded. This guarantee goes back to the very first version of the IEEE standard when the committee members knew that these five operations *could* be efficiently correctly rounded, and therefore should be. The five operations are addition, subtraction, multiplication, division, and square root. Correctly rounded means that (by default) the answer will be the closest possible result (round to nearest even).

Okay then, if addition is guaranteed to be correctly rounded, then how come 0.1 + 0.2 doesn’t equal 0.3? To understand it helps to rewrite the calculation like this:

double(double(0.1) + double(0.2)) == double(0.3)

In this rendition of the problem we can see that there are *four* separate roundings going on, each of which adds error. The three floating-point constants are rounded to double precision, two of them are added (and implicitly rounded again), and then a comparison is done. Even when all four operations are correctly rounded (such guarantees for conversions showed up in more recent IEEE standards) we can’t assume that a series of correctly rounded operations will give the same results as a series of infinitely precise operations that are then rounded once.

If we print all of the relevant numbers to high enough precision then we can see what is going on. Here’s some simple code:

printf(“%1.30f\n”, 0.1);

printf(“%1.30f\n”, 0.2);

printf(“%1.30f\n”, 0.1 + 0.2);

printf(“%1.30f\n”, 0.3);

And here are the results (from VC++ 2017 – VC++ 2013 and before can’t print this many digits):

0.100000000000000005551115123126

0.200000000000000011102230246252

0.300000000000000044408920985006

0.299999999999999988897769753748

It turns out that double(0.1) and double(0.2) both get rounded up, and that gives their sum a boost upwards. And, if you look carefully you will see that their sum gets rounded up as well. Meanwhile, 0.3 gets rounded down. All of the individual roundings give the best possible result, in isolation, but the net result is that the sum is one ULP higher than the literal double(0.3), necessarily.

## Sometimes there’s error, but not always

The mistake, however, is to assume that the “failure” in this case means you should give up. If the two constants being added had been exact then there would only have been one rounding in the calculation and the result would have matched the literal on the right-hand side. So, for example, 1.0 + 2.0 will give an exact result, as will 98432341293.375 + 0.000244140625. When you calculate 1.0 / 5.0 the answer will not be exact but it will be correctly rounded and will be equal to double(0.2).

Correctly rounded results are great, but even better is when we are guaranteed *exact* results. It may be surprising to some software developers that floating-point math is sometimes guaranteed to give exact results but this falls out of the other guarantees. Here are some examples:

- Integers that are small enough will be converted to float/double with no error, and vice-versa
- Floating-point addition, subtraction, and multiplication of integral values will be exact as long as the inputs are exact and the results are small enough
- The square root of a perfect square (such as sqrt(1369.0)) will be exact
- Multiplication and division by powers of two is exact as long as the result is in normalized range
- If you subtract two numbers that are close to each other, where ‘close’ means within a factor of two of each (Sterbenz’s lemma) other then the result will be exact (although
*exact*in this case sometimes just means catastrophic cancellation) - Similarly, adding opposite signed numbers with the equivalent constraints is also exact
- floor and ceil should give perfect results. If they don’t then some blogger is likely to test your implementation for all four-billion floats and point out the error

Even sin and cos are guaranteed to be correctly rounded by some floating-point libraries, which opens up the possibility of identical results on a wider range of machines.

I’m sure there are many more examples.

Using these guarantees to get deterministic results is not easy because the floating-point standard leaves some things confusingly unspecified, such as intermediate precision. As a practical matter this is an enormous problem on any code that is compiled for an x87 coprocessor (which potentially means old 32-bit x86 code) but is often a non-issue for 64-bit code or even 32-bit code if you flip enough SSE switches.

So, don’t blindly trust floating-point math, but don’t assume that it introduces errors at random. Careful choice of algorithms can make an enormous difference.

For more ramblings about floating-point math see my ongoing series.

Reddit discussion is here.

Pingback: IEEE float division | Physics Forums - The Fusion of Science and Community

Pingback: IEEE float division | Page 2 | Physics Forums - The Fusion of Science and Community