Sometimes Floating Point Math is Perfect

Posted on June 19, 2017 by brucedawson

I’ve written in the past about how to compare floating-point numbers for the common scenario where two results should be similar but may not be identical. In that scenario it is reasonable to use an AlmostEqual function for comparisons. But there are actually cases where floating-point math is guaranteed to give perfect results, or at least perfectly consistent results. When programmers treat floating-point math as a dark art that can return randomly wrong results then they do themselves (and the IEEE-754 committee) a disservice.

A common example given is that in IEEE floating-point math 0.1 + 0.2 does not equal 0.3. This is true. However this “odd” behavior is then extrapolated in some ill-defined way to suggest that all floating-point math is wrong, in unpredictable ways. The linked discussion then used one of my blog posts to justify their incorrect analysis – hence this article.

In fact, IEEE floating-point math gives specific guarantees, and when you can use those guarantees you can sometimes make strong conclusions about your results. Failing to do so leads to a cascade of uncertainty in which any outcome is possible, and analysis is impossible.

One of the fundamental guarantees of IEEE math is that five basic operations will be correctly rounded. This guarantee goes back to the very first version of the IEEE standard when the committee members knew that these five operations could be efficiently correctly rounded, and therefore should be. The five operations are addition, subtraction, multiplication, division, and square root. Correctly rounded means that (by default) the answer will be the closest possible result (round to nearest even).

Okay then, if addition is guaranteed to be correctly rounded, then how come 0.1 + 0.2 doesn’t equal 0.3? To understand it helps to rewrite the calculation like this:

double(double(0.1) + double(0.2)) == double(0.3)

In this rendition of the problem we can see that there are four separate roundings going on, each of which adds error. The three floating-point constants are rounded to double precision, two of them are added (and implicitly rounded again), and then a comparison is done. Even when all four operations are correctly rounded (such guarantees for conversions showed up in more recent IEEE standards) we can’t assume that a series of correctly rounded operations will give the same results as a series of infinitely precise operations that are then rounded once.

If we print all of the relevant numbers to high enough precision then we can see what is going on. Here’s some simple code:

printf(“%1.30f\n”, 0.1);
printf(“%1.30f\n”, 0.2);
printf(“%1.30f\n”, 0.1 + 0.2);
printf(“%1.30f\n”, 0.3);

And here are the results (from VC++ 2017 – VC++ 2013 and before can’t print this many digits):

0.100000000000000005551115123126
0.200000000000000011102230246252
0.300000000000000044408920985006
0.299999999999999988897769753748

It turns out that double(0.1) and double(0.2) both get rounded up, and that gives their sum a boost upwards. And, if you look carefully you will see that their sum gets rounded up as well. Meanwhile, 0.3 gets rounded down. All of the individual roundings give the best possible result, in isolation, but the net result is that the sum is one ULP higher than the literal double(0.3), necessarily.

Sometimes there’s error, but not always

The mistake, however, is to assume that the “failure” in this case means you should give up. If the two constants being added had been exact then there would only have been one rounding in the calculation and the result would have matched the literal on the right-hand side. So, for example, 1.0 + 2.0 will give an exact result, as will 98432341293.375 + 0.000244140625. When you calculate 1.0 / 5.0 the answer will not be exact but it will be correctly rounded and will be equal to double(0.2).

Correctly rounded results are great, but even better is when we are guaranteed exact results. It may be surprising to some software developers that floating-point math is sometimes guaranteed to give exact results but this falls out of the other guarantees. Here are some examples:

Integers that are small enough will be converted to float/double with no error, and vice-versa
Floating-point addition, subtraction, and multiplication of integral values will be exact as long as the inputs are exact and the results are small enough
The square root of a perfect square (such as sqrt(1369.0)) will be exact
Multiplication and division by powers of two is exact as long as the result is in normalized range
If you subtract two numbers that are close to each other, where ‘close’ means within a factor of two of each (Sterbenz’s lemma) other then the result will be exact (although exact in this case sometimes just means catastrophic cancellation)
Similarly, adding opposite signed numbers with the equivalent constraints is also exact
floor and ceil should give perfect results. If they don’t then some blogger is likely to test your implementation for all four-billion floats and point out the error

Even sin and cos are guaranteed to be correctly rounded by some floating-point libraries, which opens up the possibility of identical results on a wider range of machines.

I’m sure there are many more examples.

Using these guarantees to get deterministic results is not easy because the floating-point standard leaves some things confusingly unspecified, such as intermediate precision. As a practical matter this is an enormous problem on any code that is compiled for an x87 coprocessor (which potentially means old 32-bit x86 code) but is often a non-issue for 64-bit code or even 32-bit code if you flip enough SSE switches.

So, don’t blindly trust floating-point math, but don’t assume that it introduces errors at random. Careful choice of algorithms can make an enormous difference.

For more ramblings about floating-point math see my ongoing series.

Reddit discussion is here.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

This entry was posted in Floating Point and tagged floating point. Bookmark the permalink.

17 Responses to Sometimes Floating Point Math is Perfect

lewiscowles2015 says:

July 10, 2017 at 2:23 pm

decimal struct? or just round?
Time to process may be one factor, but @ ~1s for 100,000,000 operations on not new intel consumer Hardware
https://stackoverflow.com/questions/2550281/floating-point-vs-integer-calculations-on-modern-hardware/41570985#41570985 (my answer)
IDK, perhaps it’s time for more libfixmath in situations where rounding is unacceptable.

Reply
Zaru says:

July 12, 2017 at 5:48 am

Common source of confusion: programmer does not know numerical bases very well, and does not know that even if he types numbers in decimal, most floating-point implementations use _binary_ floating point. Which could be useful to bring up in this article.

Reply
John Campbell says:

July 12, 2017 at 8:46 pm

When you use “printf(“%1.30f\n”, 0.1);” and get “0.100000000000000005551115123126” How accurate is this decimal representation of the binary value stored as 0.1 ? ie where does the decimal reporting fail ?

Reply
- brucedawson says:
  
  July 12, 2017 at 10:40 pm
  
  That depends on your printf implementation. With VC++ 2013 and earlier the printing would switch to zeroes after about seventeen digits, IIRC. With VC++ 2015 and above and with gcc’s CRT you will get as many digits as you ask for. Eventually the digits go to zeroes (if you ask for more than thirty digits) because at some point there is nothing more to print, but that can take a few hundred digits.
  
  Reply
Malcolm McLean says:

July 13, 2017 at 2:49 am

You might be interested in my floating point portability routines. They save floats as IEEE 754, and load them back, even on non-IEE 754 architectures.

https://github.com/MalcolmMcLean/ieee754

Reply
- brucedawson says:
  
  July 13, 2017 at 9:15 pm
  
  Huh. So, what non-IEEE architectures exist these days? I know that some GPUs still play fast-and-loose with denormals, and I can see some cheating being done on rounding (but why, it’s cheap to do it right), but why would a platform not support the IEEE floating-point format? That seems… odd. Tell me more.
  
  Reply
  - Malcolm McLean says:
    
    August 3, 2017 at 4:14 am
    
    The routines future-proof your code against non-IEEE. They also support embedded processors and legacy systems. I actually write them as part of a Matlab support project, the original code reads in a Matlab matrix. Without the routines, I would have had to specify that the code could only run on IEEE-compatible machines.
    
    Reply
  - musaran2 says:
    
    October 15, 2023 at 7:49 am
    
    6 years later and AI training hardware embraces extremely compact floats down to 8 bits, some with configurable bit repartition…
    
    Reply
KLR says:

July 13, 2017 at 7:21 am

I guess the author wants to be reassuring, in that it is OK to use floating point numbers. However, if anything, he has made me more suspicious.

Reply
Alex Cohn says:

September 25, 2017 at 6:09 am

Some while ago I was programming online billiard. It was essential to have the balls propagate exactly same on all computers, but all our attempts to synchronize the end-users failed. Until we found that some Direct X drivers set the floating-point arithmetic (in CPU, not on the GPU they controlled) differently. After some try-and-error, we succeeded to force all of them to comply (we had to compromise precision, but this was not a big deal: the balls still rolled and bounced quite realistically). Luckily, we shut down the shop before the ARM processors became popular for online gaming.

Reply
Sean C. says:

November 20, 2017 at 5:45 am

While I understand why double(double(0.1) + double(0.2)) != double(0.3), doesn’t IEEE754 guarantee that the 5 basic arithmetic operations are exactly rounded, which should imply that double(double(0.1) + double(0.2)) == double(0.3) should hold.

For exact rounding to be upheld, shouldn’t the result of the floating-point operation be within 1/2 a ulp of the exact arithmetic result such that they round to the same value?

What does the IEEE754 guarantee of exact rounding for the 5 basic arithmetic operations actually mean?

Thanks.

Reply
- brucedawson says:
  
  November 20, 2017 at 11:14 pm
  
  I’m confused by your question. In most (all?) cases the casts that you are adding don’t actually do anything. Generally speaking, in programming languages, 0.1 is already a double, and 0.1 + 0.2 is rounded to double.
  
  That said, the answer is that any individual operation is correctly rounded, but if the inputs are also rounded then the multiple roundings may combine badly, as they do in this case.
  
  Reply
SC says:

November 21, 2017 at 12:53 am

They’re not casts but taken from your notation for the decimal fractional inputs being rounded to the nearest representable double, as in the passage in your article:

“Okay then, if addition is guaranteed to be correctly rounded, then how come 0.1 + 0.2 doesn’t equal 0.3? To understand it helps to rewrite the calculation like this:

double(double(0.1) + double(0.2)) == double(0.3)

In this rendition of the problem we can see that there are four separate roundings going on, each of which adds error.”

I understand this to be the nature of floating-point operations in practice but I can’t square this with the guarantee of exact rounding for the 5 basic operations as mandated by IEEE754. As referenced from your article “Floating-Point Determinism”, D. Monniaux’s paper “The pitfalls of verifying floating-point computations” referencing Goldberg states:

“IEEE-754 specifies exact rounding [Goldberg, 1991, §1.5]: the result of a floating-point operation is the same as if the operation were performed on the (infinitely precise) real numbers with the given inputs, then rounded according to the rules in the preceding section. Thus, x⊕y is defined as r(x+y)…”

As such, from the definition above, shouldn’t the following be true:
0.1 ⊕ 0.2 == double( infintely precise(0.1) + infintely precise(0.2) )
== double(0.1 + 0.2) == double(0.3)

So, while 0.1 ⊕ 0.2 != 0.3, according to the exact rounding guarantees for IEEE754 floating-point operations, 0.1 ⊕ 0.2 == double(0.3).

In your reply, you state that “… but if the inputs are also rounded then the multiple roundings may combine badly, as they do in this case.”

Does this not contradict the definition of exact rounding and imply that the IEEE754 guarantee of exact rounding for the 5 basic arithmetic operations is not a strong guarantee in practice.

Thanks.

Reply
- brucedawson says:
  
  November 21, 2017 at 8:32 am
  
  You’re misunderstanding Goldberg, I think. Floating-point math on a computer always uses rounded inputs. This is unavoidable. When it talks about “(infinitely precise) real numbers” it means that you take the infinitely precise value represented by the floating-point number (double in this case) – it doesn’t mean the infinitely precise value represented in the code. By the time the floating-point math operation happens the real-number 0.1 is lost. In its place is the nearest double.
  
  This is generally unavoidable. If the 0.1 and 0.2 are inputs from a file, for instance, then it is *unavoidable* that they be rounded to double before being added, thus giving us three roundings in that calculation. Unavoidable.
  
  Yes, the results would be more precise if 0.1 and 0.2 weren’t rounded first, but such a system would be drastically different from IEEE 754 math.
  
  This calculation is not one that IEEE math can do. Sorry:
  
  double( infintely precise(0.1) + infintely precise(0.2) )
  
  Reply
SC says:

November 21, 2017 at 9:29 am

The quote was from Monniaux’s paper quoting Goldberg. Another quote from a passage in Monniaux’s paper:

“By “strict IEEE-754 behaviour”, “strict IEEE-754 single precision” or “strict IEEE-754 double precision”, we mean that each individual basic arithmetic operation is performed as if the computation were done over the real numbers, then the result rounded to single or double
precision.”

And another from “Floating-point arithmetic” – wikipedia:

“IEEE 754 requires correct rounding: that is, the rounded result is as if infinitely precise arithmetic was used to compute the value and then rounded.”

I think the common operative phrase is “as if infinitely precise arithmetic was used…”, note the “as if…” and this has messed up my understanding because I had initially understood it the way you describe it in this article.

Just to play devil’s advocate, in your reply – “When it talks about “(infinitely precise) real numbers” it means that you take the infinitely precise value represented by the floating-point number (double in this case)”

Why would they say “infinitely precise” to mean a single or double precision value when it is understood that such floating-point values are of finite precision? Sure, they can represent some decimal fractions up to a hundred over digits as you mentioned in one of your other articles but I don’t think this is infinite precision in the general case.

Thanks again.

Reply
- brucedawson says:
  
  November 21, 2017 at 11:01 pm
  
  I agree that their phrasing could lead to confusion. But also be aware that your understanding is impossible. When it is time to do the addition the IEEE compliant processor is given two double-precision numbers. That’s it. It doesn’t know their history. It can’t know their history. Each of the input numbers could be as much as 0.5 ULP away from the infinitely precise number that it originally came from. Sorry. That is necessarily the best that the standard can do.
  
  Where “infinitely precise” is correctly used in those references is that the operation itself is done “as if to infinite precision” and then rounded.
  
  Reply
SC says:

November 21, 2017 at 6:37 pm

I reread Goldberg and I think I understand now. Goldberg’s discussion on exactly rounded operations and his definition of exactly rounded were in the context of decimal floating-point arithmetic examples that he used up to that point at least. Of course decimal floating-point inputs can be computed exactly in decimal floating-point arithmetic as opposed to binary floating-point arithmetic which might need the decimal inputs to be rounded.

The bases got me again. The stars are in alignment now! Thanks 🙂

Reply