I’ve seen a few online discussions linking to my Comparing Floating Point Numbers page for misguided reasons and I wanted to discuss those reasons to help people understand why throwing epsilons at the problem without understanding the situation is a Really Bad Idea™. In some cases, comparing floating-point numbers for exact equality is actually correct.

If you do a series of operations with floating-point numbers then, since they have finite precision, it is normal and expected that some error will creep in. If you do the same calculation in a slightly different way then it is normal and expected that you might get slightly different results. In that case a thoughtful comparison of the two results with a carefully chosen relative and/or absolute epsilon value is entirely appropriate.

However if you start adding epsilons carelessly – if you allow for error where there should be none – then you get a chaotic explosion of uncertainty where you can’t tell truth from fiction.

## Floating-point numbers aren’t cursed

Sometimes people think that floating-point numbers are magically error prone. There seems to be a belief that if you redo the exact same calculation with the exact same inputs then you might get a different answer. Now this can happen if you change compilers (such as when changing CPU architectures), change compiler settings (such as optimization levels), and it can happen if you use instructions (like fsin/fcos/ftan) whose value is not precisely defined by the IEEE standard (and if you run your code on a different CPU that implements them differently). But if you stick to the basic five operations (plus, minus, divide, multiply, square root) and you haven’t recompiled your code then you should absolutely expect the same results.

Update: this guarantee is mostly straightforward (if you haven’t recompiled then you’ll get the same results) but nailing it down precisely is tricky. If you change CPUs or compilers or compiler options then, as shown in this chart from this post, you can get different results from the same inputs, even on very simple code. And, it turns out that you can get different results from the same machine code – if you reconfigure your FPU. Most FPUs have a (per-thread) setting to control the rounding mode, and x87 FPUs have a setting to control register precision. If you change those then the results will change. So the guarantee is really that the same machine code will produce the same results, as long as you don’t do something wacky.

Understanding these guarantees is important. I talked to somebody who had spent weeks trying to understand how to deal with floating-point instability in his code – different results on different machines from the same machine code – when it was clear to me that the results should have been identical. Once he realized that floating-point instability was not the problem he quickly found that his code had some race conditions, and those had been the problem all along. Floating-point as the scapegoat delayed his finding of the real bug by almost a month.

## Constants compared to themselves

The other example I’ve seen where people are too quick to pull out an epsilon value is comparing a constant to itself. Here is a typical example of the code that triggers this:

float x = 1.1;

if (x != 1.1)

printf(“OMG! Floats suck!\n”);

On a fairly regular basis somebody will write code like this and then be shocked that the message is printed. Then somebody inevitably points them to my article and tells them to use an epsilon, and whenever that happens another angel loses their wings.

If floating-point math is incapable of getting correct results when there are *no* calculations (except for a conversion) involved then it is completely broken. And yet, other developers manage to get excellent results from it. The more logical conclusion – rather than “OMG! Floats suck!” is that the code above is flawed in some way.

And indeed it is.

## Fatally flawed floats

The problem is that there are two main floating-point types in most C/C++ implementations. These are float (32 bits) and double (64 bits). Floating-point constants in C/C++ are double precision, so the code above is equivalent to:

if (float(1.1) != double(1.1))

printf(“OMG! Floats suck!\n”);

In other words, if 1.1 is not the same when stored as a float as when stored as a double then the message will be printed.

Given that there are twice as many bits in a double as there are in a float it should be obvious that there are many doubles that cannot be represented in a float. In fact, if you take a randomly selected double then the odds of it being perfectly representable in a float are about one part in 4 billion. Which is poor odds.

## It looks so simple…

The confusion, presumably, comes from the fact that 1.1 looks like such a simple number, and therefore the naive expectation is that it can trivially be stored in a float. That expectation is incorrect – it is *impossible* to perfectly represent 1.1 in a binary float. To see why let’s see what happens when we try converting 1.1 to binary. But first let’s practice base conversion.

To convert the fractional part of a number to a particular base you just repeatedly multiply the number by the base. After each step the integer portion is the next digit. You then discard the integer portion and continue. Let’s try this by converting 1/7 to base 10:

- 1/7 (initial value)
**10/7 = 1+3/7 (multiply by ten, first digit is one)****30/7 = 4+2/7 (discard integer part, multiply by ten, next digit is four)****20/7 = 2+6/7 (discard integer part, multiply by ten, next digit is two)****60/7 = 8+4/7 (discard integer part, multiply by ten, next digit is eight)****40/7 = 5+5/7 (discard integer part, multiply by ten, next digit is five)****50/7 = 7+1/7 (discard integer part, multiply by ten, next digit is seven)**- 10/7 = 1+3/7 (discard integer part, multiply by ten, next digit is one)

The answer is 0.**142857**142857… We can see that the bold steps (2-7) repeat endlessly so therefore we will never get to a remainder of zero. Instead those same six digits will repeat forever.

## 1.1 is not a binary float

Let’s try the same thing with converting 1.1 to base two. The leading ‘1’ converts straight across, and to generate subsequent binary digits we just repeatedly multiply by two. After each multiply we take the integer portion of the result as our next digit, and discard the integer portion before the next multiply:

- 0.1 (initial value)
- 0.2 (multiply by two, first digit is zero)
**0.4 (multiply by two, next digit is zero)****0.8 (multiply by two, next digit is zero)****1.6 (multiply by two, next digit is one)****1.2 (discard integer part then multiply by two, next digit is one)**- 0.4 (discard integer part, then multiply by two, next digit is zero)
- 0.8 (multiply by two, next digit is zero)

Notice that the steps in bold (3-6) repeat endlessly, so the binary representation of 1.1 repeats endlessly. The net result is that the binary representations of 1.1 in float (24-bit mantissa) and double (53-bit mantissa) precision are:

float(1.1) = %1.00011001100110011001101

double(1.1) = %1.0001100110011001100110011001100110011001100110011010

The slight breaking of the pattern at the end of both numbers is because the conversion is done with round-to-nearest rather than truncation, in order to minimize the error in the approximation, and in both cases it is rounded up.

All binary numbers can be exactly represented in decimal, but not all decimal numbers can be exactly represented in binary. Just as 1/7 cannot be represented as a decimal number, 1/10 cannot be represented as a binary number. For more details on conversions and precision see this post.

## And therefore…

float(1.1) = 1.10000002384185791015625

double(1.1) = 1.100000000000000088817841970012523233890533447265625

and these values are not equal. Don’t expect the result of double-precision calculations to equal the result of float-precision calculations, even when the only ‘calculation’ is converting from decimal to binary floating point. Two reasonable ways to fix the initial code would be:

float x = 1.1f; // Float constant

if (x != 1.1f) // Float constant

printf(“OMG! Floats suck!\n”);

or:

double x = 1.1;

if (x != 1.1)

printf(“OMG! Floats suck!\n”);

Both of these pieces of code will behave intuitively and will not print the message. No epsilon required. And no angels harmed.

This is a very clear and straight explanation of how floats and doubles behave, and the extra about base conversion was very helpful to remember how it worked! As a programmer who used to avoid floats and doubles whenever possible (because I didn’t understand the “magic” behind them) and is getting to C/C++ game programming, this post and all the others about the subject taught me I was doing it wrong. Thank you!

Excellent discussion. Even knowing some of this stuff, it’s easy to forget day-to-day. The examples are extremely valuable when one’s thinking isn’t as clear as it should be.

As a case in point, all of Relic’s games have used floating-point extensively in gameplay code, but are still completely deterministic across CPUs from the Pentium II through the i7 — we know this because you can play networked games and replays on different generations of machine without desyncing.

There are caveats: the way some SSE instructions are specified is a bit of a minefield, and we’re trying to figure out now how we can make things work reliably across architectures, but the main point is valid. Floating point is subtle, but it’s not inherently nondeterministic.

Reblogged this on Fabio Ticconi's Blog.

Cogently written. The worst counter-example to the superstition “testing floats with equality is always bad” may be this one, where the tested value is an infinite: http://stackoverflow.com/questions/11421756/weverything-yielding-comparing-floating-point-with-or-is-unsafe

A couple more examples are in this post of mine (it can get tricky, but that does not undermine the message that floating-point is deterministic): http://blog.frama-c.com/index.php?post/2011/11/08/Floating-point-quiz

Nice — I appreciate the comment and the links.

One quibble: in the Floating-point-quiz you say “For a decimal number to be representable as a (base 2) floating-point number, its decimal expansion has to end in 5” but that is only true if there is a fractional part. All of the integers from 0 to 2^24 (about 16 million) are representable in a float, and many more are representable in a double, and most of them don’t end in 5.

Pingback: Float Precision Revisited: Nine Digit Float Portability | Random ASCII

Pingback: Comparing Floating Point Numbers, 2012 Edition | Random ASCII

An awesome reminder of how to do base conversion and why we had to learn it in the first place!

Thanks,

Donato

In one of the examples, why do you use float(1.1) and (double)1.1, and thus not double(1.1)?

There was no reason. I just changed it to make them consistent in order to avoid any possible confusion.

The problem isn’t the comparison, the problem is that people think 1.1 is a float constant. As you say at the end, the solution is to match constants and variables; either use 1.1f or use a double. You wouldn’t use 1.1f to initialize an int, and you wouldn’t expect “Hello, world” to be valid for a pointer-to-function, so why use 1.1 to initialize a float? Sure, it works (and if someone’s using 42 to initialize a double, rather than 42.0, that’s not a problem, since small integers are perfectly representable in floats), but it’s setting you up for confusion later.

Or better still, switch to a language like Pike, where the default float type is double precision, AND you get an arbitrary-precision float type (Gmp.mpf, using the GNU Multiprecision Library).

I agree that using 1.1 as a float constant is the problem. I think that most new developers assume that using 1.1 as a float constant is as benign as using 42 as a float constant. In our decimal-centric world (the US measurement system notwithstanding) 1.1 “seems” like a simpler number than 1.25, but as a binary float it is not.

I’m not sure that a default float type of double helps — after all, the default float type in C/C++ is also double, but for performance reasons many people explicitly choose float. An arbitrary precision float type is a great thing to have, but used carelessly it just gives you the same issues only with smaller epsilons.

Well, when people want absolute maximum performance, they don’t pick Pike, they go with C. But most of the time the cost is worth it – just as letting your language do garbage collection for you might have a performance and RAM cost, but it’s so much easier on the debugging. I can make a GUI program that displays a colorful Mandelbrot image in a screenful of Pike code, but the boilerplate to just create a window would be more than that in C.

With float vs double, I don’t remember ever working with floats for performance – I just always go double for accuracy. And these days, most high level languages are doing the same thing; the ‘float’ type in Python, Pike, and (I think) JavaScript/ECMAScript (where it’s just called a number) is double-precision. The advantages of single-precision floats just aren’t enough. On today’s hardware, I’m not sure there’s even any benefit left at all – in the same way that there’s no real benefit to working with 16-bit integers rather than 32-bit, because moving them around in memory usually involves just as much work. Ignore C’s float and just use double instead. Life will be better.

Floats definitely still have significant advantages. If you are dealing with large arrays of them then the memory savings (and reduced cache misses) can be critical. And GPUs still process floats a lot faster than doubles. And if you do SSE SIMD coding then you can process floats twice as fast as doubles.

It absolutely depends on the domain of course. I work in games and for most of our data we absolutely cannot use double.

Actually, that’s true. When you work with entire arrays of something, packing can help a lot. (See also PEP 393 strings.🙂 ) But for individual float variables, which is what most people work with most of the time, the performance difference is going to be negligible and the risk high.

I wouldn’t want to speak so definitely about what most people work with most of the time, but just be aware that doubles don’t actually solve all of the problems of binary floating-point math. Tom Forsyth likes to point out that sometimes switching to double just means your bugs now only occur after hours of playtime, or in distant corners of the map.

http://home.comcast.net/~tom_forsyth/blog.wiki.html#%5B%5BEven%20more%20precision%5D%5D

Then again, sometimes using double really is the right solution, or at least really does solve the problem.

Your mileage may vary.

FYI: In Pike float doesn’t default to double. It depends on how you compiled your Pike compiler and if you didn’t add some specific flags, then it depends on architecture. On 32-bit systems float is 32-bit, on 64-bit systems float is 64-bit.

Having the precision of your floating-point numbers depend on the size of your address space seems bizarre. I would guess that that is because the ‘int’ type is the size of a pointer so ‘float’ might as well be also, but still. Not a great choice for numerical analysis.

But interesting.

I’m not sure about the compilation defaults; if the default for a 32-bit build really is a 32-bit float, it’s probably just because nobody’s wanted to go in and change it, just in case it breaks stuff. But all the prebuilt Pikes I’ve used have had a 64-bit ‘float’ type. (The ‘int’ type in Pike is actually an optimized bignum; you can store integers of any size in it, but if they happen to fit inside a pointer, they’re represented more simply, for the performance benefit.)

Pingback: How to: 0.1 float is greater than 0.1 double. I expected it to be false | SevenNet

float x = 0.5;

if (x == 0.5)

printf(“IF”);

else if (x == 0.5f)

printf(“ELSE IF”);

else

printf(“ELSE”);

why output is IF here???

When you assign a floating-point value, it is what it is. You can then compare that to another literal, no problem. In this particular case, 0.5 can be represented perfectly, so there should be no difference between float and double.

Of course, most of this trouble disappears if you use a high level language. Python has only one ‘float’ type (with at least as much as IEEE double precision, but maybe more); Pike has one ‘float’ type plus ‘mpf’, a multi-precision float (allowing you to go to arbitrarily large precision if you wish). No more fiddling with float vs double.

Very clear and point to point reply. Thank you so much

I think the real problem is with the over-simplistic design of operator overloading and promotion. There are some situations where floats should silently promote to doubles, others where doubles should silently demote to floats, and others where the compiler should demand an explicit conversion. A good language should squawk in cases where there is genuine ambiguity about what a programmer most likely meant. Since code should include sufficient casts to ensure that a code reviewer would have no reason to doubt that the programmer meant what he wrote, I would suggest that language designers should base their type-conversion rules on that principle. Unfortunately, many languages like Java and C# fall flat.

A programmer might write “float1 == double1” when genuinely intending “(double)float1 == double1”, which is how compilers would interpret it. It’s also plausible that a programmer might accidentally write that when what was intended was “float1 == (float)double1”. I’d regard both of those as sufficiently plausible that good programmers should include casts *in both cases*, so as to make clear to code reviewers that they are aware of how the comparison is performed, and if I were designing a language, I’d forbid the version without a cast. More generally, I’d allow parameters to functions to indicate whether they should allow or disallow explicit casts, since there are situations where double-to-float should be accepted silently (e.g. coordinates given to graphics functions) and others which should not (e.g. inputs to serialization or comparison functions). There are likewise situations where float-to-double should be accepted silently (inputs to most math functions), and others where they should not (inputs to comparison functions). Rather than having a broad rule “float-to-double is allowed implicitly; double-to-float isn’t” it would be more helpful to allow more case-by-case determinations.