Binary floating-point math is complex and subtle. I’ve collected here a few of my favorite oddball facts about IEEE floating-point math, based on the articles so far in my floating-point series. The focus in this list is on float but the same concepts all apply to double.

These oddities don’t make floating-point math bad, and in many cases these oddities can be ignored. But when you try to simulate the infinite expanse of the real-number line with 32-bit or 64-bit numbers then there will inevitably be places where the abstraction breaks down, and it’s good to know about them.

Some of these facts are useful, and some of them are surprising. You get to decide which is which.

- Adjacent floats (of the same sign) have adjacent integer representations, which makes generating the next (or all) floats trivial
- FLT_MIN is not the smallest positive float (FLT_MIN is the smallest positive
*normalized*float) - The smallest positive float – assuming denormals are supported, as they should be – is 8,388,608 times smaller than FLT_MIN
- FLT_MAX is not the largest positive float (it’s the largest finite float, but the special value infinity is larger)
- 0.1 cannot be exactly represented in a float
- All floats can be exactly represented in decimal
- Over a hundred decimal digits of mantissa are required to exactly show the value of some floats
- 9 decimal digits of mantissa (plus sign and exponent) are sufficient to uniquely identify any float
- The Visual C++ 2010 debugger displays floats with just 8 mantissa digits
- The integer representation of a float is a piecewise linear approximation of the base-2 logarithm of that float
- You can calculate the base-2 log of an integer by assigning it to a float
- Most float math gives inexact results due to rounding
- The basic IEEE math operations guarantee perfect rounding
- Subtraction of floats with similar values (f2 * 0.5 <= f1 <= f2 * 2.0) gives an exact result, with no rounding
- Subtraction of floats with similar values can result in a loss of virtually all significant figures (even if the result is exact)
- Minor rearrangements in a calculation can take it from catastrophic cancellation to 100% accurate
- Storing elapsed game time in a float is a bad idea
- Comparing floats requires care, especially around zero
- sin(float(pi)) calculates a very accurate approximation to pi-float(pi)
- From 2^24 to 2^31, an int32_t has more precision than a float – in that range an int32_t can hold every value that a float can hold, and millions more
- pow(2.0f, -149) should calculate the smallest denormal float, but with VC++ it generates zero. pow(0.5f, 149) works.
- IEEE float arithmetic guarantees that “if (x != y) return z / (x-y);” will never cause a divide by zero, but this guarantee only applies if denormals are supported
- Denormals have horrible performance on most hardware, which leads to some developers disabling them
- If x is a floating-point number then “x == x” may return false – if x is a NaN
- Calculations done with higher-precision intermediate values sometimes give more accurate results, sometimes less accurate results, and sometimes just inconsistent results
- Double rounding can lead to inaccurate results, even when doing something as simple as assigning a constant to a float
- You can printf and scanf every positive float in less than fifteen minutes

From the comments on the altdevblogaday version of this post:

- The IEEE standard doesn’t guarantee correct rounding of transcendental functions, due to the Table Maker’s Dilemma, but research on resolving the dilemma is ongoing
- +0.0f and -0.0f have different representations that are equal but occasionally behave differently
- -0.0f is usually printed the same as +0.0f

From the comments on the gamasutra version of this post:

- Don’t use floating-point in interrupts without ensuring that FP state is saved and restored around interrupts. It usually isn’t which can lead to horrific bugs in the code you interrupt.

Do you know of some other surprising or useful aspects of floats? Respond in the comments.

Heya, former student of yours here. Here’s a floating point problem that we ran into while trying to endianize stuff for XBox360:

http://www.runner2.com/blog/2012/3/31/fun-with-floating-points.html

Short of it is that the mantissa always has to have a leading one, otherwise when you assign one float to another, the assignment will “correct” the float for you. I guess it was our silliness for assigning the return of an endianize swap back to a float instead of a uint, but this behavior did result in some funny looking shaders and geometry (and who knows what else – that’s the scary part!) for a couple of weeks until we tag-team tracked it down.

Something seems very fishy here. The claim that a float must have a one in the most significant bit of the mantissa is incorrect, for your purposes. While it is true that all normalized floats have a leading one, that one is *implied*. Storing a bit that is always one would be unconscionably inefficient.

All 32-bit float values are valid. The only ones that could possibly be *corrected* by loading/storing are NaNs — numbers with an 0xFF exponent and a non-zero mantissa.

Depending on the FPU settings, denormalized numbers (zero exponent and non-zero mantissa) might get zeroed, but you shouldn’t generally be using these anyway.

I recommend digging deeper. I don’t think you’ve found the problem. You should seriously consider an exhaustive search. In a very reasonable time (as little as fifteen minutes) you scan scan all positive flots doing quite expensive tests, and that should help isolate the real problem.

One thing to be aware of is that the type punning/aliasing that you are doing is illegal/undefined. Consider using a union instead — although I doubt the type punning would cause a problem with the VC++ compiler on Xbox 360.

You say “We are assigning the byte-swapped data, which may or may not have a 1 in that position.” and if what you are doing is to take a float, byte swap it, and then treat it as a float, then yeah, that’s bad. Probably you are ending up with some NaN values, and they will not necessarily be preserved.

Don’t pretend that a byte-swapped float is a float. If that is the solution that you reached then you’re okay. But you should say what your solution was.

We alluded to our solution above, but basically it was to realize that when you are endian swapping something, you are no longer dealing with a type but a block of memory, and so you should treat it as a block of memory – we cast the memory at the address to a uint32, swapped it around, and then wrote the bitpattern directly out instead of reassigning it to a float like we were lazily doing before.

Looking over some of your other articles on floating point, I can see indeed points where that leading 1 isn’t present in other floats (like the representation of 1) so the explanation that we have is faulty. We were only seeing this behavior on a select few float values so it could be that we were dealing with NaNs that we hadn’t recognized as such.

That sounds good. You might want to clarify on your blog what the fix/problem was, which is as you describe above — don’t interpret a non-float bag of bits as a float.

Pingback: Exceptional Floating Point | Random ASCII

Pingback: That’s Not Normal–the Performance of Odd Floats | Random ASCII

Pingback: Float Precision Revisited: Nine Digit Float Portability | Random ASCII

Pingback: Comparing Floating Point Numbers, 2012 Edition | Random ASCII

Here’s some floating point weirdness worth considering … try computing 10,000 terms of the second order recurrence equation x[n+1] = (x[n] + alpha) / x[n-1] using initial conditions x[0] = x[1] = – alpha / 2 (for starters, try using alpha = -1 & x[0] = x[1] = 0.5).

On any floating point machine (or double-precision machine), one will observe that a peculiar form of intermittent chaotic behavior is observed in the corresponding numerical orbits whenever alpha doesn’t belong to the set {0,1} U [3/2, 3]. Moreover, this peculiar chaotic behavior is completely unexpected (i.e., not predicted by the dynamics of the underlying recurrence equation); it arises only when these sequences are computed numerically. More technically, one might say that it occurs due to the non-trivial interaction between the dynamics of the underlying recurrence equation and that of the floating point environment in which its orbits are embedded.

From what I can tell, the round-off errors in these sequences are accumulating and propagating in a bizarre way: there seems to be a very regular pattern where the values in 2 out of every 3 terms exhibit errors – which are growing exponentially in magnitude – while every 3rd term has no error at all. In other words, whenever the previous two terms contain numerical errors, the corresponding computation of the next term somehow cancels these errors out, resulting in an error-free, exact value.

A simpler way of saying the above might be this: there’s a rather interesting combination of error amplification and attenuation at play here.

One particularly interesting problem is trying to explain why no such chaos is observed whenever 3/2 <= alpha <= 3 (the cases when alpha = 0 or 1 are trivial) … these sequences actually converge to the 3-cycle {-1, (1 – alpha), -1} like they are supposed to.

That’s some cool chaos.

Here’s one more … what is the probability that two randomly chosen floating-point numbers will yield a product whose significand will not have to be normalized? Stated another way, given two random floats, what’s the probability that the product of their respective significands will be less than 2? The answer is 2ln(2) – 1 = 0.38629…

Moreover, this probability has connections to a wide variety of other problems … see my blogpost for more details:

http://stubbornzeta.blogspot.com/2016/04/random-connections.html

For these two random floats, are you assuming that they are in 1.0 to 1.99999 range? I can’t figure out how else “not have to be normalized” and “less than 2” can be equivalent. If that is the case then clarify on your blog post? Looks interesting anyway.

Yes, I’m assuming both significands are in the interval [1,2) … thus, their product will end up somewhere in the interval [1,4) prior to normalization. If normalization is required, this implies the significand of the product ended up in the subinterval [2,4); otherwise, it ended up in the sub-interval [1,2) and so the normalization step wouldn’t be needed.

Thanks for the tip, I’ll clear that up.

I’ve added a diagram to the blogpost to help clear up any confusion:

http://stubbornzeta.blogspot.com/2016/04/random-connections.html

Just for the record, as far as floating-point multiplication is concerned, the significand of each operand is always assumed to be in the interval [1,2) unless the number is denormalized, in which case the significand would be in the interval [0,1).

Moreover, the nature of floating-point multiplication allows one to treat each part of the operation independently (prior to normalization): (i) the sign of the product depends only on whether the sign bits of the operands are similar or not; (ii) the significand of the product depends only on the product of the significands of the operands, and (iii) the exponent of the product depends only on the sum of the exponents of the operands.

If the significand of the product has to be normalized, this will increment the exponent of the product by 1; otherwise, the resulting exponent (i.e., the sum of the exponents of the operands) will remain unchanged.

One of my favorite weirdnesses about -0: sqrt(-0) = -0, according to IEEE-754. And IIRC, the only way to _reveal_ the sign of 0 (without examining the bitwise representation directly) is one of the following: Divide by it and look for -Inf vs. +Inf, or use the CopySign primitive to copy the sign onto another number.

There are other places -0 behaves differently. For example, negating a 0 is not the same as subtracting a 0 from 0. I forget the exact case, but there’s one case that doesn’t flip the sign due to a different rule about subtracting equal numbers. But, that doesn’t _reveal_ the sign of 0.