## Tricks With the Floating-Point Format

Years ago I wrote an article about how to do epsilon floating-point comparisons by using integer comparisons. That article has been quite popular (it is frequently cited, and the code samples have been used by a number of companies) and this worries me a bit, because the article has some flaws. I’m not going to link to the article because I want to replace it, not send people looking for it.

Today I am going to start setting the groundwork for explaining how and why this trick works, while also exploring the weird and wonderful world of floating-point math.

There are lots of references that explain the layout and decoding of floating-point numbers. In this post I am going to supply the layout, and then show how to reverse engineer the decoding process through experimentation.

The IEEE 754-1985 standard specifies the format for 32-bit floating-point numbers, the type known as ‘float’ in many languages. The 2008 version of the standard adds new formats but doesn’t change the existing ones, which have been standardized for over 25 years.

A 32-bit float consists of a one-bit sign field, an eight-bit exponent field, and a twenty-three-bit mantissa field. The union below shows the layout of a 32-bit float. This union is very useful for exploring and working with the internals of floating-point numbers. I don’t recommend using this union for production coding (it is a violation of the aliasing rules for some compilers, and will probably generate inefficient code), but it is useful for learning. These articles on aliasing and undefined behavior and type punning provide some more details, but be aware that this usage is probably undefined behavior in C++.

```union Float_t
{
Float_t(float num = 0.0f) : f(num) {}
// Portable extraction of components.
bool Negative() const { return (i >> 31) != 0; }
int32_t RawMantissa() const { return i & ((1 << 23) - 1); }
int32_t RawExponent() const { return (i >> 23) & 0xFF; }

int32_t i;
float f;
#ifdef _DEBUG
struct
{   // Bitfields for exploration. Do not use in production code.
uint32_t mantissa : 23;
uint32_t exponent : 8;
uint32_t sign : 1;
} parts;
#endif
};
```

The format for 32-bit float numbers was carefully designed to allow them to be reinterpreted as an integer, and the aliasing of ‘i’ and ‘f’ should work on most platforms (if, such as gcc and VC++, they allow aliasing through unions), with the sign bit of the integer and the float occupying the same location.

The layout of bitfields is compiler dependent so the bitfield struct that is also in the union may not work on all platforms. However it works on Visual C++ on x86 and x64, which is good enough for my exploratory purposes. On big endian systems like SPARC and PPC the order in the bitfield struct is reversed.

In order to really understand floats, it is important to explore and experiment. One way to explore is to write code like this, in a debug build so that the debugger doesn’t optimize it away:

```void TestFunction()
{
Float_t num(1.0f);
num.i -= 1;
printf("Float value, representation, sign, exponent, mantissa\n");
for (;;)
{
// Breakpoint here.
printf("%1.8e, 0x%08X, %d, %d, 0x%06X\n",
num.f, num.i,
num.parts.sign, num.parts.exponent, num.parts.mantissa);
}
}
```

Put a breakpoint on the ‘printf’ statement and then add the various components of num to your debugger’s watch window and examine them, like this: You can then start trying interactive experiments, such as incrementing the mantissa or exponent fields, incrementing num.i, or toggling the value of the sign field. As you do this you should watch num.f to see how it changes. Or, assign various floating-point values to num.f and see how the other fields change. You can either view the results in the debugger’s watch window, or hit ‘Run’ after each change so that the printf statement executes and prints some nicely formatted results.

Go ahead. Put Float_t and the sample code into a project and play around with it for a few minutes. Discover the minimum and maximum float values. Experiment with the minimum and maximum mantissa values in various combinations. Think about the implications. This is the best way to learn. I’ll wait. I’ve put some of the results that you might encounter during this experimentation into the table below:

 Float value Integer representation Sign Exponent field Mantissa field 0.0 0x00000000 0 0 0 1.40129846e-45 0x00000001 0 0 1 1.17549435e-38 0x00800000 0 1 0 0.2 0x3E4CCCCD 0 124 0x4CCCCD 1.0 0x3F800000 0 127 0 1.5 0x3FC00000 0 127 0x400000 1.75 0x3FE00000 0 127 0x600000 1.99999988 0x3FFFFFFF 0 127 0x7FFFFF 2.0 0x40000000 0 128 0 16,777,215 0x4B7FFFFF 0 150 0x7FFFFF 3.40282347e+38 0x7F7FFFFF 0 254 0x7FFFFF Positive infinity 0x7f800000 0 255 0

With this information we can begin to understand the decoding of floats. Floats use an base-two exponential format so we would expect the decoding to be mantissa * 2^exponent. However in the encodings for 1.0 and 2.0 the mantissa is zero, so how can this work? It works because of a clever trick. Normalized numbers in base-two scientific notation are always of the form 1.xxxx*2^exp, so storing the leading one is not necessary. By omitting the leading one we get an extra bit of precision – the 23-bit field of a float actually manages to hold 24 bits of precision because there is an implied ‘one’ bit with a value of 0x800000.

The exponent for 1.0 should be zero but the exponent field is 127. That’s because the exponent is stored in excess 127 form. To convert from the value in the exponent field to the value of the exponent you simply subtract 127.

The two exceptions to this exponent rule are when the exponent field is 255 or zero. 255 is a special exponent value that indicates that the float is either infinity or a NAN (not-a-number), with a zero mantissa indicating infinity. Zero is a special exponent value that indicates that there is no implied leading one, meaning that these numbers are not normalized. This is necessary in order to exactly represent zero. The exponent value in that case is –126, which is the same as when the exponent field is one.

To clarify the exponent rules I’ve added an “Exponent value” column which shows the actual binary exponent implied by the exponent field:

 Float value Integer representation Sign Exponent field Exponent value Mantissa field 0.0 0x00000000 0 0 -126 0 1.40129846e-45 0x00000001 0 0 -126 1 1.17549435e-38 0x00800000 0 1 -126 0 0.2 0x3E4CCCCD 0 124 -3 0x4CCCCD 1.0 0x3F800000 0 127 0 0 1.5 0x3FC00000 0 127 0 0x400000 1.75 0x3FE00000 0 127 0 0x600000 1.99999988 0x3FFFFFFF 0 127 0 0x7FFFFF 2.0 0x40000000 0 128 1 0 16,777,215 0x4B7FFFFF 0 150 23 0x7FFFFF 3.40282347e+38 0x7F7FFFFF 0 254 127 0x7FFFFF Positive infinity 0x7f800000 0 255 Infinite! 0

Although these examples don’t show it, negative numbers are dealt with by setting the sign field to 1, which is called sign-and-magnitude form. All numbers, even zero and infinity, have negative versions.

The numbers in this chart were chosen in order to demonstrate various things:

• 0.0: It’s handy that zero is represented by all zeroes. However there is also a negative zero which has the sign bit set. Negative zero is equal to positive zero.
• 1.40129846e-45: This is the smallest positive float, and its integer representation is the smallest positive integer
• 1.17549435e-38: This is the smallest float with an implied leading one, the smallest number with a non-zero exponent, the smallest normalized float. This number is also FLT_MIN. Note that FLT_MIN is not the smallest float. There are actually about 8 million positive floats smaller than FLT_MIN.
• 0.2: This is an example of one of the many decimal numbers that cannot be precisely represented with a binary floating-point format. That mantissa wants to repeat ‘C’ forever.
• 1.0: Note the exponent and the mantissa, and memorize the integer representation in case you see it in hex dumps.
• 1.5, 1.75: Just a couple of slightly larger numbers to show the mantissa changing while the exponent stays the same.
• 1.99999988: This is the largest float that has the same exponent as 1.0, and the largest float that is smaller than 2.0.
• 2.0: Notice that the exponent is one higher than for 1.0, and the integer representation and exponent are one higher than for 1.99999988.
• 16,777,215: This is the largest odd float. The next larger float has an exponent value of 24, which means the mantissa is shifted enough left that odd numbers are impossible. Note that this means that above 16,777,216 a float has less precision than an int.
• 3.40282347e+38: FLT_MAX. The largest finite float, with the maximum finite exponent and the maximum mantissa.
• Positive infinity: The papa bear of floats.

We can now describe how to decode the float format:

• If the exponent field is 255 then the number is infinity (if the mantissa is zero) or a NaN (if the mantissa is non-zero)
• If the exponent field is from 1 to 254 then the exponent is between –126 and 127, there is an implied leading one, and the float’s value is:
• (1.0 + mantissa-field / 0x800000) * 2^(exponent-field-127)
• If the exponent field is zero then the exponent is –126, there is no implied leading one, and the float’s value is:
• (mantissa-field / 0x800000) * 2^-126
• If the sign bit is set then negate the value of the float

The excess-127 exponent and the omitted leading one lead to some very convenient characteristics of floats, but I’ve rambled on too long so those must be saved for the next post. I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in AltDevBlogADay, Floating Point, Math, Programming. Bookmark the permalink.

### 28 Responses to Tricks With the Floating-Point Format

1. david.wan says:

nice work. This clarifies the float storage & operation.

2. Joe Porkka says:

I wanted to generate random numbers in the range from 0 to 1 given a random number generator that returns a 32 bit unsigned integer.
The obvious way:
double const randmax = 1.0 / 0xFFFFffff;

unsigned n = getrand(); // Returns a 32-bit unsigned value
float value = (float) (n * randmax);

Doesn’t work well because float doesn’t have enough precision, so you do not get an even distribution of unique values in the range from 0 to 1.
For those who don’t regularly read this blog 🙂 it isn’t obvious that float has more discrete values from 0 to 0.5 than from 0.5 to 1.0.
This results in up to 256 consecutive values of n producing the same float value.

The “cheat” that I wrote to make fast conversion to float (using a simple union):
union FloatParts
{
float f;
int32 i;
};

FloatParts value;
value.i = (127 << 23) | (n & 0x7FFFFF);
value.f -= 1.0;

Works fine and I think it produces a much nicer distribution of “float” values that the obvious way above – as all the adjacent values are evenly spaced with no duplicates.

0: 0.000000 (0.00000000e+000)
1: 0.000000 (1.19209290e-007) Delta= (-1.19209290e-007)
2: 0.000000 (2.38418579e-007) Delta= (-1.19209290e-007)
3: 0.000000 (3.57627869e-007) Delta= (-1.19209290e-007)
4: 0.000000 (4.76837158e-007) Delta= (-1.19209290e-007)
7ffffb: 0.999999 (9.99999404e-001)
7ffffc: 1.000000 (9.99999523e-001) Delta= (-1.19209290e-007)
7ffffd: 1.000000 (9.99999642e-001) Delta= (-1.19209290e-007)
7ffffe: 1.000000 (9.99999762e-001) Delta= (-1.19209290e-007)
7fffff: 1.000000 (9.99999881e-001) Delta= (-1.19209290e-007)

This uses an exponent value of 0 (127) to produce numbers in the range 1.0 to less than 2.0.
All values in this range are equally spaced (since they all have the same exponent).

Using an exponent of 1 (128), produces values in the range 2.0 to less than 4.0, spaced out twice as far as before.
Using an exponent of 2 (129), produces values in the range 4.0 to less than 8.0, spaced out twice again.

Bruce pointed out to me that this:
static const unsigned maxRange = 1 << 24;
n &= maxRange – 1;
float value = n * 1.0 / maxRange;
will produce twice as many equally space values in the range from 0.0 to less than 1.0

This is because in the range 0.5 to less than 1 (exponent = -1) there are the same number of values are there are in the range 1.0 to less than 2.0.
It follows then that there are the same number of values in the range 0.25 to less than 0.5 (exponent = -2)

However, to keep things equally spaced, we can only use the same number of values between 0.0 to less than 0.5 as from 0.5 to less than 1.0.

This makes use of 1 more bit of “n” in the exponent than the bit-fiddling method does.

• brucedawson says:

I decided to test it (minus the generation of the random integer) to make sure my e-mail code was correct:

// Implementation
float RandFloat(unsigned n)
{
const unsigned maxRange = 1 << 24;
n &= maxRange – 1;
float value = n * 1.0 / maxRange;
return value;
}

// Test code
void RandFloats()
{
for (unsigned n = (1 << 24) – 10; n < (1 << 24); ++n)
{
Float_t randf(RandFloat(n));
printf("RandFloat(%d) = %1.8e (%08x)\n", n, randf.f, randf.i);
}
}

Output:
RandFloat(16777206) = 9.99999404e-001 (3f7ffff6)
RandFloat(16777207) = 9.99999464e-001 (3f7ffff7)
RandFloat(16777208) = 9.99999523e-001 (3f7ffff8)
RandFloat(16777209) = 9.99999583e-001 (3f7ffff9)
RandFloat(16777210) = 9.99999642e-001 (3f7ffffa)
RandFloat(16777211) = 9.99999702e-001 (3f7ffffb)
RandFloat(16777212) = 9.99999762e-001 (3f7ffffc)
RandFloat(16777213) = 9.99999821e-001 (3f7ffffd)
RandFloat(16777214) = 9.99999881e-001 (3f7ffffe)
RandFloat(16777215) = 9.99999940e-001 (3f7fffff)

Looks good. Evenly distributed floats, as close as they can be in the 0.5 to 1.0 range, and they will be equivalently distributed in the 0.0 to 0.5 range.

So there you go, if you want to generate evenly distributed floats in the 0.0 to 1.0 range, just generate integers from 0 to 1 << 24 (inclusive if you want to generate 1.0) and call the RandFloat function above.

• plasmacel says:

Nice solution. However note that you cannot generate floats in the range [0.0, 1.0] inclusive with RandFloat because the modulo “n &= maxRange – 1” wraps any (1 << 24) input value to 0. If you extend maxRange then that weakens statistical uniformity.

• plasmacel says:

I mean it is perfect for the range [0.0, 1.0] exclusive, but not inclusive.

• Rich Delaney says:

And I thought I was the only one who thought of this. There goes my interview question.

• Pavel K. says:

Just as an addition for those that come across this comment: I think you can do this with a number generator of any precision (not just 32-bit integers) by repeating the most significant bits:

union FltRep {
float f;
unsigned int i;
};

FltRep rep;
const unsigned short r = rand();

// RAND_MAX is 0x7FFF, which offers 15 bits
// of precision. Therefore, we move the bits of r
// into the top of the 23 bit mantissa, and
// repeat the most significant bits of r in
// the least significant of the mantissa
const unsigned int m = (r <<8) | (r >> 7);

rep.i = (127 << 23) | m;
rep.f -= 1.0f;

However, there are some numbers that you will never get exactly. For example, 1.5 is encoded as 0x3fc00000. Assuming we have 15 bits of precision, the closest number would have to come from generating 0x4000 from rand(), In repeating the significant bits, you would get 0x3fc00080 (which is 1.5000153). Similarly, generating 0x3fff from rand() would get 0x3fbfff7f (which is 1.4999846).

I do believe this preserves the property that you generate just as many values below 0.5 as you do above 0.5, though.

• brucedawson says:

Why not call rand() twice to generate more bits, rather than reusing the 15 bits you initially get?

• Pavel K says:

I left the comments without signing up for a wordpress account. I think that is what’s preventing me from editing, etc.

The relevant code that I meant to post is here:
http://paste.kde.org/522362/

To answer your question: The libc rand() function implemented on the windows machine that I was using was prohibitively slow. Calling it twice seemed to almost double the cost of generating the random number. That being said, calling rand() multiple times instead of repeating bits would certainly produce similarly distributed but more diverse results.

• brucedawson says:

I updated your comment and deleted the meta-comments. Let me know if it doesn’t look correct.

3. Simon says:

Great article, this clarifies many things !
Though I didn’t understand why you’re dividing the mantissa by the “value of the implied one”, 0x800000, when decoding. Do you have any precision ? Thanks.

• brucedawson says:

The mantissa field is a fixed-point number — an integer that represents a fraction. In order to turn it into a ‘real’ number we need to divide by 0×800000. That division should be floating-point or ‘real number’ division so that it gives us a number from 0.0 to 0.99999. For instance, if the mantissa is 0x400000 that means 0.5, and dividing by 0x800000 gives us that value.

• Simon says:

I didn’t read it as a floating-point division. This is limpid now, thanks !

4. tattoo bradford says:

This is a topic that’s close to my heart… Cheers! Exactly where are your contact details though?

5. wrabbit99 says:

I am looking for a way to store 32-bit float data in compressed format. The data is for geometric locations in a scene, but it is localized to individual, relatively small, meshes. I had thought of using the ILM half library, but some sources suggest it is not well-suited to location data, and the complexity of its implementation makes me wonder whether I would be making full use of the bits.

My first thought is to determine the min and max corners of a bounding box around each mesh and express each coordinate as a ratio between the two using a double for precision. I could bump the corners slightly to ensure the ratio is always in the range (-1.0, 1.0). Then I could multiply by 2^15, say, and convert to binary leaving room for one sign bit. This means each mesh is likely stored with varying precision, but it seems relatively straightforward relative to the IEEE 754 format and provides a predictable compression ratio.

My more sober second thought is that this is a well-trodden path. If you know of a good approach it would be appreciated and might be helpful to others. I suspect it’s a FAQ and I just haven’t come across the right reference :>(.

• brucedawson says:

The key thing to remember is that a float gives higher precision near zero, where as a (scaled) integer gives uniform precision. For storing a mesh you probably don’t want higher precision near zero, so using a scaled integer is more sensible. You can play around with the number of bits to get exactly the right balance between data size and precision.

Using 16-bit floats means that you will have more precision near zero (compared to 16-bit integer) but less precision near the edges. Probably not the right tradeoff.

• wrabbit99 says:

• erwincoumans says:

>>I could bump the corners slightly to ensure the ratio is always in the range (-1.0, 1.0).
How do you determine exactly how much to “bump the corners” to be safe for very large floating point values?

6. brucedawson says:

Another promising guide, that is less technical than this series, is http://floating-point-gui.de/

7. PAOK says:

Bolches yarboclos, Batman!!!

Nice post

This site uses Akismet to reduce spam. Learn how your comment data is processed.