Stupid Float Tricks

Type Punning is Not a Joke

I left the last post with a promise to share an interesting property of the IEEE float format. There are several equivalent ways of stating this property, and here are two of them.

For floats of the same sign:

  1. Adjacent floats have adjacent integer representations
  2. Incrementing the integer representation of a float moves to the next representable float, moving away from zero

Depending on your math and mental state these claims will seem somewhere between fantastic/improbable and obvious/inevitable. I think it’s worth pointing out that these properties are certainly not inevitable. Many floating-point formats before the IEEE standard did not have these properties. These tricks only work because of the implied one in the mantissa (which avoids duplicate encodings for the same value), the use of an exponent bias, and the placement of the different fields of the float. The float format was carefully designed in order to guarantee this interesting characteristic.

I could go on at length to explain why incrementing the integer representation of a float moves to the next representable float (incrementing the mantissa increases the value of the float, and when the mantissa wraps to zero that increments the exponent, QED) but instead I recommend you either trust me or else play around with Float_t in your debugger until you see how it works.

One thing to be aware of is the understated warning that this only applies for floats of the same sign. The representation of positive zero is adjacent to the representation for 1.40129846e-45, but the representation for negative zero is about two billion away, because its sign bit is set which means that its integer representation is the most negative 32-bit integer. That means that while positive and negative zero compare equal as floats, their integer representations have radically different values. This also means that tiny positive and negative numbers have integer representations which are about two billion apart. Beware!

Another thing to be aware of is that while incrementing the integer representation of a float normally increases the value by a modest and fairly predictable ratio (typically the larger number is at most about 1.00000012 times larger – six zeroes, one plus 1/8388608) this does not apply for very small numbers (between zero and FLT_MIN) or when going from FLT_MAX to infinity. When going from zero to the smallest positive float or from FLT_MAX to infinity the ratio is actually infinite, and when dealing with numbers between zero and FLT_MIN the ratio can be as large as 2.0. However in-between FLT_MIN and FLT_MAX the ratio is relatively predictable and consistent.

Here’s a concrete example of using this property. This code prints all 255*2^23+1 positive floats, from +0.0 to +infinity:

/* See
Tricks With the Floating-Point Format
for the potential portability problems with the union and bit-fields below. */ union Float_t { Float_t(float num = 0.0f) : f(num) {} // Portable extraction of components. bool Negative() const { return (i >> 31) != 0; } int32_t RawMantissa() const { return i & ((1 <> 23) & 0xFF; } int32_t i; float f; #ifdef _DEBUG struct { // Bitfields for exploration. Do not use in production code. uint32_t mantissa : 23; uint32_t exponent : 8; uint32_t sign : 1; } parts; #endif }; void IterateAllPositiveFloats() { // Start at zero and print that float. Float_t allFloats; allFloats.f = 0.0f; printf("%1.8e\n", allFloats.f); // Continue through all of the floats, stopping // when we get to positive infinity. while (allFloats.RawExponent() < 255) { // Increment the integer representation // to move to the next float. allFloats.i += 1; printf("%1.8e\n", allFloats.f); } }

The (partial) output looks like this:

0.00000000e+000
1.40129846e-045
2.80259693e-045
4.20389539e-045
5.60519386e-045
7.00649232e-045
8.40779079e-045
9.80908925e-045

3.40282306e+038
3.40282326e+038
3.40282347e+038
1.#INF0000e+000

For double precision floats you could use _nextafter() to walk through all of the available doubles, but I’m not aware of a simple and portable alternative to this technique for 32-bit floats.

We can use this property and the Float_t union to find out how much precision a float variable has at a particular range. We can assign a float to Float_t::f, then increment or decrement the integer representation, and then compare the before/after float values to see how much they have changed. Here is some sample code that does this:

float TestFloatPrecisionAwayFromZero(float input)
{
    union Float_t num;
    num.f = input;
    // Incrementing infinity or a NaN would be bad!
    assert(num.parts.exponent < 255);
    // Increment the integer representation of our value
    num.i += 1;
    // Subtract the initial value to find our precision
    float delta = num.f - input;
    return delta;
}

float TestFloatPrecisionTowardsZero(float input)
{
    union Float_t num;
    num.f = input;
    // Decrementing from zero would be bad!
    assert(num.parts.exponent || num.parts.mantissa);
    // Decrementing a NaN would be bad!
    assert(num.parts.exponent != 255 || num.parts.mantissa == 0);
    // Decrement the integer representation of our value
    num.i -= 1;
    // Subtract the initial value to find our precision
    float delta = num.f - input;
    return -delta;
}

struct TwoFloats
{
    float awayDelta;
    float towardsDelta;
};

struct TwoFloats TestFloatPrecision(float input)
{
    struct TwoFloats result =
    {
        TestFloatPrecisionAwayFromZero(input),
        TestFloatPrecisionTowardsZero(input),
    };
    return result;
}

Note that the difference between the values of two adjacent floats can always be stored exactly in a (possibly subnormal) float. I have a truly marvelous proof of this theorem which the margin is too small to contain.

These functions can be called from test code to learn about the float format. Better yet, when sitting at a breakpoint in Visual Studio you can call them from the watch window. That allows dynamic exploration of precision:

image

Usually the delta is the same whether you increment the integer representation or decrement it. However if incrementing or decrementing changes the exponent then the two deltas will be different. This can be seen in the example above where the precision at 65536 is twice as good (half the delta) going towards zero compared to going away from zero.

Caveat Incrementor

Pay close attention to the number of caveats that you have to watch for when you start partying on the integer representation of a float. It’s safe in a controlled environment, but things can quickly go bad in the real world:

  • Some compilers may prohibit the type-punning/aliasing used by Float_t (gcc and VC++ allow it)
  • Incrementing the integer representation of infinity gives you a NaN
  • Decrementing the integer representation of zero gives you a NaN
  • Incrementing or decrementing some NaNs will give you zero or infinity
  • The ratio of the value of two adjacent floats is usually no more than about 1.00000012, but is sometimes much much larger
  • The representations of positive and negative zero are far removed from each other
  • The representations of FLT_MIN and -FLT_MIN are as far from each other as the representations of FLT_MAX and -FLT_MAX
  • Floating-point math is always more complicated than you expect

Log Rolling

A related property is that, for floats that are positive, finite, and non-zero, the integer representation of a float is a piecewise linear approximation of its base 2 logarithm  I just like saying that. It sounds cool.

The reason that the integer representation is (after appropriate scaling and biasing) a piecewise linear representation of the base 2 logarithm of a (positive) float is because the exponent is logically the base-2 logarithm of a float, and it is in the high bits. The mantissa linearly interpolates between power-of-2 floats. The code below demonstrates the concept:

void PlotFloatsVersusRep()
{
    // Let's plot some floats and their representations from 1.0 to 32.0
    for (float f = 1.0f; f <= 32.0f; f *= 1.01f)
    {
        Float_t num;
        num.f = f;
        // The representation of 1.0f is 0x3f800000 and the representation
        // of 2.0f is 0x40000000, so if the representation of a float is
        // an approximation of its base 2 log then 0x3f800000 must be
        // log2(1.0) == 0 and 0x40000000 must be log2(2.0) == 1.
        // Therefore we should scale the integer representation by
        // subtracting 0x3f800000 and dividing by
        // (0x40000000 - 0x3f800000)
        double log2Estimate = (num.i - 0x3f800000) /
                    double(0x40000000 - 0x3f800000);
        //printf("%1.5f,%1.5f\n", f, log2Estimate);
        double log2 = log(f) / log(2.0);
        printf("%1.5f,%1.5f,%1.5f,%1.5f\n", f, log2Estimate, log2, log2Estimate / log2);
    }
}

If we drop the results into Excel and plot them with the x-axis on a base-2 log scale then we get this lovely chart:

image

If it’s plotted linearly then the ‘linear’ part of ‘piecewise linear’ becomes obvious, but I like the slightly scalloped straight line better. The estimate is exact when the float is a power of two, and at its worst is about 0.086 too small.

In the last of this week’s stupid float tricks I present you with this dodgy code:

int RoundedUpLogTwo(uint64_t input)
{
    assert(input > 0);
    union Float_t num;
    num.f = (float)input;
    // Increment the representation enough that for any non power-
    // of-two (FLT_MIN or greater) we will move to the next exponent.
    num.i += 0x7FFFFF;
    // Return the exponent after subtracting the bias.
    return num.parts.exponent - 127;
}

Depending on how you think about such things this is either the simplest and most elegant, or the most complicated and obtuse way of finding out how many bits it takes to represent a particular integer.

Random aside

The IEEE standard guarantees correct rounding for addition, multiplication, subtraction, division, and square-root. If you’ve ever wondered if this is important, then try this: using the Windows calculator (I’m using the Windows 7 version), calculate sqrt(4) – 2. The answer should, of course, be zero. However the answer that the calculator actually returns is:

Scientific mode: -8.1648465955514287168521180122928e-39
Standard mode: -1.068281969439142e-19

This is utterly fascinating. It shows that the calculator is using impressive precision (it did the calculations to 40 digits of accuracy, 20 digits in standard mode) and yet it got the wrong answer. Because the Windows calculator is using so much precision it will, for many calculations, get a more accurate answer than an IEEE float or double. However, because it fails to correctly round the answer (the last digit cannot be trusted) it sometimes gives answers that are laughably wrong.

I can just imagine the two calculator modes arguing: the standard mode says “according to my calculations sqrt(4) is 1.9999999999999999999” and the scientific mode says “according to my calculations it’s 1.999999999999999999999999999999999999992 – that is far more accurate”. Meanwhile an IEEE float says “I’ve only got about eight digits of precision but I think sqrt(4) is 2.”

Having lots of digits of precision is nice, but without correct rounding it can just end up making you look foolish.

That’s all folks

In the next post I’ll discuss some concrete examples that show the value of knowing how much precision is available around certain values. Until next time, float on.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in AltDevBlogADay, Floating Point, Programming and tagged , , , . Bookmark the permalink.

36 Responses to Stupid Float Tricks

  1. Adam says:

    Hi.
    Thx for intresting tips.
    I have made similar program for c and gcc :

    #include
    #include // pow

    int PrintMemRepres(double d)
    {
    size_t i;
    size_t iMax= sizeof d;
    printf(“hex memory representation of %e is : “, d);
    for (i = iMax; i>0 ; –i) printf(“%x”, ((unsigned char *)&d)[i-1]);
    printf(” \n”);

    return 0;
    }

    int main(void)
    {
    double d1 = 0.0;
    double d2;

    while ( d2<1.0)

    {
    d2 = nextafter (d1 , 2.0); // The nextafter function returns the next representable neighbor of x in the direction towards y.
    PrintMemRepres(d2);
    d1=d2;
    }

    return 0;
    }

    Regards

    Adam

    • brucedawson says:

      You should probably use %02x rather than %x, since otherwise you will miss any leading zeroes in the bytes, which can be quite confusing and ambiguous. Also, one disadvantage to scanning the double using a char pointer is that your code is now endian dependent. Be aware.

      Finally, it looks like the code got mangled pretty badly in the process of being pasted into the comment. You might want to edit and fix it, or link to a cleaner version.

  2. Adam says:

    Thx for comments. I’m trying to improve program according to your tips but it is not so easy.
    How can I make a question related with numerical computations and fractals ?

  3. Adam says:

    I have two numbers :
    double d1 = 1.0; // fixed point for fc(z) = z*z
    double d2 = d1 + pow(2.0,-53);
    After comparing its memory representations I see that these numbers are the same for computer because d2<nextafter (d1 , 2.0);
    Am I right ?
    Now compare :
    double d1 = 0.5; // parabolic fixed point
    double z = d1 + pow(2.0,-27); // point of exterior of Julia set
    These numbers are not the same, but when I iterate d2 using fc(z) = z*z + 0.25
    then z is not escaping from d1 ?
    ???

    • brucedawson says:

      You are partially right. If my calculations are correct then nextafter(1.0, 2.0) gives you 1.0 + pow(2.0, -52). Therefore, adding pow(2.0, -53) gets you half way there. The default IEEE rounding mode is round-to-nearest, tie goes to even. So, the tie is broken by rounding to 1.0.

      Take a look at the previous post for more details:

      http://altdevblogaday.com/2012/01/05/tricks-with-the-floating-point-format/

      Also, the code linked to from this post:

      Float Precision–From Zero to 100+ Digits


      handles doubles as well, and that would let you explore the double format more easily.

      Also note that if you want to see if two numbers are the same you can just compare them. Or print them with a seventeen digit mantissa. You don’t *need* to examine their memory representations. Comparing for equality is often a bad idea, but not in this case where that is really what you want.

      • Adam says:

        OK. In second example numbers are not the same, but the effect is like in the first example.
        You are an expert programmer an I’m an amateur. What happens in second example ?

    • brucedawson says:

      I don’t know what happens in the second example. Floating-point math has finite precision. Fractals are chaotic. When you are right near the edge of the interior/exterior of the Julia/Mandelbrot set then you should expect the unexpected. Maybe you have insufficient precision, or maybe you need to iterate more. Sorry, I don’t know which.

    • Cody says:

      I know this is old but in case you’re still wondering see the following for the maths:

      https://lodev.org/cgtutor/juliamandelbrot.html

  4. avjewe says:

    Why are the bitfields for exploration only, and not use in production code?

  5. avjewe says:

    It is a violation of the aliasing rules for some compilers, and will probably generate inefficient code. You answered this in the previous post. Sorry about that.

    • brucedawson says:

      Also the bit-fields are compiler dependent (the standard does not mandate the layout of bitfields IIRC) and tend to break on architectures with different byte-ordering rules.

  6. Regarding “same signness”: you can make the floating point representation symmetric by:

    // If the sign bit is set, invert the number.
    if (temp & 0x80000000) {
    // This maps -0 to +0. Too bad, so we add + 1.
    temp = ~(temp ^ 0x80000000);
    }

    // Make unsigned
    temp += 0x80000000;

    Now 0.0f is at 0x80000000 (0.5 ulps above half range) and -0.0f is at 0x7FFFFFFF (0.5 ulps below half range) and you can interprete the value as an unsigned long int and the distance function (a-b) works again for any fp-sign combination. You may want to introduce special code to skip distance=(0.0) – (-0.0)=1, or you consider equality for 0.0=-0.0 and don’t add the bias of 1. Signness is still queryable, MSB set is positive, MSB cleared is negative.

    We use this trick in lossless predictive floating point compression, where the result of the distance (a-b) of the original to the predicted value should result in a smooth laplacian PDF.

    • brucedawson says:

      I suggested some variant on this in my original article but I avoided proposing it this time because in most cases the ULPs distance across (or near) the zero-chasm isn’t meaningful. It’s usually more meaningful to use an absolute epsilon to cover that range. This was brought home to me when I was asked why sin(pi) was millions of ulps from zero.

      So yes, you can unify signed and unsigned floats into one continuous system, but you usually shouldn’t.

      One thing to be careful with is that a-b where ‘a’ and ‘b’ are arbitrary ints requires 33 bits to store the result.

      • >”millions of ulps from zero”

        Well, in the case of the code I brought forward, it’s 1 ULPs (assuming a sin()-approximation for that sin(0) = 0 and sin(pi) = -0). Or maybe you’re referring to the unmodified case?

        > “requires 33 bits to store the result.”

        Yes, but’s evitable in some cases. You can let the delta wrap around and use it as-is: “c=a-b := b=a-c := a=c-b”, if you don’t throw away both a and b you can always restore the delta-sign by checking the wrap around in the arithmetic (carry-check). Which you can also do if you are not thinking about storing c at all. Or if you are just interested in the magnitude of delta, and not it’s direction. I don’t see it as a mayor problem, it’s certainly manageable.

      • brucedawson says:

        This is intended as a reply to Niels’ comment below…

        You say “sin(pi) = -0” but have you tried that? Because in the world of floating-point math “-0” is emphatically not the result you will get. That’s my point. sin(pi) actually calculates sin((float)pi) and the answer is (float)pi – pi, which is millions of ulps away from zero. There are very few cases where treating the smallest positive and smallest negative numbers as being a few ULPs away from each other. There may be some cases where it is appropriate (if you have equations that converge successfully on a result of zero then it is) but sin(pi) is an example of why an ULPs based comparison between positive and negative numbers is *usually* not what you want.

        See this article for details on sin((float)pi):

        Comparing Floating Point Numbers, 2012 Edition

        I agree that handling the result of a-b is manageable (just put the result in a 64-bit number) but it is something that has to be thought about.

      • > You say “sin(pi) = -0″ but have you tried that?

        Hm, no, I just was proposing: “assuming a sin()-approximation for that sin(0) = 0 and sin(pi) = -0” for the discussion.

        > which is millions of ulps away from zero

        But that is a general problem of the approximation, not of the integer-cast. The approximation which gives you -8.742278E-8 has ~67 ULPs (if I did this about right with calc and counting bits), which is pretty horrible.

        I really don’t think it restricts the use of the symmetry-transform (from above) for . Let’s take an approximation of sin() which gives “sin(0)=8.742278E-8” and “sin(pi)=-8.742278E-8”, just to see if we get symmetric error-magnitudes, which it should:

        sin(pi) sin(0)
        -8.742278E-8 8.742278E-8 0.0 floating-point
        0xB3BBBD2E 0x33BBBD2E 0x00000000 hex (integer cast)
        0x33BBBD2E fix-up 1 (temp ^ 0x8000000)
        0xCC4442D1 fix-up 2 (~(temp ^ 0x8000000))
        0x4C4442D1 0xB3BBBD2E 0x80000000 bias (temp + 0x8000000)

        Now looking at the magnitudes of:

        0-sin(pi) sin(0)-0 abs difference via max()-min()
        0x80000000 – 0xB3BBBD2E –
        0x4C4442D1 = 0x80000000 =
        0x33BBBD2F 0x33BBBD2E

        It is about the same (just 1 difference because we gave -0.0 it’s own “integer slot”).
        The integer-math is distributive: “sin(0)-(-sin(pi)) = (0-sin(pi))+(sin(0)-0)”

        sin(0)-(-sin(pi)) (0-sin(pi))+(sin(0)-0) abs difference via max()-min()
        0xB3BBBD2E – 0x33BBBD2F +
        0x4C4442D1 = 0x33BBBD2E =
        0x67777A5D 0x67777A5D

        After the transform the nextafter() and previousbefore() you mention above holds as well: 0.0 is 0x80000000, nextafter() is 0x80000001 a bit above 0.0, previousbefore() is 0x7FFFFFFF which is -0.0, and so on. So in this representation, the tiniest positive number and the tiniest negative number have 2 int-units distance (and the original real fp-values are 2 ULPs different).

        Now I’m a bit confused, maybe I miss the point of your critisism. :^)

        • brucedawson says:

          I agree that if you have a calculation that converges extremely close to plus/minus zero that your technique could be valid. But what, I am asking, is this calculation? Does it happen in practice?

          sin() is not it. If sin() is 100% accurate then sin((float)pi) will not give zero, or anywhere close to it. Therefore it makes a poor real-world example for float ULPs compares near zero.

          In short, I understand that your technique works. However I don’t see many places where it is relevant. Adjacent floats at-or-near zero are not, as far as I can tell, the same thing as adjacent floats near, say 0.01. When comparing two floats that are above FLT_MIN there is a clear relative-error meaning to the ULPs comparison. The meaning near zero is totally different, and rarely as useful.

          I recommended your technique eight years ago. Now I am not convinced that it is a good idea in any common case.

        • shixingwang says:

          “The approximation which gives you -8.742278E-8 has ~67 ULPs ”
          i think -8.742278E-8 has 0xBBBD2E ULPs, how do you think it has 67 ULPs?

          • brucedawson says:

            The sentence seems unclear at best. You shouldn’t ever talk about a number “having” a certain number of ulps. Two numbers can be separated by a certain number of ulps, and I assume that you mean that -8.742278E-8 is 0xBBBD2E ULPs away from zero — that sounds about right. Anyway, I don’t think looking at how many ULPs a number is away from zero is particularly meaningful or useful, so I don’t think the question of how many ULPs away it is from zero is worthwhile.

          • shixingwang says:

            to brucedawson:
            i am interested in your blog, so i read it very carefully, do not want to miss any usefull information, thank you for the detail explanation.

      • Uh, all my formating-effort went to hell, sorry. 😀

      • > Adjacent floats at-or-near zero are not, as far as I can tell, the same thing as adjacent floats near, say 0.01.

        Right. But that’s a general problem of the log-scale of the FP-format. ULPs in my opinion are only compareable for as long as the collected ULPs values lie near each other, comparing two ULPs from two different locations, or on big differences becomes near meaningless because of the logscale caused by the location (high or low or negative etc.). 1 ULP can mean almost any absolute error, depends where it is. So it’s just not the tool for that, it’s good for checking if approximations suck badly or just a bit, or how big the rounding skew of a cascade of mul+add vs. madd is, or if your sin() is IEEE compliant and is below 0.5 ULPs.

        > But what, I am asking, is this calculation? Does it happen in practice?

        Yes, it’s in every lossless predictive floating-point number compressor, and also in JPEG2000. Be it half-float FP-images or FP-vertices, arrays of doubles etc. pp. I have it in my predictive image compressor because it gives statistically smooth and meaningfull statements about two FP-values, without the problem of exploding precision issues which you get if you want to do that as FPs (that is subtract a big number, high exponent, from a small number, low exponent, the mantissa of the lower number may vanish completely and the math is sadly lossy, and unusable).

        http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.72.3069

        Very great paper. 🙂
        I agree it appears a somewhat limited area. But I think it’s also essential for any kind of lossless FP-meddling, be it for compression or not.

  7. ttt says:

    Could you please explain as to why when incrementing the integer representation of a float, the larger number is at most about 1.0000012 times larger?

    • brucedawson says:

      Sure. A float has a 24-bit mantissa (the leading one is implied, but it still counts for most numbers). If the mantissa is 0xFFFFFE and you increment it to 0xFFFFFF then you get a ratio of 1.00000006 (0xFFFFFF/0xFFFFFE). That is the smallest increase possible. The largest increase possible (for a normalize number) is if the mantissa is 0x800000 and it gets incremented to 0x800001. That gives us the ratio of 1.00000012. Which, I just realized means that the number in my post (1.0000012 — one fewer zero) is wrong. I’ve fixed it, and added a terse explanation of the calculation, which is one + 1/8388680.

      Thanks for pointing that out.

      • shixingwang says:

        When going from zero to the smallest positive float or from FLT_MAX to infinity the ratio is actually infinite,——————i am confused that:
        FLT_MAX 3.40282347e+38 0x7F7FFFFF
        Positive infinity 0x7f800000
        so the ratio is “1.0/0x7F7FFFFF ==infinite” ???

        • brucedawson says:

          The ration of the representations is modest, but the ratio of the represented values of the floating-point values is infinite. INF/FLT_MAX == infinity. Similarly, the smallest denormal divided by zero is infinity.

  8. Peter Cordes says:

    There is a single-precision version of ISO C’s nextafter(double x,double y). It’s called nextafterf(float x, float y). There’s also nexttowardf / nexttoward / nexttowardl, which takes the destination as a long double. So iterating x=nexttowardf(x,y) will lead to bouncing between two values when y isn’t exactly representable as a float. These are all specified in the ISO C11 standard, and they’re standard C++ functions, too: http://en.cppreference.com/w/cpp/numeric/math/nextafter.

    glibc’s implementation works essentially like your blog post, by figuring out whether to increment or decrement the bit pattern as a 2’s complement integer. (esp. the double-precision version is ugly, though: since it unpacks to 32bit integers, and doesn’t use FP compare to check for NaN. The code was written in 1993 for Sun, and got into glibc via NetBSD. I’m working on an improved version, but it’s hard to get it to compile anywhere near as nicely as what I can do with hand-written asm.

  9. Ben Morgan says:

    Just FYI: your link at the top of the article “last post” leads to a 404.

  10. x4mmmx4m says:

    Great article!

    Some time ago I had tried to push patch to PostgreSQL with this trick, but community stood against 🙂

  11. Mark Gabby-Li says:

    Web archive version of type-punning/aliasing link, since the website is no longer available:
    https://web.archive.org/web/20120224063047/http://labs.qt.nokia.com/2011/06/10/type-punning-and-strict-aliasing/

Leave a reply to Niels Fröhling (@Ethatron) Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.