Comparing Floating Point Numbers, 2012 Edition

This post is a more carefully thought out and peer reviewed version of a floating-point comparison article I wrote many years ago. This one gives solid advice and some surprising observations about the tricky subject of comparing floating-point numbers.

We’ve finally reached the point in this series that I’ve been waiting for. In this post I am going to share the most crucial piece of floating-point math knowledge that I have. Here it is:

[Floating-point] math is hard.

You just won’t believe how vastly, hugely, mind-bogglingly hard it is. I mean, you may think it’s difficult to calculate when trains from Chicago and Los Angeles will collide, but that’s just peanuts to floating-point math.

Seriously. Each time I think that I’ve wrapped my head around the subtleties and implications of floating-point math I find that I’m wrong and that there is some extra confounding factor that I had failed to consider. So, the lesson to remember is that floating-point math is always more complex than you think it is. Keep that in mind through the rest of the post where we talk about the promised topic of comparing floats, and understand that this post gives some suggestions on techniques, but no silver bullets.

Previously on this channel…

This is the fifth chapter in a long series. The first couple in the series are particularly important for understanding this point. A (mostly) complete list of the other posts includes:

Comparing for equality

Floating point math is not exact. Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result. That means that comparing two floats to see if they are equal is usually not what you want. GCC even has a warning for this: “warning: comparing floating point with == or != is unsafe”.

Here’s one example of the inexactness that can creep in:

float f = 0.1f;
float sum;
sum = 0;

for (int i = 0; i < 10; ++i)
    sum += f;
float product = f * 10;
printf("sum = %1.15f, mul = %1.15f, mul2 = %1.15f\n",
        sum, product, f * 10);

This code tries to calculate ‘one’ in three different ways: repeated adding, and two slight variants of multiplication. Naturally we get three different results, and only one of them is 1.0:

sum=1.000000119209290, mul=1.000000000000000, mul2=1.000000014901161

Disclaimer: the results you get will depend on your compiler and your compiler settings, which actually helps make the point.

So what happened, and which one is correct?

What do you mean ‘correct’?

Before we can continue I need to make clear the difference between 0.1, float(0.1), and double(0.1). In C/C++ the numbers 0.1 and double(0.1) are the same thing, but when I say “0.1” in text I mean the exact base-10 number, whereas float(0.1) and double(0.1) are rounded versions of 0.1. And, to be clear, float(0.1) and double(0.1) don’t have the same value, because float(0.1) has fewer binary digits, and therefore has more error. Here are the exact values for 0.1, float(0.1), and double(0.1):

Number Value
0.1 0.1 (of course)
float(.1) 0.100000001490116119384765625
double(.1) 0.10000000000000000555111512312578270211815834045 41015625

With that settled, let’s look at the results of the code above:

  1. sum = 1.000000119209290: this calculation starts with a rounded value and then adds it ten times with potential rounding at each add, so there is lots of room for error to creep in. The final result is not 1.0, and it is not 10 * float(0.1). However it is the next representable float above 1.0, so it is very close.
  2. mul = 1.000000000000000: this calculation starts with a rounded value and then multiplies by ten, so there are fewer opportunities for error to creep in. It turns out that the conversion from 0.1 to float(0.1) rounds up, but the multiplication by ten happens to, in this case, round down, and sometimes two rounds make a right. So we get the right answer for the wrong reasons. Or maybe it’s the wrong answer, since it isn’t actually ten times float(0.1) !
  3. mul2 = 1.000000014901161: this calculation starts with a rounded value and then does a double-precision multiply by ten, thus avoiding any subsequent rounding error. So we get a different right answer – the exact value of 10 * float(0.1) (which can be stored in a double but not in a float).

So, answer one is incorrect, but it’s only one float away. Answer two is correct (but inexact), whereas answer three is completely correct (but appears wrong).

Now what?

Now we have a couple of different answers (I’m going to ignore the double precision answer), so what do we do? What if we are looking for results that are equal to one, but we also want to count any that are plausibly equal to one – results that are “close enough”.

Epsilon comparisons

If comparing floats for equality is a bad idea then how about checking whether their difference is within some error bounds or epsilon value, like this:

bool isEqual = fabs(f1 – f2) <= epsilon;

With this calculation we can express the concept of two floats being close enough that we want to consider them to be equal. But what value should we use for epsilon?

Given our experimentation above we might be tempted to use the error in our sum, which was about 1.19e-7f. In fact, there’s even a define in float.h with that exact value, and it’s called FLT_EPSILON.

Clearly that’s it. The header file gods have spoken and FLT_EPSILON is the one true epsilon!

Except that that is rubbish. For numbers between 1.0 and 2.0 FLT_EPSILON represents the difference between adjacent floats. For numbers smaller than 1.0 an epsilon of FLT_EPSILON quickly becomes too large, and with small enough numbers FLT_EPSILON may be bigger than the numbers you are comparing!

For numbers larger than 2.0 the gap between floats grows larger and if you compare floats using FLT_EPSILON then you are just doing a more-expensive and less-obvious equality check. That is, if two floats greater than 2.0 are not the same then their difference is guaranteed to be greater than FLT_EPSILON. For numbers above 16777216 the appropriate epsilon to use for floats is actually greater than one, and a comparison using FLT_EPSILON just makes you look foolish. We don’t want that.

Relative epsilon comparisons

The idea of a relative epsilon comparison is to find the difference between the two numbers, and see how big it is compared  to their magnitudes. In order to get consistent results you should always compare the difference to the larger of the two numbers. In English:

To compare f1 and f2 calculate diff = fabs(f1-f2). If diff is smaller than n% of max(abs(f1),abs(f2)) then f1 and f2 can be considered equal.

In code:

bool AlmostEqualRelative(float A, float B,
                         float maxRelDiff = FLT_EPSILON)
    // Calculate the difference.
    float diff = fabs(A - B);
    A = fabs(A);
    B = fabs(B);
    // Find the largest
    float largest = (B > A) ? B : A;

    if (diff <= largest * maxRelDiff)
        return true;
    return false;

This function is not bad. It works. Mostly. I’ll talk about the limitations later, but first I want to get to the point of this article – the technique that I first suggested many years ago.

When doing a relative comparison of floats it works pretty well to set maxRelDiff to FLT_EPSILON, or some small multiple of FLT_EPSILON. Anything smaller than that and it risks being equivalent to no epsilon. You can certainly make it larger, if greater error is expected, but don’t go crazy. However selecting the correct value for maxRelativeError is a bit tweaky and non-obvious and the lack of a direct relationship to the floating point format being used makes me sad.

ULP, he said nervously

We already know that adjacent floats have integer representations that are adjacent. This means that if we subtract the integer representations of two numbers then the difference tells us how far apart the numbers are in float space. That brings us to:

Dawson’s obvious-in-hindsight theorem:

If the integer representations of two same-sign floats are subtracted then the absolute value of the result is equal to one plus the number of representable floats between them.

In other words, if you subtract the integer representations and get one, then the two floats are as close as they can be without being equal. If you get two then they are still really close, with just one float between them. The difference between the integer representations tells us how many Units in the Last Place the numbers differ by. This is usually shortened to ULP, as in “these two floats differ by two ULPs.”

So let’s try that concept:

/* See
for the potential portability problems with the union and bit-fields below.
#include <stdint.h> // For int32_t, etc.

union Float_t
    Float_t(float num = 0.0f) : f(num) {}
    // Portable extraction of components.
    bool Negative() const { return i < 0; }
    int32_t RawMantissa() const { return i & ((1 << 23) - 1); }
    int32_t RawExponent() const { return (i >> 23) & 0xFF; }

    int32_t i;
    float f;
#ifdef _DEBUG
    {   // Bitfields for exploration. Do not use in production code.
        uint32_t mantissa : 23;
        uint32_t exponent : 8;
        uint32_t sign : 1;
    } parts;
bool AlmostEqualUlps(float A, float B, int maxUlpsDiff)
    Float_t uA(A);
    Float_t uB(B);

    // Different signs means they do not match.
    if (uA.Negative() != uB.Negative())
        // Check for equality to make sure +0==-0
        if (A == B)
            return true;
        return false;

    // Find the difference in ULPs.
    int ulpsDiff = abs(uA.i - uB.i);
    if (ulpsDiff <= maxUlpsDiff)
        return true;

    return false;

This is tricky and perhaps profound.

The check for different signs is necessary for several reasons. Subtracting the signed-magnitude representation of floats using twos-complement math isn’t particularly meaningful, and the subtraction would produce a 33-bit result and overflow. Even if we deal with these technical issues it turns out that an ULPs based comparison of floats with different signs doesn’t even make sense.

Even a ridiculously small number, like 1e-30 has an ‘large’ integer representation of 228,737,632. So, while 1e-30 is ‘close’ to zero and ‘close’ to negative 1e-30 it is a gargantuan distance from those numbers in ULPs.

After the special cases are dealt with we simply subtract the integer representations, get the absolute value, and now we know how different the numbers are. The ‘ulpsDiff’ value gives us the number of floats between the two numbers (plus one) which is a wonderfully intuitive way of dealing with floating-point error.

One ULPs difference good (adjacent floats).

One million ULPs difference bad (kinda different).

Comparing numbers with ULPs is really just a way of doing relative comparisons. It has different characteristics at the extremes, but in the range of normal numbers it is quite well behaved. The concept is sufficiently ‘normal’ that boost has a function for calculating the difference in ULPs between two numbers.

A one ULP difference is the smallest possible difference between two numbers. One ULP between two floats is far larger than one ULP between two doubles, but the nomenclature remains terse and convenient. I like it.


It turns out checking for adjacent floats using the ULPs based comparison is quite similar to using AlmostEqualRelative with epsilon set to FLT_EPSILON. For numbers that are slightly above a power of two the results are generally the same. For numbers that are slightly below a power of two the FLT_EPSILON technique is twice as lenient. In other words, if we compare 4.0 to 4.0 plus two ULPs then a one ULPs comparison and a FLT_EPSILON relative comparison will both say they are not equal. However if you compare 4.0 to 4.0 minus two ULPs then a one ULPs comparison will say they are not equal (of course) but a FLT_EPSILON relative comparison will say that they are equal.

This makes sense. Adding two ULPs to 4.0 changes its magnitude twice as much as subtracting two ULPs, because of the exponent change. Neither technique is better or worse because of this, but they are different.

If my explanation doesn’t make sense then perhaps my programmer art will:


ULP based comparisons also have different performance characteristics. ULP based comparisons are more likely to be efficient on architectures such as SSE which encourage the reinterpreting of floats as integers. However ULPs based comparisons can cause horrible stalls on other architectures, due to the cost of moving float values to integer registers. These stalls generally happen when doing float calculations and then immediately doing ULP based comparisons on the results, so keep that in mind when benchmarking.

Normally a difference of one ULP means that the two numbers being compared have similar magnitudes – the larger one is usually no larger than 1.000000119 times larger than the smaller. But not always. Some notable exceptions are:

  • FLT_MAX to infinity – one ULP, infinite ratio
  • zero to the smallest denormal – one ULP, infinite ratio
  • smallest denormal to the next smallest denormal – one ULP, two-to-one ratio
  • NaNs – two NaNs could have very similar or even identical representations, but they are not supposed to compare as equal
  • Positive and negative zero – two billion ULPs difference, but they should compare as equal
  • As discussed above, one ULP above a power of two is twice as big a delta as one ULP below that same power of two

That’s a lot of notable exceptions. For many purposes you can ignore NaNs (you should be enabling illegal operation exceptions so that you find out when you generate them) and infinities (ditto for overflow exceptions) so that leaves denormals and zeros as the biggest possible problems. In other words, numbers at or near zero.

Infernal zero

It turns out that the entire idea of relative epsilons breaks down near zero. The reason is fairly straightforward. If you are expecting a result of zero then you are probably getting it by subtracting two numbers. In order to hit exactly zero the numbers you are subtracting need to be identical. If the numbers differ by one ULP then you will get an answer that is small compared to the numbers you are subtracting, but enormous compared to zero.

Consider the sample code at the very beginning. If we add float(0.1) ten times then we get a number that is obviously close to 1.0, and either of our relative comparisons will tell us that. However if we subtract 1.0 from the result then we get an answer of FLT_EPSILON, where we were hoping for zero. If we do a relative comparison between zero and FLT_EPSILON, or pretty much any number really, then the comparison will fail. In fact, FLT_EPSILON is 872,415,232 ULPs away from zero, despite being a number that most people would consider to be pretty small.

For another example, consider this calculation:

float someFloat = 67329.234; // arbitrarily chosen
float nextFloat = 67329.242; // exactly one ULP away from ‘someFloat’
bool equal = AlmostEqualUlps( someFloat, nextFloat, 1); // returns true, numbers 1 ULP apart

Our test shows that someFloat and nextFloat are very close – they are neighbors. All is good. But consider what happens if we subtract them:

float diff = (nextFloat – someFloat); // .0078125000
bool equal = AlmostEqualUlps( diff, 0.0f, 1 ); // returns false, diff is 1,006,632,960 ULPs away from zero

While someFloat and nextFloat are very close, and their difference is small by many standards, ‘diff’ is a vast distance away from zero, and will dramatically and emphatically fail any ULPs or relative based test that compares it to zero.

There is no easy answer to this problem. If you think you have found an easy and generic answer to this problem then you should re-read the material above.

The most generic answer to this quandary is to use a mixture of absolute and relative epsilons. If the two numbers being compared are extremely close – whatever that means – then treat them as equal, regardless of their relative values. This technique is necessary any time you are expecting an answer of zero due to subtraction. The value of the absolute epsilon should be based on the magnitude of the numbers being subtracted – it should be something like maxInput * FLT_EPSILON. Unfortunately this means that it is dependent on the algorithm and the inputs. Charming.

The ULPs based technique also breaks down near zero for the technical reasons discussed just below the definition of AlmostEqualUlps.

Doing a floating-point absolute epsilon check first, and then treating all other different-signed numbers as being non-equal is the simpler and safer thing to do. Here is some possible code for doing this, both for relative epsilon and for ULPs based comparison, with an absolute epsilon ‘safety net’ to handle the near-zero case:

bool AlmostEqualUlpsAndAbs(float A, float B,
            float maxDiff, int maxUlpsDiff)
    // Check if the numbers are really close -- needed
    // when comparing numbers near zero.
    float absDiff = fabs(A - B);
    if (absDiff <= maxDiff)
        return true;

    Float_t uA(A);
    Float_t uB(B);

    // Different signs means they do not match.
    if (uA.Negative() != uB.Negative())
        return false;

    // Find the difference in ULPs.
    int ulpsDiff = abs(uA.i - uB.i);
    if (ulpsDiff <= maxUlpsDiff)
        return true;

    return false;

bool AlmostEqualRelativeAndAbs(float A, float B,
            float maxDiff, float maxRelDiff = FLT_EPSILON)
    // Check if the numbers are really close -- needed
    // when comparing numbers near zero.
    float diff = fabs(A - B);
    if (diff <= maxDiff)
        return true;

    A = fabs(A);
    B = fabs(B);
    float largest = (B > A) ? B : A;

    if (diff <= largest * maxRelDiff)
        return true;
    return false;

Catastrophic cancellation, hiding in plain sight

If we calculate f1 – f2 and then compare the result to zero then we know that we are dealing with catastrophic cancellation, and that we will only get zero if f1 and f2 are equal. However, sometimes the subtraction is not so obvious.

Consider this code:


It’s straightforward enough. Trigonometry teaches us that the result should be zero. But that is not the answer you will get. For double-precision and float-precision values of pi the answers I get are:

sin(double(pi)) = +0.00000000000000012246467991473532
sin(float(pi))     = -0.000000087422776

If you do an ULPs or relative epsilon comparison to the ‘correct’ value of zero then this looks pretty bad. It’s a long way from zero. So what’s going on? Is the calculation of sin() really that inaccurate?

Nope. The calculation of sin() is pretty close to perfect – in recent glibc versions it is perfect. Even Intel’s fsin instruction, which gives an answer with just a few bits of precision in this case, still gets the correct order of magnitude. The problem is not with sin()/fsin – the problem lies elsewhere. But to understand what’s going on we have to invoke… calculus!

For a deeper analysis of Intel’s fsin instruction see Intel Underestimates Error Bounds by 1.3 quintillion

But first we have to acknowledge that we aren’t asking the sin function to calculate sin(pi). Instead we are asking it to calculate sin(double(pi)) or sin(float(pi)). Since pi is transcendental and irrational it should be no surprise that pi cannot be exactly represented in a float, or even in a double.

Therefore, what we are really calculating is sin(pi-theta), where theta is a small number representing the difference between ‘pi’ and float(pi) or double(pi).

Calculus teaches us (since the derivative of sin(pi) is negative one) that, for sufficiently small values of theta, sin(pi-theta) == theta. Therefore, if our sin function is sufficiently accurate we would expect sin(double(pi)) to be roughly equal to pi-double(pi). In other words, sin(double(pi)) actually calculates the error in double(pi)! This is best shown for sin(float(pi)) because then we can easily add float(pi) to sin(float(pi)) using double precision:

float(pi) +3.1415927410125732
sin(float(pi)) -0.0000000874227800
float(pi) + sin(float(pi)) +3.1415926535897966

If you haven’t memorized pi to 15+ digits then this may be lost on you, but the salient point is that float(pi) + sin(float(pi)) is a more accurate value of pi than float(pi). Or, alternately, sin(float(pi)) tells you nothing more than the error in float(pi).

Woah. Dude.

Again: sin(float(pi)) equals the error in float(pi).

I’m such a geek that I think that is the coolest thing I’ve discovered in quite a while.

Think about this. Because this is profound. Here are the results of comparing sin(‘pi’) to the error in the value of ‘pi’ passed in:

sin(double(pi)) = +0.0000000000000001224646799147353207
pi-double(pi)   = +0.0000000000000001224646799147353177
sin(float(pi))  = -0.000000087422776
pi-float(pi)    = -0.000000087422780

Wow. Our predictions were correct. sin(double(pi)) is accurate to sixteen to seventeen significant figures as a measure of the error in double(pi), and sin(float(pi)) is accurate to six to seven significant figures. The main reason the sin(float(pi)) results are less accurate is because operator overloading translates this to (float)sin(float(pi)) which rounds the result to float precision.

I think it is perversely wonderful that we can use double-precision math to precisely measure the error in a double precision constant. Visual C++ 2015 can print the value of double(pi) to arbitrarily many digits, and glibc now calculates pi() accurately, so we can use this code and some hand adding to calculate pi to over 30 digits of accuracy!

double pi = 3.14159265358979323846;


You can also see this trick in action with a calculator. Just put it in radians mode, try these calculations, and note how the input plus the result add up to pi. Nerdiest bar trick ever:

pi =3.1415926535…


The point is, that sin(float(pi)) is actually calculating pi-float(pi), which means that it is classic catastrophic cancellation. We should expect an absolute error (from zero) of up to about 3.14*FLT_EPSILON/2, and in fact we get a bit less than that.

Know what you’re doing

There is no silver bullet. You have to choose wisely.

  • If you are comparing against zero, then relative epsilons and ULPs based comparisons are usually meaningless. You’ll need to use an absolute epsilon, whose value might be some small multiple of FLT_EPSILON and the inputs to your calculation. Maybe.
  • If you are comparing against a non-zero number then relative epsilons or ULPs based comparisons are probably what you want. You’ll probably want some small multiple of FLT_EPSILON for your relative epsilon, or some small number of ULPs. An absolute epsilon could be used if you knew exactly what number you were comparing against.
  • If you are comparing two arbitrary numbers that could be zero or non-zero then you need the kitchen sink. Good luck and God speed.

Above all you need to understand what you are calculating, how stable the algorithms are, and what you should do if the error is larger than expected. Floating-point math can be stunningly accurate but you also need to understand what it is that you are actually calculating. If you want to learn more about algorithm stability you should read up on condition numbers and consider getting Michael L Overton’s excellent book Numerical Computing with IEEE Floating Point Arithmetic. If your code is giving large errors then, rather than increasing your epsilon values, you should try restructuring your code to make it more stable – this can sometimes give stunning results.

The true value of a float

In order to get the results shown above I used the same infinite precision math library I created for Fractal eXtreme to check all the math and to print numbers to high precision. I also wrote some simple code to print the true values of floats and doubles. Next time I’ll share the techniques and code used for this extended precision printing – unlocking one more piece of the floating-point puzzle, and explaining why, depending on how you define it, floats have a decimal precision of anywhere from one to over a hundred digits.

Now that VC++ 2015 can print the exact value of floats and doubles it is trivial for Windows developers to print the exact value represented by a float or double. Being able to see the exact value of numbers such as double(0.1) can help make sense of some tricky floating-point math problems.


Thanks to Florian Wobbe for his excellent error finding and many insights into this topic.

Other references:


About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in AltDevBlogADay, Floating Point, Programming and tagged , , , , , , . Bookmark the permalink.

159 Responses to Comparing Floating Point Numbers, 2012 Edition

  1. cbloom says:

    I wish you would write about the practical ways to use floats robustly (in games, for example).

    This basic stuff has been covered very well already. For example, here :

    • brucedawson says:

      Don’t Store That in a Float was definitely game targeted. Any particular game specific areas you were interested in?

      • Blue King says:

        Very helpful post. I have a question regarding the <= and < you used for relative and absolute comparison in AlmostEqualRelativeAndAbs(). Is the difference in operator intentional or a typo? I know the difference is not much, but I am just curious why there is this difference.

        • brucedawson says:

          I’m not seeing the difference — I’m seeing <= used everywhere. Using < wouldn't be wrong, it would just slightly change the meaning. For instance, if maxDiff is 0.25 then it's the difference between numbers exactly 0.25 apart being considered equal or not. Same idea for maxRelDiff. I guess documenting the precise meaning would have been good…

  2. Pingback: Bruce Dawson’s update on comparing floating point numbers. « inSCIght

  3. Thanks, for an enlightening series, Bruce. By chance, is there some content missing before this paragraph?

    “If you do an ULPs or relative epsilon comparison to the correct value of zero then this looks pretty bad. It’s a long way from zero. So what’s going on? Is the calculation of sin() really that inaccurate?”

    What looks pretty bad? There seems to be some discontinuity here.

    • brucedawson says:


      Yeah, a new heading and about five short paragraphs got nuked. I have to do a manual fix-up stage on the code when I post to make it look nice and apparently I did a careless job.

      Thank you very much. It’s all fixed now.

  4. Jeff says:

    Thank you for taking the time to write your articles Bruce. I recently bumped into a float comparison issue in my own code, and found your insight very valuable.

    I’m finding it difficult to decide how many ULPs to allow between floating point numbers in different situations. There isn’t a general rule suggested, so I was wondering if you had a practical relationship to use?

    For instance, a 32-bit float accurately represents 6 significant decimal figures ( If you take this into account in your algorithms, would it always be reasonable to use a tolerance of a single ULP?

    • brucedawson says:

      One ULP will be impractical in some cases.

      The theoretically ideal plan is to analyze your algorithms carefully to see how well conditioned they are and generate reasonable error bounds based on that. That requires a lot of time and specialized knowledge. More practical is to try it and much error you get. You can determine the error either by comparing to known-good results, or compare float-precision results to double-precision results. Interval math is another way to determine what your error bounds are.

      Most people select an epsilon, and if they find failures they increase it. If you have to increase it a lot, then you probably need to rethink what you are doing.

      BTW, how many significant decimal figures a float can represent is much more complex than ‘6’. Six is one possible answer, but so is 7.22. I’ll cover that in my next post.

  5. Pingback: Software Carpentry » ULP (or, This is tricky and perhaps profound)

  6. Pingback: Software Carpentry » ULP (or, This is tricky and perhaps profound) | The Timely Custom Carpentry Blog

  7. Rick Regan says:

    “floats have a decimal precision of anywhere from one to over a hundred digits”

    You must have 2^-149 in mind, the smallest float: it has 105 significant digits. (The smallest double, 2^-1074, has 751 significant digits.) By the way, you can print all those digits with gcc, and some other languages ( (Is that what your “simple code to print the true values of floats and doubles” is?)

    Nice series of articles.

    • brucedawson says:

      2^-149 is one of the numbers I have in mind. There are actually floats that require a few more digits than that, but that is the interesting range. Thanks for the link — I’ll take a look at that. I’ll be sharing code that does the actual printing, for instructive purposes and for those using platforms that can’t print with full precision.

  8. Fab says:

    An other floating curiosity that I read in “Write Portable Code”:

    If the store operation is not optimized away, the two following functions may not return the same binary value if add is popped into a double:

    float add (float a, float b){
    return a+b;

    float add2(float a, floatb){
    float tmp = a+b;
    return b;

    Because FPU typically carry higher precison than registers (x87=80bits).

    • brucedawson says:

      Yes, this issue, and variants on it, can be quite problematic. I’ll be covering some variants on this issue in the future.

      The basic problem is that the x87 FPU doesn’t have an instruction to round a number to a particular precision. This can only be done by storing the number to memory, which is relatively expensive.

      One correction of sorts: while the x87 registers are 80 bits, there is a rounding mode and by default it is set to double precision (64-bits). Therefore the mantissa is rounded to double precision after each operation. This doesn’t affect this particular problem, but it is an important distinction in other cases. And, sometimes the precision mode is set to 32 bits, in which case this problem disappears.

      • me says:

        Actually, it isn’t. With a precision mode set to less than 80 bits, x87 calculations *may* produce less precision, or they may not. The “precision mode” is effectively a lower bound on resulting precision, the real precision being implementation defined.

        Typically nearly all operations always produce full 80-bit precision result (regardless of precision mode), exceptions being only a few slow operations that are able to save time by cutting precision to the set mode. Divisions, square roots and transcendental ops – that’s all.

        • brucedawson says:

          It is easy to experimentally determine that the Intel x87 FPU optimizes divide and square root when the precision is lowered (they are faster) whereas the performance of add/subtract/multiply is unchanged by the precision settings.

          It is also easy to experimentally determine that the Intel x87 FPU rounds the mantissa of the result to the requested precision for all operations, even when the performance is unchanged. The code below tests this for addition:

          double g_input1 = 16777216.0;
          double g_input2 = 1.125;

          void TestPrecision()
          unsigned int currentState;
          _controlfp_s(&currentState, _PC_24, MCW_PC);
          double result1 = g_input1 + g_input2;
          _controlfp_s(&currentState, _PC_53, MCW_PC);
          double result2 = g_input1 + g_input2;
          printf(“%.17g\n%.17g\n”, result1, result2);

          Compile it for x87 and it produces this result, with result1 rounded to a 24-bit mantissa:



        • keldoelykke says:

          #comment-13171 was not posted by keldoelykke. My account might have been compromised. Not sure. The avatar was wrong too.

          • keldoelykke says:

            Aha… user ‘me’ is just trolling with that user name … disregard my previous comment.
            I am getting hacker alerts in my mailbox and then I got a notification mail that ‘me’ had made a rather serious comment 🙂

    • brucedawson says:

      I just dealt with this issue in a new post, available at

      It’s a bit misleading to describe the x87 precision as 80 bits, but I’ll let my article speak for that.

  9. Pingback: Float Precision–From Zero to 100+ Digits | Random ASCII

  10. Pingback: Intermediate Floating-Point Precision | Random ASCII

  11. Pingback: Floating-point complexities | Random ASCII

  12. Pingback: Exceptional Floating Point | Random ASCII

  13. brucedawson says:

    One common variation on the floating-point comparison theme — so common that I didn’t bother mentioning it — is variations on this theme:

    float test = 4.1;
    if (test == 4.1)
    printf(“They’re equal”);

    In most cases this will not — and indeed should not — print “They’re equal”. That’s because ‘test’ is equal to 4.1f (the infinitely precise value 4.1 rounded to float precision) and ‘4.1’ is probably rounded to double precision. Since ‘test’ has been more aggressively rounded, the two values are not equal.

    In fact, if they compare as equal that probably means that ‘test’ was not properly rounded. Printing “They’re equal” is a deficiency.

  14. Pingback: That’s Not Normal–the Performance of Odd Floats | Random ASCII

  15. Pingback: Fuzzy Floating Point Comparison | Pen and Pants

  16. Lavish says:

    let say:
    int main()
    float f=0.1; //takes 4 bytes using gcc
    char *a=(char*)&f;
    int *i=(int*)&f;
    printf(“%x %x %x %x %x”,*a,*(a+1),*(a+2),*(a+3),*i);
    output : ffffffcd ffffffcc ffffffcc 3d 3dcccccd

    Problem 1: for *a,*(a+1),*(a+2), why i am getting additional ffffff as a is pointer to char which is 1 byte.
    Problem 2: for *i, why i am not getting output 3dcccccc?
    expected result:
    sign bit: 0
    exponent: 0111 1011
    mantissa: 100 1100 1100 1100 1100 1100

    observed result:
    sign bit: 0
    exponent: 0111 1011
    mantissa: 100 1100 1100 1100 1100 1101

    • brucedawson says:

      The leading ffffff when you print the bytes is because on your compiler ‘char’ is signed and it is therefore sign extended to an int when passed to print. Use unsigned char to avoid this.

      The 1 at the end is because of rounding. The first two digits that are skipped are ’11’ which means that rounding up gives a more accurate result than truncating. Your ‘expected result’ would give an error of about 0.75 ULP, whereas the observed result gives an error of about 0.25 ULP.

  17. Pingback: Doubles are not floats, so don’t compare them | Random ASCII

  18. aishraj says:

    Reblogged this on Digital Drive and commented:
    [Floating Point] Maths is tough!

  19. Ben Key says:

    First I would like to say that this is an excellent article. It is the best I have seen on the subject in a while.

    That said, there is one unanswered question. What is the best way to calculate the value that should be used for the maxDiff parameter of the AlmostEqualUlpsAndAbs and AlmostEqualRelativeAndAbs functions. You imply that a value like maxInput * FLT_EPSILON could be used. Of course you later qualify that remark by saying “Unfortunately this means that it is dependent on the algorithm and the inputs” immediately after suggesting the value of maxInput * FLT_EPSILON.

    I will refer to an example you gave in your article to explain why I am asking.

    float someFloat = 67329.234; // arbitrarily chosen
    float nextFloat = 67329.242; // exactly one ULP away from ‘someFloat’
    // The following returns true, note that it does not really matter what value
    // you choose for the maxDiff parameter.
    bool equal = AlmostEqualUlpsAndAbs(someFloat, nextFloat, FLT_EPSILON, 1);
    float diff = (nextFloat – someFloat); // .0078125000
    // The following simply does not work. Any clues as to how to properly
    // calculate this value so the diffEqualZero is set to true instead of false.
    float maxDiff = std::max(diff, 0.0f) * FLT_EPSILON;
    bool diffEqualZero = AlmostEqualUlpsAndAbs(diff, 0.0f, maxDiff, 1);

    Would knowledge of the maximum number of significant digits after the decimal point make it possible to calculate a better value for maxDiff? My thought is that if the floats are obtained from instruments taking measurements and the measurements are known to be accurate to the third decimal place, would the best value for maxDiff be ten times the smallest number with that number of digits after the decimal? For example, the smallest number with three digits after the decimal is 0.001. Ten times that would be 0.01, or the smallest number with 2 digits after the decimal.

    This seems to work but I am wondering if this is the best solution in the scenario I have described.

    I would appreciate any input you might have to offer on this.

    • brucedawson says:

      I don’t really have a good answer, I don’t think there is a good cross-domain answer, and I fear that any attempt at an answer would just add more confusion. So I punted like a coward.

      However, maxDiff is only needed for numbers that should be at or near zero. It is only needed to handle cases like sin(pi). For relative error maxUlpsDiff is the appropriate parameter to use, and this should suffice for most cases.

      So, I would set maxDiff as small as possible. Many sources suggest setting it to FLT_EPSILON, and it may be that for most ranges that is valid. I would start there and see how it works.

      I believe that the correct value of maxDiff should be affected only by expected precision of your near-zero calculations. If your inputs are from instruments with known error then you would normally use maxUlpsDiff to handle this — unless you are subtracting them and hoping to get zero, in which case sig-figs is your friend.

      Zero is a huge nuisance. Avoid it if possible.

  20. Pingback: Trouble fixing a critical bug in my WormMove algorithm! - Page 2

  21. mike says:

    I think FLT_EPSILON is 1.19e-7f, not 1.19e7f.

  22. anonUser says:

    Is this code PD, or some other licence (BSD/GPL)? I’m interested in using your ULP implementation in a GPLd program, cause you know, I’m probably going to spend my next few days catching bugs from getting this wrong…

  23. Pingback: .:[ d4 n3wS ]:. » Tout, tout tout, vous saurez tout sur les float

  24. Pingback: Game Developer Magazine Floating Point | Random ASCII

  25. Pingback: Top Eight Entertaining Blog Facts for 2012 | Random ASCII

  26. Pingback: Float Precision Revisited: Nine Digit Float Portability | Random ASCII

  27. Wenzhi XIao says:

    float has only 23 bit explitcitly for fraction part, that is, 0.00000011920928955078125 , so does it really make sense for comparing sum=1.000000119209290, mul=1.000000000000000, mul2=1.000000014901161 ? There should be no difference in their binary representation, I guess…

    • brucedawson says:

      Both ‘sum’ and ‘mul’ are valid float values with different binary representations. As the article says, they are adjacent floats. ‘mul2’ requires double precision to distinguish it from 1.0 and this is also mentioned in the article.

      The main point of the code snippet was to demonstrate that floating-point calculations are not exact and you may get some inaccuracy. This was purely to set the stage for why it is useful to have a way of seeing if two floats are almost equal.

      • Wenzhi XIao says:

        Sorry for that thoughtless question. What I want to say is, f * 10 would be treated as float, single-precision expression in C, rather than a double type, and there would be no binary difference between mul1 and mul2. I tried it on VS2008 and got the follows:
        sum = 1.000000119209290, mul = 1.000000000000000, mul2 = 1.000000000000000.

        • brucedawson says:

          That is correct that if mul2 is rounded to a float then its error disappears. However there are certainly circumstances where the variation could bubble up to the surface. All it would take would be subtracting 1.0f from the result (prior to storing to a float) and suddenly the difference would be significant. For instance “f * 10 – 1.0f” may be non-zero.

          Whether that precise problem matters depends heavily on your compiler and processor because it is ultimately a question of intermediate precision. I devoted an entire separate blog post to that complicated topic:

          However the general point was just “floating-point math can accumulate error”. No controversy there I’m sure.

          • Wenzhi XIao says:

            That‘s true. This is really a great article and helps me a lot, thank you for sharing.

          • Joe Zbiciak says:

            When you wrote “f * 10 – 1.0f”, that reminded me of an entirely different ball of yarn you didn’t touch on in this article: FMA instructions. Because they perform the add (or subtract) before rounding, they introduce a similar set of issues to x87’s sometimes-surprising extended precision behavior. Catastrophic cancelation behaves rather differently if you round before or after the subtraction.

            FMA is out of scope of this article… I just had a twitch when I saw that expression. I’m coming from the other side of things. I spent a fair bit of time last year working on writing a floating point reference implementation for verifying a processor design. We spent quite a lot of time discussing FMAs, their merits and demerits, and ended up spending many, many hours getting familiar with Dr. Kahan’s impressive body of work. (I also found an obscure rounding error in Knuth’s MMIX-ARITH implementation. Unrelated to FMAs, though.)

          • brucedawson says:

            Yes, FMA is confounding and is very thematically similar to extended precision. I think the issue you are alluding to is something like:

            a * b + c * d

            where a == c and b == -d. The mathematical answer is zero, and a non-FMA implementation will calculate that trivially. However with FMA one multiplication will be rounded and the other one will not.

            I think that the cool thing about this is that the answer you get from an FMA implementation is an *exact* measure of the rounding error in the multiplication. Therefore this is very similar to sin(double(pi)) whose result is a measure of the rounding error in double(pi).

            I also find it interesting that FMA is very difficult to emulate. It seems like double-precision without FMA should be able to emulate single-precision FMA, but in fact it can’t quite do it. Double-precision can do an exact single-precision multiply, but you also need a way to do a double-precision add-and-truncation-to-float as a single step, otherwise you get double rounding. Subtle.

  28. jenny says:

    Hey Bruce,
    thank you so much for this post. I’ve been programming for several years now and was so embarrassed to only just stumble across this issue while grading an undergrad class. I found this post through the link on your old website and I like how it’s not too much in depth like the other must read article “what every computer scientist should know about floating point arithmetic” and also not too short like the answer to that question in the c-faq website I found. I’ll definitely also browse through the other floating point articles in the future 🙂

  29. Pingback: Floating-Point Determinism | Random ASCII

  30. Pingback: Floating Point One-Pager | JerryJvL

  31. Pingback: Floating point comparison: clarification to Bruce Dawson's approach | BlogoSfera

  32. “an ULPs based comparison of floats with different signs doesn’t even make sense”

    I’m a bit unclear why. Aren’t -0 and the smallest positive float very close? Google’s googletest turns the sign-bit binary of floats into a biased integer which I think means it’s quite happy with floats of different signs. Start with AlmostEquals() in The routines it calls are within the same file.

    • brucedawson says:

      There are two issues here: does it make sense, and does it work well. I would argue it doesn’t make sense to do an ULPs based comparison of different signed numbers because this is effectively a relative comparison, and a relative comparison is meaningless with different signs. Google may have copied my previous paper, and this can work, but I don’t think it is meaningful. With different signs you should (IMHO) use a fixed epsilon.

      There’s also the practical consideration of overflow when you subtract two biased numbers — the difference between 2.0 and -2.0 won’t fit into a 32-bit signed value, so great care must be used.

      So, no obvious meaning and problematic, so I don’t recommend it.

      • So 1.0 less 0.1 ten times shouldn’t be tested against 0 with an ULP-based comparison because the result might be either less or more than +0?

        • brucedawson says:

          Correct. If you calculate 1.0 – 0.1 * 10 (or something similar) and the result does not come out to zero then, in terms of ULPs, it will be *enormously* far from zero. The difference will probably be about FLT_EPSILON or DBL_EPSILON which are millions and trillions of ULPs away from zero. This huge difference is representative of the fact that an ULPs based comparison near zero is meaningless, at least in that case.

  33. Staffan Palmroos says:

    Great article!
    I started to think about this and came up with a solution based on the fact that the quotient of two numbers that are very close is very close to 1, which works for both very small and very large numbers. The comparison then becomes “fabsf(1.0-a/b)<EPSILON".
    This has a few issues though, for one the division operation might be expensive. Also, a and b cannot be exactly zero so that have to be handled separately. In the end I come up with this function:


    /** Returns true if a and b are "equal" */
    int equal(const float a, const float b) {

    (a == b) ||
    (a == 0 && fabsf(b)<EPSILON) ||
    (b == 0 && fabsf(a)<EPSILON) ||
    (fabsf(1.0-(a/b)) < EPSILON);

    Just adjust EPSILON to the precision needed. With 2*FLT_EPSILON adjacent numbers compare as equal.

  34. brucedawson says:

    One problem, as you say, is that division can be very expensive. And, on some CPUs (Xbox 360 and PS3!) floating-point compares are fairly expensive.

    But on to correctness… This is an armchair analysis — I haven’t written test code.

    Your equal function is not symmetric — it can give different results for equal(x,y) versus equal(y,x). That’s bad. You need to sort the inputs by absolute magnitude.

    If ‘a’ and ‘b’ have different signs then they will compare equally equal as if they had the same signs. That’s bad. The fabs around 1.0-(a/b) causes this and I’m not sure what benefit it gives. Sorting by magnitude (see the previous comment) would be better than that fabs.

    Another problem is that I think your code will return true for (0, 1e-20) due to the zero check, but false for (1e-19, 1e-20), even though those numbers are closer. That’s because your zero handling means that some relatively ‘large’ numbers will compare equal to zero, but when compared against non-zero numbers they will be found to be quite noticeably different. Zero is hard.

    Comparing floats is hard. Any method that claims to automatically handle zero is probably lying.

    • Staffan Palmroos says:

      Thanks for the quick reply!
      I see the points you are making and you’re absolutely correct. I believe the main problem is that I misinterpreted what the epsilon of the floating points really is.
      Floating point is indeed really hard.

  35. Raja Bala says:

    Great article, Bruce. Was a very fun read.
    You’ve clearly shown the importance of having relative (ULPS or epsilon based) and absolute (for near zero) comparisons. Why isn’t this a part of the C++ standard though?
    Why do comparison operations on float default to bitwise compares? Why rely on user-defined functions for comparing a well-established data type?

    • brucedawson says:

      The default comparison isn’t quite bit-wise (zeroes and NaNs are handled specially) but I know what you mean.

      Fundamentally this is the concern of the IEEE floating-point standard more than the C++ standard.

      That said, the default equality comparison absolutely should be true equality. To do otherwise risks madness. I have a post almost ready that uses exact floating-point comparisons to validate math, thus proving that exact comparisons are valid.

      Fuzzy comparisons are very context sensitive. You have to decide what epsilon value to use, for instance. A standard fuzzy comparison would need user-specified epsilons.

      And, the fuzziness sometimes has to extend to ”. When, say, creating a BSP using dot products and ‘<' you would like to divide the world into two sets, but arguably what you actually do is divide the world into two definite sets plus a set of points that are so close to the cutting plane that they are ambiguous. I don't see how the floating-point math standard can fix that — although it could offer functions to help with it.

      So, standard fuzzy comparison functions would be nice, but the default should remain exact comparisons.

      • Raja Bala says:

        Ah, yes, that’s right – shouldn’t mess with true equality compares.
        Scenarios like partitioning can’t really do with the extra leverage it provides, as you pointed out.
        But, how about boolean functions like CollisionDetection? I can see AlmostEquals(..) being useful there.
        And that makes me curious – in your experience at Valve, how often does AlmostEquals(..) actually get used?

        • brucedawson says:

          I would expect most collision detection functions to return whether two objects overlap, not whether their locations are equal so AlmostEquals would not be directly relevant. It could be relevant to check, in internal builds, whether there is sufficient precision to make the collision detection meaningful.

          I don’t think our games do a lot of floating-point equality comparisons, so I don’t think AlmostEquals is relevant very often.

  36. Mike Rubin says:

    I’ve been using a ULP-based float compare that fails the catastrophic cancellation test.

    If in my Vector class, for example, I’m guarding against a potential division-by-0 (with a debug-only assert) by using this non-bitwise float compare.. which is flawed anyway (for values very close to 0).. would you recommend I do only a bitwise check for 0 in such situations (when guarding against division-by-0)?

    Thank you.

    • brucedawson says:

      I can’t say what you should do without far more information. If you want to avoid getting infinity as a result then you need to check for numbers near zero (dividing by most subnormals will give infinity even though they aren’t zero). There is no magic-bullet correct-advice I’m afraid — it really depends on the meaning of your results and their non-zeroness.

  37. Great article, thank you ! Have you considered publishing this serie of posts as an ebook/PDF guide ?

  38. thomaslgd says:

    Great articles, I learned a lot. I’m concerned about the AlmostEqualRelativeAndAbs() though. The user needs to provide two parameters for the precision. The absolute and relative threshold. It’s not user friendly for the user (must give coherent pair) and dangerous in some cases.
    Two examples :
    maxDiff = 1.4E-40
    maxRelDiff = 0.001
    Close to zero, the absolute check will return true for cases where the relative one wouldn’t. There is a discontinuity that the user may not be aware of because giving coherent pair is complicated.
    maxDiff = 1.4E-43
    maxRelDiff = 0.001
    In this case, the absolute test may fail and ask the relative one to solve a problem it can’t.

    I tried to come up with a function taking only one precision parameter, making the usage easier for the user :

    #define FLT_SUB_MIN 1.4E-45f

    //the input precision is the relativeError inverse.
    //A precision of 1000 would tolerate an error of 0.1%
    bool AlmostEqualRelative(float A, float B, float precision)
    //Check the precision desired is in between 1/2^0 and 1/2^22
    assert(precision > 1.0f && precision < 4194304.0f);
    // Check if the numbers are really close — needed
    // when comparing numbers near zero.
    float diff = fabs(A – B);
    if (diff <= FLT_SUB_MIN * precision)
    return true;

    A = fabs(A);
    B = fabs(B);

    float temp = diff * precision;
    return (temp <= A || temp <= B);

    I think it's better for three reason :
    -There is only one input parameter. A pseudo-continuity is ensured internally, making the interface easier for the user. What the absolute check and the relative one test are almost two distinct sets of inputs.
    -The relative test compare two floats bigger than "diff". In your function, "largest * maxRelDiff" may give a wrong result if both are subnormals. Would the test still succeed ? In my function, "diff * precision* generate a float at least bigger than diff so I got rid of this possible issue.
    -In your function you create "largest" out of a condition, then execute an other condition. In my function, everything is combined in one if statement. The OR will return true if almost equal to A without testing B. Knowing this fact, the user could even optimize the usage of this function if he knows which one of A or B if more often the bigger.

    Two days ago I was still thinking floats where some magical creatures that you couldn't compare properly. With your articles I think I have a better understanding of the machinery. But it's up to you to validate my theory about the efficiency/user friendliness of my function.


    • thomaslgd says:

      Oooohhh, I did not take the cases around zero into account. So what if I trigger the absolute test only if one of the input is below the achievable precision ?

      bool AlmostEqualRelative(float A, float B, float precision)
      //Check the precision desired is in between 1/2^0 and 1/2^22
      assert(precision > 1.0f && precision < 4194304.0f);

      A = fabs(A);
      B = fabs(B);

      //Return true if precision can't be achieved
      float minInputValue = FLT_SUB_MIN * precision;
      if (B <= minInputValue) //Manage case B subnormal below supported precision
      return (A + B) * precision < 1.0f;
      if (A <= minInputValue) //Manage case A subnormal below supported precision
      return (A + B) * precision < 1.0f;

      float temp = fabs(A – B) * precision;
      return (temp <= A || temp <= B);

      It tested this guy for normals and subnormals (crossing between negative and positive) and it worked like a charm.

      • brucedawson says:

        You are trying to do the impossible.

        > Two examples :
        > maxDiff = 1.4E-40
        > maxRelDiff = 0.001
        > Close to zero, the absolute check will return true for cases
        > where the relative one wouldn’t.

        Yes, that is true. That is the job of maxDiff, to return ‘equality’ when very close to zero. If it doesn’t do that for some values then it is pointless.

        This cannot be made simple and easy because it depends on where your numbers are coming from, what their expected ranges are, and what their expected errors are. If you compare sin((float)pi) to zero then the expected relative diff is infinite, and the expected absolute diff is pretty large — larger than FLT_EPSILON. Your function fails in that case, even given a perfectly rounded sin() function. Yes, sin((float)pi) is a case of catastrophic cancellation, but that is often the cause of numbers near zero.

        > what if I trigger the absolute test only if one of the input is below the achievable precision

        What do you mean by the achievable precision?

        Maybe the error in my code is that one shouldn’t try using a generic function to compare against zero. Zero is always a particularly tough case because you have lost all information about what a ‘reasonable’ error might be, so perhaps a special function for that purpose makes sense.

        Ultimately the whole comparison problem is very context sensitive, and that is unavoidable.

        • thomaslgd says:

          “Yes, that is true. That is the job of maxDiff, to return ‘equality’ when very close to zero. If it doesn’t do that for some values then it is pointless.”

          But how to provide thoughtfully maxDiff and maxRelDiff so that the two of them don’t step on each other territory ?

          “Your function fails in that case, even given a perfectly rounded sin() function.”

          The second function passes this test, I’m hopeful.

          “What do you mean by the achievable precision?”

          If I target a precision of 1/1000, I can process a relative comparison for floats higher (after fabs()) than 1000 times the minimal subnormal (1.4E-45). If one of the input is below this threshold, I must switch to an absolute comparison of the difference between the two inputs. (In the second algorithm I do “(A + B)”, it should be “diff” instead).
          “Achievable precision” is an awful term, “Input lower threshold” ?

          “…so perhaps a special function for that purpose makes sense.”

          I hope not. I will test more thoroughly.

          • brucedawson says:

            I don’t think your second function handles the case of sin((float)pi). That function gives an answer of -0.000000087422776, or about 8e-8, when the ‘expected’ answer is zero. This is about 5e37 greater than the minimal subnormal. And yet, it is almost exactly the correct answer given the input value. This sort of unexpected (without detailed analysis) behavior when dealing with results near zero makes it, I believe, impossible to automatically figure out the right error range.

            Relative errors can work nicely, but they fall down near zero, whatever ‘near zero’ means. And there’s the rub. The definition of ‘near zero’ is dependent on what calculations you did to get there, and that cannot possibly be known by a comparison function.

            > But how to provide thoughtfully maxDiff and maxRelDiff
            > so that the two of them don’t step on each other territory ?

            I don’t think this is generally possible.

  39. thomaslgd says:

    I think I got the core of the problem.
    If I had :
    X = 1E30
    Y = 1E30 + 1 ULP
    A = Y – X
    B = 0
    A perfect comparison should return true since the subtraction of Y by X is really close to zero considering the order of magnitude of X and Y. Still it will be around 1E21, which is extremely far away from zero, be it an absolute or a relative comparison.

    In this case we should have compare X and Y directly instead of comparing the subtraction to zero. So it comes back to your conclusion “Ultimately the whole comparison problem is very context sensitive, and that is unavoidable.”.

    Thanks for taking the time to explain.

  40. Pavel Celba says:

    What about following code? That’s what I deem could be better than any methods discussed in the article.
    check if two floating point numbers are almost equal.
    \param P specifies precision for comparison of the binary significand (ie. ~= number of significant figures)
    template<unsigned int P, typename T> inline bool almostEqual(const T x, const T y) {
    int ex, ey;
    const T sx=std::frexp(x, &ex);
    const T sy=std::frexp(y, &ey);
    if (ex!=ey)
    return false;
    const T p=std::pow(10.0, static_cast<int>(P));
    return floor(sx*p) == floor(sy*p);
    // Edited to HTMLize the cast so that it shows up. Note that ‘P’ is of type unsigned int.

    • brucedawson says:

      Sorry, that works poorly. The exponent check will pass for 0.5 and 0.9999999, but will fail for 0.9999999 and 1.0, even though the latter two are as close-as-can-be. So already you have rejected some number pairs that should presumably pass.

      The second check is insufficiently sensitive to be tuned well. Specifying how many decimal digits of precision to account for works poorly because the amount of decimal precision varies hugely, from about 6-9 digits. Also, because you use an equality test you hit another cusp problem where arbitrarily close numbers can be on opposite sides of the cusp.

      • Pavel Celba says:

        Yeah, have realized it shortly after post. The problem with exponent. But you are right that the second problem is even more dire as the sensitivity needs to be tuned more sometimes.

    • Ben Key says:

      You made several mistakes here? P is described in the comments but not actually defined in the code. What is P? Is it a const T or a const int? Also, static_cast(P) in wrong. I think this should be static_cast(P), but I cannot be certain.

      • brucedawson says:

        It looks the commenting system is eating angle brackets, in the original code suggestion and in your critiques — which makes them kind of amusing. I assume that you in both cases static_cast<T>(P) was intended — C style casts survive the commenting system better. I had to use HTML encoding to get the angle brackets above to survive. I can fix up the original and the reply as shown above if you’d like — let me know Pavel and Ben.

  41. strekland says:

    Reblogged this on «Memento mori» and commented:
    Плавающая запятая

  42. Keld Ølykke says:

    Great article. Thx.
    I too have played around the issues around 0.

    The result is .
    Any thoughts about the formula with the 1 in the denominator? You are way deeper into this topic, so you might have opinions against it.

    Here are some tests with various input values:

    Any comments are appreciated.

    Kind Regards,
    Keld Ølykke

    • brucedawson says:

      It looks like the code implements a relative epsilon only — I couldn’t see an absolute epsilon to handle numbers near zero. That means that this code will necessarily not handle numbers near zero correctly. That’s fine, as long as that limitation is understood.

      I also think that a default relative epsilon of 1.001 is too permissive as a default, since it lets through huge amounts of slop. I’d go with something like 1.000001 as a default relative epsilon.

      • Keld Ølykke says:


        I eliminated the absolute epsilon test. It might not be so clear 😦
        I did this in a way that corresponds to your AlmostRelativeEquals-function with the following modification:
        float largest = (B > A) ? B : A;
        if(largest < 1f) largest = 1f; // eliminates absolute epsilon test, good idea or too hacky?

        Ok. I will go for a smaller default. Thx

        • brucedawson says:

          Too hacky. You are implicitly assuming that the absolute epsilon should be FLT_EPSILON and there is no justification for this. Given my favorite test case of sin(theta) the absolute epsilon (necessary for comparing sin(pi*n) to zero) is equal to the epsilon of theta, which may be arbitrarily larger than FLT_EPSILON. For instance:

          Compare(sin(theta), 0);

          may say that the result is not equal to zero even when theta is as close to pi*n as is possible, as in 314.15926535f. Or consider this:

          Compare(sin(3.1415926535f) * .000001, 0)

          What absolute epsilon should you use here? I believe the correct answer is 3.14159 * 0.5 * FLT_EPSILON * 0.00001.

          In short, you’re injecting a magic number which makes your results range specific. This means that the same calculation done in m/s versus mm/s versus km/s may give different results.

          • Keld Ølykke says:

            I see. I think I need a another method overload that also accepts an absolute epsilon.

            If we could turn both numbers into scientific notation* could we not get back to having one epsilon value working on the mantissa* difference?

          • Keld Ølykke says:

            > If we could turn both numbers into scientific notation* could we not get back to having
            > one epsilon value working on the mantissa* difference?
            > *(e.g.

            Ah no. The absolute epsilon is an expectation of rounding errors that may have occurred the inputs. That can be less or greater than the error we accept between the inputs (if they had no rounding errors). Apples and Pears.

          • brucedawson says:

            Well, not really. The absolute epsilon is an expectation of reasonable rounding errors in the result that are not proportional the result’s magnitude. Any calculation that does subtraction can hit these. If you do subtraction then the expected error is the sum of the expected errors in the inputs. If you then multiply the result of the subtraction by 1e6 then the expected error is a million times larger

          • keldoelykke says:

            Hi again,

            I have been tinkering all week about how to create a clear view about what I want an AlmostEqual-function to do. Here is the result:

            Armed with these 2 functions – High(X) & Low(X) – we calculate them both for input X and see if input Y is within [Low(X);High(X)].

            The 2 tweak parameters for the functions are as you have deducted above absolute and relative epsilon.

            Any good?

          • brucedawson says:

            I like the idea of visualizing what the two error terms actually mean. My functions don’t apply them both simultaneously, which means their graph would have a straight line for when the absolute epsilon applies.

            I recommend reading about condition numbers as they give a mathematical basis for figuring out how inaccurate an equation can be. Most people don’t bother — they just experiment and see what they can get away with.

          • keldoelykke says:


            I made the 2 functions from the following 4 functions:

            I will have a look at condition numbers. Thx.

  43. Pingback: Adventures of a Programmer: Parser Writing Peril XXXIII | De Amentiae Mundi

  44. Pingback: Intel Underestimates Error Bounds by 1.3 quintillion | Random ASCII

  45. Pavel Celba says:

    The algorithm for nearly equal comparison does not handle correctly infinity cases:
    1) AlmostEqualRelativeAndAbs(std::numeric_limits<float>::infinity(), std::numeric_limits<float>::infinity(), 0.0f, 0.0f); // Returns false
    2) AlmostEqualRelativeAndAbs(-std::numeric_limits<float>::infinity(), std::numeric_limits<float>::infinity(), 0.0f, 0.1f); // Returns true

    Following changes must be done to fix this behavior:
    Add equal comparison test to the start of algorithm:
    if (A == B) return true;

    Add this special case test after A = abs(A); B = abs(B); :
    if (A == B && diff == A) return false;

  46. mark doherty says:

    An excellent article, wish I’d read it earlier!

    I was looking for the answer to …

    If a comparision is made with a floating point constant that is exactly representable in binary IEEE-754 single format – like 8191.875, is the result predicable, in all environents?
    In my case the other operand is received over a serial bus in IEEE-754 single format and I only perform the comparison if it is neither an NaN nor an infinity.

    I’d like to think my answer is obvious (predicable in all environments) but as “[Floating-point] math is hard.” I’d like to make sure I haven’t missed anything.

    • brucedawson says:

      I can’t think of any way that the comparison in your scenario could misbehave. Generally speaking if you compare two floats that are equal then (ignoring NaNs) they will compare equal. The only non-standard deviations from this that I am aware of are when optimizing compilers for the x87 FPU fail to round a recent calculation to float precision. That leads to the possibility that FloatFunc(f) != FloatFunc(f) because one result is rounded to float and the other is left at double or extended-double precision.

      As long as that doesn’t apply to your situation (and it sounds like it doesn’t) then you’re all good. SSE and other FPUs generally avoid the x87’s odd problems.

  47. Thanks to your serie of articles, I’m finally starting to understand 2-3 things about floating-points.

    Now I’m in the process of writing a math library for learning purposes and I’d like to provide a bunch of equality functions, one for each comparison method (absolute, relative, ULP, or a combination of them). And for the principle I’d also like to take care of the long double case.

    Depending on the implementation, and if I understood properly, the size of a long double can vary but it’s more likely to be made of 80 or 128 bits. Since there’s no standard integer type to represent a floating point that large—and since I’d like to avoid using any external library—, I’ve been wondering how to compute the ULP difference between two long doubles.

    I’ve stumbled upon the function `float_distance` from the boost library as well as a question on StackOverflow, both linked below, but I’m easily overwhelmed by their complexity so I was wondering what were your thoughts on those two implementations and/or if you already had a viable long double-compliant alternative of the technique that you’ve presented in this article?

    Note that I’m using C++11 but I’m not sure if it makes things easier here.

    Kind regards,

    • Small follow-up: I linked to the question on SO thinking that I could turn it around to get the number of ULPs.

      After thinking a bit more about it, I guess that it doesn’t make much sense—how could one deduce a correct ULP distance between two numbers by using a constant epsilon value? You’ve explained very well in your article that the epsilon between adjacent floats is nothing but constant.

      The `ulps` parameter in his function is actually used against the difference between significands rather than against the difference of the two values to compare as one would expect from the function signature—that is misleading and is a bit of a cheat.

      To confirm all this, I modified his function to get it to return an ULP difference, and even thought it’s highly possible that I might have done something terribly wrong in the process, it seems to work fine for the floats in the range [1.0f, 2.0f] (constant epsilon value in that range) while it fails otherwise.

      • brucedawson says:

        An ULPs function that only works in such a restricted range doesn’t seem very helpful.

        I haven’t tried writing an ULPs routine using floating-point math, but it would seem to me that if you divide the two numbers by each other then the numbers of ULPs that the ratio is away from 1.0 is a reasonable estimate of the ULPs between the two numbers — of by a factor of two, at most, I believe.

        Then again, at that point you might as well just use a relative method.

        If you do want the exact number of ULPs between the numbers then just write a simple high-precision math function. See the HighPrec class that I use and have a link to here as an example:

    • brucedawson says:

      I don’t have a long-double compliant technique. Since the relative and ULP methods are fundamentally the same (to some extent they are just different ways of thinking about the problem) I would probably go with the relative method.

      • I’ve actually had a click when reading on your article that the epsilon value becomes twice bigger when the exponent is being incremented—this comes with a whole lot of interesting properties which got me digging deeper into floating-points to come up with the ULP difference function linked below (broken into parts to prevent wordpress from including the code here):

        https:, //

        I don’t know how good it is but what’s sure is that I’ve accomplished my first objective—I’ve learned a lot, thanks to you!

        • brucedawson says:

          Hey, nice work.

          It’s good to see you’ve got tests, or at least asserts, but you’re only covering a limited set of numbers. A nice addition would be to generate random 64-bit patterns, reinterpret them as doubles, use your function to measure the distance, and then compare that to the actual distance measured by subtracting the 64-bit integers.

          You could excluded NaNs and insist on same-sign numbers, or whatever restrictions you want, but you’d get much wider coverage. It would be a step towards this ideal:

          • I tried to design the tests so they would work for float, double and long double types, hence why it’s not too fancy.

            I’ll try to do as you say for the float type but another time… I’d like to move on a bit for now 🙂


  48. Pingback: 判断浮点数相等的正确方法 | 张永健的技术博客

  49. Martin Zimmermann says:

    Thanks for the comprehensive and well written article!
    I have one question:
    In your algorithms you always compare some difference of floats with some epsilon using the operator “<=". This operator itself includes a comparison for equality of some float values (except the ULPs), which was your initial problem. Wouldn't it be better to use "<"?

  50. Pingback: Infinite loop heisenbug: it exits if I add a printout | 我爱源码网

  51. Alex Parf says:

    I’ve found some interesting article (written in russian):
    It has some thoughts about this problem. I apologize there is no english version link, but in my opinion, the yandex translator gives more correct translation comparing to Google:
    So, I glad to hear your opinion on that comparison technique, especially to and .

    • Alex Parf says:

      Especially to “Idiom 1: check for a bit” and “Idiom 1′: equality comparison”.

      • Alex Parf says:

        Hello Bruce.
        No one has responded to my previous question yet.
        Could you comment a version of the comparison function mentioned in the article listed above?

        fabs(x – y) < (eps + 2.0 * FLT_EPSILON) * (1.0 + fmax(fabs(x), fabs(y)))

        Is it totally wrong, has it some restrictions or it can work well? Thank you.

        • brucedawson says:

          The article seems to be trying (I only glanced at it briefly) to “automatically” generate correct epsilon values. I say this based on it saying “To protect yourself from the possibility of setting too low values eps can be as follows”, and I don’t think that automatic epsilon methods work – the correct epsilon value depends on the algorithm. But, I’m not sure. It’s a tricky subject to be sure. I’d just be worried by the magic numbers, even 1.0, 2.0, and FLT_EPSILON are usually ‘magic’ numbers if used as part of a generic solution to comparing floating-point numbers ‘correctly’.

  52. Michael Lee Finney says:

    Some of the problems can be fixed fairly easily. After converting the floating point numbers to integers, convert the signed magnitude representation to two’s complement. Then the zeros will compare equal, and positive and negative numbers near zero give reasonable results. For example if you have 0x00000001 and 0x80000001 for the smallest negative and positive denormal values (for IEEE binary32) their difference is 2. Their difference from either zero is 1. That is exactly what you would expect. The difference of very large positive and negative values is not useful (the difference between 2.0 and -2.0 is 0x80000000), but that is not the application here. This sounds expensive, but the entire comparison (on a 3770K @ 4.0Ghz) runs in 2.29ns without inlining and around 1.73ns with inlining using VC 2015. The paper “Taming the Floating Point Beast” by Chris Lomot at “” goes into some detail and examines 18 different algorithms. I have been comparing floating point numbers this way for years, but Chris’s code was about 20% faster than mine (I optimized the individual functions and he optimized the entire function and VC would not generate the optimal code so I lost a few cycles). You also need to consider other comparisons. For example isLess(x, y, ulps) {return x < y and isNotEqual(x,y,ulps); } is an appropriate extension.

  53. brucedawson says:

    Thanks for the link to Chris’ paper.

    Regarding changing to two’s complement, that does ‘solve’ the problem of tiny denormals being treated as far away, but I remain unconvinced that that is a problem worth solving. If we decide that small negative and small positive numbers should compare as close then an ULPs based comparison is the wrong way to do it. In such a system the smallest positive and negative normalized numbers should probably compare as equal, but they are 16 *million* ULPs apart. Using floating-point subtraction and comparison for the absolute difference seems more sensible and intuitive, IMHO.

  54. Pingback: Floating-Point Determinism | Random ASCII

  55. Pingback: Is operator ≤ UB for floating point comparison? - BlogoSfera

  56. Alain Weiler says:

    Hello Bruce!

    The following implementation works well in all cases.
    For explanation, please read the comment within the code.
    I have performed tests with double and find a maxDiff = 1e-15
    as suitable. For float data the limit might be aroung 1e-7…1e-8
    (not tested).


    const double doubleMaxDiff = 1.e-15;

    bool AlmostEqualRelativeAndAbs(float A, float B, float maxDiff)
    // Check if the numbers are really close — needed
    // when comparing numbers ***near zero***, but only
    // in those cases, since else we would get wrong results
    // when comparing small values whose absolute
    // difference is small but their relative difference not,
    // for example 1e-10 and 1e-10*(1+1e-7).
    // So we add an upper limit below which the absolute
    // comparison will be performed.
    A = fabs(A);
    B = fabs(B);
    float largest = (B > A) ? B : A;
    float diff = fabs(A – B);

    if (diff <= maxDiff && largest <= maxDiff)
    return true;

    if (diff <= largest * maxDiff)
    return true;

    return false;

    // in one statement:
    //return (largest <= maxDiff && diff <= maxDiff) || diff <= largest * maxDiff;

  57. Pingback: Python:floating point equality in Python and in general – IT Sprite

  58. Pingback: IEEE 754 floating point, what is the largest number < 1? | iphoneinenglish

  59. Pingback: IEEE 754 floating point, what is the largest number < 1? | iPhone Questions Answered

  60. Pingback: IEEE 754 floating point, what is the largest number < 1? | aljustiphone

  61. Pingback: IEEE 754 floating point, what is the largest number < 1? | iphonequestion

  62. Pingback: IEEE 754 floating point, what is the largest number < 1? | realiphoneanswers

  63. Pingback: Python:Function to determine if two numbers are nearly equal when rounded to n significant decimal digits – IT Sprite

  64. Pingback: Python:Python floating point [duplicate] – IT Sprite

  65. zammbi says:

    So confusing, I think I’ll just stick with AlmostEqualRelative 🙂

  66. Krishna says:

    The reason is a lot simpler than it seemed
    The following code repros


    int main() {
    double pi = 3.14159265358979323846;
    std::cout << float(pi) << std::endl;
    std::cout << float(pi) + sin(float(pi)) << std::endl;

    float(pi) is not so accurate because well…. it is lower precision.
    The key is : what is the type of the 2nd expression (which is double)

    Even better, the coincidence can be fully explained with a bit of math and no floating point magic. The math is as follows:

    float(PI) is basically PI + epsilon where epsilon is a number smaller than the 32-bit floating point can represent.

    The question is what is float(PI + epsilon) + sin(PI + epsilon)

    The taylor expansion of sin around PI is (pi – x) – 1/6(pi – x)^3 …
    i.e. sin(PI + epsilon) = -epsilon + O(epsilon^3)

    float has 24 bits mantissa.
    double has 53 bits mantissa.
    i.e. if epsilon is smaller than the float can represent, epsilon^3 must require at least 3*24 = 72 bits to represent which is more than that of double.

    i.e. for sufficiently small epsilon

    sin(PI + epsilon) = -epsilon_d (where epsilon_d is the double version of the epsilon)

    but because the type of the above is

    Thus float(PI + epsilon) + sin(PI + epsilon) = double(float(PI + epsilon)) – epsilon_d
    = PI + epsilon – epsilon_d


  67. Darek says:

    Great article. Can You give me example of pair of doubles, for which AlmostEqualRelativeAndAbs() will return true, but (diff <= maxDiff) will not be true?

  68. markus says:

    great work bruce!!

  69. What are the side effects of this naive test for near equality? that works by simply normalising values – feel free to change the end comparison to a smaller 1/2^n value 😉

    bool FloatNearEq(float first, float second) {
    if(first == 0) return second == 0;

    float norm = second / first;
    float diff = fabs(1.f – norm);
    return diff < 0.000001f;

    • brucedawson says:

      The asymmetry between first and second worries me – I’m pretty sure that there are values for which FloatNearEq(a,b) != FloatNearEq(b,a), and that’s bad form.

      Also, it doesn’t do anything special for values extremely close to zero. The result of FloatNearEq(sin(float(pi)),0) will be false. The result of FloatNearEq(FLT_MIN, 0) will be false. The result of FloatNearEq(smallestdenormal, 0) will be false.

      That last one is particularly worrisome. If the smallest positive number is not considered close to zero, then nothing is, and in many scenarios that is wrong.

      If you don’t care about zero or numbers near zero (fine in some cases) then I would just advise making sure that it is symmetrical.

  70. anishkumark says:

    Reblogged this on Trial and error and commented:
    Facing problems with floating point comparison. This post gives you some insight.

  71. Simon O says:

    “I’m such a geek that I think that is the coolest thing I’ve discovered in quite a while.”

    You’re not alone.

  72. Pingback: Merrill Aldrich : IsCloseEnough(): I Ain’t Afraid of No Float

  73. SC says:

    Great series of articles on floating-point.

    I was wondering, Michael Lee Finney posted a link to Chris Lomont’s paper in which there is also a reference to one of your comparison functions. However, that version is not included in your article above. Are there any issues with that version? Is the use of reinterpret_cast reliable in that context?

    Also in your function AlmostEqualUlpsAndAbs above, you use a union Float_t. On a SO article (, it was mentioned that accessing a union like this invokes undefined behavior.


    • Pavel Celba says:

      Accessing union is not undefined behavior. It’s compiler defined behavior – all mainstream compilers allow this as exactly defined behavior.

      • SC says:

        Actually, in addition to the SO article, Bruce also mentions it in his article ‘Tricks With the Floating-Point Format’ where he ‘doesn’t recommend using this union for production coding (it is a violation of the aliasing rules for some compilers, and will probably generate inefficient code), but it is useful for learning.’

        Another of Bruce’s comparison functions in Chris Lomont’s article uses reinterpret_cast instead of a union. Wondering if that is a better way of doing it.

        • brucedawson says:

          I believe that reinterpret_cast (from float* to int*) is a violation of aliasing rules, and the union is preferred if you must engage in type punning. Ideally we shouldn’t be type punning, but sometimes it is hard to avoid.

  74. Roland Illig says:

    The derivative of sin(pi) is –1, not 1. Otherwise sin(pi-theta)=theta wouldn’t hold. Apart fom this typo, this little insight is pretty cool.

  75. Pingback: Using Floating Point Numbers as Hash Keys – Read after Write

  76. David Binner says:

    Hi, Bruce.

    Excellent article! Very informative.

    I am re-visiting some old code and happen to be in the process of seeking a zero of a function, f. Given an interval over which f changes sign, the program bisects the interval, evaluates f at a new point within the interval, tests if f == 0, and then iterates the process until f == 0. I am also adding a few extra tests to check if–by some stroke of luck–the condition f == 0 is satisfied before too many iterations of the process. In such a case, program execution terminates.

    The location of the zero on the x-axis could be any real number. I am just trying to find a point where f(x) = 0. It could be at x = 0.003 or x = 59876644567, etc. So I don’t think using a relative error is applicable in this situation.

    Your “Infernal zero” section mentioned how things break down when comparing numbers near 0. But what about when I am actually seeking 0? (I am using variables of type double.)

    I don’t think the following statement would ever be true:

    if (f == 0) return; // Exit sub-routine

    So I am considering something like

    if (fabs(f) < DBL_EPSILON) return; // Exit sub-routine

    Or should a quantity even smaller than DBL_EPSILON be used?

    Is there some "best practices" technique for having the f == 0 condition recognized and is inexpensive computationally?

    • brucedawson says:

      DBL_EPSILON is a very small number, so it is possible that your code will never get that small. Or maybe it can get much closer to zero than that – it depends on the function. I’d be tempted to monitor the direction of the results – sampling the slope – to tell when you are getting closer or farther (assuming the function is well behaved) and try to find the closest to zero, rather than choosing an arbitrary target.

      On the other hand, if DBL_EPSILON is “close enough” (whatever that means for your purpose) then use it. But don’t treat DBL_EPSILON as magical. It has no inherent meaning in this context.

  77. John Calcote says:

    Hi Bruce. Wonderful article – even after 5 years! Your examples are all in C/C++, but I assume that concepts apply to any language that implements IEEE floating point constructs. I’ve been playing with your examples in Java and I’m running into a strange scenario – I can’t get any error in a double, no matter what I do; float, yes, but not double:

    float f1 = 0.1f;
    double d1 = 0.1d;
    System.out.printf(“0.1 as float: %1.15f\n”, f1);
    System.out.printf(“0.1 as double: %1.15f\n”, d1);


    0.1 as float: 0.100000001490116120000000000000
    0.1 as double: 0.100000000000000000000000000000

    Any thoughts as to why this might be? Is there something in the JVM that manages doubles in a magic way? I’ve googled for this topic, but it’s (not surprisingly) difficult to narrow the search enough.


  78. Hi John, Why does %1.15f print 30 digits after the decimal point?
    15 isn’t enough for double’s error, trying Run at
    Cheers, Ralph.

    • John Calcote says:

      That was my fault – I pasted in the code, then realized it was only generating 15 places, so I modified the code and reexecuted, but forgot to change the code I’d pasted earlier. It’s actually printing using %1.30f specifiers. If I could edit my comment, I would.

      • brucedawson says:

        Well, an IEEE double cannot store 0.1. No binary floating-point system can. So, you probably have a bug/limitation/quirk in the printing system of your JVM. For a long time the VC++ run-time, for instance, would only print seventeen digits of a float/double. So it was incapable of printing the non-zero digits at the end of 0.1.

        You could try experimenting with repeatedly adding 0.1 to a variable – maybe a hundred times? Then subtract 10.0. See if you get zero. The errors might cancel out – I don’t know.

        Maybe see if somebody has written a better Java print routine. I show how to write one in C++ for floats here:
        It’s just software so with sufficient knowledge and time you can accurately print double(0.1) if you want to.

      • Hi John, You could try which says “How many digits must be printed for the fractional part of m or a? There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type double.”

        And then separately, could be worth a try because it doesn’t have that “adjacent” clause and has an easier task because it’s binary to hex.

        • John Calcote says:

          Hi Ralph – thanks for these pointers. I could almost guess that Double.toString wouldn’t do me much good because in Java when a primitive is printed (using System.out.println, et al), it’s auto-boxed to it’s complex type (e.g., double -> Double), and then the toString method is called on the object. However, the toHexString was interesting:

          0.1 as double (boxed/toString): 0.1
          0.1 as double (boxed/toHexString): 0x1.999999999999ap-4

          • Hi John, That hex looks correct. Re-writing it as binary it’s 1.(1001) with an exponent of -4, where the (1001) shows those digits are recurring; that’s the 9. Shifting the binary point left four places to zero out the exponent gives 0.0001(1001). The places after the binary point represent 1/2, 1/4, 1/8, and 1/16, which is the first bit we have set. That seems right so far because 1/16th is less than the 1/10th we’re aiming for. If it was 1/8th instead that would be too high. The recurring digits then keep adding to this 1/16th to try and get it closer to 1/10th.

            1/16 + 1/32 + 1/256 + 1/512 + 1/4096 + 1/8192 + 1/65536 =

            Looked at from the other end, 0.0001100110011001 is 1100110011001, 6553 decimal, 65,536ths.

            6553 / 65536 = 0.09999084472656250000

  79. Peejay says:

    I kinda hate the IEEE floats

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s