Intermediate Floating-Point Precision

Posted on March 21, 2012 by brucedawson

Riddle me this Batman: how much precision are these calculations evaluated at?

void MulDouble(double x, double y, double* pResult)
{
    *pResult = x * y;
}

void MulFloat(float x, float y, float* pResult)
{
    *pResult = x * y;
}

If you answered ‘double’ and ‘float’ then you score one point for youthful idealism, but zero points for correctness. The correct answer, for zero-idealism points and forty two correctness points is “it depends”.

Read on for more details.

It depends. It depends on the compiler, the compiler version, the compiler settings, 32-bit versus 64-bit, and the run-time state of some CPU flags. And, this ‘it depends’ behavior can affect both performance and, in some cases, the calculated results. How exciting!

Previously on this channel…

If you’re just joining us then you may find it helpful to read some of the earlier posts in this series.

1: Tricks With the Floating-Point Format – an overview of the float format
2: Stupid Float Tricks – incrementing the integer representation
3: Don’t Store That in a Float – a cautionary tale about time
3b: They sure look equal… – special bonus post (not on altdevblogaday)
4: Comparing Floating Point Numbers, 2012 Edition
5: Float Precision—From Zero to 100+ Digits
5b: C++ 11 std::async for Fast Float Format Finding – special bonus post (not on altdevblogaday)
6: Intermediate Precision – (return *this;)

Post 4 (Comparing Floating Point Numbers) is particularly relevant because its test code legitimately prints different results with different compilers.

Default x86 code

The obvious place to start is with a default VC++ project. Let’s look at the results for the release build of a Win32 (x86) project with the default compiler settings except for turning off Link Time Code Generation (LTCG). Here’s the generated code:

?MulDouble@@YAXNNPAN@Z
    push    ebp
    mov     ebp, esp
    fld     QWORD PTR _x$[ebp]
    mov     eax, DWORD PTR _pResult$[ebp]
    fmul    QWORD PTR _y$[ebp]
    fstp    QWORD PTR [eax]
    pop     ebp
    ret     0

?MulFloat@@YAXMMPAM@Z
    push    ebp
    mov     ebp, esp
    fld     DWORD PTR _x$[ebp]
    mov     eax, DWORD PTR _pResult$[ebp]
    fmul    DWORD PTR _y$[ebp]
    fstp    DWORD PTR [eax]
    pop     ebp
    ret     0

It’s interesting that the code for MulDouble and MulFloat is identical except for size specifiers on the load and store instructions. These size specifiers just indicate whether to load/store a float or double from memory. Either way the fmul is done at “register precision”.

Register precision

By default a 32-bit VC++ x86 project generates x87 code for floating-point operations, as shown above. The peculiar x87 FPU has eight registers, and people will usually tell you that ‘register precision’ on this FPU means 80-bit precision. These people are wrong. Mostly.

It turns out that the x87 FPU has a precision setting. This can be set to 24-bit (float), 53-bit (double), or 64-bit (long double). VC++ hasn’t supported long double for a while so it initializes the FPU to double precision during thread startup. This means that every floating-point operation on the x87 FPU – add, multiply, square root, etc. – is done to double precision by default, albeit in 80-bit registers.The registers are 80 bits, but every operation is rounded to double precision before being stored in the register.

Except even that is not quite true. The mantissa is rounded to a double-precision compatible 53 bits, but the exponent is not, so the precision is double compatible but the range is not.

In the x87 register diagram below blue is the sign bit, pink is the exponent, and green is the mantissa. The full exponent is always used, but when the rounding is set to 24-bit then only the light green mantissa is used. When the rounding is set to 53-bit then only the light and medium green mantissa bits are used, and when 64-bit precision is requested the entire mantissa is used.

In summary, as long as your results are between DBL_MIN and DBL_MAX (they should be) then the larger exponent doesn’t matter and we can simplify and just say that register precision on x87 means double precision.

Except when it doesn’t.

While the VC++ CRT initializes the x87 FPUs register precision to _PC_53 (53-bits of mantissa) this can be changed. By calling _controlfp_s you can increase the precision of register values to _PC_64 or lower it to _PC_24. And, D3D9 has a habit of setting the x87 FPU to _PC_24 precision unless you specify D3DCREATE_FPU_PRESERVE. This means that on the thread where you initialized D3D9 you should expect many of your calculations to give different results than on other threads. D3D9 does this to improve the performance of the x87’s floating-point divide and square root. Nothing else runs faster or slower.

So, the assembly code above does its intermediate calculations in:

64-bit precision (80-bit long double), if somehow you avoid the CRT’s initialization or if you use _controlfp_s to set the precision to _PC_64
or 53-bit precision (64-bit double), by default
or 24-bit precision (32-bit float), if you’ve initialized D3D9 or used _controlfp_s to set the precision to _PC_24

On more complicated calculations the compiler can confuse things even more by spilling temporary results to memory, which may be done at a different precision than the currently selected register precision.

The glorious thing is that you can’t tell by looking at the code. The precision flags are a per-thread runtime setting, so the same code called from different threads in the same process can give different results. Is it confusingly awesome, or awesomely confusing?

/arch:SSE/SSE2

The acronym soup which is SSE changes all of this. Instructions from the SSE family all specify what precision to use and there is no precision override flag, so you can tell exactly what you’re going to get (as long as nobody has changed the rounding flags, but let’s pretend I didn’t mention that shall we?)

If we enable /arch:sse2 and otherwise leave our compiler settings unchanged (release build, link-time-code-gen off, /fp:strict) we will see these results in our x86 build:

?MulDouble@@YAXNNPAN@Z
    push     ebp
    mov      ebp, esp
    movsd    xmm0, QWORD PTR _x$[ebp]
    mov      eax, DWORD PTR _pResult$[ebp]
    mulsd    xmm0, QWORD PTR _y$[ebp]
    movsd    QWORD PTR [eax], xmm0
    pop      ebp
    ret      0

?MulFloat@@YAXMMPAM@Z
    push     ebp
    mov      ebp, esp
    movss    xmm0, DWORD PTR _x$[ebp]
    movss    xmm1, DWORD PTR _y$[ebp]
    mov      eax, DWORD PTR _pResult$[ebp]
    cvtps2pd xmm0, xmm0
    cvtps2pd xmm1, xmm1
    mulsd    xmm0, xmm1
    cvtpd2ps xmm0, xmm0
    movss    DWORD PTR [eax], xmm0
    pop      ebp
    ret      0

The MulDouble code looks comfortably similar, with just some mnemonic and register changes on three of the instructions. Simple enough.

The MulFloat code looks… longer. Weirder. Slower.

The MulFoat code has four extra instructions. Two of these instructions convert the inputs from float to double, and a third converts the result from double back to float. The fourth is an explicit load of ‘y’ because SSE can’t load-and-multiply in one instruction when combining float and double precision. Unlike the x87 instruction set where conversions happen as part of the load/store process, SSE conversions must be explicit. This gives greater control, but it adds cost. If we optimistically assume that the load and float-to-double conversions of ‘x’ and ‘y’ happen in parallel then the dependency chain for the floating-point math varies significantly between these two functions:

MulDouble:

movsd-> mulsd-> movsd

MulFloat

movss-> cvtps2pd-> mulsd-> cvtpd2ps-> movss

That means the MulFloat dependency chain is 66% longer than the MulDouble dependency chain. The movss and movsd instructions are somewhere between cheap and free and the convert instructions have comparable cost to floating-point multiply, so the actual latency increase could be even higher. Measuring such things is tricky at best, and the measurements extrapolate poorly, but in my crude tests I found that on optimized 32-bit /arch:SSE2 builds with VC++ 2010 MulFloat takes 35% longer to run than MulDouble. On tests where there is less function call overhead I’ve seen the float code take 78% longer than the double code. Ouch.

To widen or not to widen

So why does the compiler do this? Why does it waste three instructions to widen the calculations to double precision?

Is this even legal? Is it perhaps required?

The guidance for whether to do float calculations using double-precision intermediates has gone back and forth over the years, except when when it was neither back nor forth.

The IEEE 754 floating-point math standard chose not to rule on this issue:

Annex B.1: The format of an anonymous destination is defined by language expression evaluation rules.

Okay, so it’s up to the language – let’s see what they say.

All floating arithmetic in C is carried out in double-precision; whenever a float appears in an expression it is lengthened to double by zero-padding its fraction.

However the C++ 1998 standard does not appear to contain that language. The most obvious governing language is the usual arithmetic conversions (section 5.9 in the 1998 standard) which basically says that double-precision is only used if the expression contains a double:

5.9: If the above condition is not met and either operand is of type double, the other operand is converted to type double.

But the C++ 1998 standard also appears to permit floating-point math to be done at higher precisions if a compiler feels like it:

5.10 The values of the floating operands and the results of floating expressions may be represented in greater
precision and range than that required by the type; the types are not changed thereby.55)

The C99 standard also appears to permit but not require floating-point math to be done at higher precision:

Evaluation type may be wider than semantic type – wide evaluation does not widen semantic type

And Intel’s compiler lets you choose whether intermediate results should use the precision of the source numbers, double precision, or extended precision.

/fp:double: Rounds intermediate results to 53-bit (double) precision and enables value-safe optimizations.

Apparently compilers may use greater precision for intermediates, but should they?

The classic Goldberg article points out errors that can occur from doing calculations on floats using double precision:

This suggests that computing every expression in the highest precision available is not a good rule.

On the other hand, a well reasoned article by Microsoft’s Eric Fleegal says:

It is generally more desirable to compute the intermediate results in as high as [sic] precision as is practical.

The consensus is clear: sometimes having high-precision intermediates is great, and sometimes it is horrible. The IEEE math standard defers to the C++ standard, which defers to compiler writers, which sometimes then defer to developers.

It’s a design decision and a performance bug

As Goldberg and Fleegal point out, there are no easy answers. Sometimes higher precision intermediates will preserve vital information, and sometimes they will lead to confusing discrepancies.

I think I’ve reverse engineered how the VC++ team made their decision to use double-precision intermediates with /arch:SSE2. For many years 32-bit code on Windows has used the x87 FPU, set to double precision, so for years the intermediate values have been (as long as they stay in the registers) double precision. It seems obvious that when SSE and SSE2 came along the VC++ team wanted to maintain consistency. This means explicitly coding double precision temporaries.

The compiler (spoiler alert: prior to VS 2012) also has a strong tendency to use x87 instructions on any function that returns a float or double, presumably because the result has to be returned in an x87 register anyway, and I’m sure they wanted to avoid inconsistencies within the same program.

The frustrating thing about my test functions above is that the extra instructions are provably unneeded. They will make no difference to the result.

The reason I can be so confident that the double-precision temporaries make no difference in the example above is that there is just one operation being done. The IEEE standard has always guaranteed that the basic operations (add, subtract, multiply, divide, square root, some others) give perfectly rounded results. On any single basic operation on two floats if the result is stored into a float then widening the calculation to double is completely pointless because the result is already perfect. The optimizer should recognize this and remove the four extraneous instructions. Intermediate precision matters in complex calculation, but in a single multiply, it doesn’t.

In short, even though the VC++ policy in this case is to use double-precision for float calculations, the “as-if” rule means that the compiler can (and should) use single precision if it is faster and if the results would be the same “as-if” double precision was used.

64-bit changes everything

Here’s our test code again:

void MulDouble(double x, double y, double* pResult)
{
    *pResult = x * y;
}

void MulFloat(float x, float y, float* pResult)
{
    *pResult = x * y;
}

And here’s the x64 machine code generated for our test functions with a default Release configuration (LTCG disabled) in VC++ 2010:

?MulDouble@@YAXNNPEAN@Z
    mulsd    xmm0, xmm1
    movsdx   QWORD PTR [r8], xmm0
    ret      0

?MulFloat@@YAXMMPEAM@Z
    mulss    xmm0, xmm1
    movss    DWORD PTR [r8], xmm0
    ret      0

The difference is almost comical. The shortest function for 32-bit was eight instructions, and these are three instructions. What happened?

The main differences come from the 64-bit ABI. On 32-bit we need “push ebp/mov ebp,esp/pop ebp” in order to set up and tear down a stack frame for fast stack walking. This is crucial for ETW/xperf profiling and for many other tools and, as dramatic as it looks in this example, is not actually expensive and should not be disabled. On 64-bit the stack walking is done using metadata rather than executed instructions, so we save three instructions.

The 64-bit ABI also helps because the three parameters are all passed in registers instead of on the stack, and that saves us from two instructions to load parameters into registers. And with that we’ve accounted for all of the differences between the 32-bit and 64-bit versions of MulDouble.

The differences in MulFloat are more significant. x64 implies /arch:SSE2 so there’s no need to worry about the x87 FPU. And, the math is being done at float precision. That saves three conversion instructions. It appears that the VC++ team decided that double-precision temporaries were no longer a good idea for x64. In the sample code I’ve used so far in this post this is pure win – the results will be identical, only faster.

For more complicated calculations – anything more than a single binary operator – the results may differ on 64-bit because of the move away from double-precision intermediates. The results may be better or worse, but watch out. Floating point results can be sensitive to initial conditions: it’s the butterfly effect, wherein a distant hurricane may cause a butterfly to flap its wings

If you want to control the precision of intermediates then cast some of your input values to double, and store explicit temporaries in doubles. The cost to do this can be minimal in many cases.

64-bit FTW

64-bit code-gen was a chance for the VC++ team to wipe the slate clean. The new ABI plus the new policy on intermediate precision means that the instruction count for these routines goes from eight-to-twelve instructions to just three. Plus there are twice as many registers. No wonder x64 code can (if your memory footprint doesn’t increase too much) be faster than x86 code.

VS 2012 changes everything

Visual Studio 11 beta (which was released as VS 2012) appears to have a lot of code-gen differences, and one of them is in this area. VS 2012 no longer uses double precision for intermediate values when it is generating SSE/SSE2 code for float calculations.

It’s the journey, not the destination

So far our test code has simply done one basic math operation on two floats and stored the result in a float, in which case higher precision does not affect the result. As soon as we do multiple operations, or store the result in a double, higher precision intermediates give us a different and more accurate answer.

One could argue that the compiler should use higher precision whenever the result is stored in a double. However this would diverge from the C++ policy for integer math where the size of the destination doesn’t affect the calculation. For example:

int x = 1000000;
long long y = x * x;

The multiply generates an ‘int’ result and then assigns that to the long long. The multiply overflows, but that’s not the compiler’s problem. With integer math the operands determine the precision of the calculation, and applying the same rules to floating-point math helps maintain consistency.

Comprehension test

Here’s some simple code. It just adds four numbers together. The numbers have been carefully chosen so that there are three possible results, depending on whether the precision of intermediate values is float, double, or long-double.

float g_one = 1.0f;
float g_small_1 = FLT_EPSILON * 0.5;
float g_small_2 = DBL_EPSILON * 0.5;
float g_small_3 = DBL_EPSILON * 0.5;

void AddThreeAndPrint()
{
    printf("Sum = %1.16e\n", ((g_one + g_small_1) + g_small_2) + g_small_3);
}

Take this code, add it to a project, turn off LTCG, turn on Assembler Output (so you can see the generated code without even calling the code), and play around with build settings to see what you get. You can also run the code to see the three different results, perhaps with some calls to _controlfp_s to vary the x87’s precision.

// Here's how to use _controlfp_s to set the x87 register precistion
// to 24-bits (float precision)
unsigned oldState;
_controlfp_s(&oldState, _PC_24, _MCW_PC);

To aid in your explorations, use this simple flowchart to figure out what precision is used for intermediates in all-float calculations. Click on the image for a clearer view of the flowchart:

Got it? Simple enough?

I’m sure the VC++ team didn’t plan for the flowchart to be this complicated, however much of the complexity is due to the weirdness of the x87 FPU, and then their desire for SSE/SSE2 code to be x87 compatible. The Intel compiler’s options are complex also but I glossed over the details by just assuming that SSE/SSE2 is available.

And at least Microsoft is moving towards a dramatically simpler story with VS 2012 and with x64.

Bigger is not necessarily better

I want to emphasize that more precise intermediates are not ‘better’. As we have seen they can cause performance to be worse, but they can also cause unexpected results. Consider this code:

float g_three = 3.0f;
float g_seven = 7.0f;

void DemoOfDanger()
{
    float calc = g_three / g_seven;
    if (calc == g_three / g_seven)
        printf("They're equal!\n");
    else
        printf("They're not equal!\n");
}

If intermediate values use float precision then this will print “They’re equal!”, but otherwise – by default on x86 – it will not.

Sometimes bigger is better

Imagine this code:

float g_three = 3.0f;
float g_seven = 7.0f;

double DemoOfAwesome()
{
    return g_three / g_seven;
}

If the compiler uses source-precision (float-precision in this case for intermediates) then it will calculate a float-precision result and then convert it to a double and return it. This is consistent with the rules for integer calculations (the destination of a calculation doesn’t affect how the calculation is done), but it means that maximum precision is not realized. Developers can fix this by casting g_three or g_seven to double, but this is just another example showing that there is no simple solution.

The code that got me interested in this question is shown below in a simplified form:

float g_pointOne = 0.1f;

void PrintOne()
{
    printf("One might equal %1.10f\n", g_pointOne * 10);
}

Depending on the intermediate precision, and therefore depending on your compiler and other factors, this might print 1.0000000000 or 1.0000000149:

What’s a developer to do?

Right now the best recommendation for x86 game developers using VC++ is to compile with /fp:fast and /arch:sse2.

For x64 game developers /arch:sse2 is implied, and /fp:fast may not be needed either. Your mileage may vary, but it seems quite possible that /fp:strict will provide fast enough floating-point, while still giving conformant and predictable results. It really depends on how much you depend on predictable handling of NaNs and other floating-point subtleties that /fp:fast takes away.

For x86 developers using VS 2012 /arch:sse will still be needed but, as with x64, /fp:fast may no longer be.

If you need double-precision intermediates you can always request them on a case-by-case basis by using double or by casting from float to double at key points. That’s why I like source-precision intermediates because it gives developers the option to request higher precision when they want it.

Be wary of floating-point constants. Remember that 1.0 is a double, and using it will cause your float calculations to be done at double precision, with the attendant conversion costs. Searching your generated code for “cvtps2pd” and “cvtpd2ps” can help track down significant performance problems. I recently sped up some x64 code by more than 40% just by adding an ‘f’ to some constants. You can also find double-to-float conversions by enabling warning C4244:

// Enable this to find unintended double to float conversions.
// warning C4244: 'initializing' : conversion from 'double' to 'float', possible loss of data
#pragma warning(3 : 4244)

Wrap up

I hope this information is useful, and helps to explain some of the reasons why IEEE compliant floating-point can legitimately give different results on different platforms.

Perversely enough, it appears that developers who are not using /fp:fast are at higher risk for having their floating point results change when they upgrade to VS 2012 or start building for 64-bit.

Got any tales of woe in this area? Share them in the comments.

Update

I recently learned that the problem of intermediate precision was addressed by the C99 standardization committee which specifies FLT_EVAL_METHOD which is designed to indicate what intermediate precision is used. The main values are:

0: intermediate precision is dictated by the types used
1: intermediate precision is double precision (unless the types used are higher precision)
2: intermediate precision is long double

gcc uses 2 for x86 code (due to the x87 FPUs long double registers) and 0 for x64 code (because the SSE/SSE2 FPU supports this more naturally).

It appears that VC++ uses 1 for x86 code (they set the x87 FPU precision to double precision) and uses 1 for 64-bit code prior to VS 2012, and 0 with Visual Studio 2012 and higher. I have not checked whether VC++ sets FLT_EVAL_METHOD – I just noticed the behavior in the code generation.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

This entry was posted in AltDevBlogADay, Floating Point, Programming, Visual Studio and tagged double, float, floating point, floating point precision, intermediate precision, precision. Bookmark the permalink.

44 Responses to Intermediate Floating-Point Precision

Rick Regan says:

March 22, 2012 at 5:57 am

Here’s a “tale of woe”: http://www.exploringbinary.com/why-volatile-fixes-the-2-2250738585072011e-308-bug/ . In it I explain the reason behind the PHP floating-point conversion infinite loop bug (a perfect storm of x87 extended precision, subnormal numbers, and double rounding).

Reply
- brucedawson says:
  
  March 22, 2012 at 9:29 pm
  
  That’s a good link. A couple of interesting points from the article are:
  – When the x87 unit is set to double-precision the mantissa is always rounded to 53 bits. For numbers in the double-precision subnormal range (from -DBL_MIN to +DBL_MIN) this leads to different results because a true double would have 1 to 52 bits of precision in that range.
  
  – Double rounding can be a problem with the x87 registers. The basic idea is that if a number like 754.6 is rounded to two-digits then the answer is 750, but if it is rounded to three digits you get 755, and if that is then rounded to two digits (using round-to-nearest-even) then the answer is 760. This double rounding is a risk if you round to extended double precision and then to double precision. Luckily I believe it is not a risk when rounding to double precision and then to float precision.
  
  Reply
  - Rick Regan says:
    
    March 23, 2012 at 6:58 am
    
    I don’t think you can get a double rounding error in the case you describe (promote floats to doubles, calculate, convert answer to float), but in general, you can get double rounding errors from double to float. Consider
    
    0.50000017881393421514957253748434595763683319091796875 + 0.5.
    
    In binary, this is 0.10000000000000000000001011111111111111111111111111111 + 0.1 = 1.00000000000000000000001011111111111111111111111111111.
    
    Rounded directly to a float it becomes 1.00000000000000000000001 (because bit 25 is 0). Rounded first to a double it becomes1.000000000000000000000011 (because bit 54 is 1). Then, rounded to a float it becomes 1.0000000000000000000001 (because bit 25 is 1). This is a double rounding error from double to float.
    
    Reply
    - brucedawson says:
      
      March 23, 2012 at 8:35 am
      
      Thanks Rick — I appreciate your weighing in. I had trouble following your reasoning so I put your calculations into aligned columns, and zero padded the floats and doubles. Pasted into notepad, or otherwise viewed with a fixed-width font, I found this made it easier to see your point.
      
      .10000000000000000000001011111111111111111111111111111 double
      + .10000000000000000000000000000000000000000000000000000 double
      = 1.00000000000000000000001011111111111111111111111111111 infinite
      1.0000000000000000000000110000000000000000000000000000 double
      1.00000000000000000000010 float (double rounded)
      
      1.00000000000000000000001 float (direct)
      
      I hope you are correct that double rounding is not an issue in the simple (float)((double)float*(double)float) case, as that would make the VC++ 2010 /arch:sse2 code-gen too tragic.
      
      Reply
Rick Regan says:

March 23, 2012 at 9:36 am

(When I worked it out myself, I had things aligned in notepad. Comment formatting leaves a lot to be desired.)

Regarding your simple case, I was thinking mainly about multiplication, and that a 24×24 bit multiplication would give only 48 bits, so not enough to cause rounding at bit 54. Even with division, I couldn’t figure out how to get the “magic” bit sequence out to 54 bits. But now I’m wondering about addition and subtraction. Could we get the magic sequence after aligning the radix points? I’ll have to think about this.

Reply
- Rick Regan says:
  
  March 23, 2012 at 7:04 pm
  
  I can’t think of a scenario for addition or subtraction that would cause double rounding. We’d need a carefully crafted string of bits between bits 25 and 54 (what I called the magic sequence). That’s 30 bits, but a float, shifted to this range for alignment, falls short at 24 bits.
  
  That leaves division, which seems harder to analyze. Can division of two 24-bit values produce the magic sequence? My gut says no but I can’t tell you why.
  
  Reply
  - brucedawson says:
    
    March 23, 2012 at 7:22 pm
    
    I tried some casual research to investigate this question, but most of the results, directly or indirectly, were your site. Which is not going to help.
    
    Given enough computer time we could test all ~16 billion billion two-float possibilities, but I’d rather not.
    
    Reply
    - Rick Regan says:
      
      January 1, 2014 at 7:27 pm
      
      Bruce,
      
      I was just doing a little reading about double rounding and I remembered this old thread. I think the answer to your question is in the paper “When Is Double Rounding Innocuous” by Samuel A. Figueroa (I can’t find a copy of it online, but the content is essentially the same as chapter 6 in http://www.cs.nyu.edu/csweb/Research/Theses/figueroa_sam.pdf). I think the upshot is that since doubles have more than twice the precision as floats, there can be no double rounding in the case we discussed. (Granted I have still yet to read the details of the proofs in that paper.)
      
      Reply
      - brucedawson says:
        
        January 1, 2014 at 9:42 pm
        
        Thanks Rick. That looks like a great reference.
        
        Reply
Rick Regan says:

March 24, 2012 at 6:01 am

That’s funny. I’ve done that before too — queried something and gotten my site. Clearly Google thinks I know more than I do. 🙂

Reply
Jilles Tjoelker says:

March 31, 2012 at 3:11 pm

A further tricky thing about the x87 precision control is that a few instructions do not respect it, in particular FILD (integer->floating conversion). If an integer is pushed onto the x87 register stack, this is always exact, even though a double cannot represent all 64-bit integers exactly (and a float cannot even represent all 32-bit integers exactly). Consequences include: a conversion to double of a very large integer like LLONG_MAX in a register will never compare equal to the result of any arithmetic operation, and the difference of two integers that round to the same double, both converted to double, may be exact instead of the expected 0.

Reply
- brucedawson says:
  
  April 2, 2012 at 4:26 pm
  
  I was not aware of that ‘feature’ of fild. What makes it trickier is that the VS debugger’s register window only displays 17 digits of mantissa for the x87 registers, which is insufficient to distinguish two 80-bit long doubles that are very close. In the example below the registers window suggests that the two values are equal, but they are not.
  
  long long reallyBigInt = 100000001500000005LL;
  long long reallyBigFactor1 = 100000001LL;
  long long reallyBigFactor2 = 1000000005LL;
  
  void fildTest()
  {
  long long product = reallyBigFactor1 * reallyBigFactor2;
  if (product == reallyBigInt)
  {
  // This is executed.
  printf(“They are equal as long long integers.\n”);
  }
  
  double dProduct = (double)reallyBigFactor1 * (double)reallyBigFactor2;
  if (dProduct == reallyBigInt)
  {
  // This is not executed when compiling for x87 because
  // reallyBigInt is loaded exactly, despite not fitting
  // in a double.
  printf(“They are equal as doubles.\n”);
  }
  }
  
  Reply
Pingback: Floating-point complexities | Random ASCII
Atrix256 says:

April 9, 2012 at 8:56 pm

“Got any tales of woe in this area? Share them in the comments.”

I made a game using google’s native client (NaCl) that is a ray traced snake game. (Quick background, NaCl is natively compiled code compiled using a modified gcc chain that runs in chrome in a cross browser way, but when you compile you have to compile a 32 and 64 bit version).

Anyhow, my game has major floating point issues on various machines (especially macintosh, but some PC’s as well, especially older ones i think) where the graphics are just totally messed up and the rays miss objects that they shouldn’t.

Since the code is the same on the different machines (well, 32 and 64 bite are different of course!), it suggests that the problem could be due to the increased precision of FP registers on some chips, or since my code runs within the chrome browser process (i think…), it could be that the process itself is set to different floating point precision modes?

I’m not quite sure and it’s difficult without dev style access to a machine that can repro the problem, but this is my tale of woe and I’ve been scouring your blog for info hoping to find something to help me out.

So far I’m learning a ton about FP. Thanks for putting together all this great info, it’s way cool of you! You’ve done a great service (:

Reply
- brucedawson says:
  
  April 10, 2012 at 1:20 am
  
  That sounds like a tough problem. I would hope that Google would provide a consistent floating-point environment (consistent for 32-bit, and differently consistent for 64-bit). Is it possible that the graphics chips are different?
  
  If the code is the same then there are only a few possibilities that come to mind:
  If the 32-bit code uses the x87 FPU then the precision setting could be different.
  On SSE/SSE2 code or x87 code it is possible that somebody has altered the rounding mode — although that would be unusual.
  If you are calling transcendental functions (sin, cos, log) then be aware that different libraries and FPUs are not required to give the same answers.
  
  You should be able to query for the rounding modes and precision settings, but I don’t know how to do that in NaCL.
  
  Reply
  - Atrix256 says:
    
    April 10, 2012 at 10:32 am
    
    That’s cool to hear that you think google should be able to provide consistency. That’s what I hoped for too but wasn’t sure if it was possible to do that from their end.
    
    The ray tracer is all software rendered on the cpu (multithreaded) so luckily the video card is out of the equation.
    
    Thanks a bunch, i’ll see if i can dig deeper into the rounding mode and precision settings and see if those are different or if i can force them to be consistent on all the threads when my code launches or something (:
    
    Reply
    - brucedawson says:
      
      April 10, 2012 at 7:54 pm
      
      For a given bit of binary code (x86 or x64) that doesn’t use transcendental instructions the floating-point environment should be consistent.
      
      Reply
Atrix256 says:

April 16, 2012 at 4:24 pm

Hey Bruce,

Just to make sure i have this right, when you say the environment should be consistent, you aren’t saying that the results of calculations on different machines should all yield the same result do you?

There will still be the problem where two processors might give two different (but similar) results to the same set of floating point instructions right?

Reply
- brucedawson says:
  
  April 16, 2012 at 7:06 pm
  
  That is exactly what I am saying. If they leave the rounding mode (and precision mode, for x87), and denormal-flush setting at consistent settings then, for a sequence of machine instructions, with the possible exception of transcendental instructions, the results should be identical.
  
  If they aren’t then something is horribly wrong. If they are wrong, then where? If an IEEE add/mul/divide/subtract/sqrt instruction doesn’t give a correctly rounded result then that means your CPU is buggy.
  
  This concept — deterministic floating-point for a particular compiled binary — is relied upon by many games. These games rely on the fact that the same simulation on multiple machines will give big-identical results. You have to be careful with your inputs and a few other things, but it most definitely works.
  
  Reply
  - Atrix256 says:
    
    April 16, 2012 at 7:33 pm
    
    Interesting, myself and some fellow game devs seem to be misinformed, I’ll have to let my buddies know. Thank you very much for dispelling this pretty common (in my experience!) myth!
    
    Reply
  - brucedawson says:
    
    April 16, 2012 at 10:38 pm
    
    Also see this article:
    
    http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/
    
    In particular, from the comments: “I work at Gas Powered Games and i can tell you first hand that floating point math is deterministic.”
    
    Reply
    - Atrix256 says:
      
      April 19, 2012 at 8:44 pm
      
      I just wanted to give a quick update on this. After reading these articles and chatting with you here in the comments, you convinced me to stop barking up the floating point non determinism tree. As it turns out my problem was due to threading issues (and is now solved)!
      
      Long story short, it was trying to copy pixels on the main thread while rendering to them on the worker threads, causing pixel corruption. I had a real low repro hang bug too… Occam’s razor should have tipped me off that the two were related but I didn’t really put that together.
      
      Thanks again Bruce for all the help and the information in your articles. You are one cool dude (:
      
      Reply
Pingback: Exceptional Floating Point | Random ASCII
Pingback: That’s Not Normal–the Performance of Odd Floats | Random ASCII
Trillian says:

May 21, 2012 at 9:34 am

I also want to thank you about this series of articles. I’m currently developing a JIT compiler targetting the X64 architecture and this info is very interesting.

As an aside, I know that making a game totally deterministic when it uses floating-point values is a big challenge. I’m curious, you’ve already mentionned some factors which can modify the results of computations: the way the machine code is generated, the X87 precision flags, the SSE rounding mode flag. Are there other such factors? Will X87 and SSE give the same results if the machine code makes equivalent operations in the same order and the flags are set properly?

Reply
- brucedawson says:
  
  May 21, 2012 at 11:13 am
  
  If you set the x87 unit to float precision you should get the same results as the SSE float instructions in virtually all cases. I think the only exception is that the exponent range on the x87 unit will not be bounded to the float range. The only way to guarantee identical results in all cases is to round the results of each x87 operation to the appropriate precision by storing to memory.
  
  Other than that, if your code generator is careful then getting the same results should work. The individual operations are guaranteed to be bit identical. On the other hand, if you use SSE everywhere then that makes things much easier.
  
  Reply
Trillian says:

May 21, 2012 at 10:01 am

Oh another thing, do you see any reason why the X64 SSE MulFloat uses the packed conversion instructions (cvtps2pd) rather than the scalar ones (cvtss2sd)? Just being curious…

Reply
- brucedawson says:
  
  May 21, 2012 at 11:14 am
  
  I have no idea why VC++ would generate cvtps2pd rather than cvtss2sd.
  
  Reply
Khalid says:

June 27, 2012 at 1:27 am

Hi Bruce,
Really great explanation, though i’m still compelled to ask these silly questions due to my lack of knowledge:

1-does your article(more precisely the flowchart) apply only to ‘float’ or ‘double’, based on my testing seems all this is relevant if we use float types only

2-given current and future state of hardware and architecture, what would you recommend to go for? is it better to use doubles/floats solely or mix them up?
currently i’m leaning towards just using doubles and ignoring floats all together( seems during my profiling that its faster anyways )

many thanks!

Reply
- brucedawson says:
  
  June 27, 2012 at 9:07 pm
  
  The chart assumes that you are doing calculations involving only floats. If any of your operands are doubles then part or all of the calculations will be done at double precision, guaranteed.
  
  Except life is never that simple. Some machines have extended-double or other bigger numbers. The compiler or CPU might decide to do the calculations at this greater precision. So, *mostly* the chart only applies when you are using floats, but no guarantees.
  
  The actual calculation speed is generally identical for floats and doubles. But doubles use more space in memory so if you have large arrays of them then floats might be faster, due to fewer cache misses. In either case (but mostly with floats) if you are not careful then you might end up doing lots of float-to-double and double-to-float conversions, which slow down calculations. If doubles are faster than floats then the conversions are usually what is slowing down the floats.
  
  Floats can be faster if you are doing SIMD. A 128-bit register can hold (and process) four floats but only two doubles.
  
  Games use floats, but for many other domains it may be simpler and just as fast to use doubles.
  
  Reply
Pingback: Doubles are not floats, so don’t compare them | Random ASCII
Erik says:

August 16, 2012 at 3:02 am

Dear Bruce,

thanks for the article. I came accros it while searching for a fix for my raytracer (win32). I ported it to use sse intrinsics and got some terrible rounding errors. In the end the problem turned out to be a stupid cast, but it was cool to see how the errors changed between sse on/of and the fp settings, just like you described.

Thanks!

Reply
Pingback: Game Developer Magazine Floating Point | Random ASCII
Pingback: Top Eight Entertaining Blog Facts for 2012 | Random ASCII
Pingback: Float Precision Revisited: Nine Digit Float Portability | Random ASCII
Pingback: Comparing Floating Point Numbers, 2012 Edition | Random ASCII
Pingback: Floating-Point Determinism | Random ASCII
Richard says:

March 25, 2014 at 9:19 pm

Hi Bruce,
PrintOne()’s result also depends on rounding mode, since g_pointOne cannot actually be 0.1.

Reply
- brucedawson says:
  
  March 25, 2014 at 9:52 pm
  
  True, rounding mode can also affect the results. I was assuming (without stating it) that the rounding mode was nearest-even. Different rounding modes should be something that developers don’t normally need to worry about — they are ‘opt-in’ problems — whereas varying intermediate precision is more likely to be a problem.
  
  Reply
Pingback: Floating-Point Determinism | Random ASCII
donbright says:

November 18, 2015 at 9:17 pm

Here is an interesting story. In the boost library, you can theoretically generate deterministic pseudorandom floating point numbers. However what really happens is that you get different deterministic numbers depending on whether you use gcc on a 32 bit x86 machine or a 64 bit x86 machine.

In the end it came down to a single line of code

double result = numerator / divisor * (max_value – min_value) + min_value;

My question I guess is can you “always” workaround this by the following?

double R = numerator;
R /= divisor;
double M=max-min
R*=M
R+=min

https://github.com/openscad/openscad/issues/452#issuecomment-157563128

http://stackoverflow.com/questions/33795031/apparently-identical-math-expressions-with-different-output

Thanks.

Reply
- brucedawson says:
  
  November 18, 2015 at 10:00 pm
  
  Whether that rewriting of the code changes anything would depend on optimization settings. The only way to round something to double precision on an x87 FPU is to store it to memory, which is a waste of instructions, and thus often omitted when using high optimization levels. And, you might still end up with double rounding problems because, for instance, when you go “R /= divisor;” then on an x87 FPU set to 80-bit precision it will calculate the answer (rounded to 80-bit precision) and then when you store it in a double it will be rounded to double precision – double rounding does not always the same answer.
  
  I think that Microsoft made the right move by defaulting to setting the x87’s precision to double which *mostly* avoids all of these problems.
  
  It is amazing how many problems have been caused by the extra precision of the x87 FPU.
  
  Reply
  - donbright says:
    
    November 21, 2015 at 7:53 am
    
    thanks so much.
    
    Reply
265 993 303 says:

November 1, 2023 at 11:54 am

Integers and fixed points for speed and consistency, floating points for convenient user input! It only starts being tricky when very high dynamic ranges are necessary or many values are packed, software using arrays of millions of floating points might be hardest to optimize since switching floats to 64-bit integers is quarter dynamic range and increases memory, and integer based solution for float or double or long double dynamic range will be harder to make fast. I’m talking about 32-bit x86 software with x87 floating point as the main target since that spans a very broad range of compatibility with Digital Mars C++ working in Windows NT 3.5—Windows 11 and plenty of Intel and AMD processors supporting it

Reply