Float Precision Revisited: Nine Digit Float Portability

Last year I pointed out that float variables can be converted to text and then back to the same binary value using printf(“%1.8e”). I also supplied a test program that used C++ 11 threading to quickly prove this claim on VC++ 2012.

However what was left untested was whether a float that is converted to text using one compiler would faithfully be restored using a different compiler on a different platform. Today that question is (mostly) tested.

The summary of the results is that across a test of all two billion positive floats VC++ gets the last digit wrong four times, has legitimate disagreements with g++ 6,694,304 times, but the discrepancies probably don’t matter.

This article is part of a series of floating-point articles published mostly in 2012. The complete list is:

In the original float-precision article I claimed that:

  1. A 32-bit float can uniquely identify all six digit decimal numbers within its normalized range
  2. A 32-bit float can be uniquely identified by printing it with nine decimal digits of mantissa
  3. Printing the exact value of a 32-bit float can take up to 112 decimal digits of mantissa

For the purposes of this article the crucial distinction to understand is between the number of digits required to “uniquely identify” a particular float, and the number required to “exactly print” its value. The value 0.1f is a simple example of the distinction. 0.1f uniquely identifies a particular float (the float whose value is closest), but the value of this float isn’t 0.1 – it is actually 0.100000001490116119384765625. Close, but different.

Most developers rarely need to print the exact value of a float but it is useful to be able to print a float and be confident that you can retrieve the identical binary float value from the text.

ASCII non-equivalence: rounding rules

My first test was to write code to print all ~2 billion positive floats using VC++ 2010 (x86) and using g++ 4.6.3 on x86 Ubuntu. I had hoped to find that the ASCII representations were identical, but I was disappointed. Exactly 6,694,308 of the positive floats returned different decimal representations when printed with %1.8e with VC++ compared to g++. That’s roughly 0.3% of the total.

Here’s an example of one of the differences:

  • +6.10351563e-005: VC++
  • +6.10351562e-05: g++
  • +6.103515625e-005: full precision, using the code from this article

There’s one cosmetic difference since VC++ prints the exponents as three digit numbers and g++ uses two digits. That’s easy enough to ignore.

The next difference is that VC++ and g++ disagree about the final (ninth) digit. Looking at the full precision value we can see that the next (tenth) digit was a 5 – halfway between – and VC++ rounded up and g++ rounded down. Unfortunately it appears that there is no standard to mandate behavior in this situation. g++ appears to use the round-to-nearest-even rule for ties, and VC++ rounds away from zero for ties. I think that round-to-nearest-even is more in keeping with the spirit of the IEEE float standard, but that’s just my personal preference, so I can’t declare either one of them to be right or wrong.

ASCII non-equivalence: double rounding

Here’s another example:

  • +4.30373587e-015: VC++
  • +4.30373586e-15: g++
  • +4.30373586499999995214071901727947988547384738922119140625e-015: full precision

This one exhibits a different issue. By looking at the full precision representation of the float we can see that g++ is definitively correct in its decision to round down. The tenth digit is a four and you don’t have to look any farther to realize that rounding down is correct – VC++ is just plain wrong.

This appears to be a case of double rounding. VC++ prints a maximum of 17 digits of mantissa – if you ask for more you always get zeroes – and it looks like VC++ handles variable precision printing of floats by always printing to 17 digits and then rounding that result (or appending zeroes). The initial rounding to 17 digits rounds up to 4.3037358650000000, and when that is rounded to nine digits it is rounded up again. In a correctly rounded world the result should never be off by more than 0.5 in the last place, and VC++ fails this test, by a tiny margin.

I did some analysis of the discrepancies and I found the following:

  • In 6,694,304 cases the actual result is exactly half between what VC++ and g++ print and the difference is just a printing policy difference
  • In the remaining four cases VC++ prints the wrong value due to double rounding

While checking the results I found that in all but three cases the discrepancy was that g++ had a two as the last digit and VC++ had a three. That makes sense because .25 and .75 would be common binary float endings that could lead to ambiguous rounding, and with .75 both compilers would agree to round up to .8. Most of the two versus three discrepancies were just policy disagreements, but one was a case of double rounding.

My analysis also showed that across all ~two billion positive floats VC++ only does double rounding four times – the other 6,694,304 discrepancies were just a policy difference about what rounding rules should be used. That means that in the vast majority of cases VC++ and g++ print results that are no more than 0.5 ULPs (nine-digit decimal) away from the actual float value, and in the remaining four cases VC++ is just about 0.50000001 ULPs away.

Here are the four positive floats that VC++ double rounds, printed to full precision:

  • +1.4108220249999999656938242712330543908883141092963746233057145416434942364336535547408857382833957672119140625e-037
  • +3.75243281499999999736793464889467798646013154390921155723059854381062905304133892059326171875e-031
  • +4.30373586499999995214071901727947988547384738922119140625e-015
  • +9.40798071499999999378616166723077185451984405517578125e-014

Aside: the reason VC++ prints to 17 digits is because %e has to be able to print double values and these require a 17 digit mantissa in order to round-trip reliably. The printf code always receives a double and I guess the library writers decided that it was easiest to print to 17 digits and then adjust from there.

Binary equivalence

Luckily for software developers it is quite likely that none of this matters. The discrepancy between the g++ and VC++ results is always less than one part in 100,000,000, and the difference between the printed result and the actual float value is barely 0.5 parts in 100,000,000 . Since the maximum precision of a float is one part in ~16,777,216 this means that the difference between the g++ and VC++ results is less than one sixth of the difference between adjacent floats. Therefore, as long as the conversions from text to binary (scanf) do not contain egregious errors then we should always be able to retrieve the original binary value.

In other words, the maximum printf error is normally 0.5 (nine-digit decimal) ULPs, and occasionally 0.50000001 ULPs, but either way still much less than the distance between adjacent floats (minimum 6.019 ULPs around 1e-28), so even the differently rounded results still uniquely identify the correct float.

To verify this I did a scanf of each platform’s output on the other platform, for all ~2 billion positive floats, and in all cases I got back my original value. If you scanf them back into a double then you will get different results – because the ninth digit is then more significant – but scanning back to a float works.

Floats and debuggers

As I mentioned in the original precision post, VS 2010’s watch window prints floats with eight digits of mantissa, leading to ambiguity when debugging. I filed a bug on that and VS 2012’s watch window prints floats with nine digits. gdb (on x86 Linux) prints floats with nine digits.


Knowing that you can count on preserving the value of a float when printed with %1.8ef is important. Many game developers serialize floating-point data to text files, and they often do it incorrectly. One mistake is to use fewer than nine digits of mantissa, which means that they will occasionally lose information. Another mistake is to not trust %1.8ef and print with “%08x”, *(int*)&f instead of %1.8ef, thus losing the readability of a text format. A few game developers even print floats both ways, which has all the fashion advantages of wearing both a belt and suspenders. I hope I have proven to everyone’s satisfaction that using %1.8ef can be trusted.

Other resources

For more information and different perspectives on this topic I recommend this article which points out that printf (“%.1f\n”,0.25); is enough to show gcc/VC++ differences. I find that the whole site is quite interesting.

Readers interested in deeper details might want to read How to Print Floating-Point Numbers Accurately. It’s worth mentioning that this article does not say that accurately printing floating-point numbers is hard – it’s actually quite easy and simple – but doing it both accurately and efficiently is quite subtle and tricky.

C++ programs should take a look at Incorrect Round-Trip Conversions in Visual C++. Apparently iostreams in VC++ has a bug where it will not correctly convert some 17-digit strings to doubles. That’s pretty darned serious.

About these ads

About brucedawson

I'm a programmer, working for Valve (http://www.valvesoftware.com/), focusing on optimization and reliability. Nothing's more fun than making code run 5x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Floating Point and tagged , , , , . Bookmark the permalink.

13 Responses to Float Precision Revisited: Nine Digit Float Portability

  1. Stephan says:

    Thanks for the interesting article!

    If you don’t want to rely on the buggy or platform-specific floating point formatting of standard library implementations, you could use a third-party library like http://code.google.com/p/double-conversion/ (which is also used in Google V8, WebKit and Facebook Folly), though that only deals with doubles.

    Writing and testing (at least somewhat exhaustively) efficient string floating-point conversion code is hard, or at least, quite tedious. A few years ago I wrote C# code for conversions into the relatively simple hexadecimal format defined in IEEE754r (https://bitbucket.org/fparsec/main/src/tip/FParsecCS/HexFloat.cs) and I found the experience quite unpleasant.

  2. Paul Miller says:

    Great timing! I’ve been updating the XML I/O precision in Silhouette. Guess I need to start reading this more often!

  3. Kdansky says:

    And if floating points weren’t enough of a problem already, never forget that users can (and will) use different notations depending on where they live, and you need to learn about locale.h::_locale_tstruct and similar when reading user input.

    Example: Storing a float-parameter (let’s say, the user specifies mouse-sensitivity) in the registry (which doesn’t support floats. In Germany: 3,2 [string] -> 3.2 [float] -> “3.2″ [string] -> store in registry -> read from registry -> aaaaand everything backwards.

    • brucedawson says:

      Yes, locale bugs are a real concern. We hit one recently at work where our config files were being written with the US locale and then read back using the German locale. You always need to be sure to use the same locale to read and write floats or arbitrary badness can ensue. For serializing data you should use a fixed locale. For one project we ended up repurposing my full-precision printing code, not because we needed full precision but because it was the easiest way to have formatting control!

      • Evan M says:

        I was about to comment about the same problem. The worst part is the locale is global, so you can’t serialize floats in the same process as one that interacts with code that relies on the locale (for example, other Linux libraries that use the locale to choose user-facing strings.) I tried to fix this in WebKit by using its own float-formatting routines but was thwarted in getting all the corner cases to match the old behavior (the sorts of issues mentioned in your post).

        Now that I look again, there appear to newlocale() / uselocale() that set it on a per-thread basis. That could be useful.

  4. Pingback: Comparing Floating Point Numbers, 2012 Edition | Random ASCII

  5. Fabian says:

    As long as readability isn’t an issue: How about using “%a”? It’s exact and more compact than decimal notation.

    • brucedawson says:

      %a is a good option. It’s not as compact at printing the underlying representation in hex, but it’s still quite compact and somewhat readable.

      In some ways %a is actually the most readable format. If you print 1.1f with %a you get this:

      x = 0×1.19999ap+0

      This lets you see the repeating binary pattern, and the rounding up of the last digit, and with a bit of thought it is quite possible to read off approximate values from the next.

  6. Pingback: On floating point determinism | Yosoygames

  7. Christoph says:

    Last time I looked the PDF format did not allow scientific notation. It only allowed ddddd.ddddd real numbers. The question is: Write C++ code that expresses each float/double variable in that format as short as possible such that when read in back the value matches the original one?

    My not so elegant way for Qt was http://qt.gitorious.org/qt/qt/merge_requests/415

    • brucedawson says:

      I didn’t look at your solution but your question seems well phrased, and often that is the most important thing.

      It should always be possible to print a float/double with %1.8e/%1.16e and then just shift the result (multiply/divide the mantissa by ten) in order to get rid of the exponent. Sometimes the result will be quite long, but the value should be unchanged by the shifting.

  8. Pingback: Floating-Point Determinism | Random ASCII

  9. Pingback: There’s Only Four Billion Floats–So Test Them All! | Random ASCII

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s