Don’t Store That in a Float

I promised in my last post to show an example of the importance of knowing how much precision a float has at a particular value. Here goes.

As a general rule this type of data should never be stored in a float:

Elapsed game time should never be stored in a float. Use a double instead. I’ll explain why below.

As an extra bonus, because switching to double is not always the best solution, this post demonstrates the dangers of unstable algorithms, and how to use the guarantees of floating-point math to improve them.

How long has this been going on?

A lot of games have some sort of GetTime() function that returns how long the game has been running. Often these return a floating-point number because it allows for convenient use of seconds as the units, while allowing sub-second precision.

GetTime() is typically implemented with some sort of high frequency timer such as QueryPerformanceCounter. This allows time resolution of a microsecond or better. However it’s worth looking at what happens to this resolution if the time is returned as a float, or stored in a float. We can do that using one of the TestFloatPrecision functions from the last post – just call them from the watch window of the debugger. In the screen shot below I tested the precision available at one minute, one hour, one day, and one week:

image

It’s important to understand what these data mean. The number ‘60’, like all integers up to 16777216, can be exactly represented in a float. The watch window shows that the next value after 60 that can be represented by a float is about 60.0000038. Therefore, if we use a float to store “60 seconds” then the next time that we can represent is 3.8 microseconds past 60 seconds. If we try to store a value in-between then it will be rounded up or down.

How long did it take?

One of the most common things to do with time values is to subtract them. For instance, we might have code like this:

double GetTime();

float TimeSomethingBadly()
{
float fStart = GetTime();
DoSomething();
float elapsed = GetTime() – fStart;
return elapsed;
}

The implication of the precision calculations above is that if ‘fStart’ is around 60, then ‘elapsed’ will be a multiple of 3.8 microseconds (two to the negative eighteenth seconds). That is the most precision you can get. If less than 3.8 microseconds has elapsed then ‘elapsed’ will either be rounded down to zero, or rounded up to 3.8 microseconds.

Therefore, if our game timer starts at zero and we store time in a float then after a minute the best precision we can get from our timer is 3.8 microseconds. After our game has been running for an hour our best precision drops to 0.24 milliseconds. After our game has been running for a day our precision drops to 7.8 milliseconds, and after a week our precision drops to 62.5 milliseconds.

This is why storing time in a float is dangerous. If you use float-time to try calculating your frame rate after running for a day then the only answers above 30 fps that are possible are infinity, 128, 64, 42.6, or 32 (since the possible frame lengths are 0, 7.8, 15.6, 23.4, or 31.2 milliseconds). And it only gets worse if you run longer.

As another example consider this code:

double GetTime();

void ThinkBadly()
{
float startTime = (float)GetTime();
// Do AI stuff here
float elapsedTime = GetTime() – startTime;
assert(elapsedTime < 0.005); //
}

The purpose of this code is to warn the developers whenever the AI code takes inordinately long. However when the game has been running for a day (actually the problem reaches this level after 65,536 seconds) GetTime() will always be returning a multiple of 0.0078 s, and ‘elapsedTime’ will always be a multiple of that duration. In most cases ‘elapsedTime’ will be equal to zero, but every now and then, no matter how fast the AI code executes, the time will tick over to the next representation during the AI calculations and ‘elapsedTime’ will be 0.0078 s instead of zero. The assert will then trigger even though the AI code is actually still under budget.

It’s a catastrophe for base-ten also

The general term for what is happening with these time calculations is catastrophic cancellation. In all of these examples above there are two time values that are accurate to about seven digits. However they are so close to each other that when they are subtracted the result has, in the worst case, zero significant digits.

We can see the same thing happening with decimal numbers. A float has roughly seven decimal digits of precision so the decimal equivalent would be getting a time value of 60.00000 and having the next possible time value be 60.00001. Given a seven-digit decimal float we can’t get more than a tenth of a microsecond precision when dealing with time around 60 seconds. When we subtract 60.00000 from 60.00001 then six of the seven digits cancel out and we end up with just one accurate digit. For times less than a tenth of a microsecond we have a complete catastrophe – all seven digits cancel out and we get zero digits of precision, just like with a binary float.

Double down

The solution to all of this is simple. GetTime() must return a double, and its result must always be stored in a double. The cancellation still occurs, but it is no longer catastrophic. A double has enough bits in the mantissa that even if your game runs for several millennia your double-precision timers will still have sub-millisecond precision. You can verify this by using the double-precision variation of TestFloatPrecisionAwayFromZero():

union Double_t
{
Double_t(double val) : f(val) {}
// Portable extraction of components.
bool Negative() const { return (i >> 63) != 0; }
int64_t RawMantissa() const { return i & ((1LL << 52) – 1); }
int64_t RawExponent() const { return (i >> 52) & 0x7FF; }

int64_t i;
double f;
#ifdef _DEBUG
struct
{ // Bitfields for exploration. Do not use in production code.
uint64_t mantissa : 52;
uint64_t exponent : 11;
uint64_t sign : 1;
} parts;
#endif
};

double TestDoublePrecisionAwayFromZero(double input)
{
union Double_t num(input);
// Incrementing infinity or a NaN would be bad!
assert(num.RawExponent() < 2047);
// Increment the integer representation of our value
num.i += 1;
// Subtract the initial value find our precision
double delta = num.f – input;
return delta;
}

You can see in the screenshot below that if you store time in doubles then after your game has been running for a week you will have sub-nanosecond precision, and after three millennia you will still have sub-millisecond precision.

image

Clearly a double is overkill for storing time, but since a float is underkill a double is the right choice.

Aside: my initial calculation of the precision remaining after three millennia was wrong because the calculation of the number of seconds was done with integer math, and it overflowed and gave a completely worthless answer. Which proves that integer math can be just as tricky as floating-point math.

Changing your units doesn’t help

All along I am assuming that you are storing your time in seconds. However your choice of units doesn’t significantly affect the results. If you decide that your time units are milliseconds, or days, then the precision available after your game has been running for a day will be about the same. It is the ratio between the elapsed time and the time being measured that matters. I like seconds because they are intuitive and human friendly, and that does matter.

Or use integers

Tom Forsyth points out that the same issues happen with world coordinates and that switching to integer types can give you greater worst-case precision, as well as consistent precision. The Windows GetTickCount() and GetTickCount64() functions use this technique, using milliseconds as the units. This alternative to using a double for time is quite reasonable, especially if you encapsulate it well. A uint32_t with milliseconds as units will overflow every 50 days or so but you can avoid that by using a uint64_t. However despite Tom’s threats to invoke his OffendOMatic rule for all who use doubles, I still prefer doubles for game time because of the combination of convenient units (seconds), more than sufficient precision, and easy calculations.

While Tom and I appear to disagree over whether you should use double in situations like this, we agree that ‘float’ won’t work.

Recently John Carmack said “Time should be a double of seconds” – that’s a good vote of confidence to have.

Note that while GetTickCount() and GetTickCount64() are millisecond precision they are often actually less accurate than you would expect. Unless you have changed the Windows timer frequency with timeBeginPeriod() the GetTickCount functions will only return a new value every 10-20 milliseconds (insert pithy comment about precision versus accuracy here).

Four billion dollar question

Even if you use doubles for time, the precision available will still change as game time marches on from zero to the length of your game. These precision changes – while smaller with doubles than with floats – can still be dangerous. Luckily there is a convenient way to get the consistent precision of an integer, with the convenient units of a double.

If you start your game clock at about 4 billion (more precisely 2^32, or any large power of two) then your exponent, and hence your precision, will remain constant for the next ~4 billion seconds, or ~136 years.

And, when using doubles, this precision is approximately one microsecond.

So there you have it. The one-true answer. Store elapsed game time in a double, starting at 2^32 seconds. You will get constant precision of better than a microsecond for over a century, and if you accidentally store time in a float you will precision errors immediately instead of after hours of gameplay. You read it here first.

Time deltas fit in a float

It is important to understand that the limited precision of a float is only a problem if you do an unstable calculation, such as catastrophic cancellation cancelling out most of the digits. The code below, on the other hand, is fine:

double GetTime();

float TimeSomethingWell()
{
double dStart = GetTime(); // Store time in a double
DoSomething();
float elapsed = GetTime() – dStart; // Store *result* in a float
return elapsed;
}

In TimeSomethingWell() we store the result of the subtraction in a float – after the catastrophic cancellation. Therefore our elapsed time value will have tons of precision.

Similarly, if you are using floats in your animation system to represent short times, such as the location of key-frames in a 60 second animation, then floats are fine. However when you add these to the current time you need to store the result of the addition in a double.

Tables!

Forrest Smith made a pretty table showing how the precision of a float changes as the magnitude increases, and I mangled it to suit my needs. Here it is for time:

Float Value Time Value Float Precision Time Precision
1 1 second 1.19E-07 119 nanoseconds
10 10 seconds 9.54E-07 .954 microsecond
100 ~1.5 minutes 7.63E-06 7.63 microseconds
1,000 ~16 minutes 6.10E-05 61.0 microseconds
10,000 ~3 hours 0.000977 .976 milliseconds
100,000 ~1 day 0.00781 7.81 milliseconds
1,000,000 ~11 days 0.0625 62.5 milliseconds
10,000,000 ~4 months 1 1 second
100,000,000 ~3 years 8 8 seconds
1,000,000,000 ~32 years 64 64 seconds

 

And here is the table showing how the precision of a float diminishes when you use it to measure large distances, with meters being the units in this case:

Float Value Length Value Float Precision Length Precision Precision Size
1 1 meter 1.19E-07 119 nanometers virus
10 10 meters 9.54E-07 .954 micrometers e. coli bacteria
100 100 meters 7.63E-06 7.63 micrometers red blood cell
1,000 1 kilometer 6.10E-05 61.0 micrometers human hair width
10,000 10 kilometers 0.000977 .976 millimeters toenail thickness
100,000 100 kilometers 0.00781 7.81 millimeters size of an ant
1,000,000 .16x earth radius 0.0625 62.5 millimeters credit card width
10,000,000 1.6x earth radius 1 1 meter uh… a meter
100,000,000 .14x sun radius 8 8 meters 4 Chewbaccas
1,000,000,000 1.4x sun radius 64 64 meters half a football field

Stable algorithms also matter

Some time ago I investigated some asserts in a particle animation system. Values were going out of range after less than an hour of gameplay and I traced this back to an out-of-range ‘t’ value being passed to the Lerp function, which expected it to always be from 0.0 to 1.0. Clamping was one obvious solution but I first investigated why ’t’ was going out of range.

One problem with the code was that the three parameters were all floats, so over long periods of time it would inevitably have insufficient precision. However we were getting instability much earlier than expected and it felt like switching to double immediately might just mask an underlying problem.

The parameters to the function, all time values in seconds, corresponded to the end of an animation segment, the length of that segment, and the current time, which was always between the start of the segment (segmentEnd-segmentLength) and ‘segmentEnd’. Because the start time of the segment was not passed in this code calculated it, and then did a straightforward calculation to get ‘t’:

float CalcTBad(float segmentEnd, float segmentLength, float time)
{
float segmentStart = segmentEnd – segmentLength;
float t = (time – segmentStart) / segmentLength;
return t;
}

Straightforward, but unstable. Because ‘segmentLength’ is presumed to be quite small compared to ‘segmentEnd’, there is some rounding during the first subtraction and the difference between ‘segmentStart’ and ‘segmentEnd’ will be a bit larger or smaller than ‘segmentLength’. The resulting difference will always be a multiple of the current precision, so it will degrade over time, but even very early in the game the result will not be perfect. Because the value for ‘segmentStart’ is slightly wrong the value of “time – segmentStart” will be slightly wrong, and occasionally ‘t’ will be outside of the 0.0 to 1.0 range.

This will happen even if you use doubles. The errors will be smaller, but ‘t’ can still go slightly outside the 0.0 to 1.0 range. As the game goes on ‘t’ will range farther outside of the correct range, but from just a few minutes into the game the results will show signs of instability.

The natural tendency is to say “floating-point math is flaky, clamp the results and move on”, but we can do better, as shown here:

float CalcTGood(float segmentEnd, float segmentLength, float time)
{
float howLongAgo = segmentEnd – time;
float t = (segmentLength – howLongAgo) / segmentLength;
return t;
}

Mathematically this calculation is identical to CalcTBad, but from a stability point of view it is greatly improved.

If we assume that ‘time’ and ‘segmentEnd’ are large compared to ‘segmentLength’, then we can reasonably assume that ‘segmentEnd’ is less than twice as large as time. And, it turns out that if two floats are that close then their difference will fit exactly into a float. Always. So the calculation of ‘howLongAgo’ is exact. Ponder that for a moment – given a few reasonable assumptions we have exact results for one of our floating-point math operations.

With ‘howLongAgo’ being exact, if ‘time’ is within its prescribed range then ‘howLongAgo’ will be between zero and ‘segmentLength’, and so will ‘segmentLength’ minus ‘howLongAgo’. IEEE floating-point math guarantees correct rounding so when we divide by ‘segmentLength’ we are guaranteed that ‘t’ will be from 0.0 to 1.0. No clamping needed, even with floats.

This real example demonstrates a few things:

  • Any time you add or subtract floats of widely varying magnitudes you need to watch for loss of precision
  • Sometimes using ‘double’ instead of ‘float’ is the correct solution, but often a more stable algorithm is more important
  • CalcT should probably use double (to give sufficient precision after many hours of gameplay)

Your compiler is trying to tell you something…

With Visual C++ on the default warning level you will get warning C4244 when you assign a double to a float:

warning C4244: ‘initializing’ : conversion from ‘double’ to ‘float’, possible loss of data

Possible loss of data is not necessarily a problem, but it can be. Suppressing warnings, with #pragma warning or with a cast, is something that should be done thoughtfully, after understanding the issue. Otherwise the compiler might say “I told you so” when your game fails after a twenty-four hour soak test.

Does it matter?

For some game types this problem may be irrelevant. Many games finish in less than an hour and a float that holds 3,600 (seconds) still has sub-millisecond accuracy, which is enough for most purposes. This means that for those game types you should be fine storing time in a float, as long as you reset the zero-point of GetTime() at the beginning of each game, and as long as the clock stops running when the game is paused.

For other game types – probably the majority of games – you need to do your time calculations using a double or uint64_t. I’ve seen problems on multiple games who failed to follow this rule. The problems are particularly tedious to track down and fix because they may take many hours to show up.

Store your time values in a double, starting at 2^32 seconds, and then you don’t need to worry, at least not as much, as long as you avoid unstable algorithms.

A lot of people have commented on this article and said that the justification for using double instead of 64-bit integers is not very strong. I agree that either one will work, however I think that double has a couple of advantages. One is developer convenience. A floating point number like 1.73 is far easier to comprehend than 1730 (fixed-point with ms accuracy) and it has more precision. The more precision you give to a fixed-point integer the more unwieldy the numbers get, and there is a real cost to this.

The other reason is game industry specific. When a game does time calculations it typically uses the time values for physics, AI, and graphics, and these systems typically need floating-point numbers. So, it turns out that you cannot avoid floating-point time. Therefore, you might as well do it in the first place, and do it right. Most games already use floating-point numbers for time – I just want to encourage them to not use ‘float’.

It’s also interesting to note that Apple uses double for time – NSTimeInterval is a double. As they say: “NSTimeInterval is always specified in seconds; it yields sub-millisecond precision over a range of 10,000 years.”

Next time…

On the next post I think it might finally be time to start jumping into the delicate subject of how to compare floating-point numbers, with the many subtleties involved. Previous articles in this series, and other posts, can be found here.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in AltDevBlogADay, Floating Point, Programming and tagged , , , , , , . Bookmark the permalink.

27 Responses to Don’t Store That in a Float

  1. Shane says:

    Your writing is very clean and easy to understand. I don’t have any use for most of your topics covered (currently in C# land) but I still enjoy reading it. Thanks.

  2. John Smith says:

    You are off by an order of magnitude on the diameter of the sun. It’s about 100x as wide as Earth, not 10x. So the last 2 entries in your distance table need to be updated.

  3. Dekker found some nice properties about summation and multiplication of floating point values, and how to make them accurate. Take a look at that (template versions of the dekker alorithms published by Takeshi Ogita and S.M. Rump et al in SIAM Journal on Scientific Computing 26
    (2005), Nr. 6, S. 1995–1988:

    template
    void two_sum ( T a , T b , T & x , T & y ) {
    x = a + b;
    T z = x – a;
    y = (a – (x-z ) ) + (b-z ) ;
    }

    template
    void split ( T & x , T & y , T const& a ) {
    T c = T ( ( 1UL << ( ( float_traits : : mantissa_bits >> 1)
    + float_traits : : mantissa_bits%2)) + 1) * a ;
    x = c – ( c-a );
    y = a – x;
    }
    template
    void two_product ( T a , T b , T & x , T & y ) {
    x = a * b;
    T a1 , a2 ;
    split ( a1 , a2 , a ) ;
    T b1 , b2 ;
    split ( b1 , b2 , b ) ;
    y = a2*b2 – ( ( ( x – a1 * b1 ) – a2 * b1 ) – a1 * b2 ) ;
    }

    I hope it gets reasonably formated. The first one makes x = float( a + b ) and x + y = a + b if float would have infinite precision. So y contains the error of the limited floating point operation.
    Something similar can be stated for two_product. x = float(a*b) and x+y =a*b.
    All this only works if the compiler does optimize away the floating point operations.

    The nice thing about this is that one can easily create summations and multiplications with higher accuracy, by keeping the error term (y) and reusing it in following operations. I once implemented a matrix expression template library that execute 100% accurate scalar products, using the following algorithm (invented by S.M Rump, Takeshi Ogita et al “Accurate Floating Point Summation” 2006):
    template // faster than two_sum works only for a >= b
    void fast_two_sum ( T a , T b , T & x , T & y ) {
    assert(a>=b)
    x = a + b;
    T q = x – a;
    y = b-q;
    }
    template
    typename T::value_type accurate_sum ( T & vec )
    {
    typedef typename T::value_type value_type;
    size_t n = num_elements ( vec );
    if ( n == 0 ) return 0;

    value_type mu = std::abs ( vec ( 0 ) );
    for ( size_t i = 1; i != n; ++i )
    mu = std::max ( std::abs( vec(i) ), mu );

    value_type Ms = next_power_two ( value_type ( n+2));
    value_type sigma = Ms*next_power_two( mu );
    value_type phi = std::numeric_limits::epsilon ( ) * Ms;
    value_type factor = value_type(2)*phi*Ms;
    if ( ! check_extraction_parameters ( phi, sigma, factor ) )
    return simple_sum ( vec );

    value_type t = 0;
    T q;
    while ( true ) {
    q = elementwise_sub ( elementwise_add ( sigma, vec ), sigma );
    value_type tau = simple_sum ( q );
    vec = vec – q;
    value_type tau1, tau2;
    fast_two_sum ( t, tau, tau1, tau2 );
    if( std::abs ( tau1 ) >= factor*sigma
    || sigma <= std::numeric_limits::denorm_min() )
    return tau1 + ( tau2 + simple_sum ( vec ) );
    t = tau1;
    sigma = phi*sigma;
    }
    return 0;
    }

    The algorithm walks through the exponent until the operands no longer add relevant values. If the vector has all operands within a similar mantissa range the algorithm terminates sooner. The result is accurate for the given floating point type.

    • brucedawson says:

      Very cool.

      A similar property is that if you have a compiler that generates fmadd instructions (fused multiply add where rounding doesn’t occur until after the add) then this calculation:

      a * b + a * -b

      typically gets compiled as fmad(a, b, a * -b). That is, the “a * -b” is done as a normal multiply, and the “a * b” is done as part of an fmadd. The net result is that the result is the error in a * b. That’s a cool property I think.

  4. My personal favorite example of this was a bug in the Patriot Missile System’s software: http://www.gao.gov/products/IMTEC-92-26

  5. Minor Nit-pick says:

    Great post. Most of us programmers are only vaguely aware of the problems with floating point variables. This post makes it clear why it is necessary to understand the problems.

    A minor nit-pick about language: “data” is plural, please use “these data” rather than “this data”

  6. Matthew Fioravante says:

    Somewhat of a tangent to the point of the article. I’m not a fan of using floating point (float or double) for time at all. One recent issue which put the nail in the coffin for me was using trying to iterate over time intervals in a loop.

    double t = /* some time*/
    //Iterater over all of the 1 second intervals in the range [t, t + 30 seconds]
    for(auto i = t; i <= t + 30.0; i += 1.0) {
    //do something
    }

    This kind of loop may or may not work the way you expect, depending on how the floating point error accumulates during the addition the last value of t+ 30.0 may be skipped. We may be able to fix it with an epsilon but its easy to forget to do that and it makes the code look ugly.

    If we used 64 bit integers we avoid this issue. We also get exact values, that is we don't have to worry about whether or not (t + (x/5) + (x/5) + (x/5) + (x/5) + (x/5)) != (t + x). Finally, by wrapping your 64 bit int inside of a time type, you can easily add seconds, milliseconds, etc.. in a natural way that's just as expressive and correct as using a double normalized to seconds. One example of this is the chrono and ratio libraries in C++11.

    • brucedawson says:

      Those concerns are totally legitimate. I think they can usually be avoided (or at least minimized) by clamping and not expecting to hit exactly the final time, but they are definitely a nuisance and a risk.

      Integers do avoid these issues, but still require some thought. If you want to be able to add milliseconds then you have to have that (or some fraction of that) as your base type. If you later decide you need microsecond precision you need to reduce your range and make adjustments or risk round-off errors. If somebody tries to add 1/7 s then you still get truncation error. Integers are definitely more predictable (as long as you never overflow) but do still require some thought.

  7. Great post explaining the float-point numbers’ precision issue, with the Timer example!
    Thanks for your talent and hard work!

  8. Greg A. Woods says:

    Hmmm…. There is now a working proposal, colloquially called DEC64, for a number representation which gives you both the precision of integers, and the human friendly nature of decimal fractions, and which is inherently compatible (at the binary level) with the widely used and well understood E notation, and most importantly which also seems to be reasonably efficient to implement most operators and common math functions on modern commodity hardware as well (as opposed to the far more complex representation for decimal64 numbers given in IEE 754-2008). Though it is a couple of years more recent than this post, and only very recently becoming more widely known, I’m surprised someone hasn’t mentioned it here yet.

    See http://dec64.com/

    The proposal is a little grandiose, like its author, and still far from perfect (no representation for infinity, too many permitted representations for zero, only fully specified in the reference implementation, etc., etc., etc.), but I think it has some merit as an alternative to using unfamiliar units (microseconds, milliseconds, etc.).

    One particular advantage little mentioned in any discussion I’ve seen of DEC64 is that its representation is also far more friendly to humans at the binary level, e.g. when viewed raw in a debugger, for example.

    I don’t want this to side-track the discussion of full floating point, but I think your arguments against using appropriate precision units, such as microseconds or milliseconds, as opposed to fractional seconds is specious. However perhaps an efficient decimal representation such as DEC64 can perhaps provide a middle ground where most of the foibles of floating point are entirely avoided.

    • brucedawson says:

      DEC64 is… interesting, but has zero hope of being “the only number type in the next generation of application programming languages”. Relevant to this article, since it has only a double-precision type then there will be no problems with developers using less than double precision when they should not, but the lack of float and half support does leave a gap.

      It’s interesting that you acknowledge the importance of the friendly binary and base-10 features of DEC64, but don’t seem to agree with me that there is value in using the most convenient units for time, whatever those might be.

  9. Greg A. Woods says:

    I think perhaps you may be confusing presentation with representation, though perhaps that started with me not being entirely clear about what I meant. Integer based sub-second quantities can, for example, trivially be presented to the user in whatever form and units that are most appropriate for the context, including as base-10 seconds with decimal fractions of a second (perhaps even with rounding where less accuracy is easier to understand). This is of course true for any representation within the limits of precision for that format given the range of values that must be handled, though as Crockford suggests the complexity of the conversion between representation and presentation formats is an important consideration in the choice of representation. (I do sometimes still get confused when reading code that uses both milliseconds and microseconds internally, especially if the comments or variable names use “ms” anywhere, but I think that’s just a personal problem where I’ve never properly learned their associated magnitudes well enough.)

    My point about the benefit of DEC64 is that it is so very easy to convert it that can be done manually/mentally and almost without thinking, at least for anyone who has experience with binary or hex notation. I.e. it is well suited to revealing a clear and easily obvious understanding of a value in its native binary representation while at the same time being very easily and efficiently translated to and from convenient and similarly clear and obvious presentation formats such as base-10 E notation.

    Since I almost never deal with IEEE754 internals I probably have less than a snowball’s chance in hell of ever manually understanding a float or double value in any of its binary representations, including decimal64 (except maybe some of the obvious constant and special values) but I think I would have a decent chance of understanding values for binary or hex representations of DEC64, perhaps with even more ease than I can understand good old 4-bit/digit BCD.

    It would be very tempting to use DEC64 as the sole internal number representation for an interpreter for any “little” language, such as an AWK implementation, or indeed for any other language that only supports one kind of “number” as a data type, such as ECMAScript. Or at least in any such implementation where something like BigNum, e.g. GNU MP or whatever isn’t more appropriate (possibly with hidden internal optimizations to suitable integer formats, and maybe also DEC64, where appropriate). In fact I’m very tempted to modify AWK itself to use DEC64 just to see how it fares.

    • brucedawson says:

      Confusion all round I think. I agree that when presenting numbers to the user it is often possible to adjust the units as desired. However, what I guess I didn’t make clear, is that this is often not the situation. Imagine that you are working on a video game and are debugging the particle system. You have a bunch of doubles holding various times and you are viewing them in the debugger. The presentation you get is the one which the debugger gives to you. If the units are inconvenient to reason with then it is inconvenient to deal with this. This (ease of understanding within the debugger) is roughly the same benefit which you point to with DEC64.

      I’m not familiar with the iEEE base-10 design so I can’t comment on how DEC64 is better or worse, but DEC64 would have to be substantially better to win out over a ratified standard, and that seems unlikely.

  10. Ben Morgan says:

    Thanks for your series on floats, I find it very elucidating.

    You have mentioned several times that you would use doubles to store duration as seconds.

    In Go, durations are stored at a nanosecond resolution in int64s. Then an enum (consts) are defined for hour, minute, second, milliscond, microsecond, and nanosecond. This lets you take a duration and represent it as a float: seconds := dur / time.Second; ms := dur / time.Millisecond; etc. Also when defining times, I can say: 1500 * time.Millisecond.

    I thought the way Go stores durations was pretty handy – in what use-cases would double still be a better choice? The only use-case I can think of is when you’re dealing with the raw numbers (e.g. in a debugger), when you need to represent durations larger than 290 years, or when nanosecond precision is not enough.

    • brucedawson says:

      The main use case where I find seconds-in-a-double better than ns-in-an-int64 is, indeed, when viewing the numbers in a debugger.

      But, to be clear, I’m not trying to convince people to eschew ns-in-an-int64 – that’s a totally reasonable and valid choice. Rather, I’m railing against the many game developers who store elapsed time in a float – that is an insane and indefensible choice (you know who you are).

  11. Greg A. Woods says:

    Sorry Bruce, but I’m still not sure I understand why you would prefer to use any kind of floating point number for sub-second precision elapsed time values, in the debugger or not, especially given one other main concern you’ve addressed here.

    (I do fully agree that trying to store elapsed time in a float, as opposed to a double, just to get sub-second precision is pure insanity, even if one could guarantee that the time span was always short enough to fit without loss of so much precision as lose discrete values for relevant event markers.)

    First off you are still talking about presentation, whether it’s in the debugger or not. In the debugger you must be relying on either it’s ability or your own to know the type of a value at a given location, and for it to correctly interpret and present a floating point value in a useful presentation when you examine a given memory location. When I was talking about looking at a floating point value in the debugger I was talking about looking at it in its raw form — i.e. as a binary value, perhaps even in hex or octal form. I’ve no doubt that there are many people better than I at manually interpreting binary floating-point representations (no matter which kind and size they are), but I _do_ doubt that any of those people would admit that it is as easy as interpreting a large integer value in a binary representation.

    As for the other main concern: you’ve wisely advocated basing floating point time values at some large number such as 2^32 instead of zero in order to obtain a more even distribution of precision. However from what I can tell from a few quick tests, doing that will further complicate any attempt at manual interpretation of a binary presentation of any internal floating point representation. However if you are relying on the debugger to give a human-friendly base-10 presentation of such a floating-point value, now you’re stuck with first having to subtract 2^32 from it before converting it since trying to do that mentally with the final base-10 number is more burden than I’d ever want to have to suffer! The need for good precision over wide ranges of values has just made null-and-void any conceivable benefit of using a floating-point number for sub-second resolution time values.

    Of course dividing a binary, hex, or octal number by some power of 10, i.e. when trying to interpret a timestamp given in micro or nanoseconds or such, is still a fair feat, it is I think more easily approximated for rough manual estimates by doing the same thing computer hardware might do and simply shift the value by the closest power-of-2. On the other hand if one is relying on the debugger to present the value in base-10 then it’s almost as easy as with base-10 presentation of floating point values since one can mentally put the decimal point in the correct position to get a human scale for the number.

    Also, if one’s timer source gives discrete integer values, then I really do question why one would ever consider converting that value to a floating point representation for internal use.

    Finally there’s the performance issue of dealing with floating point numbers where they are not strictly needed. I’m going to go a little bit out on a limb here and say that for almost all operations a double is still likely to cost more cycles than a wider-than-the-ALU integer (if indeed one is still targeting a 32-bit CPU).

    So I would still say storing seconds-in-a-double is still very wrong — it’s just not as insanely wrong, and sad, as knowingly using a float. 🙂

    • brucedawson says:

      You are correct that starting your time values at 2^32 makes interpreting the values more difficult. But, I still find that they are easier to interpret than an int64 of nanoseconds. Either way you’ve got a boatload of digits. The advantage with the double is that the interesting ones are easy to find – they’re all near the decimal point. When comparing two similar times it’s relatively easy to see how close they are – not so when the interesting digits are hidden nine digits from the right and a variable distance from the left.

      You can, of course, always display time_int64/1e9 in the debugger (so it displays as floating-point seconds), which does make int64 much more manageable.

      I think you are wrong about the performance of double versus int64-in-a-32-bit-process. Doubles can be added in one instruction, with three instruction latency, while adding int64 will be many instructions with longer latency. Not that that matters either way, but floating-point math is fast, and multiple dependent instructions are generally slower.

      But, I’m not trying to convince the whole world to use double, just not to use float.

  12. Yakov Galka says:

    Between doubles and integers I prefer 32.32 fixed points. They are convenient enough to read in hexadecimal, and have a good (nanosecond) precision, which is useful in areas beyond games. I am talking about absolute time values obviously, for when you take deltas you convert them to doubles or floats again.

  13. TS says:

    This reminds me of a interesting bug some weeks ago: A colleague approached me with a screen capture of our mobile app and asked me why the maps look broken in Australia but fine in Germany – and why the same map looked fine in both locations in the browser client.
    A closer look quickly pointed out that there seemed to be some rounding to meters in effect.
    On the server data as stored as 32 bit integers, yielding a worldwide coordinate system with about 1cm precision. The difference is about 1:100, or roughly 7 Bit – ending up suspiciously close to float precision. No issue for the browser as JavaScript uses double by default (*).
    Asking the mobile devs whether they were using float instead of int or double turned out as correct, and thus the issue was quickly fixed.
    So, but why did it work properly for German maps (which was why the colleagues there puzzled in the first place)? Well, the origin of the coordinate system is located inside Europe, thus float was sufficient there, but the larger offset to the other side of the globe ate up all he least significant bits needed for sub-meter precision…

    (*):Actually, you do not get double precision in the browser everywhere: As soon as the values are fed into CSS or markup and processed by the browser they will likely end using an API which uses 32 Bit floats. The solution is to offset the data to a new origin so that the values can be displayed precisely enough with that lower precision.
    Works for other applications – including time – as well, and can be used to reduce data size when storing many values as well: A large magnitude origin combined with a small magnitude offset can be stored as two 32 bit floats giving basically double precision. Since the origin is usually the same for all values it needs to be stored only once thus saving half the data size compared to doubles.

    Regarding time for games and animation it is good advice to use relative time wherever possible and avoid absolute time: First, most takes are usually centered on a certain magnitude (for an animation which takes some seconds it is usually irrelevant which century it happens in, and while calculation tree growth in years there is no need for millisecond accuracy), thus the roughly 7 significant decimal digits provided by 32 bit floats are often already sufficient for relative time values.
    Not to mention that absolute time isn´t, at least not as a value linearly incremented at a constant rate: Leap seconds, daylight saving time, suspend/sleep modes, internet time updates, empty batteries etc. might all mess up the time in some hardly expected way.
    Not to mention timers which overflow after running for many days, like the Windows timeGetTime() timer or the 32bit unix time in the not-so-distant future.

    • brucedawson says:

      Thanks for sharing. Microsoft’s Street and Trips had a similar bug. If you zoomed in “too far” and then tried to pan slowly with the mouse then the map would not move. You had to get above a critical speed (presumably a critical mouse delta) before the updated position would be different from the previous position, in the precision used.

      As in your case this behavior was location dependent. I reported this while I worked at Microsoft and it was left unfixed for several years, and then the product was cancelled. Sad.

      I assume that they were using float somewhere along the pipeline but I never got confirmation.

  14. Weasel says:

    I know this is an old post, but it was very informative.
    However you should edit the article about integers, because integers will *always* be exact for any *delta* value, as long as the interval between time taken is shorter than what it takes for it to overflow.

    You don’t need uint64_t at all, uint32_t is likely overkill already.

    When it “wraps around” due to overflow, the two’s complement basically ensures the result will be proper unless it wrapped around so much that it passed through the original “old value” you subtract. e.g.:

    uint32_t temp = GetTickCount();
    uint32_t delta = temp – LastTime; // *always* correct if it didn’t take enough time to overflow since last call
    LastTime = temp;

    This way you can always know how much it took since last time you called the function. Forever, doesn’t matter if it overflows. At least on x86 and architectures using 2’s complement for integers.

    • brucedawson says:

      Integers definitely are a valid solution. The annoyance with integers (IMHO is that they force you to choose a resolution and stick with it everywhere, and they wrap. ms precision is probably not enough, so microseconds? Something must be chosen.

      If you choose microseconds then uint32_t is suitable for many purposes. And many intervals will fit into a uint32_t. However if somebody stores the “wrong” interval in a uint32_t – something that can last arbitrarily long – then you suddenly have a bug that will show up after 4,294.97 seconds of testing. Any lengths of time that can grow arbitrarily long require a uint64_t.

      I also find that debugging with units of microseconds is less convenient. Even very short durations end up being seven-digit numbers which are unwieldy to read. The total game length can easily be a ten digit number, whereas game duration in seconds in a double would rarely exceed four digits. A small issue, perhaps, but it’s there.

      So yes, use an unsigned integer if you prefer. Just beware that you will probably have less precision than a double, wrap-around of intervals that get too long if you ever use uint32_t, and inconveniently large numbers.

  15. Max Barraclough says:

    Great post, thanks. I’d never much considered the ‘catastrophic cancellation’ problem before. You mention something interesting in passing:

    > it turns out that if two floats are that close then their difference will fit exactly into a float. Always.

    Why is this?

    Also, I see the OffendOMatic link has fallen victim to link-rot.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.