I promised in my last post to show an example of the importance of knowing how much precision a float has at a particular value. Here goes.

As a general rule this type of data that should never be stored in a float:

Elapsed game time should never be stored in a float. Use a double instead. I’ll explain why below.

As an extra bonus, because switching to double is not always the best solution, this post demonstrates the dangers of unstable algorithms, and how to use the guarantees of floating-point math to improve them.

## How long has this been going on?

A lot of games have some sort of GetTime() function that returns how long the game has been running. Often these return a floating-point number because it allows for convenient use of seconds as the units, while allowing sub-second precision.

GetTime() is typically implemented with some sort of high frequency timer such as QueryPerformanceCounter. This allows time resolution of a microsecond or better. However it’s worth looking at what happens to this resolution if the time is returned as a float, or stored in a float. We can do that using one of the TestFloatPrecision functions from the last post – just call them from the watch window of the debugger. In the screen shot below I tested the precision available at one minute, one hour, one day, and one week:

It’s important to understand what this data means. The number ‘60’, like all integers up to 16777216, can be exactly represented in a float. The watch window shows that the next value after 60 that can be represented by a float is about 60.0000038. Therefore, if we use a float to store “60 seconds” then the next time that we can represent is 3.8 microseconds past 60 seconds. If we try to store a value in-between then it will be rounded up or down.

## How long did it take?

One of the most common things to do with time values is to subtract them. For instance, we might have code like this:

double GetTime();

float TimeSomethingBadly()

{

float fStart = GetTime();

DoSomething();

float elapsed = GetTime() – fStart;

return elapsed;

}

The implication of the precision calculations above is that if ‘fStart’ is around 60, then ‘elapsed’ will be a multiple of 3.8 microseconds (two to the negative eighteenth seconds). That is the most precision you can get. If less than 3.8 microseconds has elapsed then ‘elapsed’ will either be rounded down to zero, or rounded up to 3.8 microseconds.

Therefore, if our game timer starts at zero and we store time in a float then after a minute the best precision we can get from our timer is 3.8 microseconds. After our game has been running for an hour our best precision drops to 0.24 milliseconds. After our game has been running for a day our precision drops to 7.8 milliseconds, and after a week our precision drops to 62.5 milliseconds.

This is why storing time in a float is dangerous. If you use float-time to try calculating your frame rate after running for a day then the only answers above 30 fps that are possible are infinity, 128, 64, 42.6, or 32 (since the possible frame lengths are 0, 7.8, 15.6, 23.4, or 31.2 milliseconds). And it only gets worse if you run longer.

As another example consider this code:

double GetTime();

void ThinkBadly()

{

float startTime = (float)GetTime();

// Do AI stuff here

float elapsedTime = GetTime() – startTime;

assert(elapsedTime < 0.005); //

}

The purpose of this code is to warn the developers whenever the AI code takes inordinately long. However when the game has been running for a day (actually the problem reaches this level after 65,536 seconds) GetTime() will always be returning a multiple of 0.0078 s, and ‘elapsedTime’ will always be a multiple of that duration. In most cases ‘elapsedTime’ will be equal to zero, but every now and then, no matter how fast the AI code executes, the time will tick over to the next representation during the AI calculations and ‘elapsedTime’ will be 0.0078 s instead of zero. The assert will then trigger even though the AI code is actually still under budget.

## It’s a catastrophe for base-ten also

The general term for what is happening with these time calculations is catastrophic cancellation. In all of these examples above there are two time values that are accurate to about seven digits. However they are so close to each other that when they are subtracted the result has, in the worst case, zero significant digits.

We can see the same thing happening with decimal numbers. A float has roughly seven decimal digits of precision so the decimal equivalent would be getting a time value of 60.00000 and having the next possible time value be 60.00001. Given a seven-digit decimal float we can’t get more than a tenth of a microsecond precision when dealing with time around 60 seconds. When we subtract 60.00000 from 60.00001 then six of the seven digits cancel out and we end up with just one accurate digit. For times less than a tenth of a microsecond we have a complete catastrophe – all seven digits cancel out and we get zero digits of precision, just like with a binary float.

## Double down

The solution to all of this is simple. GetTime() must return a double, and its result must always be stored in a double. The cancellation still occurs, but it is no longer catastrophic. A double has enough bits in the mantissa that even if your game runs for several millennia your double-precision timers will still have sub-microsecond precision. You can verify this by using the double-precision variation of TestFloatPrecisionAwayFromZero():

union Double_t

{

Double_t(double val) : f(val) {}

// Portable extraction of components.

bool Negative() const { return (i >> 63) != 0; }

int64_t RawMantissa() const { return i & ((1LL << 52) – 1); }

int64_t RawExponent() const { return (i >> 52) & 0x7FF; }int64_t i;

double f;

#ifdef _DEBUG

struct

{ // Bitfields for exploration. Do not use in production code.

uint64_t mantissa : 52;

uint64_t exponent : 11;

uint64_t sign : 1;

} parts;

#endif

};double TestDoublePrecisionAwayFromZero(double input)

{

union Double_t num(input);

// Incrementing infinity or a NaN would be bad!

assert(num.RawExponent() < 2047);

// Increment the integer representation of our value

num.i += 1;

// Subtract the initial value find our precision

double delta = num.f – input;

return delta;

}

You can see in the screenshot below that if you store time in doubles then after your game has been running for a week you will have sub-nanosecond precision, and after three millennia you will still have sub-millisecond precision.

Clearly a double is overkill for storing time, but since a float is underkill a double is the right choice.

Aside: my initial calculation of the precision remaining after three millennia was wrong because the calculation of the number of seconds was done with integer math, and it overflowed and gave a completely worthless answer. Which proves that integer math can be just as tricky as floating-point math.

## Changing your units doesn’t help

All along I am assuming that you are storing your time in seconds. However your choice of units doesn’t significantly affect the results. If you decide that your time units are milliseconds, or days, then the precision available after your game has been running for a day will be about the same. It is the ratio between the elapsed time and the time being measured that matters. I like seconds because they are intuitive and human friendly, and that does matter.

## Or use integers

Tom Forsyth points out that the same issues happen with world coordinates and that switching to integer types can give you greater worst-case precision, as well as consistent precision. The Windows GetTickCount() and GetTickCount64() functions use this technique, using milliseconds as the units. This alternative to using a double for time is quite reasonable, especially if you encapsulate it well. A uint32_t with milliseconds as units will overflow every 50 days or so but you can avoid that by using a uint64_t. However despite Tom’s threats to invoke his OffendOMatic rule for all who use doubles, I still prefer doubles for game time because of the combination of convenient units (seconds), more than sufficient precision, and easy calculations.

While Tom and I appear to disagree over whether you should use double in situations like this, we agree that ‘float’ won’t work.

Recently John Carmack said “Time should be a double of seconds” – that’s a good vote of confidence to have.

Note that while GetTickCount() and GetTickCount64() are millisecond precision they are often actually less accurate than you would expect. Unless you have changed the Windows timer frequency with timeBeginPeriod() the GetTickCount functions will only return a new value every 10-20 milliseconds (*insert pithy comment about precision versus accuracy here*).

## Four billion dollar question

Even if you use doubles for time, the precision available will still change as game time marches on from zero to the length of your game. These precision changes – while smaller with doubles than with floats – can still be dangerous. Luckily there is a convenient way to get the consistent precision of an integer, with the convenient units of a double.

If you start your game clock at about 4 billion (more precisely 2^32, or any large power of two) then your exponent, and hence your precision, will remain constant for the next ~4 billion seconds, or ~136 years.

And, when using doubles, this precision is approximately one microsecond.

So there you have it. The one-true answer. Store elapsed game time in a double, starting at 2^32 seconds. You will get constant precision of better than a microsecond for over a century, and if you accidentally store time in a float you will precision errors immediately instead of after hours of gameplay. You read it here first.

## Time *deltas* fit in a float

It is important to understand that the limited precision of a float is only a problem if you do an unstable calculation, such as catastrophic cancellation cancelling out most of the digits. The code below, on the other hand, is fine:

double GetTime();

float TimeSomethingWell()

{

double dStart = GetTime(); // Store time in a double

DoSomething();

float elapsed = GetTime() – dStart; // Store *result* in a float

return elapsed;

}

In TimeSomethingWell() we store the result of the subtraction in a float – *after* the catastrophic cancellation. Therefore our elapsed time value will have tons of precision.

Similarly, if you are using floats in your animation system to represent short times, such as the location of key-frames in a 60 second animation, then floats are fine. However when you add these to the current time you need to store the *result* of the addition in a double.

## Tables!

Forrest Smith made a pretty table showing how the precision of a float changes as the magnitude increases, and I mangled it to suit my needs. Here it is for time:

Float Value |
Time Value |
Float Precision |
Time Precision |

1 | 1 second | 1.19E-07 | 119 nanoseconds |

10 | 10 seconds | 9.54E-07 | .954 microsecond |

100 | ~1.5 minutes | 7.63E-06 | 7.63 microseconds |

1,000 | ~16 minutes | 6.10E-05 | 61.0 microseconds |

10,000 | ~3 hours | 0.000977 | .976 milliseconds |

100,000 | ~1 day | 0.00781 | 7.81 milliseconds |

1,000,000 | ~11 days | 0.0625 | 62.5 milliseconds |

10,000,000 | ~4 months | 1 | 1 second |

100,000,000 | ~3 years | 8 | 8 seconds |

1,000,000,000 | ~32 years | 64 | 64 seconds |

And here is the table showing how the precision of a float diminishes when you use it to measure large distances, with meters being the units in this case:

Float Value |
Length Value |
Float Precision |
Length Precision |
Precision Size |

1 | 1 meter | 1.19E-07 | 119 nanometers | virus |

10 | 10 meters | 9.54E-07 | .954 micrometers | e. coli bacteria |

100 | 100 meters | 7.63E-06 | 7.63 micrometers | red blood cell |

1,000 | 1 kilometer | 6.10E-05 | 61.0 micrometers | human hair width |

10,000 | 10 kilometers | 0.000977 | .976 millimeters | toenail thickness |

100,000 | 100 kilometers | 0.00781 | 7.81 millimeters | size of an ant |

1,000,000 | .16x earth radius | 0.0625 | 62.5 millimeters | credit card width |

10,000,000 | 1.6x earth radius | 1 | 1 meter | uh… a meter |

100,000,000 | .14x sun radius | 8 | 8 meters | 4 Chewbaccas |

1,000,000,000 | 1.4x sun radius | 64 | 64 meters | half a football field |

## Stable algorithms also matter

Some time ago I investigated some asserts in a particle animation system. Values were going out of range after less than an hour of gameplay and I traced this back to an out-of-range ‘t’ value being passed to the Lerp function, which expected it to always be from 0.0 to 1.0. Clamping was one obvious solution but I first investigated why ’t’ was going out of range.

One problem with the code was that the three parameters were all floats, so over long periods of time it would inevitably have insufficient precision. However we were getting instability much earlier than expected and it felt like switching to double immediately might just mask an underlying problem.

The parameters to the function, all time values in seconds, corresponded to the end of an animation segment, the length of that segment, and the current time, which was always between the start of the segment (segmentEnd-segmentLength) and ‘segmentEnd’. Because the start time of the segment was not passed in this code calculated it, and then did a straightforward calculation to get ‘t’:

float CalcTBad(float segmentEnd, float segmentLength, float time)

{

float segmentStart = segmentEnd – segmentLength;

float t = (time – segmentStart) / segmentLength;

return t;

}

Straightforward, but unstable. Because ‘segmentLength’ is presumed to be quite small compared to ‘segmentEnd’, there is some rounding during the first subtraction and the difference between ‘segmentStart’ and ‘segmentEnd’ will be a bit larger or smaller than ‘segmentLength’. The resulting difference will always be a multiple of the current precision, so it will degrade over time, but even very early in the game the result will not be perfect. Because the value for ‘segmentStart’ is slightly wrong the value of “time – segmentStart” will be slightly wrong, and occasionally ‘t’ will be outside of the 0.0 to 1.0 range.

This will happen even if you use doubles. The errors will be smaller, but ‘t’ can still go slightly outside the 0.0 to 1.0 range. As the game goes on ‘t’ will range farther outside of the correct range, but from just a few minutes into the game the results will show signs of instability.

The natural tendency is to say “floating-point math is flaky, clamp the results and move on”, but we can do better, as shown here:

float CalcTGood(float segmentEnd, float segmentLength, float time)

{

float howLongAgo = segmentEnd – time;

float t = (segmentLength – howLongAgo) / segmentLength;

return t;

}

Mathematically this calculation is identical to CalcTBad, but from a stability point of view it is greatly improved.

If we assume that ‘time’ and ‘segmentEnd’ are large compared to ‘segmentLength’, then we can reasonably assume that ‘segmentEnd’ is less than twice as large as time. And, it turns out that if two floats are that close then their difference will fit exactly into a float. Always. So the calculation of ‘howLongAgo’ is exact. Ponder that for a moment – given a few reasonable assumptions we have *exact* results for one of our floating-point math operations.

With ‘howLongAgo’ being exact, if ‘time’ is within its prescribed range then ‘howLongAgo’ will be between zero and ‘segmentLength’, and so will ‘segmentLength’ minus ‘howLongAgo’. IEEE floating-point math guarantees correct rounding so when we divide by ‘segmentLength’ we are guaranteed that ‘t’ will be from 0.0 to 1.0. No clamping needed, even with floats.

This real example demonstrates a few things:

- Any time you add or subtract floats of widely varying magnitudes you need to watch for loss of precision
- Sometimes using ‘double’ instead of ‘float’ is the correct solution, but often a more stable algorithm is more important
- CalcT should probably use double (to give sufficient precision after many hours of gameplay)

## Your compiler is trying to tell you something…

With Visual C++ on the default warning level you will get warning C4244 when you assign a double to a float:

warning C4244: ‘initializing’ : conversion from ‘double’ to ‘float’, possible loss of data

Possible loss of data is not necessarily a problem, but it can be. Suppressing warnings, with #pragma warning or with a cast, is something that should be done thoughtfully, after understanding the issue. Otherwise the compiler might say “I told you so” when your game fails after a twenty-four hour soak test.

## Does it matter?

For some game types this problem may be irrelevant. Many games finish in less than an hour and a float that holds 3,600 (seconds) still has sub-millisecond accuracy, which is enough for most purposes. This means that for those game types you should be fine storing time in a float, as long as you reset the zero-point of GetTime() at the beginning of each game, and as long as the clock stops running when the game is paused.

For other game types – probably the majority of games – you need to do your time calculations using a double or uint64_t. I’ve seen problems on multiple games who failed to follow this rule. The problems are particularly tedious to track down and fix because they may take many hours to show up.

Store your time values in a double, starting at 2^32 seconds, and then you don’t need to worry, at least not as much, as long as you avoid unstable algorithms.

A lot of people have commented on this article and said that the justification for using double instead of 64-bit integers is not very strong. I agree that either one will work, however I think that double has a couple of advantages. One is developer convenience. A floating point number like 1.73 is far easier to comprehend than 1730 (fixed-point with ms accuracy) and it has more precision. The more precision you give to a fixed-point integer the more unwieldy the numbers get, and there is a real cost to this.

The other reason is game industry specific. When a game does time calculations it typically uses the time values for physics, AI, and graphics, and these systems typically need floating-point numbers. So, it turns out that you cannot avoid floating-point time. Therefore, you might as well do it in the first place, and do it right. Most games already use floating-point numbers for time – I just want to encourage them to not use ‘float’.

It’s also interesting to note that Apple uses double for time – NSTimeInterval is a double. As they say: “NSTimeInterval is always specified in seconds; it yields sub-millisecond precision over a range of 10,000 years.”

## Next time…

On the next post I think it might finally be time to start jumping into the delicate subject of how to compare floating-point numbers, with the many subtleties involved. Previous articles in this series, and other posts, can be found here.

Your writing is very clean and easy to understand. I don’t have any use for most of your topics covered (currently in C# land) but I still enjoy reading it. Thanks.

You are off by an order of magnitude on the diameter of the sun. It’s about 100x as wide as Earth, not 10x. So the last 2 entries in your distance table need to be updated.

Good catch. I’m not sure how I missed that. Fixed.

Dekker found some nice properties about summation and multiplication of floating point values, and how to make them accurate. Take a look at that (template versions of the dekker alorithms published by Takeshi Ogita and S.M. Rump et al in SIAM Journal on Scientific Computing 26

(2005), Nr. 6, S. 1995–1988:

template

void two_sum ( T a , T b , T & x , T & y ) {

x = a + b;

T z = x – a;

y = (a – (x-z ) ) + (b-z ) ;

}

template

void split ( T & x , T & y , T const& a ) {

T c = T ( ( 1UL << ( ( float_traits : : mantissa_bits >> 1)

+ float_traits : : mantissa_bits%2)) + 1) * a ;

x = c – ( c-a );

y = a – x;

}

template

void two_product ( T a , T b , T & x , T & y ) {

x = a * b;

T a1 , a2 ;

split ( a1 , a2 , a ) ;

T b1 , b2 ;

split ( b1 , b2 , b ) ;

y = a2*b2 – ( ( ( x – a1 * b1 ) – a2 * b1 ) – a1 * b2 ) ;

}

I hope it gets reasonably formated. The first one makes x = float( a + b ) and x + y = a + b if float would have infinite precision. So y contains the error of the limited floating point operation.

Something similar can be stated for two_product. x = float(a*b) and x+y =a*b.

All this only works if the compiler does optimize away the floating point operations.

The nice thing about this is that one can easily create summations and multiplications with higher accuracy, by keeping the error term (y) and reusing it in following operations. I once implemented a matrix expression template library that execute 100% accurate scalar products, using the following algorithm (invented by S.M Rump, Takeshi Ogita et al “Accurate Floating Point Summation” 2006):

template // faster than two_sum works only for a >= b

void fast_two_sum ( T a , T b , T & x , T & y ) {

assert(a>=b)

x = a + b;

T q = x – a;

y = b-q;

}

template

typename T::value_type accurate_sum ( T & vec )

{

typedef typename T::value_type value_type;

size_t n = num_elements ( vec );

if ( n == 0 ) return 0;

value_type mu = std::abs ( vec ( 0 ) );

for ( size_t i = 1; i != n; ++i )

mu = std::max ( std::abs( vec(i) ), mu );

value_type Ms = next_power_two ( value_type ( n+2));

value_type sigma = Ms*next_power_two( mu );

value_type phi = std::numeric_limits::epsilon ( ) * Ms;

value_type factor = value_type(2)*phi*Ms;

if ( ! check_extraction_parameters ( phi, sigma, factor ) )

return simple_sum ( vec );

value_type t = 0;

T q;

while ( true ) {

q = elementwise_sub ( elementwise_add ( sigma, vec ), sigma );

value_type tau = simple_sum ( q );

vec = vec – q;

value_type tau1, tau2;

fast_two_sum ( t, tau, tau1, tau2 );

if( std::abs ( tau1 ) >= factor*sigma

|| sigma <= std::numeric_limits::denorm_min() )

return tau1 + ( tau2 + simple_sum ( vec ) );

t = tau1;

sigma = phi*sigma;

}

return 0;

}

The algorithm walks through the exponent until the operands no longer add relevant values. If the vector has all operands within a similar mantissa range the algorithm terminates sooner. The result is accurate for the given floating point type.

Very cool.

A similar property is that if you have a compiler that generates fmadd instructions (fused multiply add where rounding doesn’t occur until after the add) then this calculation:

a * b + a * -b

typically gets compiled as fmad(a, b, a * -b). That is, the “a * -b” is done as a normal multiply, and the “a * b” is done as part of an fmadd. The net result is that the result is the error in a * b. That’s a cool property I think.

My personal favorite example of this was a bug in the Patriot Missile System’s software: http://www.gao.gov/products/IMTEC-92-26