A few years ago I did a lot of thinking and writing about floating-point math. It was good fun, and I learned a lot in the process, but sometimes I go a long time without actually using that hard-earned knowledge. So, I am always inordinately pleased when I end up working on a bug which requires some of that specialized knowledge. Here then is the second of (at least) three tales of floating-point bugs that I have investigated in Chromium (part one is here, part three is here). And this time I actually fixed the bug, both in Chromium, and then in googletest so that future generations will be spared some confusion.
This bug was a flaky test failure that suddenly started happening. We hates flaky test failures. They are particularly perplexing when they start happening in a test which has not changed for years. I got pulled in after a few weeks to investigate. The error messages (with some edits for line length) started something like this:
The difference between expected_microseconds and converted_microseconds is 512, which exceeds 1.0
Well. That sounds bad. This is a googletest error message saying that two floating-point values that were supposed to be within 1.0 of each other were, in fact, separated by 512.
The difference between the floating-point numbers was an immediate clue. It seemed very suspicious that the two numbers were separated by exactly 2^9. Coincidence? Nope. The rest of the message which listed the two values being compared just made me more certain of the underlying cause:
expected_microseconds evaluates to 4.2934311416234112e+18,
converted_microseconds evaluates to 4.2934311416234107e+18
If you’ve done enough time in the IEEE 754 trenches you will immediately recognize what is going on.
If you read last week’s episode then it may feel like déjà vu that the magnitude of the numbers is identical. This is pure coincidence – I’m just using the numbers I was dealt. This time they’re printed in scientific notation, so that gives some variety. Read the previous episode for more background.
The basic problem is a variation on last week’s problem: floating-point numbers in computers are not the same as the real numbers used by mathematicians. As they get bigger they get less precise, and in the range of the numbers in the test failure all of the double-precision values being dealt with were, necessarily, multiples of 512. A double has 53 bits of precision and these numbers were far bigger than 2^53, so greatly reduced precision was inevitable. And now we can understand the problem.
The test was calculating the same value using two different methods. It was then verifying that the results were close, where “close” was defined as being within 1.0. The calculation methods produced very similar answers so most of the time the results would round to the same double-precision value. But every now and then the correct answer would be near a cusp, and one calculation would round one way, and one calculation would round the other way.
To be more precise, the numbers that ended up being compared were:
- 4293431141623410688
- 4293431141623411200
With the exponents stripped away we can more easily see that they are 512 apart. The two infinitely precise results generated by the test functions were always less than 1.0 apart so when they were values like 429…10653.5 and 429…10654.3 they would both round to 429…10688. The trouble comes when the infinitely precise results are near a value like 4293431141623410944. This value is exactly half way between two doubles. If one function generates 429…10943.9 and the other function generates 429…10944.1 then these results – which are just 0.2 apart – get rounded in different directions and end up 512 apart!
This is the nature of a cusp, or a step function. You can have two results which are arbitrarily close but which are on opposite sides of a cusp – a point exactly between two doubles – and therefore they round in opposite directions. Changing the rounding mode is often suggested, but doesn’t help – it just moves the cusp.
It’s like being born near midnight – the tiniest variation can forever change the date (and maybe the year, century, or millennium) recorded for your birth.
My commit message might have been overly dramatic, but it was not wrong. I felt uniquely qualified to handle this situation:
commit 6c2427457b0c5ebaefa5c1a6003117ca8126e7bc
Author: Bruce Dawson
Date: Fri Dec 08 21:58:50 2017Fix epsilon calculation for large-double comparisons
My whole life has been leading up to this bug fix.
I mean, how often do I get to land a change to Chromium with a commit message that quite reasonably contains links to two (2!) of my blog posts.
The fix in this case was to calculate the difference between adjacent doubles at the magnitude of the calculated values. This was done with the rarely used nextafter function, like this:
epsilon = nextafter(expected, INFINITY) – expected;
if (epsilon < 1.0)
epsilon = 1.0;
The nextafter function finds the next double (towards infinity in this case), and the subtraction (which is exact, how convenient) then finds the difference between doubles at that magnitude. The algorithm being tested could inherently give an error of 1.0 so epsilon had to be at least that large. This epsilon calculation made it trivial to check that the values were either within 1.0 or were adjacent doubles.
I never investigated why the test had suddenly started failing but I suspect that the timer frequency or the timer start point had changed, thus making the numbers bigger.
I just checked this. The test uses several simulated QueryPerformanceCounter (QPC) counts, including <int64>::max(), or 2^63-1. This is an unrealistically high number and this is the number that triggered failures. Applying some math to the failure output shows that a QPC frequency of 2.148 MHz was used when the failures were hit. When the test was reliable the QPC frequency was probably higher – perhaps the CPU frequency, so ~3 GHz or so. When the QPC frequency was that high the test value of 2^63-1 would have interpreted as a much shorter time, thus avoiding the flaky failure.
Therefore, the flaky failure started when a machine or OS change reduced the frequency of QueryPerformanceCounter.
Fixing googletest
It bothered me that it required esoteric floating-point knowledge to understand this problem so I wanted to fix googletest. My first attempt went badly.
I initially tried to fix googletest to make EXPECT_NEAR fail whenever a meaninglessly-small epsilon was passed, but apparently a lot of tests within Google – and presumably many more outside of Google – abuse EXPECT_NEAR on doubles. They pass an epsilon value that is too small to be useful, but the numbers they are comparing are identical so the test passes. I fixed a dozen uses of EXPECT_NEAR without making a dent in the problem and then gave up.
It wasn’t until I started writing this blog post (almost three years after the bug!) that I realized how to safely and easily fix googletest. If code is using EXPECT_NEAR with a too-small epsilon value and the test is passing (meaning that the values actually are equal) then it’s not really a problem. It only becomes a problem when the test fails, so all I needed to do was to look for too-small epsilon values in the failure case and display an informative message then.
I landed that change and now the error message for the 2017 failure looks like this:
The difference between expected_microseconds and converted_microseconds is 512, where
expected_microseconds evaluates to 4.2934311416234112e+18,
converted_microseconds evaluates to 4.2934311416234107e+18.
The abs_error parameter 1.0 evaluates to 1 which is smaller than the minimum distance between doubles for numbers of this magnitude which is 512, thus making this EXPECT_NEAR check equivalent to EXPECT_EQUAL. Consider using EXPECT_DOUBLE_EQ instead.
Note that EXPECT_DOUBLE_EQ doesn’t actually check for equality – it checks that the doubles are equal within four units in the last place (ULPs). See my Comparing Floating Point Numbers post for details on this concept.
My hope is that most software developers will see the new error message and be sent down the correct path, and I think that the googletest fix is ultimately more important than fixing the Chromium test.
Hacker news discussion is here.
The (very quiet) reddit discussion is here.
Thank you for addressing this issue in gtest. Might this also be the reason for my issue from 2010?
https://stackoverflow.com/q/3564871/108238
Your issue is certainly closely related. It is not surprising that 0.7 != 7 * 0.1. The two results will (I believe) match to within one unit in the last place (1 ULP) and ASSERT_DOUBLE_EQ counts it as a match if numbers are within 4 ULPs. I think that it should have been named differently to acknowledge that, but it’s too late now.
My change wouldn’t have made any difference because the googletest comparison was passing.
The second answer explains this quite well. The third answer (“never-ever compare floating point numbers for equality!”) annoys me because while you _usually_ shouldn’t compare floating-point numbers for equality there absolutely are times when you should. The trick is to know the difference.