Floating Point in the Browser, Part 3: When x+y=x (y != 0)

A few years ago I did a lot of thinking and writing about floating-point math. It was good fun, and I learned a lot in the process, but sometimes I go a long time without actually using that hard-earned knowledge. So, I am always inordinately pleased when I end up working on a bug which requires some of that specialized knowledge. Here then is the third of (at least) three tales of floating-point bugs that I have investigated in Chromium (part one is here, part two is here). It’s another variant on precision problems when pushing the limits – maybe I just keep encountering the same floating-point bug?

In this post I’ll also cover some debugging techniques you can apply if you ever want to explore the Chromium source code or investigate a crash.

Sad tabThe title of the bug I filed when I hit this issue was “OOM (Out of Memory) crash in chrome://tracing when zoomed way in” which does not sound like a floating-point bug.

As usual I wasn’t looking for trouble, I was just poking around in some chrome://tracing traces to try to understand some of the events when suddenly I got a sad tab – a crash.

You can view and upload recent Chrome crashes by going to chrome://crashes but I wanted to load the crash dump into a debugger so I navigated to where they are stored locally:

%localappdata%\Google\Chrome\User Data\Crashpad\reports

I loaded the most recent crash dump into windbg (Visual Studio works fine also) and started investigating. Because I had Chrome and Microsoft’s symbol servers configured and source server enabled the debugger automatically downloaded the PDBs (debug information) and the necessary source files. Note that this setup is available to all – you don’t have to be a Google employee or Chromium developer to have this magically work. You can find Chrome/Chromium debugging setup instructions here. A Python install is required to have the source-code download automatically.

Analysis of the crash showed that the out-of-memory failure happened because the v8 (JavaScript engine) function NewFixedDoubleArray tried to allocate an array with 75,209,227 elements and the maximum size allowed in this context is 67,108,863 (0x3FFFFFF in hexadecimal).

The nice thing about investigating crashes that I triggered is that I can try to reproduce them while monitoring more closely. A bit of experimentation showed that memory remained stable as I zoomed in until I got to a critical point and then memory suddenly skyrocketed and then the tab crashed, even as I sat there doing nothing.

The problem now was that I could easily get a call stack for the crash, but only for the Chrome C++ code part of it. However the actual bug appeared to be in the chrome://tracing JavaScript code. I tried testing with Chrome’s canary (daily) build under a debugger and got this tantalizing message:

==== JS stack trace ======================================

Unfortunately this enticing line was not followed by an actual stack trace. A bit of git spelunking showed that the feature to print JS call stacks on OOM was added in 2015 and was then removed in December 2019.
I was investigating the bug in early January 2020 (remember then? Those were good times. Innocent times. Simpler times. Vote Biden. But I digress) which meant that the OOM stack-trace code had been removed from the daily build but was still in the stable build…

So my next step was to try to reproduce the bug on the stable version of Chrome. This gave me the following results (edited for clarity):

    0: ExitFrame [pc: 00007FFDCD887FBD]
     1: drawGrid_ [000016011D504859] [chrome://tracing/tracing.js:~4750]
     2: draw [000016011D504821] [chrome://tracing/tracing.js:4750]

imageIn short, the OOM crash seemed to be from drawGrid_ which I found (using the Chromium code-search page) in x_axis_track.html. With some hacking on that file I narrowed down the problem to the call to updateMajorMarkData. That function has a loop which does a call to majorMarkWorldPositions_.push and that is the culprit.

I should mention that, despite being a browser developer, I am the world’s worst JavaScript programmer. Being a C++ systems programmer does not, in fact, give me magical “front end” skills. Hacking on the JavaScript to understand this bug was reasonably painful for me.

The loop (which you can see here) looks something like this:

for (let curX = firstMajorMark;
curX < viewRWorld;
         curX += majorMarkDistanceWorld) {
    this.majorMarkWorldPositions_.push(
        Math.floor(MAJOR_MARK_ROUNDING_FACTOR * curX) /
        MAJOR_MARK_ROUNDING_FACTOR);
}

I added some debug-print statements before the loop and got the data below. When I was not quite zoomed in enough to cause the crash the critical numbers looked something like this:

firstMajorMark: 885.0999999642371
majorMarkDistanceWorld: 1e-13

When I had zoomed in enough to trigger the crash the numbers looked more like this:

firstMajorMark: 885.0999999642371
majorMarkDistanceWorld: 5e-14

885 divided by 5e-14 is 1.8e16 and the precision of a double-precision floating-point number is 2^53 which is 9.0e15. Therefore the bug happens when majorMarkDistanceWorld (the distance between grid points) is so small relative to firstMajorMark (the location of the first major mark) that the addition in the loop… does nothing. That is, if you add a small number to a large number then if the small number is “too small” then the large number may (in the default/sane round-to-nearest mode) stay at the same value.

Because of this the loop spins endlessly and the push command runs until the array hits the size limits. If there were no size limits then the push command would keep running until the entire machine ran out of memory, so, yay?

The fix was pretty simple – don’t print the grid marks if we can’t:

if (firstMajorMark / majorMarkDistanceWorld > 1e15) return;

Two non-experts collaborate to land a fixAs is pretty common for the sort of changes that I land my bug fix was one line of code and a six-line comment. I’m only surprised that there wasn’t a 50-line commit message in iambic pentameter, a musical score, and a blog post. Oh wait.

Unfortunately the JavaScript stack frames still don’t get printed in OOM crashes because recording call stacks requires memory and therefore isn’t safe at that point. I’m not sure how I would have investigated this bug now, after the OOM stack frames were fully removed, but I’m sure I would have found a way.

Whether you are a JavaScript developer trying to use extremely large integers, or a test writer trying to use the largest integer, or a UI implementer who wants to allow unbounded zoom, it’s important to remember that when you push the limits of floating-point math you might push through those limits.

Reddit discussion is here.

Twitter thread is here.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Chromium, Computers and Internet, Floating Point and tagged , . Bookmark the permalink.

5 Responses to Floating Point in the Browser, Part 3: When x+y=x (y != 0)

  1. Daniel Neve says:

    Presumably “Out of memory” in this scenario is due to hitting some predefined arbitrary limit, and not because the system itself has nothing left to give you right? Assuming this is the case, could there not be some reserved pool of memory for printing callstacks and whatnot in the event of an OOM error?

    • brucedawson says:

      Yes, in this particular case the call stacks would work fine because, as you say, the process is not at all out of memory. That is, I guess, why it worked. However in “normal” out of memory situations it was actually causing crashes, and I think undefined behavior (nested garbage collection or some-such) so the risks were not justified.

      • Richard says:

        At least one of the core library systems I use reserves “some” memory up-front purely to put crash info (inc. OOM) into.
        Usually the important part of the callstack isn’t very many frames, as it’s generally an unbounded loop or recursion.
        In the latter case you won’t get the originator of course, but it is at least a good Clue.

        This would have to be part of the v8 system, you couldn’t do it in JS.

        OS Core Dumps give you the whole shebang of course, but figuring out a JS VM state from that tends to be infeasible…

  2. akraus1 says:

    I am not (yet) working with chrome but how would one get the Javascript call stack out of a crashed Chrome instance memory dump in the “normal” not OOM case?
    Great finding how rounding can trip off really strange bugs.

    • brucedawson says:

      That’s a good question and I don’t know. I got lucky this time because the relevant JavaScript stack was printed for me (in stable builds) so I didn’t need to investigate how to get a JavaScript stack from the C++-code debugger (windbg or VS). I assume that there is some way to do it, but I don’t know what it is or if it is easy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.