Everything Old is New Again, and a Compiler Bug

“What’s an EXCEPTION_FLT_STACK_CHECK exception?” one of my coworkers said. I said “It’s a weird and rare crash. Why do you ask?”

It turns out that one of these weird and rare crashes had started showing up in Chrome (M54 branch, not the stable branch that consumers are running). We began looking at it together until I decided it made more sense to assign the bug to myself. Partly I did this because the crash analysis required some specialized knowledge that I happened to have, but mostly because I was curious and I thought that this bug was going to be interesting. I was not disappointed.

  • The crash was in a FPU that Chrome barely uses
  • The instruction that crashed Chrome was thousands of instructions away from the one that triggered the exception
  • The instruction that triggered the exception was not at fault
  • The crash only happened because of third-party code running inside of Chrome
  • The crash was ultimately found to be caused by a code-gen bug in Visual Studio 2015

In the end the investigation of this bug went very quickly, but it required multiple bits of obscure knowledge from my past, and therefore I thought it might be interesting to share.

This post describes the process of investigating this bug, and I’ve saved some crash dumps so that you can follow along in the investigation – Chrome’s symbol server makes it easy to see what’s going on. The crash dumps are all linked, with instructions, at the end of this post. A very similar crash dump is what I used to understand and ‘fix’ this bug.

Investigating

The first thing to understand is that imageEXCEPTION_FLT_STACK_CHECK has nothing to do with the CPU’s regular stack. It’s the FLT_STACK, I mean, obviously. More on that shortly.

The crash was in a function called base::MessagePumpWin::GetCurrentDelay in Chrome’s browser process. When looking at an interesting crash I always switch to assembly language (ctrl+F11 toggles this) – otherwise the information is far too vague to allow a reliable diagnosis. The crashing instruction was the fld instruction in the sequence below:

movsd  mmword ptr [ebp-8],xmm0
fld    qword ptr [ebp-8] 
fstp   qword ptr [esp]
call   ceil (11180840h)
fstp   qword ptr [ebp-8] 

The fld instruction is part of the x87 FPU and it loads a floating-point value onto the x87’s peculiar eight-register stack, so it seems initially plausible that it could have caused a FLT_STACK check.

Who designs a stack with just eight entries? Intel. It must have seemed like a good idea at the time.

Since it’s an x87 crash we need to see the state of that FPU. The Visual Studio Registers window defaults to only showing the integer registers, so I had to right-click and enable Floating Point – it’s handy to know that in VS this refers to the x87 FPU, not SSE or AVX. This is just one of many minor tricks needed to investigate this bug.

Aside: VS team, can we talk? The layout of the registers window is terrible. Putting all of the x87 registers on one line and letting word-wrap sort it out may have been good enough for Arnaud Amalric, but I find it pretty annoying. How about consistent indentation, and maybe an explicit line break after ST7? And how about not word-wrapping between ‘e’ and ‘+’ in floating-point numbers! Bug filed.

Here are the relevant registers, crash stack, and the output window from a live repro of this bug:

image

For fld to trigger a stack exception the stack must be full and I think I remembered that TAGS = FFFF means that the stack is empty but I also remembered something far more important. The x87 FPU started life as a coprocessor. Because of this (or maybe it was because of floating-point latencies – doesn’t matter) it doesn’t report exceptions synchronously. So, in the crazy world of x87 floating-point, exceptions are reported on the next x87 instruction that executes. So the fld instruction is not the problem. The problem is the previous x87 instruction executed.

These days most floating-point math is done using SSE instructions. This means that the previous x87 instruction might be a long way away. There was no sign of it in the GetCurrentDelay function. At this point this bug was looking impossible to investigate. But…

The x87 designers weren’t totally cruel and capricious. They dedicated a register to storing the address of the last x87 instruction that executed – the actual problematic instruction! The address of that instruction is referred to in the VS registers window as EIP, not to be confused with the main processor’s EIP register. And this register, it turns out, is crucial. In the screenshot above we can see that EIP equals 0x0FFD83FC but EIP is 0x10C9F30D. They are almost 14 MB away from each other, so it’s a good thing we didn’t try guess-and-go debugging.

To see the code at EIP we just have to paste 0x10C9F30D (the 0x prefix is necessary) into the address bar of the Visual Studio disassembly window. If you have source server enabled then VS will even pull down the appropriate source files, but the assembly language is the main thing we need. The bold instruction in HandleUpdateTimeout near the bottom is the ‘culprit’, but the nearby assembly language also turns out to be relevant:

ProcessPowerCollector::UpdatePowerConsumption
10C9F2B7  push  ebp
10C9F2B8  mov   ebp,esp
10C9F2BA  and   esp,0FFFFFFF8h

10C9F2F7  call  ProcessPowerCollector::RecordCpuUsageByOrigin
10C9F2FC  movsd xmm0,mmword ptr [esp+10h]
10C9F302  pop   edi
10C9F303  pop   esi
10C9F304  mov   esp,ebp
10C9F306  pop   ebp
10C9F307  ret
ProcessPowerCollector::HandleUpdateTimeout
10C9F308  call  ProcessPowerCollector::UpdatePowerConsumption
10C9F30D  fstp  st(0)
10C9F30F  ret

fstp pops a value off the x87’s eight register stack. It would trigger a FLT_STACK check if the stack was empty, so maybe the stack was empty at this time. The next thing to know is that the Windows calling conventions for 32-bit programs say that float and double results should be returned in st(0). Since UpdatePowerConsumption is supposed to return a double this means that the floating-point stack should not be empty.

So next we look at UpdatePowerConsumption and at address 10C9F2FC we see the problem. That instruction moves the return value into xmm0, an SSE register instead of into st(0). Hence the mismatch.

Now, the compiler is allowed to create its own calling conventions if it wants to – within the same binary the only rule is that everybody has to agree on what the rules are. So, the bug is not necessarily that UpdatePowerConsumption is returning the value in the wrong register set. The bug is that the compiler generated a caller and a callee that didn’t agree on how they were going to talk to each other.

Further complicating this bug is that it is difficult to reproduce. I tried a normal build, and then I tried changing from /O1 to /O2, and then I tried an LTCG build and the compiler kept refusing to generate the inconsistent code. Eventually a coworker (thanks Sébastien!) pointed out that if I set full_wpo_on_official=true then more source files would be built with LTCG, and finally I could reproduce the bug. I reported the bug and then, since the optimizer was misbehaving, worked around the issue by disabling optimizations for the two relevant functions. Microsoft was able to reproduce the bug and it sounds like they’ve made some progress in understanding it.

But there remained one mystery: this code was supposed to run every thirty seconds, so how come this crash wasn’t hitting every user of Chrome 54? Well, it turns out that floating-point exceptions are, by default, suppressed. When a floating-point instruction does something bad it usually just returns a best-effort result and continues running. I wrote about this a few years ago and recommended that developers try enabling some of these exceptions in order to flush out bugs that would otherwise be hidden.

So now we know why everyone isn’t crashing, but there remained one mystery: why is anyone crashing? If floating-point exceptions are suppressed then shouldn’t this bug have remained hidden forever?

The answer to that is that in the crazy world of Windows there are a lot of programs that think that injecting their DLLs into all processes is a good idea. Whether malware or anti-malware or something else these injected DLLs end up causing a good portion of all of Chrome’s crashes. And in this case one of these injected DLLs decided that changing the FPU exception flags in somebody else’s process was a good idea. I mean, it could be worse. In this case they exposed a genuine compiler bug, and it’s much more polite than trashing our heap or patching our code, so, thanks?

I verified this by grabbing the FPExceptionEnabler class, modifying it to enable just the _EM_INVALID exception, and then stepping over the _controlfp_s call that enables the exceptions, and seeing which floating point registers changed – they’re highlighted in red. The only one was the CTRL register which went from 0x27F to 0x27E so I knew that clearing the low bit would enable the FPU invalid exceptions. I then installed 32-bit Chrome canary, with the bug, launched it under the Visual Studio debugger, stopped on the HandleUpdateTimeout function, used the registers window to change CTRL to 0x27E, and then let it run to its crash – thousands of instructions later. Theory confirmed.

And with that our story is complete. To recap the weirdness:

  • A FLT_STACK is different from the regular stack
  • A FLT_STACK exception does not occur on the problematic instruction
  • If you enable display of Floating Point registers and navigate to the instruction indicated by the floating-point EIP you can find the real culprit
  • Then you just have to find why the FLT_STACK is messed up. Missing function prototypes used to be the usual reason, but compiler bugs seem more trendy now
  • Floating-point exceptions are normally suppressed
  • Third-party code does rude things to the processes that it visits

Crash dumps

And, as promised, here are the crash dumps. In order to use them you should load them into windbg or Visual Studio and add https://chromium-browser-symsrv.commondatastorage.googleapis.com – Chrome’s symbol server – to the list of symbol servers, so that Chrome’s PDB files are automatically downloaded, as discussed here.

imageYou should probably also enable source server, as discussed here, so that the debuggers will automatically download the appropriate source files. In windbg you type .srcfix and in VS you click a check box, shown above.

With all of the crash dumps I recommend viewing the assembly language (Ctrl+F11 in Visual Studio), opening the Registers window and right-clicking and enable the Floating Point display, so you can see the floating-point CTRL, STAT, and EIP registers.

And, here are the four crash dumps, all less than 136 kB.

  • 1 HandleUpdateTimeout before call.dmp – in this you’ll be on the call to ProcessPowerCollector::UpdatePowerConsumption, and in the Registers window you can see that Ctrl is 0x27F, which is good.
  • image2 HandleUpdateTimeout after call.dmp – in this one you’ll be on the fstp instruction which is about to mess things up. I’ve imagehelpfully changed Ctrl to 0x27E to enable the _EM_INVALID exceptions. Note that STAT is 0x20 and the floating-point EIP register is 0x10088407 – those will both change
  • image3 HandleUpdateTimeout exception triggered.dmp – now execution has moved to imagethe ret instruction and the exception has been triggered. The STAT register value indicates (somehow) that an exception is pending, and EIP indicates the fstp instruction – the last x87 instruction executed
  • 4 GetCurrentDelay crash.dmp – this crash dump captures the actual crash, in a completely different call stack. Unfortunately the crucial floating-point EIP register is zero – crash dumps saved by Visual Studio do not record it. Luckily Chrome’s crashpad does save it or else this bug would have been much more difficult to resolve

While crashpad thankfully does save the x87 state (implemented here) Visual Studio does not. So I filed a bug.

This bug last appeared in Chrome version 55.0.2858.0 canary. If you want to recreate it then instructions are in the connect bug, but get ready for long build times.

If you want more floating-point goodness then you’ll happy to know I’ve written a whole series of articles on this subject. If you want to read more about finding compiler bugs I’ve got just the post for you.

Reddit discussion is here.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Debugging, Floating Point and tagged . Bookmark the permalink.

26 Responses to Everything Old is New Again, and a Compiler Bug

  1. Manel says:

    This looks like a classical code generation / instruction encoding bug. Callee function should return its value into MMX0, which aliases to ST0 (FPU and MMX registers are aliased). Instead, it returns into XMM0, which falls into a totally different register file. Not sure if the FP stack would still be empty, though.

    • brucedawson says:

      I doubt it’s an encoding bug – the encodings for MMX0 and XMM0 are pretty different. It’s a calling convention disagreement.

      And, in case you were wondering, if the result were written to MMX0 then the x87 stack would still be empty and the data would be in the wrong format anyway, so the crash would have been unchanged.

  2. Very interesting post. I think that you should say at the beginning that this concerns 32-bit programs. FYI, the 32-bit x86 ABI seems similar under Linux: I’ve checked that when using the -m32 -mfpmath=sse -msse2 GCC’s options (to select the 32-bit ABI and instruct GCC to use SSE for floating-point operations), there are moves between SSE and x87 registers when functions are called or return. For instance, on “double f(double x) { return sin(x); }”, one can see: the call to sin, then FP stack → memory → xmm0 → memory → FP stack (here, there are no computations on xmm0, so that this no-op chain is suppressed with optimizations enabled). With the 64-bit ABI, SSE is used everywhere for float and double, but the x87 FPU is still used for long double, as extended precision is not provided by SSE.

    Concerning the exceptions, recommending to enable some of them may not be a good idea in all cases: some libraries may not like that. The C standard says: “Certain programming conventions support the intended model of use for the floating-point environment: a function call does not alter its caller’s floating-point control modes, clear its caller’s floating-point status flags, nor depend on the state of its caller’s floating-point status flags unless the function is so documented; a function call is assumed to require default floating-point control modes, unless its documentation promises otherwise; a function call is assumed to have the potential for raising floating-point exceptions, unless its documentation promises otherwise.” Otherwise you may get problems like you saw here with some DLL.

    Also, remember the following security hole in the JavaScript engine of Firefox, which overwrote memory when some plugin had changed the x87 FPU control word to select a reduced precision: https://bugzilla.mozilla.org/show_bug.cgi?id=358569 (CVE-2006-6499).

    • brucedawson says:

      That’s a scary bug! I agree that there are risks for changing the FPU state, which is why I only recommend it for testing, and I’ve had good results from doing that. I opened crbug.com/645700 to investigate doing that for some of Chrome’s tests.

  3. Oliver says:

    Hey Bruce, thanks for this great blog entry. Very interesting.

    I just wanted to add that not just injected DLLs may be at fault, but also kernel mode drivers. It is customary to avoid FP instructions in drivers for one reason: the services a driver provides often get called in the context of a user mode thread. One can explicitly save the FP state and restore it later (KeSaveFloatingPointState, KeSaveExtendedProcessorState), but to do this all the time is considered a rather expensive operation and best avoided. So when writing a driver, one has to be very careful to not mess around with the FP state. In my domain of expertise, file system filter drivers, this is particularly important. Imagine if a file operation inside some spreadsheet application messed up FP state all the time?! Of course initially the application would be blamed. I just wanted to spread the word that this needn’t be the case. But even drivers providing access to hardware devices could be to blame. Depends entirely on the callstack down in kernel mode.

    Also, I think there are legit cases for injecting DLLs via KnownDLLs. For me the most important such case is SuRun.

    • And, in fact, we’ve seen a crash in Firefox due to a buggy driver (Cisco VPN) screwing up our floating point state: https://bugzilla.mozilla.org/show_bug.cgi?id=435756

      • Oliver says:

        Thanks Ted. I’ll bookmark this to refer to if that discussion comes up again with colleagues. Because we have also some code running in kernel mode written by people who are not kernel mode devs. I had created a presentation and handout some years back together with a colleague to inform them about some of the limitations they should be aware about.

      • brucedawson says:

        I had a bad floating-point driver clearing the x87 stack years ago, but that caused crashes that were far more random – you could fld on one instruction and fstp on the next and trigger an exception because the floating-point stack had been emptied in-between! Such severe cases should be very rare though.

        In this case 90% of the crashes contained the same third-party DLL. Correlation is not causation, but it’s very suspicious.

    • yuhong says:

      Probably worth mentioning here too is AVX-SSE transition penalties. With 64-bit drivers, one can use XMM registers without calling KeSaveFloatingPointState. AVX decided to support such drivers by preserving the high values of YMM registers, but there is a cost to do so. The alternative would be to call KeSaveExtendedProcessorState, which must be done if the driver uses any instructions with a VEX prefix.

  4. kamikaze says:

    I have experimented with floating point exceptions before and I found that doing floating point properly can be very expensive.

    I have an application that uses a gauss algorithm to find geometry intersections in 3-D space. The algorithm ignores that the linear equation matrix is not solvable if a path is parallel to the plane it should cross. That leads to a division by zero. So the calling function has to check whether the result set contains NaN/±Inf values.

    As part of my regular development process I activated the compiler’s undefined behaviour canaries, and the gauss algorithm was the only hit in testing. It turned out that the division by zero is undefined rule also applies to FP (which I didn’t know and surprises me). That led me on a bit of literature research that resulted in two findings:

    – The C++ standard does not mandate IEEE 754 FP behaviour
    – All major C++ compilers implement IEEE 754 FP behaviour (more or less, it depends on the hardware)

    I usually adhere to the C++ standard (C++11 in this case) so I preceded the divisions with a zero check to exit the function early. I thought that might actually gain me a bit of performance, but the case was so rare that the opportunity to leave the function early didn’t make up for the additional overhead. Instead of running 300 * 10^6 times/s the function only performed at 200 * 10^6 times/s. Nullptr checks are supposed to be free, FP 0 checks apparently aren’t.

    So I turned off floating point canaries and removed the check and put a «Requires IEEE 754 FP behaviour» stamp on the whole thing.

    • Note that in C, a floating-point division by 0 becomes defined if __STDC_IEC_559__ is defined by the implementation (IEEE 754 support).

      I have a similar problem with one of my codes (where very good performance is very important), but with integer division in internal loops, where the divisor can be 0 in very rare cases. Adding checks would slow the code down, while the behavior of the x86 integer division instruction, with an exception in case of division by 0, is OK for me. But in C, that’s undefined behavior.

    • Krzysiek says:

      “Instead of running 300 * 10^6 times/s the function only performed at 200 * 10^6 times/s.”
      Would it be any better if the early return branch was marked as rare (to aid branch predictor)?

    • John Payson says:

      Division by zero is neither more nor less defined than any other computation which yields a result which is beyond the range that can be represented in a floating-point type. Manually checking against division by zero won’t guard against situations where division by a really small value yields a really big value.

  5. FYI you put the URL to Microsoft’s symbol server where you intended to put Chrome’s. Force of habit? 🙂

  6. As always, great investigative work! Love those posts, thanks for taking the time.

  7. asdf says:

    During the bad old Win95 days, my parents had a scanner that stopped working after they bought a new printer. After investigating, it turned out that the scanner drivers would occasionally divide by zero, which was normally masked by the OS kernel, but that the printer drivers enabled that exception. I don’t really know which party offended me more, and of course neither were willing to fix their drivers as it was obviously the other guys’ fault.

    • brucedawson says:

      Probably an HP printer. I worked on video games back in those days and our games had a print function. If you had certain brands of HP printers then the game would crash after you printed.

      I’m going to take a stand here and say that the fault was 100% HPs. Floating-point divide by zero is usually accidental, but not always – it is a valid way of solving some equations or avoiding special-case checks. Floating-point divide-by-zero is okay. Turning it into a crash in somebody else’s process is definitely *not* okay.

      Note that the filtering of the exceptions is not done by the kernel – it’s done by the FPU. If they are masked then a result is delivered (infinity for divide-by-zero) and execution continues.

      • I work on music notation software, and we’ve been blighted in the past by crashes that were ultimately triggered (though technically not caused by) HP printer drivers that toggle the floating point mask. Being a printer driver it’s part of your process. This is definitely a case of needing to wipe your feet when you are a guest in someone else’s house.

        • brucedawson says:

          Definitely worth filing a bug against HP, and also putting in some defensive measures such as resetting the floating-point state on return from print routines, and calling _clearfp() on the way in to print routines. Good luck.

  8. Alen Ladavac says:

    Fantastic article, as always. I’d like to point out one detail… you mentioned how a lot of 3rd party software seems to inject dlls and a lot of crashes are caused by that code. I notice the same in our games’ crash logs as well. However what may not be apparent is that some of those are not deliberately and directly injecting dlls. They might be using SetWindowsHookEx() and not understanding what’s going on under the hood. Installing a hook implicitly injects the dll where the hook handler resides into all processes.
    The docs tell you that the handler must be in the dll, but doesn’t tell you that it will do the injection on your behalf. It’s bad design on Microsoft part, I’d say, but it’s ancient legacy and probably wouldn’t be possible to fix now. But the docs could be better and should offer alternatives.

  9. factormystic says:

    You should name & shame the offending third party.

  10. yuhong says:

    For fun, look up “fopcode compatibility mode” in Intel manuals.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.