“What’s an EXCEPTION_FLT_STACK_CHECK exception?” one of my coworkers said. I said “It’s a weird and rare crash. Why do you ask?”
It turns out that one of these weird and rare crashes had started showing up in Chrome (M54 branch, not the stable branch that consumers are running). We began looking at it together until I decided it made more sense to assign the bug to myself. Partly I did this because the crash analysis required some specialized knowledge that I happened to have, but mostly because I was curious and I thought that this bug was going to be interesting. I was not disappointed.
- The crash was in a FPU that Chrome barely uses
- The instruction that crashed Chrome was thousands of instructions away from the one that triggered the exception
- The instruction that triggered the exception was not at fault
- The crash only happened because of third-party code running inside of Chrome
- The crash was ultimately found to be caused by a code-gen bug in Visual Studio 2015
In the end the investigation of this bug went very quickly, but it required multiple bits of obscure knowledge from my past, and therefore I thought it might be interesting to share.
This post describes the process of investigating this bug, and I’ve saved some crash dumps so that you can follow along in the investigation – Chrome’s symbol server makes it easy to see what’s going on. The crash dumps are all linked, with instructions, at the end of this post. A very similar crash dump is what I used to understand and ‘fix’ this bug.
The crash was in a function called base::MessagePumpWin::GetCurrentDelay in Chrome’s browser process. When looking at an interesting crash I always switch to assembly language (ctrl+F11 toggles this) – otherwise the information is far too vague to allow a reliable diagnosis. The crashing instruction was the fld instruction in the sequence below:
movsd mmword ptr [ebp-8],xmm0
fld qword ptr [ebp-8]
fstp qword ptr [esp]
call ceil (11180840h)
fstp qword ptr [ebp-8]
The fld instruction is part of the x87 FPU and it loads a floating-point value onto the x87’s peculiar eight-register stack, so it seems initially plausible that it could have caused a FLT_STACK check.
Who designs a stack with just eight entries? Intel. It must have seemed like a good idea at the time.
Since it’s an x87 crash we need to see the state of that FPU. The Visual Studio Registers window defaults to only showing the integer registers, so I had to right-click and enable Floating Point – it’s handy to know that in VS this refers to the x87 FPU, not SSE or AVX. This is just one of many minor tricks needed to investigate this bug.
Aside: VS team, can we talk? The layout of the registers window is terrible. Putting all of the x87 registers on one line and letting word-wrap sort it out may have been good enough for Arnaud Amalric, but I find it pretty annoying. How about consistent indentation, and maybe an explicit line break after ST7? And how about not word-wrapping between ‘e’ and ‘+’ in floating-point numbers! Bug filed.
Here are the relevant registers, crash stack, and the output window from a live repro of this bug:
For fld to trigger a stack exception the stack must be full and I think I remembered that TAGS = FFFF means that the stack is empty but I also remembered something far more important. The x87 FPU started life as a coprocessor. Because of this (or maybe it was because of floating-point latencies – doesn’t matter) it doesn’t report exceptions synchronously. So, in the crazy world of x87 floating-point, exceptions are reported on the next x87 instruction that executes. So the fld instruction is not the problem. The problem is the previous x87 instruction executed.
These days most floating-point math is done using SSE instructions. This means that the previous x87 instruction might be a long way away. There was no sign of it in the GetCurrentDelay function. At this point this bug was looking impossible to investigate. But…
The x87 designers weren’t totally cruel and capricious. They dedicated a register to storing the address of the last x87 instruction that executed – the actual problematic instruction! The address of that instruction is referred to in the VS registers window as EIP, not to be confused with the main processor’s EIP register. And this register, it turns out, is crucial. In the screenshot above we can see that EIP equals 0x0FFD83FC but EIP is 0x10C9F30D. They are almost 14 MB away from each other, so it’s a good thing we didn’t try guess-and-go debugging.
To see the code at EIP we just have to paste 0x10C9F30D (the 0x prefix is necessary) into the address bar of the Visual Studio disassembly window. If you have source server enabled then VS will even pull down the appropriate source files, but the assembly language is the main thing we need. The bold instruction in HandleUpdateTimeout near the bottom is the ‘culprit’, but the nearby assembly language also turns out to be relevant:
10C9F2B7 push ebp
10C9F2B8 mov ebp,esp
10C9F2BA and esp,0FFFFFFF8h
10C9F2F7 call ProcessPowerCollector::RecordCpuUsageByOrigin
10C9F2FC movsd xmm0,mmword ptr [esp+10h]
10C9F302 pop edi
10C9F303 pop esi
10C9F304 mov esp,ebp
10C9F306 pop ebp
10C9F308 call ProcessPowerCollector::UpdatePowerConsumption
10C9F30D fstp st(0)
fstp pops a value off the x87’s eight register stack. It would trigger a FLT_STACK check if the stack was empty, so maybe the stack was empty at this time. The next thing to know is that the Windows calling conventions for 32-bit programs say that float and double results should be returned in st(0). Since UpdatePowerConsumption is supposed to return a double this means that the floating-point stack should not be empty.
So next we look at UpdatePowerConsumption and at address 10C9F2FC we see the problem. That instruction moves the return value into xmm0, an SSE register instead of into st(0). Hence the mismatch.
Now, the compiler is allowed to create its own calling conventions if it wants to – within the same binary the only rule is that everybody has to agree on what the rules are. So, the bug is not necessarily that UpdatePowerConsumption is returning the value in the wrong register set. The bug is that the compiler generated a caller and a callee that didn’t agree on how they were going to talk to each other.
Further complicating this bug is that it is difficult to reproduce. I tried a normal build, and then I tried changing from /O1 to /O2, and then I tried an LTCG build and the compiler kept refusing to generate the inconsistent code. Eventually a coworker (thanks Sébastien!) pointed out that if I set full_wpo_on_official=true then more source files would be built with LTCG, and finally I could reproduce the bug. I reported the bug and then, since the optimizer was misbehaving, worked around the issue by disabling optimizations for the two relevant functions. Microsoft was able to reproduce the bug and it sounds like they’ve made some progress in understanding it.
But there remained one mystery: this code was supposed to run every thirty seconds, so how come this crash wasn’t hitting every user of Chrome 54? Well, it turns out that floating-point exceptions are, by default, suppressed. When a floating-point instruction does something bad it usually just returns a best-effort result and continues running. I wrote about this a few years ago and recommended that developers try enabling some of these exceptions in order to flush out bugs that would otherwise be hidden.
So now we know why everyone isn’t crashing, but there remained one mystery: why is anyone crashing? If floating-point exceptions are suppressed then shouldn’t this bug have remained hidden forever?
The answer to that is that in the crazy world of Windows there are a lot of programs that think that injecting their DLLs into all processes is a good idea. Whether malware or anti-malware or something else these injected DLLs end up causing a good portion of all of Chrome’s crashes. And in this case one of these injected DLLs decided that changing the FPU exception flags in somebody else’s process was a good idea. I mean, it could be worse. In this case they exposed a genuine compiler bug, and it’s much more polite than trashing our heap or patching our code, so, thanks?
I verified this by grabbing the FPExceptionEnabler class, modifying it to enable just the _EM_INVALID exception, and then stepping over the _controlfp_s call that enables the exceptions, and seeing which floating point registers changed – they’re highlighted in red. The only one was the CTRL register which went from 0x27F to 0x27E so I knew that clearing the low bit would enable the FPU invalid exceptions. I then installed 32-bit Chrome canary, with the bug, launched it under the Visual Studio debugger, stopped on the HandleUpdateTimeout function, used the registers window to change CTRL to 0x27E, and then let it run to its crash – thousands of instructions later. Theory confirmed.
And with that our story is complete. To recap the weirdness:
- A FLT_STACK is different from the regular stack
- A FLT_STACK exception does not occur on the problematic instruction
- If you enable display of Floating Point registers and navigate to the instruction indicated by the floating-point EIP you can find the real culprit
- Then you just have to find why the FLT_STACK is messed up. Missing function prototypes used to be the usual reason, but compiler bugs seem more trendy now
- Floating-point exceptions are normally suppressed
- Third-party code does rude things to the processes that it visits
And, as promised, here are the crash dumps. In order to use them you should load them into windbg or Visual Studio and add https://chromium-browser-symsrv.commondatastorage.googleapis.com – Chrome’s symbol server – to the list of symbol servers, so that Chrome’s PDB files are automatically downloaded, as discussed here.
You should probably also enable source server, as discussed here, so that the debuggers will automatically download the appropriate source files. In windbg you type .srcfix and in VS you click a check box, shown above.
With all of the crash dumps I recommend viewing the assembly language (Ctrl+F11 in Visual Studio), opening the Registers window and right-clicking and enable the Floating Point display, so you can see the floating-point CTRL, STAT, and EIP registers.
And, here are the four crash dumps, all less than 136 kB.
- 1 HandleUpdateTimeout before call.dmp – in this you’ll be on the call to ProcessPowerCollector::UpdatePowerConsumption, and in the Registers window you can see that Ctrl is 0x27F, which is good.
- 2 HandleUpdateTimeout after call.dmp – in this one you’ll be on the fstp instruction which is about to mess things up. I’ve helpfully changed Ctrl to 0x27E to enable the _EM_INVALID exceptions. Note that STAT is 0x20 and the floating-point EIP register is 0x10088407 – those will both change
- 3 HandleUpdateTimeout exception triggered.dmp – now execution has moved to the ret instruction and the exception has been triggered. The STAT register value indicates (somehow) that an exception is pending, and EIP indicates the fstp instruction – the last x87 instruction executed
- 4 GetCurrentDelay crash.dmp – this crash dump captures the actual crash, in a completely different call stack. Unfortunately the crucial floating-point EIP register is zero – crash dumps saved by Visual Studio do not record it. Luckily Chrome’s crashpad does save it or else this bug would have been much more difficult to resolve
This bug last appeared in Chrome version 55.0.2858.0 canary. If you want to recreate it then instructions are in the connect bug, but get ready for long build times.
If you want more floating-point goodness then you’ll happy to know I’ve written a whole series of articles on this subject. If you want to read more about finding compiler bugs I’ve got just the post for you.
Reddit discussion is here.