Last month I wrote about an odd crash that was hitting a few Chrome users. Something was corrupting the XMM7 register and that was causing Chrome to crash. We fixed a couple of bugs in Chrome and we were able to contact the third-party company whose software was causing the problems. They released a fixed version, and I assumed that my work was done.
However, instead of a gradual decline in the rates of this crash I saw a gradual increase. Apparently enterprise software updates roll out extremely slowly, and users were installing the old buggy version faster than they were updating to the fixed version. This situation will resolve itself eventually, but a lot of crashes were going to happen in the meantime. With the proper fix moving slowly through the pipelines I decided to try to hack in a decidedly improper fix.
The crashes happened because the compiler generated code that zeroed XMM7 and then expected (reasonably enough) that the register would retain its value. I had to convince the compiler to behave differently.
The problematic line of code was this one:
The std::move was supposed to move the pointer in task_source to the function parameter, and zero out task_source. The zeroing was being done with XMM7 and was failing when XMM7 was corrupted.
Attempt #1: NOINLINE
As it turns out the moving and zeroing was actually happening in the RegisteredTaskSource move constructor. This was a separate function in a separate source file so normally it would not be able to make assumptions about the value of XMM7, but in our highly-optimized official builds we do Link Time Optimization (LTO) which can do cross-module inlining, so the body of the constructor was inserted into RunWorker which then allowed it to make these assumptions about XMM7. If I tagged this function as NOINLINE (shown in this candidate change list (CL)) then the assumption would go away, and with it the crash.
I created a CL that did this, and looked at the generated code in our official builds and realized that there was another function that was inlined that was making the same assumption about XMM7. I tagged that one as NOINLINE and looked at the generated code again and found another function that was inlined that was making the same assumption about XMM7.
Playing whack-a-mole is not a great strategy for controlling the compiler. Even if I coerced the compiler into not depending on the XMM registers retaining their values the fix would not be robust. Any little change to the source code or the compiler could cause the bug to return, with arbitrarily bad memory corruption possibilities.
Attempt #2: optimize off
My next attempt was using a larger and more robust hammer. By telling the compiler to not optimize the RunWorker function at all I could be pretty certain that it wouldn’t use any XMM registers. This method worked fine, and while shipping non-optimized code is normally undesirable it would have been fine in this case. The function in question was not performance sensitive and it would have worked. I wrote up an enormous comment block to explain why I was doing this strange and wonky thing and sent the change out for review.
Attempt #3: it’s clobbering time
Code-review can be a great process. It helps to maintain high code quality, and it gives other software developers a chance to offer suggestions. And they did. One code reviewer told me how to get what I wanted, and the other code reviewer told me where to put it.
One developer suggested that there might be a better option that “optimize off”. The root cause of the crashes was that XMM7 (and probably other XMM registers) were being corrupted. What if we told the compiler that this was happening? It turns out that the gcc/clang compilers offer a syntax for doing exactly that. These compilers allow inline assembly language, and in order for the inline code to coexist with the surrounding C/C++ code the asm statement includes an optional “clobbers” section. This is a list of registers that the assembly language may have written to (used to fix Chrome’s own XMM7 clobbering), and this is exactly what was needed. All I needed was this inline assembly block:
: “%xmm6”, “%xmm7”, “%xmm8”, “%xmm9”, “%xmm10”, “%xmm11”,
“%xmm12”, “%xmm13”, “%xmm14”, “%xmm15”);
This cryptic syntax says that the (empty) inline assembly block may have modified the XMM registers from XMM6 to XMM15 and that therefore the compiler should not assume anything about their values. Placing this right after the call to the arbitrary tasks ensures that the compiler will make no assumption about these registers. This is almost a perfect match for the problem.
The other developer who reviewed the change suggested that I move the magic asm block to a different location (base::TaskAnnotator::RunTaskImpl). The original function that was crashing was just one place where tasks (that might clobber XMM registers) were called, and he told me where to put the asm block so that it would follow shortly after all calls to the potentially problematic tasks. This new location would prevent the crashes that we were seeing, and possible future crashes that might happen elsewhere.
I looked at the assembly language generated by various versions of my fixes in order to ensure that I was getting the desired results. The easiest way to do this was to load chrome.dll into windbg and then disassemble the problematic function. This command-prompt and then windbg set of commands does the trick:
> windbg -z chrome.dll
0:000> uf chrome!base::internal::WorkerThread::RunWorker
With the NOINLINE solution I could see fewer uses of XMM registers that were assumed to still be zero, but I did not get it to zero. With the NOOPT solution I saw that all use of XMM registers went away, although the corrupt values would be retained forever. With the asm block solution (initially tested in the RunWorker function, and then moved) I could see two changes:
One change was that the pattern of zeroing an XMM register before the loop and then using it in the loop went away. That makes sense because the asm directive explicitly told the compiler that that was not going to work.
The other change was that the RunWorker function started preserving the registers from XMM6 to XMM15 in the function prologue, and then restoring them in the epilogue. This makes sense because the clobber entries in the asm block tell the compiler that the XMM registers were clobbered, and the compiler is in charge of following the platform ABI. That is, the compiler needs to ensure that the functions that call RunWorker don’t see registers being corrupted. So, once the fix was moved to RunTaskImpl the RunTaskImpl function would preserve and restore the non-volatile XMM registers and any calling functions would no longer see register corruption. Perfect!
Fixed, but should we?
After a few days of the asm block hack shipping to Chrome I can see that it has completely worked around the bug. Crashes in the canary (daily builds) channel have gone to zero even as the crash rate for the regular builds continue to climb. So, all good, right?
It’s complicated. I’m pleased that we were able to address a pain point for our customers, but working around third-party bugs is not something that we want to get in the habit of doing. Doing so creates perverse incentives. As a general policy Chromium does not add patches to work around third-party bugs. We do not support having third-party code injected into our processes and we do not “fix” the crashes which that causes because that road leads to madness. As a general rule, if some third-party code (very often security software) is causing Chrome to crash then users should request a fix from their security vendor, should uninstall that software, or should configure it to not invade Chrome’s processes.
I decided to push for this particular fix for a few reasons. The main reason was that the vendor has already pushed a fix. That is crucial. This asm hack is just a temporary workaround until their users install the updates. The secondary reason is that there was no obvious way for our mutual users to realize that Trellix disk encryption was the problem. The crashes happened even when no Trellix modules were injected into our processes – I don’t know how – which meant Chrome was going to get blamed, with no easy way for us to explain ourselves.
I hope that this was the correct decision, and I hope that we are able to remove the clever-but-horrible asm hack after not too long.
Twitter announcement is here, announcement of the previous post was here.
Thanks for a great write up, and an extremely interesting read.
Considering the WinDBG screenshot you’ve shared, with the ASM code that is now produced by the compiler — instead of simply reading and writing to and from XMM7 — did you manage to measure how significant is the performance hit produced by that change?
I’m sure that for a single function, which isn’t frequently invoked, the change is minuscule. But, considering the fact that the XMM “restriction” is enforced on a translation-unit level (task_annotator.cc) — the impact might be real, right?
PS – I’m not sure how frequently the functionality in task_annotator.cc is invoked, so excuse me in advance if the question is irrelevant.
The XMM “restriction” isn’t enforced for an entire translation unit, it is at one location in one function. Roughly speaking the cost is the time to save and then restore the ten affected registers to the stack on each call to this function which is on the order of a half-dozen nanoseconds, plus the ~160 bytes of cache-pollution. If this was in a frequently called function then it could be a concern, but it’s not.
Put another way, the cost of preserving the registers is significantly lower than the cost of allocating memory which Chrome does all the time.
So, the question is important and relevant, but I’m confident that the cost is tiny.