Please Restore Our Registers When You’re Done With Them

“Hey, you. Yes you, that function over there. When you’re cleaning up please remember to restore all of my registers. Yes, that one too – what do you think this is, Linux?”

That’s the problem I was dealing with in a nutshell. Functions are required by a platform’s ABI (Application Binary Interface) to preserve certain registers – restoring them if they were used – but the set of registers that must be restored varies between platforms, and the rules on Linux are different from those on Windows. That may be why I encountered register corruption in Chrome on Windows. But let’s take a step back.

I wasA crashed phone showing a Windows BSOD asked to look at a crash bug in Chrome. The crash was strongly correlated with the injection of third-party DLLs into Chrome’s processes (something that we cannot and do not support) so it was highly likely that the crash was caused by those third-party DLLs, but I still wanted to understand what was going on.

Some of my coworkers had investigated the bug previously and added some extra tests, so the crash had been isolated down to this pseudo-code (actual code is here):

while (StillRunning()) {
  DoLotsOfStuff();
  ImportantFunction(std::move(m_ptr));
  CHECK(!m_ptr);
}

The code above represents a long running loop that does a bunch of stuff. At the end of each iteration it calls a function and moves a smart pointer to the function parameters, which should zero out the source pointer. The crash happened when the CHECK statement noticed that the source pointer was, in fact, not zeroed – we intentionally crash then to avoid memory corruption.

The thing that made me want to investigate was the peculiarity of the behavior of the last two lines. How could we zero a pointer on one line of code, and then find it non-zero on the next line? Even if third-party DLLs are the culprit, how could they do that? I’d checked for any modification of the code bytes near the crash and found nothing, so, how? I wanted to understand.

The truth is in the crash dump

As usual I downloaded one of the crash dumps and looked at the assembly-language instructions which implemented the C++ source code. Here, in a mixture of pseudo-code and assembly language, is what I found the code had been translated to (see comment#61 in the bug for details):

xorps xmm7, xmm7   ; Zero register xmm7
while (StillRunning()) {
  DoLotsOfStuff();
  mov rax, QWORD PTR[rsp + 50h]
  movaps QWORD PTR[rsp + 50h], xmm7   ; zero m_ptr
  call ImportantFunction
  CrashIfNonZero(rsp+50h);
}

Apologies for the mixing of metaphors, but the key point is that the compiler decided to zero the XMM7 register (one of the SSE registers) before the loop started. Then, at the end of each loop iteration it would use XMM7 to zero out m_ptr (stored at the address rsp+50h). The compiler expected XMM7 to remain zero. It did not.

I looked at a large set of crash dumps to see if there was any pattern to the values in XMM7. Here are four of the values I found:

  • 96 12 54 91 ca c8 18 ef 98 e8 77 c9 6e 5d ce ee
  • c5 1e 15 13 00 a0 94 5b 37 a5 f3 55 a8 7e 8d 7d
  • 54 39 1f 15 3e bf 13 3e 58 98 fd 6d 64 a3 5a 27
  • 04 df 90 27 02 94 4c ed 73 65 1d 61 af da 33 36

If there is a pattern in these numbers then I sure can’t see it. This randomness is another clue that constrains what the problem could be caused by.

ABIs matter

Mount St. Helens crater - snow near fireThe functions DoLotsOfStuff and ImportantFunction and all of the functions that they call are required by the Windows ABI to preserve XMM7 (this is not the case on Linux). If they use it then they have to restore it. But one of them was not (or the stack location where they preserved the register got trashed, but that seems less likely). In most of the crashes third-party DLLs were present in the Chrome process. These DLLs are presumed to be hooking Chrome or OS functions, and their injected code was presumed to be corrupting XMM7.

I tweeted about this, trying to get theories for how this could happen. Amidst the various replies talking about ISRs, DPCs, and drivers I saw a reply from someone I’d never talked to before saying, basically, “What about this Chromium code?”

I saw this tweet on my home laptop and when I went to look on my work machine the author had already deleted it. My curiosity piqued I reached out to them in a DM. They said that they had thought the code looked suspicious but then realized that the problem was understood by the developers and that the code wasn’t actually compiled in to Chrome on Windows. Such is the challenge of looking for misuse of XMM7 in Chromium’s source code – there are too many references (over 17,000), and most of them are irrelevant.

They then mentioned that they were switching to binary analysis using IDA Pro and had found a couple of functions that were shipping in chrome.dll but weren’t restoring XMM7 – they sent source-code links to what looked like real bugs. This is a time when binary analysis is actually easier than “use the source,” because the machine code has had all macros and #ifdefs processed and represents what actually ships.

I decided to replicate their work using “dumpbin /disasm” and some simple Python code to scan the output. For every function in Chrome (found by looking for global symbols in the disassembly output) my script checked to see whether XMM7 was used without being saved. Initially I checked to see if it was written relative to rsp prior to its first use, but I found it being written relative to rax and rbp as well, so I eventually loosened my heuristics. My script still gives false positives, and may also give false negatives, but it works well enough to be useful.

Despite my initial assumption that this was a third-party bug, my simple script found several suspicious functions. There were roughly three categories of functions where the first use of XMM7 was not restoring it:

  1. Functions such as dav1d_iflipadst_16x8_internal_16bpc_sse4 (from here?) that are internal-use functions by the dav1d library. These functions are all called by wrappers that preserve and restore XMM7, therefore they are fine.
  2. Functions such as __longjmp_internal which by design restores all of the non-volatile registers so that it can return to a previous execution state.
  3. Buggy code that is built in to Chromium

Mauna Loa crater, snow in Hawaii. Okay, I really like volcanoes!Using this crude binary-analysis technique I ultimately found the same two buggy functions in chrome.dll that my twitter accomplice had found.

The function ScaleRowUp2_Bilinear_12_SSSE3 in WebRTC was putting the constant 0x0008000800080008 into XMM7 without first preserving. This is a bug, and could cause crashes, but I knew that this wasn’t the cause of this crash because the XMM7 values I was seeing were highly random. I reported the issue to the author and they filed a bug and fixed it within 24 hours.

DyadicBilinearQuarterDownsampler_sse in openh264 was also using XMM7 without preserving it. Video codecs often deal with values with high entropy so maybe this could have been producing the random values I was seeing (spoiler alert: it wasn’t) and it was certainly wrong. I filed a bug for this and then decided to fix it. Landing this fix came with a couple of challenges:

  1. The bug was in an assembly language file that used multiple macros to ensure cross-platform correctness. Thus I had to figure out (by examining nearby functions) the correct incantations to preserve the registers when needed. It wasn’t too bad, but it’s always weird when I’m writing code in a language that I effectively don’t know at all. Pattern recognition FTW. Anyway, the two line fix worked.
  2. Fixing the bug in openh264 does not immediately help Chrome because Chromium uses a pinned copy of third-party libraries, so I needed to “roll” in the latest version of openh264. Sometimes there is an auto-roller that does this regularly, but openh264 lacked this. It had been six months since the last roll of openh264, and in the interim somebody had moved the public header files to a new directory. Since Chromium and another third-party project (WebRTC) were both including headers from this renamed directory there was an eight step process (one, two, three, four, five, six, seven, eight) to bring in the new version without breaking anything in WebRTC or Chromium. The basic technique was to do conditional includes in WebRTC, plus waiting for auto-rolls of WebRTC into Chromium and vice-versa.

The WebRTC and openh264 issues were genuine bugs, and fixing them will probably prevent future crashes in Chromium, but it made no difference to the issue that I was investigating. The crashes continued. Third-party software continued to be the most likely explanation.

Roadside chapel in FranceThere had been multiple hints as to what type of third-party software could be the problem. It has to be something that creates highly randomized data. There was an apparent correlation with third-party disk encryption software. One customer I was investigating the crashes with was using a particular third-party disk encryption product, and Microsoft had noticed a correlation with tasks that do filesystem work. Attempts were made to contact the vendor.

Update: the vendor (McAfee/Trellix) was contacted and they have released a fix to their Drive Encryption product.

I am glad that the root cause has been fixed, but also, I would like it if developers working on product that use assembly language could audit their code to make sure they are honoring the Windows ABI. This isn’t the first instance of this class of bug, and it won’t be the last.

My motivation

I decided to write this up because I thought that the journey was interesting, but also because the journey is not over. There could be other registers that are not being properly saved and restored in Chromium. There could be other projects that are making this mistake, in some cases perhaps not aware of the differences between the Linux and Windows ABI. Any software rules that are not tested and enforced are inevitably broken and I am not aware of any consistent testing to detect ABI violations. More bugs of this type seem inevitable.

Summary

These crashes started happening around the M91 versions of Chrome. This at first looked like a Chrome bug, but it now seems more likely that it was because the compiler or the Chromium code changed to make it vulnerable to XMM7 register corruption that was already happening in the ecosystem. That is, prior to M91 Chrome wasn’t using XMM7 at all in the RunWorker function (I checked), and starting with M91 the code-gen changed (compiler change?) and the function started relying on XMM7 staying zeroed for hours at a time. So please restore our registers when you’re done with them.

And thanks again to Dougall on twitter for poking at this problem and inspiring me to dig deeper.

Note that there is now a follow-up blog post, explaining how we made the crashes stop even while the register corruption continues.

Twitter announcement

Hacker news discussion

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Chromium, Debugging, Investigative Reporting, Symbols and tagged , , . Bookmark the permalink.

25 Responses to Please Restore Our Registers When You’re Done With Them

  1. Greg Law says:

    Interesting journey! I’m curious, did you consider WinDBG Time Travel Debugging for this? Feels like that would have been a good fit.

    • brucedawson says:

      I didn’t consider it. The main problem is that I didn’t have a local repro and the bug took a while to manifest. So, I would have had to convince the customer to install and run the tool in their environment, which I don’t think would have worked.
      For many crash investigations I don’t have access to any customers – just crash dumps.

  2. rpavlik says:

    I’m glad I’m not the only one who has used reverse engineering tools (in my case, radare2 long ago, and Ghidra recently) to investigate my own code or other code I have sources for. I do try to avoid writing in assembly though.

    • brucedawson says:

      I agree about avoiding writing assembly. It is needed sometimes (high-performance cryptography for instance) but it sure brings along a lot of headaches. I read it a lot (in crash dumps, mostly) but I’m happy to not have written any in a while.
      I think the only assembly language code that I am responsible for is the high-precision math routines used in Fractal eXtreme, and those are mostly generated by C# code, which adds an extra layer of complexity to their maintenance.

  3. George B. says:

    Do you understand that contents of m_ptr is undefined after calling std::move(m_ptr) ?
    You can only assign to m_ptr, not to dereference it.
    No need to go down assembly code.
    Just try to avoid errors when write in C++ — use compiler for error checking, static code analyzers are also helpful .
    For windows please consult:
    https://devblogs.microsoft.com/cppblog/new-code-analysis-checks-in-visual-studio-2019-use-after-move-and-coroutine/

    • david rogers says:

      yeah, foo(std::move(x)) may or may not move x, depends on callee.

      • brucedawson says:

        I was wondering if anybody would comment on the move issues – two separate issues in fact:
        1) We were assuming that m_ptr was zero after moving from it. That is definitely not something that should be done in normal code. However this was debugging-only code and we control the compiler and C++ implementation and we happened to know that it works. Not a good general practice, but I’m okay with it here.
        We do use some static code analysis, and asan and similar. None of them found this XMM7 bug, so drastic measures were needed. We don’t use the VC++ code analysis because VC++ cannot compile our code – pity.
        I’m confused by your comment about “no need to go down to assembly code” because in fact there was a tremendous need to do that. Without that the bugs would not have been found and fixed.
        2) We were assuming that m_ptr was actually being moved. The thing about that assumption is that it will either be true always or never, everywhere or nowhere. If m_ptr was not moved then the code would fail and we would fix it. So that is less of a concern than point #1.

    • neilhendo says:

      Why do you think it’s undefined?

      First, you don’t know what the type of `m_ptr` is. It’s probably their own smart pointer type. In which case, I suspect their move constructor is implemented similar to this:

      SmartPointer(SmartPointer &&other) noexcept
      : ptr(std::exchange(other.ptr, nullptr)
      {}

      Second, even if `m_ptr` was std::unique_ptr, the Standard specifies that the moved-from unique_ptr is set to nullptr. (See [unique_ptr] section.)

      I think you may be referring to this from the [lib.types.movedfrom] section:

      “Objects of types defined in the C++ standard library may be moved from (15.8). Move operations may be explicitly specified or implicitly generated. Unless otherwise specified, such moved-from objects shall be placed in a valid but unspecified state.”

      Note that “valid but unspecified” is very different to undefined. By default moved-from objects are valid but unspecified. But the behaviour for unique_ptr is specified.

  4. naesten says:

    By the way, it looks like the source for that dav1d symbol you mentioned can be found by just searching for the symbol with the dav1d_ prefix and trailing _sse4 cpuflags suffix stripped off, i.e. https://source.chromium.org/search?q=iflipadst_16x8_internal_16bpc&sq=&ss=chromium

    • brucedawson says:

      Ah yes – so it can.
      I am not a fan of symbol names that are generated with macros and there must be such a beast (that then calls the one that you found?). Code search is great but it usually can’t resolve that trickiness.

  5. WindowsSucks says:

    Why is drive encryption software injecting code in usermode? Are they hooking WriteFile or something? Why is it not a kernel filter driver?

    • brucedawson says:

      I think it was a kernel filter driver, because the corruption occurred even when we could see no third-party code in the process. So, it appears that register corruption in a kernel filter driver can filter back to the process. Maybe that’s the boundary where Microsoft should be (in CHECK builds at least) enforcing the ABI.
      But, I’m not a kernel expert or filter driver expert and I’ve never seen the crashes on my machine so I’m extrapolating a lot here.

  6. Miku AuahDark says:

    I remember also have to deal with this issue in a LOVE game framework in Linux.

    Someone filled a bug report into our issue tracker complaining crashes in ArchLinux. After some investigation, we thought it’s LuaJIT bug (Lua interpreter that LOVE uses mainly for its game scripting language, but spoiler: it’s not LuaJIT). I spin up my old laptop with ArchLinux and able to reproduce the crash.

    Some time later, the OP posted that disabling the audio subsystem in the game framework makes the issue went away, so I reported it to OpenAL-soft hoping it’s issue in their side (spoiler again: it’s not). I also cross-post the issue both in LuaJIT and OpenAL-soft repositories since I still don’t know which library to put the blame on. The OpenAL-soft maintainer is able to reproduce the issue but can’t figure out what’s going on.

    The next day, someone commented in my LuaJIT GitHub issue tracker mentions that activating DAZ bit in MXCSR register will trigger the crash in LuaJIT. This believes me that piece of code somewhere in OpenAL-soft forgot to reset those bits as according to Linux System-V ABI, DAZ bit must be preserved. I wrote a simple test DLL in Windows that set the DAZ bit and indeed it crashes LuaJIT there too.

    I decided to start GDB and place a breakpoint at alcCreateContext first then put watchpoint at the MXCSR. This apparently takes long time since GDB has to single-step every single instruction. An hour later, culprit found. It was caused by PipeWire setting the DAZ bits without resetting it back.

    I decided to send bug report to PipeWire. After some discussion with the PipeWire maintainer along with OpenAL-soft maintainer, PipeWire maintainer finally committed a fix. PipeWire 0.3.48 is released, tested it myself by updating my PipeWire, and the crash no longer reproducible.

    This issue is real, see GitHub issue love2d/love#1776 (not gonna send links here to prevent spam detection) and PipeWire 0.3.48 changelog.

    • brucedawson says:

      Uggh. Yeah, there’s a long history of drivers and other third-party code modifying the floating-point state. Changing precision or enabling floating-point exceptions are particularly fun ones. Good job tracking it down.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.