See the end of the post for an October 2018 bug fix update, or read the whole story:
Flaky failures are the worst. In this particular investigation, which spanned twenty months, we suspected hardware failure, compiler bugs, linker bugs, and other possibilities. Jumping too quickly to blaming hardware or build tools is a classic mistake, but in this case the mistake was that we weren’t thinking big enough. Yes, there was a linker bug, but we were also lucky enough to have hit a Windows kernel bug which is triggered by linkers!
In September of 2016 we started noticing random failures when building Chrome – 3 out of 200 builds of Chrome failed when protoc.exe, one of the executables that is part of the build, crashed with an access violation. That is, we would build protoc.exe, and then run it to generate header files for the next build stage, but it would crash instead.
The developers who investigated knew immediately that something weird was happening but they couldn’t reproduce the bug locally so they were forced to make guesses. A couple of speculative fixes (reordering the tool’s arguments and adding explicit dependencies) were made, and the second fix seemed to work. The bug went away for a year.
And then, a few days shy of its first birthday, the bug started happening again. A steady drumbeat of reports came in – ten separate bugs were merged into the master bug over the next few months, representing just a fraction of the crashes.
I joined the investigation when I hit the bug on my workstation. I ran the bad binary under a debugger and saw this assembly language in the debugger:
00000001400010A1 00 00 add byte ptr [rax],al
00000001400010A3 00 00 add byte ptr [rax],al
00000001400010A5 00 00 add byte ptr [rax],al
00000001400010A7 00 00 add byte ptr [rax],al
Now we have a problem statement that we can reason about: why are large chunks of our code segment filled with zeroes?
I deleted the binary and relinked it and found that the zeroes were replaced with a series of five-byte jmp instructions. The long array of zeroes was in an array of thunks, used by VC++’s incremental linker so that it can more easily move functions around. It seemed quite obvious that we were hitting a bug in incremental linking. Incremental linking is an important build-time optimization for huge binaries like chrome.dll, but for tiny binaries like protoc.exe it is irrelevant, so the fix was obvious: disable incremental linking for the tiny binaries used in the build.
It turned out that this fix did work around an incremental linking bug, but it was not the bug we were looking for.
I then ignored the bug until I hit it on my workstation two weeks later. My fix had not worked. And, this time the array of zeroes was in a function, instead of in the incremental linking jump table.
I was still assuming that we were dealing with a linker bug so when another two weeks later I hit the problem again I was confused. I was confused because I was not using Microsoft’s linker anymore. I had switched to using lld-link (use_lld=true in my gn args). In fact, when the bug first hit we had been using the VC++ compiler and linker and I’d just hit it with the clang compiler and linker. If switching out your entire toolchain doesn’t fix a bug then it’s clearly not a toolchain bug – mass hysteria was starting to seem like the best explanation.
Up to this point I had been hitting this bug randomly. I was doing a lot of builds because I was doing build-speed investigations and these crashes were interfering with my ability to do measurements. It’s frustrating to leave your computer running tests overnight only to have crashes pollute the results. I decided it was time to try science.
Instead of doing a dozen builds in a night to test a new build optimization I changed my script to just build Chrome in a loop until it failed. With jumbo distributed builds and a minimal level of symbols I can, on a good day, build Chrome a dozen times in an hour. Even a rare and flaky bug like this one starts happening every single night when you do that. So do other bugs (zombies!) but that’s a different story.
And then, I got lucky. I logged on to my computer in the morning, saw that genmodule.exe had crashed overnight (the crashing binary varied), and decided to run it again, to get a live crash instead of looking at crash dumps. And it didn’t crash.
The crash dump (I have Windows Error Reporting configured to save local crash dumps, all Windows developers should do this) showed lots of all-zero instructions in the critical path. It was not possible for this binary to run correctly. I ran genmodule.exe under the debugger and halted on the function that had previously crashed – that had previously been all zeroes – and it was fine.
Apologies for the strong language, and those of a more sensitive nature (and karate fans) might want to skip the rest of this paragraph, but WTF?!?
I then loaded the crash dump into windbg and typed “!chkimg”. This command compares the code bytes in the crash dump (some of them are saved in the crash dump, just in case) against those on disk. This is helpful when a crash is caused by bad RAM or bad patching and it will sometimes report that a few dozen bytes have been changed. In this case it said that 9322 bytes in the code in the crash dump were wrong. Huh!
Now we have a new problem statement: why are we not running the code that the linker wrote to the file?
This was starting to look like a Windows file cache bug. It looked like the Windows loader was pulling in pages full of zeroes instead of the pages that we had just written. Maybe something to do with multi-socket coherency of the disk and cache or ???
My coworker Zach made the vital suggestion that I run the sysinternals sync command after linking binaries. I resisted at first because the sync command is quite heavyweight and requires administrative privileges, but eventually I ran a weekend long test where I built Chrome from scratch over 1,000 times, as admin, with various mitigations after running the linker:
- Normal build: 3.5% failure rate
- 7-second sleep after linking exes: 2% failure rate
- sync.exe after linking exes: 0% failure rate
Huzzah! Running sync.exe was not a feasible fix, but it was a proof of concept. The next step was a custom C++ program that opened the just-linked exe and called FlushFileBuffers on it. This is much lighter weight and doesn’t require administrative privileges and this also stopped the bug from happening. The final step was to convert this into Python, land the change, and then make my favorite under-appreciated tweet:
Later that day – before I’d had a chance to file an official bug report – I got an email from Mehmet, an ex-coworker at Microsoft, basically saying “Hey, how’s things? What’s this I hear about a kernel bug?”
I shared my results (the crash dumps are quite convincing) and my methodology. They were unable to reproduce the bug – probably due to not being able to build Chrome as many times per hour as I can. But, they helped me enable circular-buffer ETW tracing, rigged to save the trace buffers on a build failure. After some back-and-forth I managed to record a trace which contained enough information for them to understand the bug.
The underlying bug is that if a program writes a PE file (EXE or DLL) using memory mapped file I/O and if that program is then immediately executed (or loaded with LoadLibrary or LoadLibraryEx), and if the system is under very heavy disk I/O load, then a necessary file-buffer flush may fail. This is very rare and can realistically only happen on build machines, and even then only on monster 24-core machines like I use. They confirmed that my fix should reliably avoid the bug (I’d already noted that it had allowed ~600 clean builds in a row), and promised to create a proper fix in Windows.
Play along at home
You probably won’t be able to reproduce this bug but if you want to see an example crash dump you can find one (and the .exe and .pdb files) on github. You can load them into Visual Studio and see all the zero bytes in the disassembly, or load them into windbg to run !chkimg and see the !chkimg errors:
0:000> .sympath .
Symbol search path is: .
00412d40 0000 add byte ptr [eax],al ds:002b:cbb75f7e=??
9658 errors : @$ip (00408000-00415815)
0:000> uf eip
00412d40 0000 add byte ptr [eax],al
00412d42 0000 add byte ptr [eax],al
00412d44 0000 add byte ptr [eax],al
00412d46 0000 add byte ptr [eax],al
1) Building Chrome very quickly causes CcmExec.exe to leak process handles. Each build can leak up to 1,600 process handles and about 100 MB. That becomes a problem when you do 300+ builds in a weekend – bye bye to ~32 GB of RAM, consumed by zombies. I now run a loop that periodically kills CcmExec.exe to mitigate this, and Microsoft is working on a fix (Microsoft also told me how to disable the problematic CcmExec.exe feature, which I did).
2) Most Windows developers have seen 0xC0000005 enough times to remember that it means Access Violation – it means that your program dereferenced memory that it should not have, or in a way that it should not have. But how many Windows programmers recognize the error codes 3221225477 or -1073741819? It turns out that these are the same value, printed as unsigned or signed decimal. But, not surprisingly, when developers see a number around negative one billion their eyes glaze over and the numbers all start to look the same. So when some of the crashes returned error code -1073740791 the difference was either not noticed, or was ignored.
3) That’s a shame because it turns out that there were two bugs. crbug.com/644525 is the Chromium bug for investigating what turned out to be this kernel bug. But, once I landed a workaround for that bug and reenabled incremental linking we started hitting different crashes – crbug.com/812421. Some developers were hitting error code –1073740791 which is 0xC0000409 which is STATUS_STACK_BUFFER_OVERRUN. I never saw this crash myself but I asked for a crash dump (I was terrified that crbug.com/644525 had returned) from a coworker and saw that ntdll.dll!RtlpHandleInvalidUserCallTarget was calling RtlFailFast2. I recognized this signature and knew that it had nothing to do with buffer overruns. It’s a Control Flow Guard violation, meaning that the OS thinks that your program is being exploited by bad people to do an illegal indirect function call.
It appears that if you use /incremental with /cfg then the Control Flow Guard information isn’t always updated during incremental linking. The simple fix was to update our build configurations to never use /incremental and /cfg at the same time – they aren’t a useful combination anyway.
And, for my own sanity, I landed a few changes that get us to print Windows error codes in hex. So much better.
We still don’t know what caused this bug to start showing up in the first place – maybe our switch to gn changed the ordering of build steps to make us more vulnerable?
We also don’t know why the bug disappeared for a year. Was the original bug something unrelated that was fixed by this change? Or did we just get lucky or oblivious?
Either way, whether we fixed two or three separate bugs, Chrome’s builds are much more reliable now and I can go back to doing build-performance testing without hitting failures.
The Chrome build workaround I landed is 100% reliable, and both lld-link.exe and Microsoft’s link.exe will be adding FlushFileBuffers calls as mitigations. If you work on a tool that creates binaries (Rust? I filed a bug for Go’s linker but it’s not vulnerable) using memory mapped files you should consider adding a FlushFileBuffers call just before closing the file. This bug shows up from Server 2008 R2 (Windows 7) up to the latest stable build of Windows 10 and OS fixes will take a while to propagate so you might as well be careful.
Russian translation is here.
A twitter user shared a report of them hitting the bug in 2015, which is not hugely shocking since this bug appears to be 8+ years old. And another twitter user reported that they hit bug after I reported it.
Update, October 2018
I recently found that at least two other companies (for a total of five known) were hitting this bug – once you know it’s there it becomes easier to identify it. One of the other companies that was hitting it was Microsoft.
I also just heard that Windows 10 RS5 (1809, the fall 2018 update) has a fix. So, the workaround that we added to lld-link will probably get an OS version check so that we only pay the cost when it is actually needed.