Compiler Bugs Found When Porting Chromium to VC++ 2015

WinDirStat of Chromium, no generated codeMoving a big software project to a new compiler can be a lot of work, and few projects are bigger than Chromium. In addition to the main Chromium repository, which includes all of Blink, there are over a hundred other open-source projects which Chromium incorporates by reference, totaling more than 48,000 C/C++ files and 40,000 header files. Porting Chromium on Windows to VC++ 2015 requires getting all of these projects to build and run.

As of March 11th 2016 Chromium for Windows has switched from VC++ 2013 to VC++ 2015, and it doesn’t look like it’s switching back. This will give us more C++ 11 features, new security options, much faster LTCG/PGO builds, and other advantages.

The tracking bug for this project currently has over 330 comments on it, with contributions from dozens of developers. Writing about all of those changes would require an entire book. So I’m going to focus on my favorite part of this project – compiler bugs. In particular, bugs where the compiler silently generates incorrect code.

Update: another code-gen bug in VS 2015 was found on March 31st, and worked around about two weeks later. A VC++ bug was filed. And another silent-bad-code-gen bug was found in September 2016, and worked around a few days later.

If you throw a diverse enough set of code at a compiler, build it in lots of different ways, run lots of tests, and ship the code to customers, you’re going to find some bugs.

Because Chromium is open source I will include links to Chromium bugs and fixes where possible, in addition to links to the VC++ bug reports.

I didn’t find all of these bugs, but it was my job to investigate them, come up with a minimal repro, and report them to Microsoft. And I have to say that the Microsoft team was amazing. They were very supportive, and helpful, and it was clear that they really wanted VC++ 2015 to be as good as possible, and that includes building Chromium. We had an excellent symbiotic relationship: I fed them high quality bug reports, and they fixed them incredibly quickly.

image

Anyone can file VC++ bugs at the connect website. Getting your bugs noticed is easier if you work for Google, but anyone who files quality bugs can get help. The VC++ MVP program lets ordinary developers who participate in the VC++ community get privileged access to the VC++ team. Recommended – I enjoyed being a VC++ MVP before joining Google.

Chromium is open source so Microsoft can theoretically reproduce any of these bugs if I just tell them what source file to compile and where to look. But that can be a slow process. If the repro steps for your bug includes “Download 90,000 source files and our custom build system and compile for hours” then you can’t expect priority service.

This is especially true because most reports of bugs in compilers are erroneous – most ‘compiler bugs’ are actually bugs in the code being compiled, such as invoking undefined behavior. So, I always had to prove to myself that these were compiler bugs before reporting them, which usually meant simplifying the repro case to make it trivially obvious that the compiler was at fault. I’ve discussed techniques for doing this before.

Failed Chromium

After getting all of Chromium building the next step was to start running Chromium. It would run fine once, but on subsequent runs I’d hit an assert. I suspected a code-gen bug but it was not the case. The problem ultimately turned out to be this code:

HandleWrapper h(CreateMutex(…));
if (h.HandleIsInvalid() && GetLastError() == SOME_ERROR_CODE) …

It turns out that the HandleWrapper class allocates memory, and the debug version of the VC++ 2015 CRT allocations functions call the FlsGetValue function, and FlsGetValue calls SetLastError(0) because it is so incredibly proud that it succeeded. SetLastError(0) destroys the error code from CreateMutex, and madness ensues. I filed a bug and tweeted about this and it turned into a fun little twitter discussion with opinions about evenly split between “if memory allocation affects LastError then Armageddon is imminent” and “if you assume memory allocation won’t affect LastError then you are Satan”.

This way of using HandleWrapper is very convenient and is pervasive in Chromium so fixing all of the uses seemed impractical so instead the HandleWrapper constructor now imagepreserves and then restores LastError. Some other misuses of GetLastError were also fixed.

This was not a compiler code-gen bug, but I wanted to include it because the twitter storm was wonderful.

Microsoft decided to fix the bug which I assumes means that they are choosing Satan over Armageddon.

Failed test

The next step was to start running tests. There are a lot of Chromium test executables and after a bit of experimenting I hit a failure – media_unittests failed on the AudioVideoMetadataExtractorTest.AudioWAV test, but only when built with VS 2015. This is suspicious but proves nothing by itself – it could well be a Chromium bug which the compiler exposed. However I eventually debugged the problem enough to find the failing function (ff_read_riff_info in riffdec.c), and then the failing line of code. It was this:

char key[5] = { 0 };

Simple enough – this is supposed to zero the entire array, but instead it only zeroed the first four bytes.

Clearly the compiler doesn’t make this mistake every time it is asked to zero a five-byte array, or else nothing would run. There was something about the surrounding code that triggered this odd behavior and I still don’t know exactly what it was. I enjoyed coming up with a minimized repro for this bug – just 38 lines including test code to drive it. There’s something satisfying about a small repro and it makes it much more likely that the compiler team will fix the bug – which they did.

In order to get past this bug I checked in a workaround which explicitly zeroes the last character of the array.

PGO only crash

Profile Guided Optimization uses profiling information in order to generate better code. It can give great performance improvements but it’s scary because minor changes in your training – the stage when profiling data is gathered – can affect the generated code. This means that your builds are not entirely deterministic. How exciting. When Google shipped a Chrome Dev Channel build that used PGO we hit a suspicious crash that wasn’t happening in normal builds. The crash was an access violation, which usually means a reference to an unmapped or inaccessible address, but the crash dump showed that there was memory there. The crashing instruction was:

movaps  xmm0,xmmword ptr [r14+2398h] ; Don't do this at home

The movaps instruction requires a 16-byte aligned address and this is reading from 0x2398 + r14. So, unless r14 was guaranteed to be 8-bytes greater than a multiple of 16 this would fail. In fact, r14 was guaranteed to be 16-byte aligned, so this instruction would never work.

The code that triggered this bug was actually scalar math on four adjacent integers. The compiler’s vectorizer recognized that it could do the operations using SIMD instructions – it almost did an amazing job…

The problem with investigating this bug was that while a normal Chromium build ‘only’ takes about half an hour on a 24-core/48-thread machine, a PGO build (actually two link-time-code-generation builds with a training session in-between) takes hours. Not good for reporting a bug.

One of the main features of PGO is that it can decide whether to optimize-for-size or optimize-for-speed at very fine granularity. The source file that contained the crashing function was usually optimized-for-speed so I theorized that PGO was causing it to be optimized-for-size instead. I changed the optimization settings for this file, recompiled it, and examined the generated assembly language – bingo!

As discussed in my compiler-bug-reporting post, /FAcs makes investigating compiler bugs much faster.

After figuring out how to trigger this bug I preprocessed the file, reduced it a bit, and filed a bug with instructions to look for the movaps instruction with a bad offset – see the bug for details.

To get a preprocessed file I ran this command to compile just this file with verbose output and retaining the .rsp file (response file, containing arguments to the compiler):

ninja -v -d keeprsp -C out\Release  ..\..\third_party\libvpx\source\libvpx\vp8\encoder\mcomp.c^^

Then I edited the .rsp file to add /P, re-ran the command that had been verbosely printed, and had my repro.

After reporting this bug I then ran all of our tests on a version of Chromium that was optimized-for-size everywhere, to see if any other bugs were found. This reproduced the original crash, but nothing else.

The minimized repro for this bug was still pretty big and for some reason I could not attach it to the bug – I emailed it to Microsoft and later uploaded it to github. The repro code is also not runnable, but it’s pretty easy to examine the generated code to see whether the bug is there – grepping for movaps in the generated assembly language and looking at the offset is sufficient. See the bug for details.

The generic way of reporting a link-time problem, including LTCG or PGO code-gen bug is to create a LINK_REPRO. This is a good technique to know about but on Chromium it would probably be a many GB repro.

This bug was particularly annoying because it was difficult to come up with a way to work around it. The common strategy of “change the optimization settings” doesn’t work when PGO is overriding those based on profile data. Luckily the fix will arrive soon enough that no workaround is needed.

Bad 64-bit structure read

After talking to the vp8 team about the PGO only crash bug they mentioned another problem that they had hit. Some versions of one of their structures would cause incorrect code to be generated for a read of a 64-bit element from an array. The bug wasn’t currently happening but they had the git hash from a couple of times when it had. So, I synced their repo to that point, preprocessed the file, reproduced the bug, and then started reducing the file. I knew what the bad code-gen looked like so I could minimize the imagerepro quickly. On 32-bit optimized-for-size C compiles (not C++) the compiler would read the first half of the 64-bit value from the correct offset but the second half would always be read from an offset of 4. Oops.

I filed the bug with a minimized repro that was just 22 lines, including test code to print the bad results. This bug was first noticed in VS 2013 but hadn’t previously been reported. It will be fixed in VS 2015 Update 2. No workaround was needed because the bug was not currently manifesting.

Pre-RTM

This bug actually came first chronologically, but since it was a bug in VS 2015 prior to its initial release I’m rating it as less interesting. The repro for the bug – all 44 lines of it, including test code to drive it – causes the call to a destructor to be omitted. I apologize for the confusing description – there’s no way to go back and fix typos or misunderstandings.

Other bugs

Probably my smallest compiler bug repro was an internal compiler error that I could trigger with a one line source file. I worked around that bug by not using /analyze on ffmpeg for a while.

Compared to a one-line repro my bloated 67 line source file to trigger an internal compiler error just seems weak.

One of the reasons to upgrade to VC++ 2015 is because LTCG/PGO links are up to five times faster than with VC++ 2013. Why? I know one reason.

We also hit several bugs where VC++ 2015 refused to compile or link our code:

There are other bugs but this is enough for today.

Once that last linker crash is fixed I can only assume that VC++ 2015 will be completely bug free. If it can build Chromium’s many configurations successfully then it must be!

It could be worse

I worked on one project where in addition to finding several compiler bugs I also found a flaw in the design of the processor, but that’s an NDA tale for another day…

Discussion of this article

There is a good discussion of this article on reddit, a discussion on hacker news, and a slashdot discussion.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Programming, Visual Studio and tagged , , , . Bookmark the permalink.

34 Responses to Compiler Bugs Found When Porting Chromium to VC++ 2015

  1. Pingback: 1 – Compiler Bugs Found When Porting Chromium to VC++ 2015

  2. I wonder if that Pre-RTM bug is the same one I hit and reported… (https://connect.microsoft.com/VisualStudio/feedback/details/1322217/c-compiler-bug-returning-aggregate-initialization-leaks-when-combined-with-move-construction)

    It feels good to help fix a bug, especially one in popular software.

  3. We just got Firefox Nightly builds switched over yesterday: https://bugzilla.mozilla.org/show_bug.cgi?id=1186060 . Most of the interesting bugs we had to fix are dependencies of this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1119082 .

    We were pretty close to having it working, but at some point someone had blanket-disabled all the new VC2015 warnings, and my coworker decided to actually tackle that by actually fixing them or disabling specific warnings only in certain directories, which was pretty good.

    The PGO build time wins are great! VC2015 is about a full hour faster than VC2013 for a Firefox PGO build.

    We’re historically pretty bad about filing compiler bugs, I don’t know why. PGO compiler bugs are especially hard to create testcases for. The worst PGO bug I had to debug was this one back in the Firefox 3 release timeframe: https://bugzilla.mozilla.org/show_bug.cgi?id=475178#c20 . Clearly a miscompilation, but I wasn’t ever sure how to reduce a testcase.

    • brucedawson says:

      Hey Ted, thanks for the Firefox context, especially the corresponding tracking bug. You should definitely try Update 2 as it has a lot of bug fixes.

      For PGO bugs (or LTCG bugs or linker bugs) one option – which I should mention in the article – is a LINK_REPRO. It gathers up all of the linker inputs so that you can submit them to Microsoft. It’s documented here: https://support.microsoft.com/en-us/kb/134650

  4. Tom says:

    If you’re not already aware of it, delta debugging (https://en.wikipedia.org/wiki/Delta_Debugging) can be extremely helpful in generating smaller programs that demonstrate a compiler error. I’ve isolated a compiler bug from a 30,000 line program to a ~30 line reproducer by hand, and it took days. With delta debugging, it might have run for days, but it would have only taken half an hour of my time to set up.

    • brucedawson says:

      I have heard of that but haven’t tried it. I’ve always managed to reduce the reproes quickly enough to make a tool not quite tempting, but I should take the time to learn it.

      • Kevin Choi says:

        All it boils down to is some form of binary heap search. Problem is it’s hard to tailor to a specific project since it requires close ties to building and running. I’ve had some written for Makefile projects and it was fine…. until I ran into problems which couldn’t be solved by binary search.

  5. Harland says:

    I wonder how Microsoft works have reacted had I filed these bugs with the same documentation, but I didn’t work for Chromium at Google. I get the idea the response wouldn’t have been nearly as warm and enthusiastic.

    • Resuna says:

      Heh, I’m reminded of Jerry Pournelle’s raving about the great customer support he got from these little 70s and 80s era computer companies. Well, of course he did. He was one of the top computer columnists in Byte, which was probably the top hobbyist computer magazine at that time!

    • brucedawson says:

      I think there are multiple factors. Working at Google doesn’t hurt. I also used to work at Microsoft so I developed good contacts back then. Being publicly visible on my blog is also advantageous. But I also have been supplying high-quality VC++ bug reports for at least two decades, and that history is also important.

    • When we participated in Microsoft BarCode events to port apps from iPhone/Android to Windows Phone. We worked on our application, an offline public transport journeyplanner, and divided the C code into a “component” and the frontend in .NET. We had so many issues with the Visual C++ compiler. Issues that worked fine on x86 and broke hard on ARM. Common replies of Microsoft specialist as “yes, its beta”, “you are doing extreme things”, “you should rewrite you C code in C#”. No priority support for small code examples that broke.

      I do wonder once Chromium works on x86, someone dares to give it a go on ARM. Our code was originally written in C11, with the most insane stack overflow errors we ported all our code to ANSI-C89. I was actually relieved hearing Microsoft’s support for Clang, I wondered about the reason for choosing VC++ before reading the article.

  6. Resuna says:

    Strictly speaking LastError is a horrible anti-pattern and has been ever since errno in UNIX kept telling people that the user they were trying to send mail to wasn’t a typewriter.

  7. Are you going to move to building with CLang anytime soon?

  8. A user says:

    I wonder why the VS team is not running such tests like compiling Chromium before they release.

    • brucedawson says:

      The VC++ team does have some Chromium repos which they build with VC++. However they rely on Google for some of the fixes to the code to get it compiling with a new compiler. There is also the problem of the many configurations of Chromium – gyp/gn, 32-bit/64-bit, debug/release/official/PGO, component/static – that’s more than 24 variations, each with lots of quirks and lots of test binaries. Since Google already has build machines configured to build and test these many variants it is inevitable that Google will find some bugs that Microsoft could not so easily find.

      • Jon says:

        But still, Microsoft has a huge amount of C++ software they could be testing with their new compiler. It’s surprising that outsiders can find bugs that they didn’t.

        • brucedawson says:

          They certainly do test their compiler with lots of software (purchased test suites, Windows, Office, and more). I guess the space of non-trivially different C++ programs is so vast that test coverage is still sparse, regardless.

  9. rpavlik says:

    You did embolden me to get more involved with Visual C++ dev, that’s for sure! (Well, that, and the fact that my current main codebase could build on GCC and Clang as well as 2013, but ICEd out 2015 – eventually I got jealous of the 2015 improvements and got active in trying to get the blockers fixed.) And, indeed, the ICE got fixed! https://connect.microsoft.com/VisualStudio/feedback/details/2154536

    I know it’s kind of a hazily-defined topic, especially on non-Windows, but at least on Windows, it’s a bit clearer – have you found a tool that does the preprocessing of includes for only non-system includes? I feel like that’s probably an important step in reducing test cases (getting it to a single file) but also that including copies of system header chunks (as if I just preprocessed it all the way) is both not helpful and makes the job of reducing the test case harder.

    Can’t say I’m clever enough to find any code-gen bugs yet – my current role is apparently to figure out if I’m a nightmare for the frontend or the other way ’round… https://connect.microsoft.com/VisualStudio/feedback/details/2454428/ (Yeah, trying but failing isolating a test case out of Boost Interprocess, then trying to figure out if it’s supposed to compile or not, not my best bug report. Jon Caves said he thinks it’s not supposed to work, aka, “as designed”, which just leaves me confused as to why 2013, as well as gcc on mingw64, are totally cool with it…)

    • brucedawson says:

      The last bug I reduced was made difficult by the system includes – the bug was a compiler bug that was only triggered by Update 2 RC header files (see “got out-of-context status updates about the bug on twitter” in the blog post).

      The technique that was suggested to me to reduce that bug was (vague hand waving follows):

      Use /d1showIncludes (or maybe /showIncludes) to get a full list of header files included, directly or otherwise. Include all of the top-level system headers from the .cpp file first.
      Then preprocess the whole shebang, remove the top section (which should be just system includes) and then put back the #include directives for them.

      I haven’t actually done this but it seems like it should work and should even be automatable.

      • rpavlik says:

        Ah, sure, that sounds like it might work – at least for everything except for the intentionally-reincludable files like assert.h (and as long as header order isn’t affecting your bug…) Not quite the clang-powered (or boost-wave-powered) magic wand I was hoping for, but given how much I’ve searched, I didn’t really expect it😀

  10. Harlan Rosenthal says:

    I once wrote code that exposed a firmware bug (Prime 2250 minicomputer, if I remember right). It involved a multi-byte instruction broken across a page boundary, with bad data output if a page fault occurred – which was more likely than you’d think because the minicomputer only had 4-way-hashing in the paging registers so there were a LOT of collisions in our operations.. The really scary part was how many programs might be out there generating sporadic bad data that nobody had noticed.

  11. Yasser Asmi says:

    Hey Bruce, this is great. I was recently playing with Chromium and building it is not fun and very time consuming. The article talks about bugs in VS. I am curious if you found any bugs in Chromium that only VS compiler flagged (vs GCC)?

  12. Ben says:

    When are PGO builds going to ship in beta/release? I can’t seem to find the corresponding Chrom(ium) bug.

    Is VS2015 still so reticent to inline non-trivial functions, even those that are used only once? That always bothered me.

  13. Donpedro says:

    In CTP 5 I’ve found a compile time regression (in C2; optimized builds only) that affected only a handful of files, but was quite massive. There was one file where I thought I hit an infinite loop in the compiler (file compiled in 16 seconds with VS 10, but wouldn’t finish compiling even after 10+ minutes with VS 15). Then I let the build run overnight, turned out I was wrong. The file compiled in a whopping ~17,100 seconds (that’s almost 5 hours!).

    I couldn’t produce a trivial repro from scratch, but I isolated and minimized the affected file, packed the whole translation unit (cpp file and all the headers, plus compilation options in a response file; batch file that invokes cl.exe, and reproduces the problem), and sent it to Microsoft. They fixed it in Update 1.

    The company I work for really wanted to switch to VS 15 by the time our flagship product gets a new major release, but could not, because of this somewhat slow fix. Either way, I’m happy they fixed it eventually.

  14. Ofek Shilon says:

    I wonder if the misaligned argument to SSE instruction is the same I encountered a short while ago (https://connect.microsoft.com/VisualStudio/feedback/details/1172020/vc-bad-code-generation)
    unpcklpd xmm0, qword ptr [ebp-50h]

    Haven’t checked if it was fixed.

  15. Pingback: C++ Annotated: March – May 2016 | ReSharper C++ Blog

  16. Pingback: C++ Annotated: March – May 2016 | CLion Blog

  17. Pingback: Zeroing Memory is Hard (VC++ 2015 arrays) | Random ASCII

  18. Pingback: Everything Old is New Again, and a Compiler Bug | Random ASCII

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s