I recently discovered that Microsoft’s VC++ compiler loads mshtml.dll – also known as Internet Explorer. The compiler does this whenever the /analyze option (requesting static code analysis) is used. I’m no compiler architecture expert, but a compiler that loads Internet Explorer seems peculiar.
This isn’t just a theoretical concern either. I discovered this while investigating why too much static-analysis parallelism causes my machine to become unresponsive for many minutes at a time, and the mshtml window appears to be the part of the cause.
As usual I used xperf/ETW to find the culprit.
Update, March 2015. According to recent comments on the bug (which, I think, has an awesome title) this is fixed in Windows 8.1, and possibly Windows 8.0. That suggests that the bug was ultimately determined to be in urlmon.dll or msxml6.dll, not in cl.exe.
Does it matter?
I don’t spend my days looking for trouble – it finds me. In this case I was trying to use /analyze on a new codebase.
(see Two Years (and Thousands of Bugs) of Static Analysis for practical advice on using /analyze)
I have parallel compiles enabled, and also parallel project builds. Since both of these parallelization options default to numProcs-way parallelism this can lead to numProcs*numProcs compilers running, which on my work machine would mean up to 144 parallel compiles. I have a fast SSD and 32 GB of RAM so I would expect some sluggishness but not a total meltdown. And indeed with normal compiles a bit of sluggishness is what I get. But when compiling with /analyze my machine becomes unresponsive for up to half an hour! Mouse clicks take tens of seconds to be processed, and even Task Manager can only rarely update its window. This happens every time I do a highly parallelized /analyze build. I first noticed this problem with VS 2010 but I didn’t seriously investigate the problem until I hit it again when using VS 2013.
I have ETW tracing running on my machine 24×7. Data is recorded to a 300 MB circular buffer and I can type Win+A to record a trace at anytime. This was designed for ease of use when recording traces while playing a game, but it’s also handy when your desktop is locked up and you still need to record a trace. Microsoft’s wprui has a similar design with Ctrl+Win+C as its global shortcut. After recording a trace I managed to cancel the build and regain control of my desktop. Then I analyzed the trace and figured out what was going on.
(see Xperf Basics: Recording a Trace (the easy way) for how to record a trace with wprui)
The question “why is my computer unresponsive when doing dozens of simultaneous compiles” is difficult to answer – it’s too broad. With dozens of compilers all fighting for CPU time it is normal for any given thread to be CPU starved some of the time. I needed to find a more tightly scoped question to investigate.
WPA has a very helpful graph when you are investigating a UI hang. Windows keeps track of how long each application goes without checking for messages and if an application goes ‘too long’ without checking for messages then an ETW event is emitted. If you are recording data from the right provider (Microsoft-Windows-Win32k) then those events will be in your trace and a UI Delays graph will be shown in the System Activity area. Normally this graph should be empty, but on the trace of my unresponsive system it showed a sea of bars representing dozens of MsgCheck Delay events from every process on my system – here is just a small fraction of them:
Curiously enough, about three quarters of the hung programs were instances of cl.exe. Windows was reporting that the compiler was not running its message pump. This immediately brought up the question of why does the compiler have a message pump!
Let’s ignore that for now – just think of it as foreshadowing.
The next step was to choose a hung application and figure out why it was unable to pump messages. I chose dwm.exe (the Desktop Window Manager) because it runs at high priority. It should be able to run smoothly even on an overloaded system so if it is getting hung then you know things have gotten bad.
The UI Delay events include the process Id and thread Id for the thread that is failing to pump messages, and the interval of time during which no messages were pumped. This gives us enough information for us to finally have a well formed question:
Why is thread 19,308 of process dwm.exe(4336) failing to pump messages for 2.668 s starting at 84847.551566180 s into the trace?
That is a concrete question, and now we can figure out how to answer it.
Having this well formed question is quite important because if we naively use wait analysis then we will find that many threads go for many seconds without running. When doing wait analysis it is important to remember that an idle thread is not a problem – most threads on your computer should be idle most of the time. What matters is an unresponsive thread – a thread that does not promptly respond when it should. That’s why the UI Delay events are so important – they highlight a thread that is failing to respond.
(see Xperf Wait Analysis–Finding Idle Time for a dated look at how to follow wait chains)
The slow-to-respond dwm thread was waiting in ExAcquireResourceExclusiveLite, multiple times, sometimes waiting for hundreds of ms. The ReadyingProcess and other columns, documented here, tell us who “readied” the dwm thread – who released the lock and let it run. I manually followed this chain for a while but after a few hundred context switches I got bored.
So, I used wpaexporter to export all of the context switches for the relevant range. Then I filtered this to only show lines where ExAcquireResourceExclusiveLite was on the NewThreadStack, and then I got Excel to calculate the time difference between adjacent context switches. This was a quick-and-dirty way of finding an approximate lock chain, and who was taking a long time with the lock. These lock handoffs are usually extremely quick – often less than a ms – but a significant number of them took longer, usually about 16 or 32 ms. Even though only less than 3% of the lock handoffs took more than 15 ms these slow lock handoffs accounted for more than 75% of the time I surveyed.
I then looked at some of the context switch data more closely and it appears that the slow lock handoffs are because sometimes a thread is given its turn to get the lock but it doesn’t get any CPU time. So, the lock sits there unused until the next OS scheduling interval gives the thread some CPU time. Or maybe some of the threads are taking the lock and then losing their time slice. It’s hard to tell. Either way, it appears to be a priority inversion. Or maybe it’s just what happens when you are hitting a lock thousands of times per second on a busy system?
The very first thread that I saw in the lock chain was part of the VC++ compiler. It was being called by CreateWindowExW, which seemed weird for a compiler. And CreateWindowExW was being called by mshtml, which seemed even weirder. And in fact, about 65% of the traffic on the windowing system lock was from the VC++ compiler, mostly via mshtml.dll.
So, when you run /analyze the compiler consumes lots of CPU time to analyze your code, and it also opens a window (two actually, one for COM purposes). If you run many copies of the compiler then you get many windows being opened, and over-subscribed CPUs. And madness ensues.
The compiler instances are running at a slightly lower priority than ‘normal’ windowing processes such as Visual Studio and Outlook and this lower priority together with the high CPU demand and their heavy use of the windowing lock seems to be a crucial part of the problem, by leading to priority inversions.
Technically I don’t know that mshtml and the opening of the windows is the cause of hangs. I know that the hangs only happen with /analyze compiles, I know that /analyze compiles load mshtml.dll, and I know that mshtml.dll shows up on some of the wait chains. Correlation is not causation, but it sure is suspicious.
If Microsoft removes mshtml.dll from the compiler then I’ll happily re-run my tests and report back.
But, but why?
Without access to a lot of source code I can’t tell exactly what is going on, but here’s what I know. If you run the compiler with the /analyze option then it loads mspft120.dll – the /analyze DLL. Then mspft120 loads msxml6.dll to load an XML configuration file. Then msxml6 loads urlmon.dll to open the stream, and finally urlmon loads mshtml.dll. Then mshtml.dll creates a window, because that’s what it does.
The XML files being loaded are:
- res://C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\BIN\mspft120.dll/213
I’m sure that every step in this process makes sense in some context – just not in this context. I suspect nobody ever noticed that mshtml.dll was being loaded, or else they didn’t run enough parallel compiles for it to matter.
I feel skepticism from my readers. This is good. If somebody told me that the VC++ compiler loaded a web browser I would doubt them. I hope that you trust me, but my feelings won’t be hurt if you feel it is appropriate to trust-but-verify. I can help.
If you are comfortable using windbg then verification is fairly easy. Just run these commands from a command prompt:
- set _NT_SYMBOL_PATH=SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
- windbg.exe cl.exe /analyze SomeSourceFile.cpp
- In the windbg command field type either “sxe ld mshtml” or “bp mshtml!InitGlobalWindow”
- This command sets the path so that you can run cl.exe.
- This command sets up access to Microsoft’s symbol servers – c:\symbols is any local directory where you want the symbols cached.
- This command runs windbg with the compiler as the process to debug – SomeSourceFile.cpp is any source file on your machine.
- The first command says to break when mshtml.dll is loaded. The second one says to break when the InitGlobalWindow function in mshtl is hit. You can use either or both of these.
Then just hit F5 or type ‘g’ into the windbg command field to start debugging. After some delays to download symbols execution will stop at your breakpoint. Then you can type ‘kc’ to get a callstack, and then ‘q’ to quit.
You can also investigate this using Visual Studio. Run devenv.exe from the command prompt where you did the first two steps and set up the debugging session appropriately. You can’t set breakpoints on module load but you can look for module load messages in the output window. I couldn’t set a breakpoint on InitGlobalWindow but I was able to set a breakpoint on CreateWindowExW.
Until Microsoft fixes their compiler to not load their web browser it seems impossible to avoid this problem when doing lots of parallel builds. The only solution that I am aware of is to reduce the amount of parallelism. Setting the maximum number of parallel project builds to a smaller number seems to reduce the hangs to a more manageable level.
You can do this from the command line (important for build machines) with this command:
reg add “HKCU\Software\Microsoft\VisualStudio\12.0\General” /v MaxConcurrentBuilds /t REG_DWORD /d 8 /f
This sets the maximum number of parallel project builds to 8, for Visual Studio 2013 (version 12.0). Adjust the version number and amount of parallelism to your needs.
You can also pass a parameter to the compiler’s /MP option to specify the maximum compile parallelism for that set of source files.
I suspect that having fewer CPU cores would also help, since that would automatically reduce the over-subscription ratio – but I want lots of parallelism. Oh well.
If Microsoft had a global compiler scheduler then they could avoid over-subscribing CPUs.
A few people reasonably asked why the mitigation of dropping to two-way parallel project builds is not enough to close the case. One reason is, of course, that opening a web browser from a compiler is still a bad idea. Another reason is that twelve-way parallel project builds is the default for a twelve-thread CPU – default behavior should behave better than that.
The final reason why that mitigation is not ideal is that sometimes more than two-way parallel project builds is needed in order to get ideal build speeds. As I showed in my parallel compilation blog post last week, parallel compilation is off by default, and even when it is on many projects will spend a portion of their time doing serial builds. So, a combination of parallel compilation and parallel project builds is appropriate.
In particular, we use precompiled headers for most of our projects. VC++ builds the precompiled header as the first step in building these projects, and it compiles nothing else until it finishes. So, at the start of every full build the only parallelism I get is from parallel project builds.
It is unfortunate that VC++/msbuild don’t have a global scheduler that avoids running 144 compiles in parallel. A global compiler scheduler is the only way to have full intra- and inter-project parallelism without over committing.
Also, the parallel-project builds setting is a global setting, and if I lower it then some solutions will build more slowly.
Reddit discussion is here.