The free lunch is over and our CPUs are not getting any faster so if you want faster builds then you have to do parallel builds. Visual Studio supports parallel compilation but it is poorly understood and often not even enabled.
I want to show how, on a humble four-core laptop, enabling parallel compilation can give an actual four-times build speed improvement. I will also show how to avoid some of the easy mistakes that can significantly reduce VC++ compile parallelism and throughput. And, as a geeky side-effect, I’ll explain some details of how VC++’s parallel compilation works.
Plus, pretty pictures.
Spoiler alert: my test project started out taking about 32 s to build. After turning on parallel builds this dropped to about 22 s. After fixing some configuration errors this further dropped to 8.4 s. Read on if you want the details.
A CPU bound build
My test project consists of 21 C++ source files. It is a toy project but it was constructed to simulate real issues that I have encountered on larger projects.
Each source file calculates a Fibonacci value at compile time using a recursive constexpr function so that the 56 byte source files take a second or two to compile. I previously blogged about how to use templates to do slow compiles but the template technique is quite finicky, and the thousands of types that it creates cause other distracting issues, so all of my measurements are done using the constexpr solution (thanks Nicolas).
Since constexpr is not natively supported in Visual Studio 2013 I had to install the Visual C++ Compiler November 2013 CTP which contains new C++11 and C++14 features. VS 2015 should work also. After installing it you can select this platform toolset from the project properties in the General section, as shown to the right. If you don’t feel like installing the CTP, no worries, just follow along. The lessons apply to VS 2010 to 2013. The CTP just makes it easier to demonstrate them in a small project.
Having made a slow-to-compile project the challenge is to improve the build speed without changing the source code.
If you install the CTP and build my test project and watch Task Manager you’ll see something like the image to the right. I have Task Manager configured to show one graph for all CPUs because I think it makes it easier to see how busy my system is, and the answer in this case is not very. The screenshot is from my four-core eight-thread laptop and it looks like my CPU is about 16% busy, meaning one-and-a-bit threads are in use. So, not very much parallelism is going on and the build took about 32 seconds.
It turns out there is an easy explanation for this behavior. Parallel compilation is off by default in VS 2013!
I’m not sure why parallel compilation is off by default, but clearly the first step should be to turn it on. Be sure to select All Configurations and All Platforms before making this change, shown in the screenshot below:
If we build the debug configuration now we will get an error because Enable Minimal Rebuild is on by default and it is incompatible with multi-processor compilation. We need to turn it off. Again, be sure to select All Configurations and All Platforms before making this change.
The other reason to disable minimal rebuild is because it is not (IMHO) implemented correctly. If you use ccache on Linux then when the cache detects that compilation can be skipped it still emits the same warnings. The VC++ minimal rebuild does not do this, which makes eradicating warnings more difficult.
Turning on parallel builds definitely helps. The Task Manager screenshot to the right uses the same scale as the previous one and we can see that the graph is taller (indicating parallelism) and narrower (indicating the build finished faster, about 22 seconds). But it’s far from perfect. The build spikes up to a decent level of parallelism twice and then it settles down to serial building again. Why?
Many developers will assume that this imperfect parallelism is because the compiles are I/O bound, but this is incorrect. It is incorrect in this specific case, and it is also generally incorrect. Compilation is almost never significantly I/O bound because the source files and header files quickly populate the disk cache. Repeated includes in particular will definitely be in the disk cache. Writing the object files may take a while, but the compiler doesn’t have to wait for this. Linking might be I/O bound (especially the first link after a reboot), but compilation is virtually never I/O bound, and I’ve got the xperf traces to prove it.
So what is going on?
It’s time to get all scientific. Let’s grab an xperf trace of the build so that we can analyze it more closely. I recommend using UIforETW to record a trace – the default settings will be fine, or you can turn off context switch and CPU sampling call stacks to save space. Having recorded a trace I opened it in WPA, opened a Processes view and arranged the columns appropriately to make it easy to see the lifetimes of all compiler (cl.exe) processes. Let’s first look at a graph from before parallel compilation was enabled:
Each horizontal bar represents a cl.exe process. I happen to know that cl.exe is single-threaded – any parallelism has to come from multiple cl.exe processes running simultaneously. The numbers along the bottom represent time in seconds. We can see that cl.exe is invoked 13 times to compile our 21 files and we can see that there is zero parallelism.
Next let’s look at a graph of building our project with parallel compilation enabled:
Isn’t that pretty? We can see that the build is running faster (the graph is skinnier), and we can see the places where our build is running in parallel (stacked bars). But there are mysteries. What are those green and blue spikes near the top left? And why are there so many serial compilations on the right?
Diagnosing what is going on is still challenging because while we can see the compiler processes coming to life we cannot see what file is being compiled when. This is annoying. So I fixed it, ‘cause that is what programmers do!
I wrote a simple program that calls devenv.exe as a sub-process. I also modified my build-tracing batch file to add the /Bt+ flag to the compiler options. This option would frequently crash with VS 2010, but it works reliably with VS 2013 and prints things that look sort of like this:
time(c1xx.dll)=1.01458s < 5290422043 – 5292596900 > [Group3_E.cpp]
time(c2.dll)=0.00413s < 5292601054 – 5292609908 > [Group3_E.cpp]
This tells us the length of time spent in the compiler front-end and back-end for each source file. Handy. Additionally, the large numbers are the start and stop times for each stage, taken from QueryPerformanceCounter. So, my wrapper process parses the compiler output and emits ETW events that show up in my trace shortly after the compile stages finish – UIforETW automatically records those custom ETW events. It calculates the delay from when the compile stage ended to when it received the output, just in case, and puts that in the event also. This gives us an annotated trace of our build:
The pink and green diamonds at the top of the screen shot correspond to the events emitted when my wrapper program sees the /Bt+ output at the end of each compile stage. The mouse is hovering over the circled diamond, and we can read off the payload in the bottom of the tooltip:
- Source file: Group1_E.cpp
- Stage duration: 6.064306 s
- Start offset: –6.065526 s
- End offset: –0.001220375 s
That means that the event was emitted just 1.2 ms after the compile finished – close enough – and that Group1_E.cpp took over 6 s to compile, so the long green bar is the compilation of Group1_E.cpp, and it’s time to start explaining what is going on.
How VC++ handles multiple files
Visual Studio doesn’t invoke the compiler once per source file. That would be inefficient. Instead VS passes as many source files as possible to the compiler and the compiler processes them as a batch. If parallel processing is disabled then the compiler just iterates through them. If parallel processing is enabled and multiple files are passed in then things get interesting.
In this situation the initial compiler process does no compilation – instead it takes on the task of being the master-control-program. It spawns MIN(numFiles,NumProcs) copies of itself and each of those child processes grabs a source file and starts compiling. The child processes keep grabbing more work until there is none left, and then they exit. The master-control-program sticks around until all of the children are finished.
Now this graph makes sense. The blue bar is the master-control-program – the compiler process started by Visual Studio. It doesn’t do any compilation. The other six bars are six new compiler processes that each grab a source file and go to work. We can use the generic ETW events that my wrapper inserts to determine that the short bar is from compiling CompileParallel.cpp, a mostly empty source file. The four roughly equal-length bars are from Group1_A.cpp, Group1_B.cpp, Group1_C.cpp, and Group1_D.cpp, all of which do the same compile-time calculations. The long green bar is from Group1_E.cpp which does four times as much compile-time computation.
What is limiting parallelism?
The question now is, why doesn’t Visual Studio submit all of the source files at once? What is the limiting factor? A bit of poking around reveals the answer. The batches of files that Visual Studio submits to the compiler have to all have identical compiler options. That makes sense really – send one set of compiler options and a list of source files. The problem is that the compilation of different batches of files don’t overlap. Visual Studio waits for the previous batch to finish before submitting a new batch. So, the more different types of command-line options you have, the more parallelism is limited. In this particular project we have:
- stdafx.cpp – this file creates the precompiled header file so it necessarily has different command-line options
- Group1_*.cpp – these files all use the precompiled header file so they create another batch
- Group2_*.cpp – these files do not use the precompiled header file so they create another batch
- Group3_*.cpp – these files use the precompiled header file but they each have a different warning disabled on the command line, so they each go in a batch by themself
Making it fast
Now our task is clear. We need to minimize the number of different sets of compiler options (batches).
The warning suppression differences are easy – those can be moved from the command-line to the source file. We just have to use #pragma warning(disable) instead of /wd. Using #pragma warning(disable) is much better anyway. It makes it easier to see what warnings you are suppressing and it lets you put in comments to explain why you are suppressing them. Recommended.
Precompiled header files are a trickier case. If we use precompiled headers then we need at least two batches, and possibly three. I’ve seen projects with four different precompiled header files which means that they have at least eight different batches. Precompiled header files are usually a net-win but it is important to understand their costs.
If you want to be certain that you don’t have custom compile settings for some files then the best thing to do is to load up the .vcxproj file into a text editor and look – it is far to easy to miss customizations when looking in project properties:
With these changes made we can do a final compilation of our project, and the results are marvelous:
The screen shot to the left uses the same horizontal scale as the other ones, but it is so narrow that I can easily fit this paragraph to the right. This paragraph owes its existence to fast parallel compilation, and it would like to say thank you.
The stdafx.cpp file still compiles serially, but everything else is quite nicely parallelized, and our build times are hugely improved.
The green bar at the bottom sticks out longer than everything else and this is because that is the compiler instance that compiles group1_E.cpp, the most expensive source file. These long-pole source files can be a problem, especially if they start compiling late in the build. If you have a single file that takes 20-30 seconds then that can really ruin build parallelism. This is particularly risky if you use unity builds. If you glom together too many files then you will reduce the opportunities for parallelism and you will increase the risk of creating a “long-pole” that will finish compiling long after everything else. Any unity file that takes more than a few seconds to compile is probably counter-productive, or at least past the point of diminishing returns.
Unity builds are when you include multiple .cpp files from one .cpp file in order to reduce the overhead of redundantly processing header files. In extreme cases developers include every .cpp file from one .cpp file, but that’s just dumb.
Now that we’ve tamed parallel compilation it is rewarding to look at Task Manager’s monitoring of the three different builds. The time savings and the different shapes of the CPU usage graph are quite apparent:
It could be easier
Ideally Visual Studio (technically MSBuild I suppose) should have a better scheduler. There is no reason why the compilation of different batches can’t overlap, and that could give even more performance improvements. A global scheduler could also cooperate with parallel project builds, to avoid having dozens of compiles running simultaneously. Maybe in the next version? If you don’t want to wait you could switch to using the ninja build system – that is what Chrome uses to get perfectly parallel builds on all platforms.
Or, as I suggested in You Got Your Web Browser in my Compiler!, I think Microsoft could get global compiler scheduling quite easily:
Getting a global compiler scheduler would actually be quite easy. Just get MSBuild to create a global semaphore that is initialized to NumCores and pass in a switch to the compilers telling them to acquire that semaphore before doing any work, and release it when they’re done. Problem solved, trivially.
Ideally VC++ would emit ETW events at the end of each compilation and linking stage. This would avoid the need for my egregious hacks, and it would give better information, such as which process was working on each file. This would be much better than wrapper processes that use undocumented compile switches. Pretty please?
Of course, parallel compilation is not a panacea. For one thing, if our compiles are fast then when will we have sword fights?
For another thing, heavily parallel compilation can make your computer slow. A few dozen parallel project builds that are all doing parallel compilation may start up more compiler instances than you have processors, and that may make Outlook a bit less responsive. More on that next week.
In this particular case we didn’t use the xperf trace for anything particularly exotic – just process lifetimes and compile-completion markers. However the magic of ETW/xperf is that we can go arbitrarily deep. If we wanted to see all disk I/O, or all file I/O, or detailed CPU usage, or all context switches, or 8,000 CPU samples/sec/core, then grabbing that information is trivial – UIforETW records all that (just check Fast sampling). All of the information goes on the same timeline and we can dig as deep as we want. It’s amazingly powerful, and in a future article I’ll show how I used it to find some surprising behavior of Microsoft’s /analyze compiler…
In many projects just turning on parallel compilation is sufficient. If you’re not doing anything pathological with your build settings and if you aren’t going unity-mad then just turn on parallel builds and enjoy the faster builds. However even in that case you may still find it useful or interesting to use /Bt+, with or without ETW tracing, to investigate your compile-times on a per-file basis.
VS 2013 can also do parallel Link Time Code Generation (LTCG). LTCG can make for extremely slow builds so being able to parallelize it – four threads by default – is pretty wonderful. We upgraded one of our projects to VS 2013 mostly because that made LTCG linking two to three times faster. With VS 2013 Update 2 CTP 2 (a preview of VS 2013 Update 2) the /cgthreads linker option was added which lets you control the level of LTCG parallelism and get a bit more LTCG build speed – /cgthreads:8 on the linker command line means to use eight threads for code-gen instead of the default of four.
Future work could include analyzing the linker /time+ switch output to do some analysis of link times, but this data will not be as rich as the compile data.
Keeping it real
I promised at the beginning that my highly artificial test program was representative of real build-time problems that I have addressed. To make that point, here is a compiler lifetime graph from a real project, recorded before I started doing build optimization:
It was doing parallel builds but, due to seven different sets of compilation settings and some serious long-pole compiles it was far from perfectly parallelized. By breaking up some unity files and getting rid of unique compilation settings I was able to reduce the compilation time of this project by about one third.
If this information lets you improve your build speeds then please share your stories in the comments.
I bought my laptop three years ago. It has a four-core eight-thread CPU and 8 GB of RAM. I use it for hobby programming projects. For professional programming I would consider that core-count and memory size to be insufficient. Having more cores obviously helps with parallelism, but more memory is also crucial. For one thing you need enough memory to have multiple compiles running simultaneously without paging, but that is just the beginning. You also need enough memory for Windows to use as a disk cache. If you have enough memory then builds should not be significantly I/O bound and you can get full parallelism. My work machine is a six-core twelve-thread CPU with 32 GB of RAM. Adequate. I look forward to the eight-core CPUs coming out later this year, and 64 GB of RAM is probably in my future.
My laptop has four cores and, due to hyperthreading, it has eight threads. The speedup from having those hyperthreads is highly variable but it is typically pretty small. So, on my laptop it is unreasonable to expect more than a four times speedup from parallelism. In practice the speedup is likely to be less because there are other processes running as part of compilation, such as mspdbsrv.exe. On real projects with lots of work for mspdbsrv.exe the maximum speedups are likely to be less than the number of cores.
I’ve made available my pathological project. It requires the Visual C++ Compiler November 2013 CTP to build it. The solution file contains three projects: CompileNonParallel, CompileMoreParallel, and CompileMostParallel. These project share source files but have different build settings – I hope that the project names are sufficiently expressive.
Also included is buildall.bat, ETWTimeBuild_lowrate.bat, and the source for devenvwrapper.exe. If you run buildall.bat from an elevated command prompt then it will register the ETW provider then run the other files and build all three projects, creating three ETW files. The trace files used in creating this blog post are also included.
You can use ETWTimeBuild_lowrate.bat as a general purpose tool for recording lightweight build traces. Just pass in the path to your solution file and the name of the project to build and it will rebuild the release configuration of your project. For deeper profiling you can enable the sampling profiler or other providers.
The devenvwrapper project also serves as a good example of a very simple ETW provider.
The ETW files can then be loaded into WPA 8.1. You can then go to the Profile menu and apply the supplied CompilerPerformance.wpaProfile file which will lay out the data in a sensible way. Drill down into the two data sets and you too can see how parallel your builds are. The table version of the Generic Events data lets you sort by compilation time and otherwise analyze the data. Note that filters are applied so that only the compiler processes and compiler events are shown.
All of these files can be found on github and can be downloaded and used thusly:
> git clone https://github.com/randomascii/main
> cd main\xperf\vc_parallel_compiles