A twitter discussion on build times and source-file sizes got me interested in doing some analysis of Chromium build times. I had some ideas about what I would find (lots of small source files causing much of the build time) but I inevitably found some other quirks as well, and I’m landing some improvements. I learned how to use d3.js to create pretty pictures and animations, and I have some great new tools.
As always, this blog is mine and I do not speak for Google. These are my opinions. I am grateful to my many brilliant coworkers for creating everything which made this possible.
The Chromium build tools make it fairly easy to do these investigations (much easier than my last build-times post), and since it’s open source anybody can replicate them. My test builds took up to 6.2 hours on my four-core laptop but I only had to do that a few times and could then just analyze the results.
I did my tests on an October 2019 version of Chromium’s code because that gave me the most flexibility about which build options to use. I used a 32-bit, debug, component (multi-DLL, for faster linking) build, with NACL disabled, full debug information, with reduced debug information for blink (Chromium’s rendering engine) as my base build. In other words, these are the build arguments I used:
target_cpu = “x86” # 32-bit build, maybe faster?
is_debug = true # Extra checks, minimal optimizations
is_component_build = true # Many different DLLs, default with debug
enable_nacl = false # Disable NACL
symbol_level = 2 # Full debug information, default on Windows
blink_symbol_level = 1 # Reduced symbol information for blink
This is a good set of options for developing Chromium as it gives full debuggability together with a fast turnaround on incremental builds (when just a few source files are modified between builds).
Enough talk, let’s see some data
Let’s start with some pretty pictures. The graph below shows the relationship between the number of lines of code in source files and how long it took to compile them. I clamped the really-big and really-slow files and zoomed in to make it easier to see the patterns and… there aren’t any (live diagrams here):
In this diagram and the ones that follow the colors represent when in the build a file was compiled, with blue files happening first, then green, and red files happening last.
The .csv files I generated from my builds have a wealth of data, easily explorable with the supplied scripts:
A bit of analysis shows that it makes sense that there is no correlation between source-file length and compile time. The 30,137 compile steps that I tracked consumed a total of 11.6 million lines of source code in the primary source files. However the header files included by these source files added an additional 3.6 billion lines of source code to be processed. That is, the main source files represent just 0.32% of the lines of code being processed.
It’s not that Chromium has 3.6 billion lines of source code. It’s that header files that are included from multiple source files get processed multiple times, and this redundant processing is where the vast majority of C++ compilation time comes from. It’s not disk I/O, it’s repeated processing (preprocessing, parsing, lexing, code-gen, etc.) of millions of lines of code (note the 100% CPU usage, maintained for most of the build, due to ninja’s excellent job scheduling):
But what about precompiled header files? Couldn’t they be used to reduce the overhead of these header files? Well, it turns out that Chromium does use precompiled header files in some areas and the 3.6 billion line number is after those savings have been factored in. More on that later.
Patterns? You want patterns?
Let’s create another chart, this one showing the relationship between the lines of code in all of the include-file dependencies and the compile times. Now we’ve got some patterns (live diagrams here):
Is it just me or does that look like a distorted version of the classic Hertzsprung-Russell diagram of stellar luminosities versus temperatures? No? Just me?
Looking at this graph I can see at least four patterns, numbered here:
- The bottom of the chart shows a thick band of points curving up and to the right (mostly red, but all colors appear) – the main sequence. This area clearly shows that more lines in dependencies generally leads to longer compile times. The bottom of this band curves up gradually, looking like it’s almost an O(n^2) equation, at least for the minimum cost, although the average and maximum costs look closer to linear.
- Starting at about the 4.0 second mark on the y-axis (in green) there is a significant set of files that have a minimum compile time of about four seconds, regardless of include size. There are many files in this region whose source-files plus all of their includes are less than 200 lines and yet take 4.0 seconds or longer to compile.
- There is an odd vertical structure from 200,000 to 240,000 lines of includes (in greenish blue).
- There is also a perfectly straight vertical line of points at 343,473 lines of includes and around 12 to 13 seconds of compile time (below the digit ‘4’, in green).
The colorful graphs of compile times (live versions available here) are interactive. This means that identifying which files are associated with each of the patterns is as simple as moving the mouse around in that area.
Structures 2 and 4 – Precompiled headers
It turns out that structures 2 and 4 are both related to precompiled header files. Structure 2 is a set of files that use precompiled header files. This leads to a minimum compile cost of about 4.0 seconds (loading precompiled header files in clang-cl is fairly expensive) and a very low dependencies line count – apparently the headers that are precompiled don’t count as dependencies in this context.
The vertical line in structure four is from creating precompiled header (.pch) files. It is perfectly vertical because every one of the precompilations is compiling exactly the same set of headers, all from precompile_core.cc, which includes (through command-line magic) precompile_core.h. For some reason this file gets compiled 59 times, each time creating a 76.9 MB .pch file. This got even worse for a while but has been mitigated – see below.
In short, precompiled headers can be very helpful, but they come with their own costs. In this case there is the 900+ seconds to redundantly build the .pch files, and then the ~16,000 seconds to load the large .pch file more than 4,000 times, plus the additional dependencies.
Note: if we disable precompiled headers entirely the build gets slightly slower. And the main source files go from 0.32% of the lines of code being processed to just 0.125%! (previously reported as 0.22% due to omission of system header files), with the includes adding up to 9.3 billion lines of code.
By March 2020 the number of copies of the blink .pch file had grown to 67 with each one now 90 MB. I ran some experiments with shrinking the blink precompiled header files and found that if I reduced precompile_core.h dramatically I could:
- Cut out almost 90% of the cost of creating the .pch files
- Cut out almost 95% of the 5.5 GB size of the .pch files
- Slightly lower the average compile time for Blink source files
- Reduce accidental dependencies – translation units depending on headers that they don’t need
Those improvements were good enough that I was able to get a change landed to reduce what precompile_core.h includes.
I got the cost-reduction numbers above from a build-summarizing script I wrote, but when when I created graph creation tools for this blog post it made sense to apply them to this change, to better visualize the improvements, so I patched the precompiled header change in to my old Chromium build. And, I realized that I could animate the compile-times and the number of include lines for each target, to make a movie showing the transition from old to new. In reality the change was a jump from one state to the other, but showing it as motion makes it easier to see patterns.
In this video (which just shows source files in Blink, to reduce the noise) you can see the files which create the precompiled-header files (pattern 4) moving down and to the left, because they are including fewer files and compiling faster. You can also see the many files which consume the precompiled header files (pattern 2) moving to the right because they now have more header files to consume – recall that precompiled headers aren’t counted – and moving both up and down (both longer and shorter compiles).
The animation makes it crystal clear that this change didn’t help all blink source files. Some got slower, so I might have to try using the original precompile_core.h for some files – this blog post sure triggers a lot of work!
In addition to visualizing the savings I (and you) can use one of my scripts to measure, for instance, the before/after costs of creating the precompiled header files:
python ..\count_costs.py windows-default.csv *precompile_core.cc
59 files took 0.193 hrs to build. 0.000 M lines, 20.3 M dependent lines
python ..\count_costs.py windows-pchfix.csv *precompile_core.cc
59 files took 0.021 hrs to build. 0.000 M lines, 2.6 M dependent lines
Our full results with this change patched in to the old build now look like this (live diagrams here):
The two precompile-related patterns are now gone which just leaves us with the main sequence and the bluish-green tower (pattern #3) at just past 200,000 include lines. A bit of spelunking shows that the tower is mostly v8 files, especially source files generated by the build, which all include expensive-to-compile header files. In general generated source files can consume much compilation time. I hope to make some improvements there, but that will have to wait until after the blog post:
What if we had fewer source files…
My original belief was that having fewer source files would improve build times, by reducing redundant processing of header files. Wouldn’t it be nice if there was some way to test this? It turns out that there is. A classic technique for doing this is to treat your .cpp files like include files. That is, instead of compiling ten .cpp files individually, generate a .cpp file that includes them, and compile those generated files. This technique is sometimes called a unity build. The generated files look something like this:
If the included C++ files share a significant number of header files – usually the case if the files are related – then a significant amount of work can be avoided.
Some people have argued against this by pointing out that if you #include all of your source files into one then your incremental build times get worse. Well. Yeah. So don’t do that. There’s a lot of middle ground between compiling everything separately and compiling everything in one translation unit.
For a while Chromium had an option to do this, called jumbo builds, created by Daniel Bratell. This system defaulted to trying to #include 50 source files in each generated file (subject to constraints) and this was configurable. These jumbo builds significantly reduced the time to do full rebuilds of Chromium on machines with few processors. For incremental builds and massively parallel builds the benefits of jumbo builds were lower.
I decided to do three different jumbo builds with jumbo_file_merge_limit set to three different values. I would have done more small numbers but 2-4 failed due to command-line limits that I didn’t feel like addressing.
The graph below shows four points which are, left to right, merge amounts of 50, 15, 5, and the default build. The graph shows how the total number of hours of compile time goes down as the number of translation units compiled is reduced.
The downwards curve of the graph suggests that if we can reduce the number of translation units to 10,000 then the compile times will hit zero but I would advise readers not to trust that extrapolation.
One of the common complaints about jumbo/unity builds is that, by glomming many files together, they make individual compiles take longer. Let’s examine that:
python ..\count_costs.py R710480-default.csv
30137 files took 21.453 hrs to build. 11.856 M lines, 3611.6 M dependent lines
Averages: 393 lines, 2.56 seconds, 119.8 K dependent lines per file compiled
So, our default compile takes an average of 2.56 seconds per source file. What of our jumbo build with up to five C++ files per compilation:
Uh. Wait a minute. We’ve almost cut the number of files compiled in half, and the average compile time has… dropped?
No, this isn’t an error. This happens because jumbo was applied mostly to expensive-to-compile files. Since most of their cost is header files the combining of them barely increased their compile cost. Since there are now fewer expensive files the average cost drops. It’s a miracle! It turns out you have to go to the 84th percentile before jumbo-5 files take longer to compile, and even the most expensive file only takes 42% longer to compile. There is such a thing as a free lunch. In addition jumbo linking may be faster due to reduced redundant debug information.
This all looks great, but, there are problems.
The main one is that when you #include a bunch of source files together then the compiler treats them as one translation unit, with one anonymous namespace, but the programmers writing the code see each source-file as independent. This causes obscure compiler errors in Chromium as identically named “local” functions and variables conflicted, but only in the jumbo configuration. The jumbo build effectively meant we were programming in an odd dialect of C++ with surprising rules around global namespaces.
Jumbo builds (if taken to excess) can also make incremental builds slower – because the compile-times for a large batch of source files can be non-trivial – and more tweaks are needed to work around this. Even though the average is better there are a bunch of 99th percentile file which take longer to compile, especially if jumbo_file_merge_limit is set too high.
Jumbo builds also creates additional coupling at link time as “unused source files” get linked in and then require all of their dependencies as well. Without care this can lead to shipping binaries that are bloated with test code, or surprising link errors.
Additionally, Google’s massively parallel goma build never benefited from jumbo. So, jumbo builds were deemed too much of a hack and were turned off. I was only able to use it for this post by syncing to the commit just before it was disabled – and I had to fix two name conflicts in anonymous namespaces to get Chromium to compile with it.
Reductio ad absurdum
The logical extension of source files with little code and lots of header files would be source files with zero code and lots of header files. But surely such files would never exist. Right?
While working on this I noticed that compiling of source files generated by mojo (Chromium’s IPC system) took quite a while. While poking at these I found that 680 of the generated C++ files (21% of the total) contained no code. Due to the size of the header files these no-code C++ files were collectively taking about twenty minutes of CPU time to build! I landed a change that detected this situation and removed the #includes when no code was detected. This was a simple change that reduces Chromium’s build time significantly in absolute terms (four to five minutes elapsed time on a four-core laptop) but as a percentage (~1.6%) barely moves the needle.
Now that I have the scripts needed to create videos to show compile-time improvements I figured I might as well animate this one as well – you can see some of the source-files racing towards the origin (tested on a March 2020 repo):
Why so long?
My initial twitter guess was that build times grow roughly in proportion to the square of the number of translation units. If we assume a large project with N classes, each with a separate .cc file and .h file then the number of compile steps is N. If 10% of the header files are used (often indirectly) by each .cc file then our compile cost is N * 0.10 * N, which is O(N^2). QED.
Other work in this area
I am far from the first person to try their hand at investigating Chromium’s build times.This is hardly surprising because Chromium’s build times have always been daunting, and different visualizations often reveal different opportunities.
Much more recently an alternate denominator – tokens instead of lines of code – was proposed in this January 2020 document. I chose to stick to lines of code because everyone knows how to measure them and the correlations in Chromium are similar.
For understanding why individual source files take a long time to compile we now have the excellent -ftime-trace option for clang which emits flame charts showing where time went. This flag can be set in Chromium’s build by setting compiler_timing = true in the gn args. When this is done a .json file will be created for each file compiled, and these can be loaded into chrome://tracing, or programmatically analyzed. In particular the author of –ftime-trace has created ClangBuildAnalyzer which can do bulk analysis of the results.
This Chromium script looks for expensive header files in a Chromium repo. It could potentially be combined with –ftime-trace for more precise measurements.
ninjatracing converts the last ninja build’s build-step timing into a .json file suitable for loading into chrome://tracing file to visualize the parallelism of the build.
post_build_ninja_summary.py is designed to give a twenty line summary of where time went in a ninja build. If the NINJA_SUMMARIZE_BUILD environment variable is set to 1 then autoninja will automatically run the script after each build, as well as displaying other build-performance diagnostics. This script was enhanced recently to allow summarizing by arbitrary output patterns.
There is an ongoing project to improve Chromium build times by making more use of mojom forward declarations instead of full definitions wherever possible. This is tracked by crbug.com/1001360.
There is an ACM paper (temporary freely accessible) discussing unity/jumbo builds in WebKit in great detail.
Reproducing my work
The scripts and data that I created for this blog post are all on github. They simply leverage ninja and gn’s results to create .csv files, and allow easy queries of the .csv files. The web page source is available, with the live page itself here. Click on any point on a graph to save it as a .png, and hover over it to see details on individual source files.
But it’s not science unless other people can reproduce the results. In order to follow along you need to get a Chromium repo (not a trivial task, but possible, just follow these instructions). Then, you need to run one or both of the supplied batch files (consider running them manually since some steps may be error prone, take 5+ hours, and consume dozens of GB of disk space):
- test_old_build.bat – this batch file checks out an old version of Chromium’s source, patches it so that jumbo works, and then builds Chromium with five different settings. Expect this to take about 20 hours on a four-core machine with lots of memory and a fast drive.
- test_new_build.bat – this batch file assumes that you are synced to a recent version of Chromium’s source, it patches in a change so that system headers are tracked and then builds Chromium, and then does incremental builds with and without the mojo empty-file fix. Expect this to take about 13 hours on a four-core machine with lots of memory and a fast drive.
All of my test builds use the”-j 4” option to ninja, for four-way build parallelism. My laptop has four cores and eight logical processors, and ninja would default to ten-way parallelism, but I wanted each compile process to have a core to itself, to minimize interference. You should adjust this setting based on your particular machine and scientific interests. Using all logical processors with “-j 10” instead of “-j 4” makes my builds run about 1.28 times as fast.
If you don’t want to spend hours getting a Chromium repo and building multiple variants you can still do some Chromium build-time analysis. Just clone the repo and you can run count_costs.py on various .csv files with varying filters to see which files are the worst. If you find anything interesting, let me know.
And, if you want to fix any of the issues, Chromium is open source.
Issues found while writing this
- Empty files generated by mojo – crbug.com/1054626, fixed in this change
- Precompiled files are too big, causing wasted time creating them, wasted time loading them, and 5+ GB of wasted disk space – crbug.com/1061326, fixed in this change
- Precompiled files generated redundantly – crbug.com/1060840. This was reported a long time ago. Since then the number of redundant builds has increased, but the change to shrink precompile_core.h has made this redundant building less critical.
- Clang loads precompiled files slowly – this has been known for a while and is discussed on crbug.com/672115 and there is ongoing work on a patch to improve this.
- Windows Photos can’t correctly display images with alpha. I tweeted about that here and included a link to the bad photo here.
- Precompiled header files have tricky tradeoffs that are difficult to assess.
- Chromium’s average source-file size is smaller (393 lines) than I expected.
- If you are creating a large project and build times are important to you then the most important thing you can do is to prohibit small source files