So many possible introductions to this one:
- Windows 7: Sheesh, I sure am slow at creating processes
- Windows 10: Hold my beer…
Or how about:
- A) How long does CreateProcess take on Windows?
- B) How long would you like it to take?
- A) You mean you can make it as fast as I want?
- B) No, I can make it as *slow* as you want
O(n^2) algorithms that should be linear are the best.
Note that, despite breathless and click-baity claims to the contrary, the performance of Chrome and Chromium was never affected by this bug. Only Chromium’s tests were affected, and that slowdown has been 100% mitigated.
I often find odd performance issues all on my own, but sometimes they are given to me. So it was when I returned from vacation to find that I’d been CCed on an interesting looking bug. Vivaldi had reported “Unit test performance much worse on Win10 than Win7”. unit_tests were taking 618 seconds on Win10, but just 125 seconds on Win7.
Update, April 23, 2019: Microsoft received the initial “anomaly” report on the 15th, the repro steps on the 21st, and announced a fix today. Quick work!
By the time I looked at the bug it was suspected that CreateProcess running slowly was the problem. My first guess was that the problem was UserCrit lock contention caused by creating and destroying default GDI objects. Windows 10 made these operations far more expensive, I’d already written four blog posts about the issues that this causes, and it fit the symptoms adequately well.
But! I don’t get paid to guess at the causes of problems, so I built Chromium’s unit_tests.exe, profiled it myself, and quickly found that my initial theory was completely wrong.
unit_tests.exe runs about 10,900 tests. It creates child-process instances of itself to run them, with each child process running ten tests. That means it needs to create about 1,090 child processes. The child-process creation is multi-threaded, but serialized by a lock in order to control handle inheritance. I verified by looking at CPU Usage (Precise) data in an ETW trace that the main process was spending huge amounts of time waiting on its lock around the CreateProcess calls.
The easiest explanation for why there was so much contention for the lock around CreateProcess would be that CreateProcess was running slowly, and indeed it was taking about 320 ms to execute. The CPU Usage (Sampled) data showed exactly that. It showed that 99% of the CPU time in the main unit_tests process was inside of CreateProcess, and 98.4% of the samples were in a single function. To be clear, that 98.4% of samples was not in a function and its children – those samples were all in the same function. It was the hottest hot-spot you can possibly get. The function was called ntoskrnl!MiCopyToCfgBitMap, and here is some CPU sampling data from a worst-case scenario – notice that in this case more than 99.5% of samples are in the one function:
CFG is a security feature that uses a bitmap to track which indirect branch targets are valid and MiCopyToCfgBitMap was clearly initializing that bitmap. unit_tests.exe is a 122 MB executable and it has about 180,000 valid indirect branch targets but… computers are really fast. The CFG bitmap for unit_tests.exe is only about 1.5 MB and it’s hard to understand why it would take about 780 million clock cycles to initialize such a small amount of data.
Step 1, mitigate
Since CFG is a security feature we want to leave it enabled for all of the executables that we ship. But we don’t actually need it for our tests. So, I landed a change that turns off CFG for our test binaries, and suddenly unit_tests.exe was running about five times faster, with CreateProcess itself now running about 20 times faster (Vivaldi measured CreateProcess as 200 times faster after the mitigation!). Our other large test binaries also run a bit faster, but I didn’t spend much time measuring their improvements.
Step 2, investigate
My work here was done but I am a very curious bear and I wanted to better understand what was going on. I wanted to find out whether the problem was specific to Chromium (perhaps caused by our use of clang-cl or lld-link) and I wanted to try to understand why CFG was causing such a slowdown.
My best theory for why MiCopyToCfgBitMap was so slow was that it was using an O(n^2) algorithm. That would explain why this performance issue had gone unnoticed by Microsoft, and it would explain why initializing such a small amount of data was taking so long. Ever since I wrote a “sooper-fast (TM)” version of std::vector (before std::vector existed) and found that it had abysmal performance due to my use of a linear growth strategy I have had a deep respect for the ability of O(n^2) algorithms to make fast computers run slowly.
In one trace that I grabbed I found that more than 95% of the samples in my test process were in just seventeen instructions in MiCopyToCfgBitMap, which is tough to do without an n^2 algorithm:
My first attempt at investigating was to grab the sample counts and addresses from the ETW trace, grab the disassembly of MiCopyToCfgBitMap from livekd, write a script to merge them, and then analyze the annotated disassembly. This gave a nice visualization of a small number of instructions accounting for the vast majority of the CPU time (the lea/bt/jae trio account for 23% of CPU time in the process):
0: kd> uf nt!MiCopyToCfgBitmap nt!MiCopyToCfgBitMap: ... nt!MiCopyToCfgBitMap+0x3ce: 205 fffff802`60ca812e movzx eax,byte ptr [r8] 407 fffff802`60ca8132 inc r8 1771 fffff802`60ca8135 dec r9 1 fffff802`60ca8138 mov ecx,eax 212 fffff802`60ca813a mov edx,eax 84 fffff802`60ca813c shr edx,6 3657 fffff802`60ca813f and ecx,3Fh 13 fffff802`60ca8142 mov eax,dword ptr [rdi+rdx*4] 4092 fffff802`60ca8145 imul eax,ecx 6111 fffff802`60ca8148 add r10d,eax 2163 fffff802`60ca814b cmp edx,3 fffff802`60ca814e jne fffff802`60ca8125 ... nt!MiCopyToCfgBitMap+0x42d: 22 fffff802`60ca818d lea edx,[rcx+r10] 2104 fffff802`60ca8191 bt qword ptr [r9],rdx 10784 fffff802`60ca8195 jae fffff802`60ca81ab ... nt!MiCopyToCfgBitMap+0x437: 43 fffff802`60ca8197 mov rdx,qword ptr [r14+30h] 1988 fffff802`60ca819b test rdx,rdx fffff802`60ca819e je fffff802`60e47737 ... nt!MiCopyToCfgBitMap+0x44b: 89 fffff802`60ca81ab inc ecx 1973 fffff802`60ca81ad cmp ecx,r8d fffff802`60ca81b0 jb fffff802`60ca818d ...
But the function was too large to be easily reverse engineered, so I moved on to another tactic. I wrote a Python script that generated classes with virtual functions. Each virtual function is necessarily an indirect branch target so this lets me create binaries with arbitrarily large numbers of CFG entries. I eventually settled on generating 16 source files, each with 50 classes, each with 0 to 1,600 virtual functions. That gave me the following CFG entry counts:
- 45 – miscellaneous overhead
- 800 – one per class, probably related to how destructors are called
- 0 to 1,280,000 – one per virtual function
I then compiled seventeen different variants (using /MP for parallel compilation), using this command to verify how many CFG entries I was getting:
dumpbin gen0.exe /loadconfig | find “function count”
Finally I measured CreateProcess time of each version with a simple test harness. That gave me this beautiful graph:
That n^2 graph is so perfect that I’d be suspicious if I hadn’t measured it myself. Plotting the same data as CFG count against sqrt(CreateProcessTime) gives a perfectly straight line (with a tiny kink at the beginning for the fixed cost of actually creating a process):
I don’t know what MiCopyToCfgBitMap is doing but it clearly has two nested loops, and can’t efficiently handle large executables.
The input to this function is, roughly speaking, a sorted array of function pointers. The output of the function is a bitmap, one byte for every 64-bytes of code. This is instrinsically a linear algorithm, which is why it being O(n^2) was so surprising.
After creating the necessary gn file (build script) changes to turn off CFG for the test binaries I did one last test, to verify that my fix worked, and to revel in the joy of watching the tests run so much more smoothly.
My revelry was cut short when several of the tests repeatedly timed out and my test run showed no signs of completing at all. It made no sense. I had done a clean run with my fix just a few minutes before, I’d made no code changes, and now I was seeing a new and different form of pathological test performance. WTF!
I grabbed another ETW trace and quickly found that some of the tests were spending 60+ seconds calculating the size of a directory. The current directory. Previously I had run the tests from the output directory, and this scan would take about 4 seconds – fast enough that I didn’t notice it. For some reason on the last test run I had run the test from the Chromium source directory and that directory takes much longer to fully scan.
With that realization I was able to get the clean test run that I wanted (I went back to running from the output directory) and I then filed a bug for the tests with the bogus directory scanning. They are supposed to be testing the behavior of the AppInfoSummaryPanel, this panel includes a display of the size of the app, and the test never specified what directory should be scanned. This bug costs a few seconds every time unit_tests is run, and occasionally costs a lot more, but it should be easy to fix.
Summary and references
Sometimes the most important piece of information is just the knowledge that a piece of code runs faster in one environment than the other, or some other clue that improvement is possible. As soon as Vivaldi pointed out that unit_tests.exe ran slower on Windows 10 than on Windows 7 a solution was inevitable.
Microsoft has been notified and I assume that they will investigate, and fix the issue if possible. There is no sign that this affects Chrome itself because only large .exe files are affected.
- Bug: Unit test performance much worse on Win10 than Win7
- Bug: AppInfoDialogViewsTest tests take a minute to run from the src directory
- 24-core CPU and I can’t move my mouse – first UserCrit issue found
- A Not-Called Function Can Cause a 5X Slowdown – second UserCrit issue found
- 24-core CPU and I can’t type an email (part one) – CFG, WMI, v8, and service workers
- Test code to demonstrate CreateProcess O(n^2) performance
- Vivaldi’s blog post
- Commute challenge 2018, because who doesn’t like watching me crash a hand cycle
This post is sort of a continuation of an ongoing series about making Windows slower.
- Making Windows Slower Part 0: Making VirtualAlloc arbitrarily slower – O(n^2)
- Making Windows Slower Part 1: Making file access slower
- Making Windows Slower Part 2: Process Creation – O(n^2)
- Making Windows Slower part 3: Process creation again, O(n^2) again, this post
CFG was designed to allow efficient checking of the validity of indirect branch target addresses. However it shipped with three significant inefficiencies in other areas, and the second one is still not fixed:
- VirtualQuery loops over CFG memory ran orders of magnitude too slow (fixed)
- CFG memory is never freed, leading to essentially unbounded memory leaks in applications that repeatedly allocate/free code at random addresses
- CFG bitmap initialization is slow