This story begins, as they so often do, when I noticed that my machine was behaving poorly. My Windows 10 work machine has 24 cores (48 hyper-threads) and they were 50% idle. It has 64 GB of RAM and that was less than half used. It has a fast SSD that was mostly idle. And yet, as I moved the mouse around it kept hitching – sometimes locking up for seconds at a time.
So I did what I always do – I grabbed an ETW trace and analyzed it. The result was the discovery of a serious process-destruction performance bug in Windows 10.
The ETW trace showed UI hangs in multiple programs. I decided to investigate a 1.125 s hang in Task Manager:
In the image below you can see CPU usage for the system during the hang, grouped by process name – notice that total CPU usage rarely goes above 50%:
The CPU Usage (Precise) table showed that Task Manager’s UI thread was repeatedly blocked on calls to functions like SendMessageW, apparently waiting on a kernel critical region (which are the kernel-mode version of critical sections), deep in the call stack in win32kbase.sys!EnterCrit (not shown):
I manually followed the wait chain through a half-dozen processes to see who was hogging the lock. My notes from the analysis look something like this:
Taskmgr.exe (72392) hung for 1.125 s (MsgCheckDelay) on thread 69,196. Longest delay was 115.6 ms on win32kbase.sys!EnterCrit, readied by conhost.exe (16264), thread 2560 at 3.273101862. conhost.exe (16264), 2560 was readied at 3.273077782 after waiting 115,640.966 ms, by mstsc.exe (79392), 71272. mstsc.exe was readied (same time, same delay) by TabTip.exe (8284), 8348, which was readied by UIforETW.exe (78120), 79584, which was readied by conhost.exe (16264), 58696, which was readied by gomacc.exe (93668), 59948, which was readied by gomacc.exe (95164), 76844.
I had to keep going because most of the processes were releasing the lock after holding it for just a few microseconds. But eventually I found several processes (the gomacc.exe processes) that looked like they were holding the lock for a few hundred microseconds. Or, at least, they were readied by somebody holding the lock and then a few hundred microseconds later they readied somebody else by releasing the lock. These processes were all releasing the lock from within NtGdiCloseProcess.
I was tired of manually following these wait chains so I decided to see if the same readying call stack was showing up a lot of times. I did that by dragging the Ready Thread Stack column to the left and searching the column for NtGdiCloseProcess. I then used WPA’s View Callers-> By Function option to show me all of the Ready Thread Stacks that went through that function – in this view the stack roots are at the bottom:
There were 5,768 context switches where NtGdiCloseProcess was on the Ready Thread Stack, each one representing a time when the critical region was released. The threads readied on these call stacks had been waiting a combined total of 63.3 seconds – not bad for a 1.125 second period! And, if each of these readying events happened after the thread had held the lock for just 200 microseconds then the 5,768 readying events would be enough to account for the 1.125 second hang.
I’m not familiar with this part of Windows but the combination of PspExitThread and NtGdiCloseProcess made it clear that this behavior was happening during process exit.
This was happening during a build of Chrome, and a build of Chrome creates a lot of processes. I was using our distributed build system which means that these processes were being created – and destroyed – quite quickly.
The next step was to find out how much time was being spent inside of NtGdiCloseProcess. So I moved to the CPU Usage (Sampled) table in WPA and got a butterfly graph, this time of callees of NtGdiCloseProcess. You can see from the screen shot below that over a 1.125 s period there was, across the entire system, about 1085 ms of time spent inside of NtGdiCloseProcess, representing 96% of the wall time:
Anytime you have a lock that is held more than 95% of the time by one function you are in a very bad place – especially if that same lock must be acquired in order to call GetMessage or update the mouse position. In order to experiment better I wrote a test program that creates 1,000 processes as quickly as possible, waits half a second, and then tells all of the processes to exit simultaneously. The CPU usage of this test program on my four-core eight-thread home laptop, grouped by process name, can be seen below:
Well, what do you know. Process creation is CPU bound, as it should be. Process shutdown, however, is CPU bound at the beginning and the end, but there is a long period in the middle (about a second) where it is serialized – using just one of the eight hyperthreads on the system, as 1,000 processes fight over a single lock inside of NtGdiCloseProcess. This is a serious problem. This period represents a time when programs will hang and mouse movements will hitch – and sometimes this serialized period is several seconds longer.
I’d noticed that this problem seems to be worse when my computer has been running for a while so I rebooted and ran the test as soon as my laptop had settled down. The process-shutdown serialization is indeed less severe, but the issue is still clearly present on the freshly rebooted machine:
I then ran the same test on an old Windows 7 machine (Intel Core 2 Q8200, circa 2008) – you can see the results here:
Process creation is slower, as you would expect from a much slower CPU, but process destruction is as fast as my new laptop at its best, and is fully parallelized.
This tells us that this serialization on process shutdown is a new issue, introduced sometime between Windows 7 and Windows 10.
48 hyper-threads, 47 of them idle
Amdahl’s law says that if you throw enough cores at your problem then the parts that cannot be parallelized will eventually dominate execution. When my work machine has been heavily used for a few days this serialization issue gets bad enough that process-shutdown becomes a significant part of my distributed build times – and more cores can’t help with that. In order to get maximum build speeds (and if I want to move my mouse while doing builds) I need to reboot my machine every few days. Even then my build speeds are not as fast as they should be, and Windows 7 starts to look tempting.
In fact, adding more cores to my workstation makes the UI less responsive. That is because Chrome’s build system is smart enough to spawn more processes if you have more cores, which means that there are more terminating processes fighting over the global lock. So it’s not just “24-core CPU and I can’t move my mouse” it’s “24-core CPU and therefore I can’t move my mouse.”
This problem has been reported to Microsoft and they are investigating.
Just one more thing…
This is what what my process create test program looks like when run on my 24-core work machine:
See that tiny horizontal red line on the bottom right? That’s Amdahl’s law visualized, as 98% of my machine’s CPU resources sit idle for almost two seconds, while process destruction hogs the lock that I need in order to move the mouse.
The ProcessCreateTests code is available here.
Discussions of this post can be found at:
If you liked this post you might like these other investigative reporting posts: