I recently hit some multi-minute delays on my workstation. After investigating I found that the problem was due to a lock being held for five minutes, and during that time the lock-holder was mostly just spinning in a nine-instruction loop.
Coming up with a good title for my blog posts is critical but I immediately realized that the obvious title of “48 processors blocked by nine instructions” was taken already by a post from less than a month earlier. The number of processors blocked is different, and the loop is slightly longer but, really, it’s just deja vu all over again. So, while I will explain the new issue that I found I first want to discuss the question of why this keeps happening.
Why does this happen?
Roughly speaking it’s due to an observation which I’m going to call Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
What happens is that a developer writes some code and it uses an O(n^2) algorithm. Maybe the developer doesn’t realize, or maybe it’s only O(n^2) because of a bug, or maybe they know it’s O(n^2) and think “that will never matter”. The code’s performance is fine in the lab, and is fine for most users in the wild, but then somebody creates 7,000 processes with App Verifier enabled, or creates a binary with 180,000 CFG entries, or builds a big enough DLL that repeatedly scanning a singly-linked list while linking it consumes all CPU time. When working on startup time for Valve’s Source 2 engine I found many O(n^2) algorithms, each contributing about 30 seconds to startup time, so the problem sneaks up on all sorts of developers.
O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
For example, App Verifier’s log filename creation is linear, for a given process launch, which is fine until you realize that makes it O(n^2) on the number of processes launched. Sometimes it’s not even obvious that there are nested loops, or it’s not technically O(n^2), or it’s not obvious that the loops can run enough to matter, and yet…
So, beware of this possibility, think about it when coding, see how performance scales on huge workloads, and look for suspicious hot spots when profiling those huge workloads. Or, leave it for me to find and blog about.
Back to our regularly scheduled rant
As usual I was just minding my own business, working away on my obsolete-but-still-powerful 48-logical-processor 96 GiB RAM workstation. I typed ninja chrome to build Chromium and… nothing happened. I watched and waited for twenty seconds and the build still hadn’t started. So, I switched to UIforETW to record an ETW trace, of course. Or, rather, I tried to. UIforETW hung while trying to start trace recording. This is the first time that I’ve seen a bug use defensive measures to stop me from investigating it!
A minute or two later my Chromium build started and UIforETW started tracing, but the trace was too late and I had no clues as to what had just happened. I hate that.
Then a few days later the same thing happened, and UIforETW failed in the same way. This time I left tracing running to circular memory buffers so that I would be ready for the third time the hang happened. However this reduces the performance of my build tools significantly so after a few hours I gave up.
Then the behavior happened again. This time I ran Microsoft’s ETW trace recording tool – wprui – and I was able to start recording. About 40 seconds later my build started running and I had the trace!
Now can I start investigating?
I’d previously noticed in Task Manager that WinMgmt.exe was running during these hangs. Looking at the CPU Usage (Precise) data in WPA confirmed that after more than forty seconds where WinMgmt.exe was almost the only thing running, my machine sprang to life with a burst of activity when WinMgmt.exe was done:
This is all quite suspicious but my discerning readers know that correlation is not causation and they want proof.
Just like last time I zoomed in on the moment of release, sorting the context switches by Switch-In Time and looking for the first one with a long Time Since Last value (indicating how long the thread had gone without running). After ignoring a dozen threads that had just had short naps I came to the first of many that had been waiting for 41.57 seconds. The long-sleeping thread wasn’t woken by WinMgmt.exe but I quickly found that it was woken by a thread which had been woken by WinMgmt.exe a fraction of a millisecond earlier.
In the screen shot below, showing context-switch data, line 17 shows thread 72,748(WinMgmt.exe) waking up thread 74,156 (svchost.exe). Then on line 19 thread 74,156 (svchost.exe) goes on to wake up thread 58,704 (svchost.exe) that had been waiting for 41.57 seconds – that is the first thread to wake from a long sleep and the chain of wakeups continues from there. You can find just woken threads in the New Thread Id (6th) column and then look down a few lines to see them in the Readying Thread Id (4th) column, waking up another thread. The process names and process IDs help with context. Line 17 connects to line 18 and 19, line 19 connects to line 20, which connects to line 23, which connects to line 27, and so on, each thread enabled to run by the previous one in the chain:
41.57 seconds is a long time for a thread to be blocked but in fact there were hundreds of threads that were blocked, and they had been blocked for much longer than that. The only reason their Time Since Last value was about 41.5 seconds is because that is how long tracing had been running when the hang resolved itself.
The results seemed to agree with the theory that WinMgmt.exe was the problem, but they didn’t prove it. I got more confident when I looked up svchost.exe (3024) in Trace-> System Configuration-> Services and found that it was the Winmgmt service, but I still wanted more certainty.
After poking around a bit more (forwards and backwards through time) I decided that the interactions were too complex to be worth analyzing in detail, especially without any thread names to give clues about what the 25 different threads in svchost.exe (3024) were doing
I then took a different tack to prove that WinMgmt.exe was the problem. Maybe I should have started here, but that would have been too easy. I grabbed the WinMgmt.exe command line from the Processes table in WPA and ran it manually. The command was:
and running it took about five minutes. While it was running (I had lots of time) I found that I couldn’t start ETW tracing from UIforETW. That was better proof of the cause than any messy analysis I could do.
Here we go again…
As is now my common practice I looked at the CPU Usage (Sampled) data to see where WinMgmt.exe was spending its time. I quickly found that 96.5% of the samples were in repdrvfs.dll!CPageCache::Read(), called on four different stacks:
Three of the full stacks to this function are shown here, mostly to help anyone at Microsoft who wants to investigate:
I added the address column and found that 95.3% of the samples were all in a single nine-instruction loop (the samples only ever hit on seven of the nine instructions (see here for why), but the debugger showed the full size of the loop):
I then ran winmgmt.exe /verifyrepository manually while gathering CPU counter data on branch instructions executed. From that I could determine roughly how many times the loop was executing. This was probably unnecessary but I wanted to verify that the loop was being executed many times rather than (for some reason) executing slowly. And, I thought it was nerdy-cool that I could so easily do this – it was a tiny change to the existing batch file. I found that WinMgmt.exe was executing roughly a branch instruction per cycle which meant that the loop (which I already knew was consuming most of the CPU time) was running extremely quickly, and the slowness was because it was executing hundreds of billions of times.
Just to be thorough I looked to see why UIforETW had been unable to start tracing during the incident. I found that UIforETW had run xperf but xperf had stalled for 41.5 seconds (actually more) on this call stack:
In short, xperf calls in to Wbem and is therefore blocked by this issue. The reason xperf is trying to stop tracing before it starts tracing is that I added that behavior in order to make starting of tracing more resilient. I suspect that the hang would have happened anyway, but I don’t know.
Graphing the computational complexity
I noticed that WinMgmt.exe seemed to be scanning the c:\windows\System32\wbem\Repository directory, which was 1.9 GB on my machine so I asked around – at work and on twitter – to get some data points on how big this directory is. I also asked people to time winmgmt.exe /verifyrepository and started graphing. Even though these tests were being done a wide range of machines, with different CPU speeds, the graph was pretty clear:
That graph of sqrt(time) versus repository size is implausibly perfect for coming from six different machines, and yet, it’s real. The VerifyRepository function clearly has O(n^2) performance. If n is the size of the Repository directory in GB then VerifyRepository takes about 1.6*n^2 minutes. This was a good estimate all the way from a fraction of a second up to ten minutes.
I’m either lucky, unlucky, or just observant, because for a couple of weeks nobody else seemed to be hitting this problem – I thought there must be something weird about my machine. But suddenly I started hearing suspiciously similar complaints from coworkers. One of them had a 2.6 GB repository that took ten minutes to verify. Some of our CI builders were affected, and various other people were hit to varying degrees. My coworkers generally know to ask me if their Windows machines have performance problems but there are probably still lots of Google employees working on Windows that are being hit by this bug without them realizing it.
Luckily I’d already started working with our IT department. I’d found the script that was running WinMgmt and learned that it was being run hourly. That meant that my machine was running WinMgmt.exe /verifyrepository 10% of the time, and some of my coworkers were running it more than 16% of the time. That’s a pretty high chance of getting up to a ten minute build delay.
By the time the reports started rolling in the fix was already on its way to production. The script wasn’t essential, and certainly wasn’t worth the problems it was causing so the fix was to stop calling it.
winmgmt.exe /verifyrepository has a nine-instruction loop whose number of iterations is proportional to the square of the size of the wbem repository. This causes the command to take up to ten minutes to run when it should really take just a few seconds. That’s pretty bad in itself.
But to make it worse the command holds a WMI (Windows Management Instrumentation) lock while doing this work, so any process that does certain WMI operations during this time period will hang.
The script that runs winmgmt.exe /verifyrepository every hour has been doing this for years, but the problematic behavior just started being noticed a month or two ago. This presumably means that the wbem repository has recently started getting much larger. The delays at 0.5 GB are easy to miss, but the delays at 1.0 GB and larger can be debilitating. Based on a twitter suggestion I ran strings.exe on the objects.data file (seems a bit crude). A lot of the most common strings contained “polmkr” in their name, but I don’t know what that means.
I’d love to see a fix to the performance issue, and I’d love it if our IT department could find and fix the issue that causes our wbem repositories to get so huge. Until then our IT department has promised to stop running the /verifyrepository command every hour, which should avoid the worst symptoms.
- Master list of ETW tutorials, investigations, and documentation is at https://tinyurl.com/etwcentral
- CPU Usage (Sampled) tutorial (for finding where CPU time is spent) is here, and documentation is here
- CPU Usage (Precise) tutorial (for finding why threads are unable to run) is here, and documentation is here
- Specific articles are linked where appropriate in the post, or you can look at my Investigate Reporting category
- For more tales of O(n^2) algorithms see Accidentally Quadratic