I recently hit some multi-minute delays on my workstation. After investigating I found that the problem was due to a lock being held for five minutes, and during that time the lock-holder was mostly just spinning in a nine-instruction loop.
Coming up with a good title for my blog posts is critical but I immediately realized that the obvious title of “48 processors blocked by nine instructions” was taken already by a post from less than a month earlier. The number of processors blocked is different, and the loop is slightly longer but, really, it’s just deja vu all over again. So, while I will explain the new issue that I found I first want to discuss the question of why this keeps happening.
Why does this happen?
Roughly speaking it’s due to an observation which I’m going to call Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
What happens is that a developer writes some code and it uses an O(n^2) algorithm. Maybe the developer doesn’t realize, or maybe it’s only O(n^2) because of a bug, or maybe they know it’s O(n^2) and think “that will never matter”. The code’s performance is fine in the lab, and is fine for most users in the wild, but then somebody creates 7,000 processes with App Verifier enabled, or creates a binary with 180,000 CFG entries, or builds a big enough DLL that repeatedly scanning a singly-linked list while linking it consumes all CPU time. When working on startup time for Valve’s Source 2 engine I found many O(n^2) algorithms, each contributing about 30 seconds to startup time, so the problem sneaks up on all sorts of developers.
O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
For example, App Verifier’s log filename creation is linear, for a given process launch, which is fine until you realize that makes it O(n^2) on the number of processes launched. Sometimes it’s not even obvious that there are nested loops, or it’s not technically O(n^2), or it’s not obvious that the loops can run enough to matter, and yet…
So, beware of this possibility, think about it when coding, see how performance scales on huge workloads, and look for suspicious hot spots when profiling those huge workloads. Or, leave it for me to find and blog about.
Back to our regularly scheduled rant
As usual I was just minding my own business, working away on my obsolete-but-still-powerful 48-logical-processor 96 GiB RAM workstation. I typed ninja chrome to build Chromium and… nothing happened. I watched and waited for twenty seconds and the build still hadn’t started. So, I switched to UIforETW to record an ETW trace, of course. Or, rather, I tried to. UIforETW hung while trying to start trace recording. This is the first time that I’ve seen a bug use defensive measures to stop me from investigating it!
A minute or two later my Chromium build started and UIforETW started tracing, but the trace was too late and I had no clues as to what had just happened. I hate that.
Then a few days later the same thing happened, and UIforETW failed in the same way. This time I left tracing running to circular memory buffers so that I would be ready for the third time the hang happened. However this reduces the performance of my build tools significantly so after a few hours I gave up.
Then the behavior happened again. This time I ran Microsoft’s ETW trace recording tool – wprui – and I was able to start recording. About 40 seconds later my build started running and I had the trace!
Now can I start investigating?
I’d previously noticed in Task Manager that WinMgmt.exe was running during these hangs. Looking at the CPU Usage (Precise) data in WPA confirmed that after more than forty seconds where WinMgmt.exe was almost the only thing running, my machine sprang to life with a burst of activity when WinMgmt.exe was done:
This is all quite suspicious but my discerning readers know that correlation is not causation and they want proof.
Just like last time I zoomed in on the moment of release, sorting the context switches by Switch-In Time and looking for the first one with a long Time Since Last value (indicating how long the thread had gone without running). After ignoring a dozen threads that had just had short naps I came to the first of many that had been waiting for 41.57 seconds. The long-sleeping thread wasn’t woken by WinMgmt.exe but I quickly found that it was woken by a thread which had been woken by WinMgmt.exe a fraction of a millisecond earlier.
For explanations of the CPU Usage (Precise) graphs and the readying-thread/new-thread concepts see this tutorial or this documentation
In the screen shot below, showing context-switch data, line 17 shows thread 72,748(WinMgmt.exe) waking up thread 74,156 (svchost.exe). Then on line 19 thread 74,156 (svchost.exe) goes on to wake up thread 58,704 (svchost.exe) that had been waiting for 41.57 seconds – that is the first thread to wake from a long sleep and the chain of wakeups continues from there. You can find just woken threads in the New Thread Id (6th) column and then look down a few lines to see them in the Readying Thread Id (4th) column, waking up another thread. The process names and process IDs help with context. Line 17 connects to line 18 and 19, line 19 connects to line 20, which connects to line 23, which connects to line 27, and so on, each thread enabled to run by the previous one in the chain:
41.57 seconds is a long time for a thread to be blocked but in fact there were hundreds of threads that were blocked, and they had been blocked for much longer than that. The only reason their Time Since Last value was about 41.5 seconds is because that is how long tracing had been running when the hang resolved itself.
The results seemed to agree with the theory that WinMgmt.exe was the problem, but they didn’t prove it. I got more confident when I looked up svchost.exe (3024) in Trace-> System Configuration-> Services and found that it was the Winmgmt service, but I still wanted more certainty.
After poking around a bit more (forwards and backwards through time) I decided that the interactions were too complex to be worth analyzing in detail, especially without any thread names to give clues about what the 25 different threads in svchost.exe (3024) were doing
I then took a different tack to prove that WinMgmt.exe was the problem. Maybe I should have started here, but that would have been too easy. I grabbed the WinMgmt.exe command line from the Processes table in WPA and ran it manually. The command was:
and running it took about five minutes. While it was running (I had lots of time) I found that I couldn’t start ETW tracing from UIforETW. That was better proof of the cause than any messy analysis I could do.
I then did the repro again with tracing already running and when I analyzed the trace I found over one hundred processes that had one or more threads blocked for more than five minutes!
Here we go again…
As is now my common practice I looked at the CPU Usage (Sampled) data to see where WinMgmt.exe was spending its time. I quickly found that 96.5% of the samples were in repdrvfs.dll!CPageCache::Read(), called on four different stacks:
Three of the full stacks to this function are shown here, mostly to help anyone at Microsoft who wants to investigate:
I added the address column and found that 95.3% of the samples were all in a single nine-instruction loop (the samples only ever hit on seven of the nine instructions (see here for why), but the debugger showed the full size of the loop):
I then ran winmgmt.exe /verifyrepository manually while gathering CPU counter data on branch instructions executed. From that I could determine roughly how many times the loop was executing. This was probably unnecessary but I wanted to verify that the loop was being executed many times rather than (for some reason) executing slowly. And, I thought it was nerdy-cool that I could so easily do this – it was a tiny change to the existing batch file. I found that WinMgmt.exe was executing roughly a branch instruction per cycle which meant that the loop (which I already knew was consuming most of the CPU time) was running extremely quickly, and the slowness was because it was executing hundreds of billions of times.
Just to be thorough I looked to see why UIforETW had been unable to start tracing during the incident. I found that UIforETW had run xperf but xperf had stalled for 41.5 seconds (actually more) on this call stack:
In short, xperf calls in to Wbem and is therefore blocked by this issue. The reason xperf is trying to stop tracing before it starts tracing is that I added that behavior in order to make starting of tracing more resilient. I suspect that the hang would have happened anyway, but I don’t know.
Graphing the computational complexity
I noticed that WinMgmt.exe seemed to be scanning the c:\windows\System32\wbem\Repository directory, which was 1.9 GB on my machine so I asked around – at work and on twitter – to get some data points on how big this directory is. I also asked people to time winmgmt.exe /verifyrepository and started graphing. Even though these tests were being done a wide range of machines, with different CPU speeds, the graph was pretty clear:
That graph of sqrt(time) versus repository size is implausibly perfect for coming from six different machines, and yet, it’s real. The VerifyRepository function clearly has O(n^2) performance. If n is the size of the Repository directory in GB then VerifyRepository takes about 1.6*n^2 minutes. This was a good estimate all the way from a fraction of a second up to ten minutes.
I’m either lucky, unlucky, or just observant, because for a couple of weeks nobody else seemed to be hitting this problem – I thought there must be something weird about my machine. But suddenly I started hearing suspiciously similar complaints from coworkers. One of them had a 2.6 GB repository that took ten minutes to verify. Some of our CI builders were affected, and various other people were hit to varying degrees. My coworkers generally know to ask me if their Windows machines have performance problems but there are probably still lots of Google employees working on Windows that are being hit by this bug without them realizing it.
Luckily I’d already started working with our IT department. I’d found the script that was running WinMgmt and learned that it was being run hourly. That meant that my machine was running WinMgmt.exe /verifyrepository 10% of the time, and some of my coworkers were running it more than 16% of the time. That’s a pretty high chance of getting up to a ten minute build delay.
By the time the reports started rolling in the fix was already on its way to production. The script wasn’t essential, and certainly wasn’t worth the problems it was causing so the fix was to stop calling it.
winmgmt.exe /verifyrepository has a nine-instruction loop whose number of iterations is proportional to the square of the size of the wbem repository. This causes the command to take up to ten minutes to run when it should really take just a few seconds. That’s pretty bad in itself.
But to make it worse the command holds a WMI (Windows Management Instrumentation) lock while doing this work, so any process that does certain WMI operations during this time period will hang.
The script that runs winmgmt.exe /verifyrepository every hour has been doing this for years, but the problematic behavior just started being noticed a month or two ago. This presumably means that the wbem repository has recently started getting much larger. The delays at 0.5 GB are easy to miss, but the delays at 1.0 GB and larger can be debilitating. Based on a twitter suggestion I ran strings.exe on the objects.data file (seems a bit crude). A lot of the most common strings contained “polmkr” in their name. That eventually led to us realizing that the problematic command was:
gpresult /scope:computer /x results.txt /F
On my home laptop this command increases the size of the wbem\Repository directory by about 340,000 bytes per run (measured over 50 runs). So that seems like a bug.
I reported the verifyrepository bug on twitter and it initially seemed to get some traction with the WMI team, but then I stopped getting responses so I don’t know what the status is. I formally reported the bug in November 2020 – watch that spot.
I’d love to see a fix to the performance issue. I’d love to see a fix to the WMI growth issue. And I’d love to see a way to shrink our multi-GB repositories. Until then our IT department has stopped running the /verifyrepository command every hour, and stopped running gpresult every hour. This means that the repository growth has stopped, and so have the hourly problems.
- Master list of ETW tutorials, investigations, and documentation is at https://tinyurl.com/etwcentral
- CPU Usage (Sampled) tutorial (for finding where CPU time is spent) is here, and documentation is here
- CPU Usage (Precise) tutorial (for finding why threads are unable to run) is here, and documentation is here
- Specific articles are linked where appropriate in the post, or you can look at my Investigate Reporting category
- For more tales of O(n^2) algorithms see Accidentally Quadratic
- This tale of WMI investigation could be helpful the next time I do battle with WMI.
Reddit discussion is here, hacker news discussion is here, twitter thread is here, and maybe here. Russian translation is here.
The kicker here is that WMI is just a proprietary means of giving remote control and monitoring your system for your admin (and Microsoft). Not sure who owns your workstation, but if it’s you, probably a great time to consider Linux.
Google owns my workstation. I agree that it is important to remember who owns the computing equipment that you use.
I will say that I appreciate that Google is happy to spend enough money to buy me good equipment.
this right here tends to be forgotten, often from different perspectives.
if the business is required to add these checks, the system hardware / specs should be higher. My employer is also in a regulated industry, so my computer is scanned – frequently – by multiple products / systems for various activities (and i assume that at least a few checks will overlap among the products).
that said, they supply me with sufficient hardware that I usually don’t notice (though I too find that WMI based checks seem to be the most intrusive).
if you’re on the other side – the business / IT side that requires compliance… if you’re adding a 10% CPU load, you better start to justify providing 10% higher CPU specs, otherwise people will get frustrated and leave… for those of us who are required, and have sufficiently strong machines to handle it, we usually don’t care (“usually” being based on whether it’s disruptive to us / aka working correctly).
clearly some of the comments demonstrate that people are quick to assume that IT isn’t paying for the regulation (in fact, the business is, since it’ll increase turnover rates, decrease performance, reduce morale, etc)… the comments are the lesson – if you need to do this stuff (detailed scanning), make sure the machines can handle it or the business
will (probably) suffer.
Yep, that makes sense. More memory and CPU power is needed to handle all of the background scanning.
The thing is, with an O(n^2) algorithm while holding a lock the amount of CPU power doesn’t matter. The time the lock is held becomes excessive (8+ seconds) as soon as the wbem/WMI database exceeds 300 MB, regardless of your CPU speed, core count, memory amount, or SSD speed. The corporation then either needs to ensure that the WMI database never grows, or ensure that VerifyRepository() is never called, or pressure Microsoft to fix the algorithm.
TL;DR – more CPU power can solve some problems, but not this one.
Also consider restricting WMI processes to a single core affinity, and put them at low priority. That should significantly free up resources.
In this case that might make things worse – remember, the problematic code is holding a lock this whole time – I need it to finish as quickly as possible.
I hate to be cynical, but I think there are two reasons problems like this are more common.
(1) Many developers are less committed to writing quality code. Too many people just write code that’s good enough so that they can avoid getting fired.
(2) Too many developers have the mindset that computing power is unlimited and will solve all of their problems. Along with this mindset is a decreased emphasis on understanding algorithmic complexity.
Do you know what’s even more expensive than CPU power and RAM? Dev hours. I don’t assume developers are lazy, or stupid or even have bad intentions. Instead I am aware that you either ship the feature or you/your team/your product/your company is replaced by something or someone that can deliver things faster with more or less the same quality. In the long run, it’s probably worth releasing the feature with O(n^2) performance today and fixing it when someone figures out what is the use case that breaks it.
I am not saying that developers are evil, just subject to the same human weaknesses as everyone else. Saying “I need to do x to keep my job” is a rationalization that’s used by many professionals. Part of being a good professional is caring about more than just keeping your job. Most managers are reasonable and are willing to listen to an argument about why a product is not ready for release yet.
Operating system software should be held to a higher standard than other software, because so many people depend on it. Most people on the Windows platform don’t care about extra bells and whistles. They just want an o/s that’s rock solid. Operating systems have traditionally had longer product development cycles for a reason – because the software needs to be perfect.
The reference to “polmkr” is, more than likely, related to the Group Policy Preferences (GPP) tools (which, prior to Microsoft’s acquisition, were developed by “PolicyMaker”).
The FireEye FLARE team reverse-engineered the WMI repository format and they have a tool that can dump the repository. That might be useful to determine what’s happening. Their tools are here: https://github.com/fireeye/flare-wmi
I might try those. Thanks.
That’s why latency is always measured in p99:
Yes WMI is my worst enemy in corporate environments. See https://aloiskraus.wordpress.com/2019/11/17/the-wmi-query-from-hell/. I have written a .NET Core 3.0 Windows Service which writes all WMI Queries of the system into a CSV file so you can analyze them later with ease in Excel. Very handy to find which query was slow. For that I use powershell which gets the job nicely done. Who issues the veryifyrepository call? Is it some system management tool?
The verifyrepository call came from a custom powershell script, now disabled.
Ironically that same powershell script was responsible for the growth of the WMI repository, and (doubly ironically) that script was intended to diagnose machine problems.
Oops. Software is hard.
Nice investigation. I’ll have to try some of those techniques the next time WMI is being mean to me.
Pingback: O(n^2), again, now in WMI – GeekWay
Is this the season for WMI-related problems? I also bumped into one recently: https://peteronprogramming.wordpress.com/2019/12/13/be-careful-what-you-ask-for-especially-in-wmi-queries/
Thanks for the writeup. I’m going to consult it the next time I have to do some WMI investigations.
Investigating WMI problems is always made more difficult by the uncertainty in how to track what is going on. I received third-hand a .wprp file that lists (without comments) a pair of providers to use for WMI tracing. To enable these just paste this into the UIforETW setting dialog’s “extra user mode providers” field:
Interpreting the generated data is an exercise left for whoever tries this first.
I’m commenting here to share this information and in hopes that I can find it when I need it.
Pingback: O(n^2), again, now in WMI | Random ASCII – Programming Blog