Near the end of January I was pointed to a twitter thread where a Windows user with a powerful machine was hitting random hangs in explorer. Lots of unscientific theories were being proposed. I don’t generally do random analysis of strangers’ performance problems but the case sounded interesting so I thought I’d take a look.
Freya shared an ETW trace of what was happening on her machine and I took a look using Windows Performance Analyzer (WPA). The first thing I noticed was that the UI Delays graph showed that, as promised, explorer.exe’s thread 7,888 was failing to check for messages for 20.531 seconds. It was hung.
Now, explorer.exe has many UI threads so it’s not like the entire process was hung, but one of its windows was definitely hung, it was causing hangs elsewhere, and this is bad.
If a thread is failing to pump messages then that is either because it is busy doing something else (consuming CPU) or it is waiting on something else (idle CPU). After zooming in to the 20.531 second MsgCheck Delay time period I checked the CPU Usage (Precise) data (derived from context switch instrumentation and 100% accurate) and saw that thread 9,228 was running 99.2% of the time – it was consuming lots of CPU.
The next task was to figure out what it was doing. The CPU Usage (Sampled) data (derived from a 1 kHz sampling profiler) told me that thread 9,228 was spending roughly 99.7% of its time (26994 out of 27074 samples) in the BatchPositionChangesHelper destructor (line 21) and its children (lines 23-25). That is a very expensive destructor.
I don’t have access to this source code but I looked through the stacks for a bit and they seemed to suggest that explorer.exe was spending more than 20 seconds doing a lot of tasks related to… arranging icon positions.
Yup, that’s it
Arranging icons on the desktop is pretty simple. You just stack then in columns, overflow to the next column, and stop when the screens are full. So, 20 seconds arranging icons didn’t seem plausible and I assumed that the root cause would be some weird shell extension or other third-party software, but eventually I tried to reproduce the bug in the simplest way possible. I thought to myself, what if I just make a thousand copies of a tiny .jpg image on my desktop and see if explorer.exe misbehaves. It was too dumb to be sufficient, and yet:
src = os.path.join(script_path, ‘SunsetWhales.jpg’)
dst = os.path.join(desktop_path, ‘TestFiles%04d.jpg’)
for i in range(file_count):
shutil.copy(src, dst % i)
I ran this simple script with a file_count of 1,000 and suddenly explorer.exe was spinning like mad for more than twenty seconds. It really was that simple.
Computers today are really fast. The CPU of the original reporter (OP) was running at 4.6 GHz and they had approximately 950 GIF files on their desktop. In 20 seconds their CPU would tick through 92 billion cycles, or 97 million cycles per image. That is a lot.
My guess was that once again this was due to an observation which I have taken to calling Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
That is, the most likely explanation for why arranging icons was taking so long is that the icon rearranging code used an O(n^2) (aka quadratic) algorithm such that as the number of icons doubled the time to arrange them quadrupled. This sort of performance scaling can take an algorithm that works fine for ten items and make it fail miserably with just 1,000 items.
Good theory, but how can I prove it?
I started by writing a script that would populate my desktop with a specified number of images. I ran that repeatedly with increasingly large numbers of images, and recorded an ETW trace so that I could measure the performance. I also monitored explorer.exe using Task Manager so that I could tell when it had finished one job and was ready for the next.
My first test gave messy results – it looked like a non-linear increase but any attempt at line fitting was going to be more about hope-and-magic than following the data. I needed to understand what was going on in order to better test my theory.
While looking at the traces I realized that the BatchPositionChangesHelper destructor was running most of the time (blue region) but not all of the time that explorer was running (green region):
I realize that, among other things, the layout work was being interrupted by display work, and then I understood the cause of the variability.
When my Python script started creating images the explorer.exe process would notice and immediately start trying to lay out icons. It might end up doing this multiple times while I was creating the images and this was creating unpredictable results. It was a race condition which made the total cost inconsistent. Since I didn’t have access to the explorer.exe source code I had to hack up a way to make it wait until all of the images were created before doing any layout. I did this by using psutil to suspend the explorer.exe process while I was creating the images. Then when I resumed the process it would do all of the work. The code looks something like this:
explorer = p = psutil.Process(explorer_pid)
With that in place I ran my test batch file while recording an ETW trace. To minimize noise and trace size I disabled context switch call stacks (unneeded) and I turned off indexing for the desktop folder. I monitored explorer.exe CPU usage with Task Manager and hit enter to go to the next test whenever it had gone to zero. That gave me this very nice graph of the CPU Usage of explorer.exe:
The individual blocks represent CPU usage for 100, 200, 300, and so on images up to 1,000. If you have a keen eye then you will see that the CPU usage increases faster than linear but slower than quadratic. That is, the initial data suggests that the layout algorithm is not-quite-O(n^2).
However, explorer does more work that just icon layout. If some of its tasks are O(n) – linear – then they will diffuse the impact of the O(n^2) tasks. As ‘n’ increases the O(n^2) tasks will eventually dominate, but I didn’t want my test harness to run even longer than the 160 seconds it was already taking.
Therefore my next task was to isolate out the time spent in the BatchPositionChangesHelper destructor. It represented 78.4% of the time spent in explorer.exe in my test trace, and 92.3% of the time spent in the busy thread, and if I could prove that it was quadratic then I would have proved that as ‘n’ increased it would dominate evermore.
To do that I looked at the CPU Usage (Sampled) data and filtered it down to just show the samples in the BatchPositionChangesHelper destructor and its children. I then looked at the ten different areas of the graph, and graphed the sample counts. The curve is so smooth that it looks fake, but this is the actual data.
If you look at key points on the graph such as when the image count is 500 and then 1,000 you can see that the performance scaling is slightly worse than O(n^2). That is, laying out 1,000 icons takes more than four times longer than laying out 500 icons.
The coup de grace
I don’t tend to have many icons on my desktop, so I am mostly immune to this bug. However I’ve seen people with their desktop completely full of icons and they are probably hitting lesser versions of this.
The OP used their desktop to store GIF files. They treated it like a folder (which it is) where you can easily store images. They rarely used the icons on the desktop. So, when the number of icons eventually became excessive they decided to uncheck “Show desktop icons” to reduce the clutter. The icons were hidden and they could continue to store images in that folder.
The hangs that they saw, where explorer was repeatedly spending 20+ seconds arranging the icons on their desktop, where explorer was burning 92 billion CPU cycles to get the icons positioned just right… were happening with the icons hidden.
That’s next level amazing.
Laying out icons on a grid should be an inherently linear operation, but somehow it was written as quadratic and was executed whether the icons were being displayed or not.
That’s it. If you write code that is going to be run by others then make sure that it scales well enough to handle any conceivable data set – reasonable or not. Quadratic algorithms usually fail that test.
The original bug seemed to be related to rearranging multi-monitor setups (a job hazard for streamers, so I’m told) so for a while I was testing by plugging and unplugging my external monitor. That works poorly for efficient testing, and it also seems to have worn out the external monitor connection on my personal laptop. My laptop can no longer see my external monitor. Oops.
Yes, with symbols
When I analyzed the OP’s trace I just loaded it into Windows Performance Analyzer (WPA) and waited. I didn’t have to check to see what version of Windows they were running or what patches they had installed. WPA just looked at the debug information for all of the EXEs and PE files and downloaded symbol files from Microsoft’s symbol servers (and Chrome’s because I have that configured as well). Symbol servers are good. If you are on Windows then make sure you are using symbol servers. If you are not on Windows – I’m very sorry (and by that I mean that it is a shame that debugging and profiling – especially of issues on other users’ machines – is so much harder for you).
Passing it along
I don’t know how many people this bug affects (anyone with 200-300 icons is hitting a modest version of this, and it gets progressively worse with more) and I have no power to fix it. So, I filed a bug. I am not hopeful that it will be fixed. My last quadratic-in-Windows bug has had zero comments since it was filed a few months ago.
The raw measurements from my tests are here and the tests themselves are here on github. This bug is extremely easy to reproduce. If somebody wants a Feedback Hub entry they should create one. I recommend using UIforETW’s Browse Folder option while the desktop is hung – the operation will be blocked for the duration.
Lessons for software developers
I’ve gone through quite a few interview loops during my career. I have often been asked to come up with an algorithm to do some artificial task. The obvious “brute-force” algorithm would usually be quadratic (O(n^2)) or, occasionally, exponential (O(2^n)). This would then generally lead to a discussion of:
- Why quadratic and exponential are unacceptably slow for most real-world problems
- How to improve the algorithm so that it was O(n log n) or better.
Despite the obvious awareness of this issue we, as an industry, keep shipping code that is quadratic. Code that is fast enough to make it into production, but slow enough to make things fall down once it gets there. See for example this, this, this, this, this, and many more. We really need to stop.
Twitter announcement is here.
Hacker news discussion is here.
Reddit discussion is here.