Near the end of January I was pointed to a twitter thread where a Windows user with a powerful machine was hitting random hangs in explorer. Lots of unscientific theories were being proposed. I don’t generally do random analysis of strangers’ performance problems but the case sounded interesting so I thought I’d take a look.
Freya shared an ETW trace of what was happening on her machine and I took a look using Windows Performance Analyzer (WPA). The first thing I noticed was that the UI Delays graph showed that, as promised, explorer.exe’s thread 7,888 was failing to check for messages for 20.531 seconds. It was hung.
Now, explorer.exe has many UI threads so it’s not like the entire process was hung, but one of its windows was definitely hung, it was causing hangs elsewhere, and this is bad.
If a thread is failing to pump messages then that is either because it is busy doing something else (consuming CPU) or it is waiting on something else (idle CPU). After zooming in to the 20.531 second MsgCheck Delay time period I checked the CPU Usage (Precise) data (derived from context switch instrumentation and 100% accurate) and saw that thread 9,228 was running 99.2% of the time – it was consuming lots of CPU.
The next task was to figure out what it was doing. The CPU Usage (Sampled) data (derived from a 1 kHz sampling profiler) told me that thread 9,228 was spending roughly 99.7% of its time (26994 out of 27074 samples) in the BatchPositionChangesHelper destructor (line 21) and its children (lines 23-25). That is a very expensive destructor.
I don’t have access to this source code but I looked through the stacks for a bit and they seemed to suggest that explorer.exe was spending more than 20 seconds doing a lot of tasks related to… arranging icon positions.
Yup, that’s it
Arranging icons on the desktop is pretty simple. You just stack then in columns, overflow to the next column, and stop when the screens are full. So, 20 seconds arranging icons didn’t seem plausible and I assumed that the root cause would be some weird shell extension or other third-party software, but eventually I tried to reproduce the bug in the simplest way possible. I thought to myself, what if I just make a thousand copies of a tiny .jpg image on my desktop and see if explorer.exe misbehaves. It was too dumb to be sufficient, and yet:
src = os.path.join(script_path, ‘SunsetWhales.jpg’)
dst = os.path.join(desktop_path, ‘TestFiles%04d.jpg’)
for i in range(file_count):
shutil.copy(src, dst % i)
I ran this simple script with a file_count of 1,000 and suddenly explorer.exe was spinning like mad for more than twenty seconds. It really was that simple.
Computers today are really fast. The CPU of the original reporter (OP) was running at 4.6 GHz and they had approximately 950 GIF files on their desktop. In 20 seconds their CPU would tick through 92 billion cycles, or 97 million cycles per image. That is a lot.
My guess was that once again this was due to an observation which I have taken to calling Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
That is, the most likely explanation for why arranging icons was taking so long is that the icon rearranging code used an O(n^2) (aka quadratic) algorithm such that as the number of icons doubled the time to arrange them quadrupled. This sort of performance scaling can take an algorithm that works fine for ten items and make it fail miserably with just 1,000 items.
Good theory, but how can I prove it?
I started by writing a script that would populate my desktop with a specified number of images. I ran that repeatedly with increasingly large numbers of images, and recorded an ETW trace so that I could measure the performance. I also monitored explorer.exe using Task Manager so that I could tell when it had finished one job and was ready for the next.
My first test gave messy results – it looked like a non-linear increase but any attempt at line fitting was going to be more about hope-and-magic than following the data. I needed to understand what was going on in order to better test my theory.
While looking at the traces I realized that the BatchPositionChangesHelper destructor was running most of the time (blue region) but not all of the time that explorer was running (green region):
I realize that, among other things, the layout work was being interrupted by display work, and then I understood the cause of the variability.
When my Python script started creating images the explorer.exe process would notice and immediately start trying to lay out icons. It might end up doing this multiple times while I was creating the images and this was creating unpredictable results. It was a race condition which made the total cost inconsistent. Since I didn’t have access to the explorer.exe source code I had to hack up a way to make it wait until all of the images were created before doing any layout. I did this by using psutil to suspend the explorer.exe process while I was creating the images. Then when I resumed the process it would do all of the work. The code looks something like this:
explorer = p = psutil.Process(explorer_pid)
With that in place I ran my test batch file while recording an ETW trace. To minimize noise and trace size I disabled context switch call stacks (unneeded) and I turned off indexing for the desktop folder. I monitored explorer.exe CPU usage with Task Manager and hit enter to go to the next test whenever it had gone to zero. That gave me this very nice graph of the CPU Usage of explorer.exe:
The individual blocks represent CPU usage for 100, 200, 300, and so on images up to 1,000. If you have a keen eye then you will see that the CPU usage increases faster than linear but slower than quadratic. That is, the initial data suggests that the layout algorithm is not-quite-O(n^2).
However, explorer does more work that just icon layout. If some of its tasks are O(n) – linear – then they will diffuse the impact of the O(n^2) tasks. As ‘n’ increases the O(n^2) tasks will eventually dominate, but I didn’t want my test harness to run even longer than the 160 seconds it was already taking.
Therefore my next task was to isolate out the time spent in the BatchPositionChangesHelper destructor. It represented 78.4% of the time spent in explorer.exe in my test trace, and 92.3% of the time spent in the busy thread, and if I could prove that it was quadratic then I would have proved that as ‘n’ increased it would dominate evermore.
To do that I looked at the CPU Usage (Sampled) data and filtered it down to just show the samples in the BatchPositionChangesHelper destructor and its children. I then looked at the ten different areas of the graph, and graphed the sample counts. The curve is so smooth that it looks fake, but this is the actual data.
If you look at key points on the graph such as when the image count is 500 and then 1,000 you can see that the performance scaling is slightly worse than O(n^2). That is, laying out 1,000 icons takes more than four times longer than laying out 500 icons.
The coup de grace
I don’t tend to have many icons on my desktop, so I am mostly immune to this bug. However I’ve seen people with their desktop completely full of icons and they are probably hitting lesser versions of this.
The OP used their desktop to store GIF files. They treated it like a folder (which it is) where you can easily store images. They rarely used the icons on the desktop. So, when the number of icons eventually became excessive they decided to uncheck “Show desktop icons” to reduce the clutter. The icons were hidden and they could continue to store images in that folder.
The hangs that they saw, where explorer was repeatedly spending 20+ seconds arranging the icons on their desktop, where explorer was burning 92 billion CPU cycles to get the icons positioned just right… were happening with the icons hidden.
That’s next level amazing.
Laying out icons on a grid should be an inherently linear operation, but somehow it was written as quadratic and was executed whether the icons were being displayed or not.
That’s it. If you write code that is going to be run by others then make sure that it scales well enough to handle any conceivable data set – reasonable or not. Quadratic algorithms usually fail that test.
The original bug seemed to be related to rearranging multi-monitor setups (a job hazard for streamers, so I’m told) so for a while I was testing by plugging and unplugging my external monitor. That works poorly for efficient testing, and it also seems to have worn out the external monitor connection on my personal laptop. My laptop can no longer see my external monitor. Oops.
Yes, with symbols
When I analyzed the OP’s trace I just loaded it into Windows Performance Analyzer (WPA) and waited. I didn’t have to check to see what version of Windows they were running or what patches they had installed. WPA just looked at the debug information for all of the EXEs and PE files and downloaded symbol files from Microsoft’s symbol servers (and Chrome’s because I have that configured as well). Symbol servers are good. If you are on Windows then make sure you are using symbol servers. If you are not on Windows – I’m very sorry (and by that I mean that it is a shame that debugging and profiling – especially of issues on other users’ machines – is so much harder for you).
Passing it along
I don’t know how many people this bug affects (anyone with 200-300 icons is hitting a modest version of this, and it gets progressively worse with more) and I have no power to fix it. So, I filed a bug. I am not hopeful that it will be fixed. My last quadratic-in-Windows bug has had zero comments since it was filed a few months ago.
The raw measurements from my tests are here and the tests themselves are here on github. This bug is extremely easy to reproduce. If somebody wants a Feedback Hub entry they should create one. I recommend using UIforETW’s Browse Folder option while the desktop is hung – the operation will be blocked for the duration.
Lessons for software developers
I’ve gone through quite a few interview loops during my career. I have often been asked to come up with an algorithm to do some artificial task. The obvious “brute-force” algorithm would usually be quadratic (O(n^2)) or, occasionally, exponential (O(2^n)). This would then generally lead to a discussion of:
- Why quadratic and exponential are unacceptably slow for most real-world problems
- How to improve the algorithm so that it was O(n log n) or better.
Despite the obvious awareness of this issue we, as an industry, keep shipping code that is quadratic. Code that is fast enough to make it into production, but slow enough to make things fall down once it gets there. See for example this, this, this, this, this, and many more. We really need to stop.
Tired of reading boring performance analysis? Instead you can read about how I used 19 different commute methods in September 2018, or 20 different commute methods in April 2017.
Twitter announcement is here.
Hacker news discussion is here.
Reddit discussion is here.
Another great analysis, Bruce! Thanks!
What browser extension is that with the ‘[Normal] 0%’ on the Twitter post?
The desktop has a few options when you right click -> View, one is “Align icons to grid.” If this is unchecked, it’s possible that the layout manager needs to do a lot more work displaying icons, since it might use a combination of existing positions and, when adding new icons, might need to dynamically adjust positions to make room for more icons.
I haven’t actually tested out if this could be the culprit–anyone else want to test?
Oh wow, even when icons are disabled it still runs that. I wonder if one *disables* “Auto arrange icons” and “Align icons to grid”, and then sets the “Show desktop icons” to false, one would hope at least then it won’t try to re-arrange them like broken record.
“CPU usage increases faster than linear but slower than quadratic. That is, the initial data suggests that the layout algorithm is not-quite-O(n^2)” … “…some of its tasks are O(n) – linear – then they will diffuse the impact of the O(n^2) tasks. As ‘n’ increases the O(n^2) tasks will eventually dominate…”
That *is* O(n^2). In fact, I think you’ve found the longest and most roundabout possible way to give the definition of O(n^2).
Can’t you just say “for every x > x_0…” like every intro CS textbook?…
I was trying to define O(n^2). In the first sentence I was trying to decide whether I could prove that the algorithm was O(n^2) – I could not at that point. In the second sentence I was saying that *if* it is O(n^2) then it will eventually dominate, but I need a better way to tell if it is O(n^2). I am (mostly) assuming that the reader knows what O(n^2) is.
“as the number of icons doubled the time to arrange them quadrupled” – is that clear enough?
I bet this is basically the same problem with any folder with a large number of photos.
I have over 14,000 files in my Camera Uploads folder with it set to Large icons.
Sometimes this works fine, but if I start moving files in and out and deleting files, often explorer slows to a stop for long periods of time.
Note that photos are of various random dimensions so layout has to look at the size of every photo to adjust their row height and column width.
This folder is also in Dropbox so it’s checking sync status for every file.
Just now went to the folder and without even changing any files it is spinning scanning the files and showing a green percent complete bar that is not even progressing at all.
I would bet $100 it’s the same code using the wrong algorithm.
I will split this folder into multiple smaller folder but yes this is just bad algorithm choice on code that affects a lot of people
If it is the same algorithm then they layout code would take roughly 14*14*20 seconds (20 seconds for 1,000, quadratically worse for 14,000) which is more than an hour, so I doubt the Camera Uploads folder is using the same algorithm. It is probably using a different inefficient algorithm.
Bryan, I suspect your problem is related to folder templates rather than icon layout. Explorer in windows 8/10 tries to intelligently decide whether a folder is “General Items,” “Documents,” “Pictures,” “Music,” or “Videos.”
In my experience, the specialized templates perform like crap. Even a folder with a couple dozen video files loads noticeably slower, probably because it’s reading the tags. And the extra info it loads does not appear to be cached; you can switch to another folder and go back and get the same lag. And it is slow even with icons turned off and even with Details or List views (no grid layout).
You’ll likely get better performance if you switch to using the General template for all folders.
Same problem in Chrome form Vivaldi with large number of Tabs. Problems start around 100 Tabs, with over 400 Tabs resulting in UI interactions having a 1 second delay.
Vivaldi developers seem to have doubled down on bad code with new builds slowing down to ~6 second delay with over 200 Tabs open (even hibernated ones).
I can say from my experience that Vivaldi is without a doubt the most poorly performing of all Chromium browsers.
Thank you for another interesting analysis – it *never* gets boring reading your blog!
You mentioned the Commute Challenge 2017 and 2018. I’m curious – what happened to the 2019 and 2020 edition?
2019 I ran out of interesting ideas, although I could-have/should-have repeated many with my new office location. 2020… well…
I might do a mini/modified commute challenge this summer, pandemic permitting
> If you are not on Windows – I’m very sorry
debuginfod does have promise, but I have yet to see that promise turned into a reality. I can load a profiler trace or a crash dump from any Windows user’s machine and know that the symbols for Microsoft and Chrome binaries (and the binaries themselves) will load automatically. It has worked that way for a long time. I have never seen that happen on Linux, still.
If you are in a non-Windows environment where symbols for profile traces and crashes from other people’s machines will automatically show up, then please tell me more.
Linux has its own advantages of course – source code – but for this specific scenario I think Windows wins.
>If you are not on Windows – I’m very sorry (and by that I mean that it is a shame that debugging and profiling – especially of issues on other users’ machines – is so much harder for you).
FYI there are debug symbol servers on Linux too, e.g. at least RHEL and Void Linux have them and Debian has announced it recently too: https://lwn.net/Articles/847256/
Also of note is that there hasn’t been too much of a need of debug symbol servers on Linux in the first place because distros, e.g. Debian, provide corresponding debug symbols packages for each package (e.g. chromium-dbgsym for chromium) since forever, even before Windows had its debug symbol servers.
debuginfod does indeed sound great. When I worked on Linux (about seven years ago) I hoped and wished for such a thing. Back then I still had to track down the debug symbols packages for each set of .so files I was interested and install them. This was particularly challenging if I was looking at crashes from another Linux distro. I hope that Linux can catch up to Microsoft on widespread support for debuginfod/symbol-servers.
Next I would love to see Linux support source indexing. Source indexing is like a symbol server, but for source code. I hacked up a version of that when I was at Valve so that the right version of source files would be retrieved from Perforce when debugging crashes of Valve’s Linux games. I made some brief efforts to get native gdb support for it but that project didn’t get very far.
I think that Linux users are less interested in these two concepts because most of the time using local symbols and local source is fine. But that means that as soon as you need to look at perf or crash data from another machine there is a hella-steep learning curve, and most developers don’t bother. That what I meant by my cheeky comment.