Performance improvements can come from many directions including better algorithms or micro-optimized inner loops. In this case I got a worthwhile speedup by working better with the Windows thread scheduler.
Noticing the problem
One of the tests I like to do when I get a new and presumably faster computer is to load up Fractal eXtreme (trial version available) and start zooming. The challenge is to hold down the zoom-in key (‘I’) and let it auto-repeat so that I am zooming into the Mandelbrot set as fast as possible, and see how far in I can go before being unable to ‘steer’ accurately due to the recalculations taking too long. I noticed that with my fancy new 4-core 8-thread laptop I wasn’t able to zoom in significantly further than with my old 2-core laptop, so I investigated and found a trivial optimization to make this scenario much faster.
This is the second performance improvement in the latest version of Fractal eXtreme. The first one is documented here.
When trying to investigate a performance problem my first step is always to record an ETW trace. I did that and looked at the CPU Usage for Fractal eXtreme and saw this:
My shiny new laptop was spending a significant amount of time sitting mostly idle!
The way that zooming in Fractal eXtreme (FX) works is that when you double-click or type ‘I’ then FX does an animated zoom of the bitmap. This visual feedback makes it much easier to follow what is happening, compared to just snapping to the new magnification. By default this animated zoom takes 300 milliseconds (ms).
FX has always had worker threads that make use of all available CPUs, but when FX was first written it was rare to have a multi-core machine and the animated zoom didn’t leave much processing power for other tasks, so fractal calculations stopped during the animated zoom, and resumed afterwards.
A few years ago, as multi-core machines became main-stream, FX was improved so that at the start of the zoom the worker threads stopped working on the old image, and then started working on the new image while the animated zoom was happening. This meant that when the animated zoom stopped the new image would already be partially calculated – maybe even completely calculated in some cases.
When I looked at the trace I was expecting to see a brief blip where the CPU usage dropped from eight threads (100% on the CPU Usage graph) to one thread (12.5% on the CPU usage graph) as the worker threads were all stopped and then restarted. The drop in CPU usage in the graph above, however, can hardly be called ‘brief’.
- From about 24.870 s to 24.920 s (50 ms) there is only one thread running – seven threads are idle.
- From 24.920 s to 25.000 s (80 ms) there are only four threads running – four threads are idle.
- From 25.000 s on CPU usage ramps up to seven threads (87.5%) but it isn’t until around 25.100 s (1oo ms of seven threads) that all eight threads are running more or less steadily.
This averages out to about 50% utilization over this 230 ms time period, and about 65% utilization over the full 300 ms that is spent on each zoom.
The main cause of this odd behavior was that, in an example of very poor programming, the worker threads were doing a Sleep(50) between chores. I don’t know what I was thinking when I wrote this, as Sleep should rarely (never?) be needed, but that explains the initial 50 ms delay. It can easily get worse than 50 ms because according to clockres the timer resolution on my laptop defaults to 15.6 ms. Since Sleep timers on Windows trigger on the next multiple of the timer resolution this means that my worker threads could actually end up sleeping for up to 62.4 ms.
But what about the gradual ramping up from four threads to seven and finally to eight?
At this point I’m going to indulge in some extrapolation and guessing. The inner details of the Windows scheduler are not fully documented, but this behavior appears to reveal one interesting aspect of the scheduler.
My new laptop has four cores and eight threads. In Intel’s parlance, it is hyperthreaded. That means that two hardware threads (each containing its own register set) share one core (L1 cache, execution engine, with the exact details of what is shared varying between different models and manufacturers). Because these two threads on one core are sharing some resources, if the OS schedules tasks (where by ‘task’ I mean a Windows thread) on both hardware threads then the tasks will both run somewhat slower than if they were running on separate cores. The actual slowdown varies but tasks can easily take 50% longer to complete. Therefore, the operating system tries to use only one hardware thread on each core, whenever possible. This can be seen in the CPU Scheduling screen shown below:
Each horizontal line represents one CPU thread. Four of the lines have so many scheduling events on them that they are solid lines of diamonds, whereas the other four are much more lightly used (disclaimer: frequency of scheduling events is a poor proxy for CPU activity, but in this case it appears to be representative).
On a lightly loaded system the rule appears to be straightforward – schedule only one thread per core. The quandary appears when there are more threads ready to run than cores available to run them – when should the operating system use those hyperthreads?
If there are two tasks available to run, each of which will take 100 ms to run on an exclusively owned core, then running them serially on one core will take 200 ms. If the OS schedules them simultaneously, running on two threads of the same core, and if because of that they each take 50% longer to run, then they will both finish in 150 ms – which is 50 ms faster than running them serially. So, on a heavily loaded system the OS should use all available hardware threads.
However if another core becomes available then the operating system can run the two tasks on separate cores and they will both finish in 100 ms, which is 50 ms faster than putting them on one core.
Moving tasks around from one core to another is expensive because of cache effects, so once the operating system makes a decision it can’t change it lightly.
So, the task of the operating system is simple: in order to get best performance it must predict the future to decide how heavy the CPU load will be. If the magic eight-ball says “CPU Usage looks light” then it should wait, and postpone scheduling code on hyperthreads. If the magic eight-ball says “CPU Usage looks heavy” then it should make use of all resources immediately.
The best way to predict the future is to look at the past, so there’s our answer. When all of the FX worker threads went to sleep for 50 ms the operating system interpreted that as a sign that CPU usage was light, and was likely to stay light. When all of the worker threads woke up simultaneously it looked at the CPU usage history and decided that this burst of activity was likely to be temporary. After 80 ms (five time slices) it decided that maybe this demand was real, so it decided to use most of the hyperthreads.
The one remaining question is why CPU usage went from four to seven cores, instead of from four to eight. My theory on that is that the operating system decided that the one thread that was running the whole time was the most important, so it let it have its CPU core all to itself, for a little bit longer.
I hit two variations of this exact same issue within the same month, on two unrelated projects. In the other case there was a job system that would wake up all worker threads simultaneously, even if there wasn’t work for all of them. This led to a temporary priority inversion, due in part to this scheduler behavior and in part due to a lock implementation that did busy waiting.
The Fix is In
With the Sleep removed the OS scheduler behaved the way that I wanted it to and FX was able to use almost 100% of CPU time when it needed. This can be seen in the CPU usage graph below:
There is a brief blip each zoom – apparently due to lock contention when calling InvalidateRect – but it is no longer enough to have a significant effect on performance.
If we zoom out on the CPU Usage graph we can see how the CPU usage changes as we start at a magnification of 1.0x and zoom in, doubling each time, to a magnification of about a thousand million million:
Initially the CPU has opportunities to go idle because the next image (at 800×600 resolution) is calculated in less time than the 300 ms spent doing the animated zoom. That means that in these cases FX can explore the Mandelbrot set in real time.
As the magnification increases the load increases, which can be seen in the increasing height and width of the activity spikes, but only when FX switches from double-precision to 128-bit fixed-point precision does it fail to be real time, and even then the performance drops off very smoothly.
The reason my new machine wasn’t as much faster as I had been hoping was because of this scheduling characteristic, which only affects hyperthreaded machines.
The details of the Windows scheduler are likely to change, so designing around this behavior would be foolish. However avoiding Sleep is certainly a good idea, not waking up more threads than needed is best, locks that busy wait are very dangerous, and true real-time behavior is hard on Windows.
I suspect that setting Windows to the High Performance power plan would avoid this behavior, but this also significantly increases electricity use and reduces battery life, so I cannot recommend this.
On my old laptop with the old version of FX I could hold down the zoom key and double the magnification about 600 times before the updates were too slow for me to see where I was going. With the new laptop and the new version of FX (which also includes a modest improvement in calculation speed) I am able to double the magnification about 1,300 times before running out of speed!
AMD’s upcoming BullDozer processors will blur the distinction between ‘full’ cores and hyperthreads by having two hardware threads per core but with less sharing of resources.
If you double the size of a subatomic particle about 140 times then it will be approximately the size of the visible universe. This puts the magnifications that Fractal eXtreme routinely does in an interesting perspective, and actually manages to make the universe seem small.
In 1986 I wrote my first fractal program. MandFXP was well optimized and was the fastest fractal explorer available for the Commodore Amiga. But, computers were a lot slower then. At the default resolution of 320×200 and an average of less than 20 iterations per pixel it took about 15 seconds to render the first, non-zoomed, fractal image, and it had no processing power left over to do calculations while doing animated zooms.
My test resolution today has 7.5 times more pixels. Even with 50 times more iterations per pixel I can now render the Mandelbrot set in ~150 ms and get effective real-time feedback at magnifications up to 8 million million.
7.5 * 50 * (15.0 / .150) = 37,500 times faster with greater precision – ain’t progress fun?