I was calculating a long fractal zoom movie with Fractal eXtreme when I noticed that one of the phases seemed to be taking too long on my four-core eight-thread processor. A bit of investigation showed that this phase was effectively being done in a serial manner. All eight threads were calculating the same points, and overwriting each others results.
Amdahl’s law teaches us that the more you parallelize code the more the serial portions dominate, and this unnecessary serialization was making this particular zoom movie take more than 25% longer to calculate! This was going to add about two days of calculation time to the movie.
I fixed the bug and restarted the zoom movie calculation. The fixed version will be released soon, along with some other tweaks. More details later.
Give the Fractal eXtreme demo version a try – even with this bug it makes great use of all of your cores to explore fractals with the spin of a mouse wheel.
4-core 8 thread, does the hyper-threading make much difference? My understanding is that hyperthreading duplicates the registers so that another thread can use the compute resources when another stalls. Given the simple loop of mandelbrot code the only unpredicted branch would be when the iterations reach the maximum. Hmm, as long as you dont have to hit the l1 cache (would that cause a stall?). I presume Fractal Extreme doesn’t use the AVX instructions (8 256 bit registers!) yet.
The two hardware threads share the execution resources of a core, and this is most beneficial when one thread is stalled waiting on a cache miss. However the two threads of execution are constantly interleaving their execution. For instance, a modern Intel processor can do three integer adds per cycle, however it is more likely that code will be doing dependent adds which means only one executes per cycle. This leaves two of the ALUs idle. Thus, most multithreaded code will be sped up by hyperthreads even if there are no cache misses or mispredicted branches. When I measured the speedup from using 8 threads instead of 4 threads I found a 22% speedup (https://randomascii.wordpress.com/2011/05/27/sandybridge-and-scaling-of-performance/).
It’s hugely complicated because running 8 threads increases the power draw which probably causes a lower turboboost frequency.
FX does not use the AVX instructions and it’s not clear that they are good match for deep zooming, which is where performance matters most.
“FX does not use the AVX instructions and it’s not clear that they are good match for deep zooming”
Because the integer instructions aren’t widened to 256bit?
I infer this from the wikipedia page which says of the forthcoming AVX2
“Expansion of most integer AVX instructions to 256 bits”
So investigating at http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/index.htm#intref_cls/common/intref_avx_details.htm
Its seems there aren’t any arithmetic operations that work with mm256i (256bit integer vector). The most you could do is load the unused upper 128 bits and use as register cache. That’s not going to increase performance like doubling the number of multiplications would.
Now I understand.
Having a super wide integer multiply would be ideal. At the very least I’d need AVX to be able to do more 64×64 integer multiplies (giving a 128-bit result) per second than the regular integer unit can do. Larrabee (now Knight’s Corner) had his ability, but AVX does not appear to.
The floating-point instructions are difficult to repurpose for extended precision.
Pingback: Fractal eXtreme New Version–Better Zoom Movies | Random ASCII
Pingback: Fractal eXtreme, now cheaper | Random ASCII