## Faster Fractals Again

I was calculating a long fractal zoom movie with Fractal eXtreme when I noticed that one of the phases seemed to be taking too long on my four-core eight-thread processor. A bit of investigation showed that this phase was effectively being done in a serial manner. All eight threads were calculating the same points, and overwriting each others results.

Amdahl’s law teaches us that the more you parallelize code the more the serial portions dominate, and this unnecessary serialization was making this particular zoom movie take more than 25% longer to calculate! This was going to add about two days of calculation time to the movie.

I fixed the bug and restarted the zoom movie calculation. The fixed version will be released soon, along with some other tweaks. More details later.

Give the Fractal eXtreme demo version a try – even with this bug it makes great use of all of your cores to explore fractals with the spin of a mouse wheel.

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Fractals, Math, Performance, Programming. Bookmark the permalink.

### 6 Responses to Faster Fractals Again

1. 4-core 8 thread, does the hyper-threading make much difference? My understanding is that hyperthreading duplicates the registers so that another thread can use the compute resources when another stalls. Given the simple loop of mandelbrot code the only unpredicted branch would be when the iterations reach the maximum. Hmm, as long as you dont have to hit the l1 cache (would that cause a stall?). I presume Fractal Extreme doesn’t use the AVX instructions (8 256 bit registers!) yet.

• brucedawson says:

The two hardware threads share the execution resources of a core, and this is most beneficial when one thread is stalled waiting on a cache miss. However the two threads of execution are constantly interleaving their execution. For instance, a modern Intel processor can do three integer adds per cycle, however it is more likely that code will be doing dependent adds which means only one executes per cycle. This leaves two of the ALUs idle. Thus, most multithreaded code will be sped up by hyperthreads even if there are no cache misses or mispredicted branches. When I measured the speedup from using 8 threads instead of 4 threads I found a 22% speedup (https://randomascii.wordpress.com/2011/05/27/sandybridge-and-scaling-of-performance/).

It’s hugely complicated because running 8 threads increases the power draw which probably causes a lower turboboost frequency.

FX does not use the AVX instructions and it’s not clear that they are good match for deep zooming, which is where performance matters most.

2. brucedawson says:

Having a super wide integer multiply would be ideal. At the very least I’d need AVX to be able to do more 64×64 integer multiplies (giving a 128-bit result) per second than the regular integer unit can do. Larrabee (now Knight’s Corner) had his ability, but AVX does not appear to.

The floating-point instructions are difficult to repurpose for extended precision.

This site uses Akismet to reduce spam. Learn how your comment data is processed.