I’ve been working on Fractal eXtreme on-and-off for years, and an important (and fun) part of working on it is optimizing the calculation of the Mandelbrot set, especially the high-precision math routines that allow deep-zooming.

After working on this over the course of a decade I assumed that all the easiest optimizations were done, but recently I found a couple of trivial optimizations that I had missed. This article discusses a simple algebraic improvement that gives up to a 33% speedup, and a follow-up article discusses a thread-scheduling improvement.

## Mandelbrot Set Math Primer

The Mandelbrot set is calculated by setting the complex number ‘c’ to the location of the point you want to calculate and then executing the following loop:

z = 0;

while (abs(z) <= 2.0)

z = z * z +c;

If the loop never exits then the point is in the Mandelbrot set. Simple enough. Practical considerations necessitate adding a limit to how many times the loop runs before giving up the search, but I’m going to ignore that for the sake of elegance and simplicity.

##### The Unzoomed Mandelbrot set:

Since z and c are complex numbers we have to store each one as two components and expand out operations like multiplication to multiple operations on those components. I’m not going to review complex math, but I will point out that abs(complex) is equal to the square root of (real * real + imaginary * imaginary) and if you square both sides you can avoid the square root. With that in mind a more detailed version would be:

z.r = 0;

z.i = 0;

while (z.r * z.r + z.i * z.i < 4.0)

{

temp = z.r * z.r – z.i * z.i + c.r;

z.i = z.r * z.i * 2 + c.i;

z.r = temp;

}

That’s correct but it’s not very efficient. The code does six multiplications and at high-precisions those are very expensive. It squares z.r and z.i twice each, and it multiplies by two. All of those can be avoided:

z.r = 0;

z.i = 0;

zrsqr = z.r * z.r;

zisqr = z.i * z.i;

while (zrsqr + zisqr < 4.0)

{

z.i = z.r * z.i;

z.i += z.i; // Multiply by two

z.i += c.i;

z.r = zrsqr – zisqr + c.r;

zrsqr = square(z.r);

zisqr = square(z.i);

}

##### Now the code is doing three multiplies per iteration, which is the minimum. So it’s perfect, and all we have to worry about is making sure that all of our basic operations – especially squaring and multiplication – are as fast as possible.

When using floating-point math the speed of addition, subtraction, and multiplication are generally identical. However floating-point math has only about 52 bits of precision which is woefully inadequate for serious fractal exploration. All of my fractal programs have featured high-precision math routines to allow virtually unlimited zooming. Fractal eXtreme supports up to 7,680 bits of precision. While this isn’t truly unlimited it is close enough because fractal calculation time generally increases as the cube of the zoom level, and even on the fastest PCs available it exceeds most people’s patience between 500 and 2,000 bits of precision.

The speed of squaring and multiplication is critical because in high-precision math they are O(n^2) operations and everything else is O(n). As ‘n’ (the number of 32-bit or 64-bit words) increases, the time to do the multiplications becomes the only thing that matters. Okay, at extremely high precisions multiplication doesn’t have to be O(n^2), but the alternatives are a lot of work for uncertain gain, so I’m ignoring them.

While multiplication and squaring are both O(n^2) it turns out that squaring can be done roughly twice as fast as multiplication. Half of the partial results when squaring are needed twice, but only need to be calculated once.

All the information necessary to find the algebraic optimization is now available.

The observation (which came from somebody I know only through e-mail) was that the Mandelbrot calculations can be done using three squaring operations rather than two squares and a multiply. At the limit this gives approximately a one-third speedup!

The one multiply was being used to calculate zr * zi. If we instead calculate (zr + zi)^2 then we get zr^2 + 2*zr*zi + zi^2. Since we’ve already calculated zr^2 and zi^2 we can just subtract them off and, voila, multiplication through squaring:

z.r = 0;

z.i = 0;

zrsqr = z.r * z.r;

zisqr = z.i * z.i;

while (zrsqr + zisqr < 4.0)

{

z.i = square(z.r + z.i) – zrsqr – zisqr;

z.i += c.i;

z.r = zrsqr – zisqr + c.r;

zrsqr = square(z.r);

zisqr = square(z.i);

}

At low precisions this is inefficient because we are doing an extra add and an extra subtract. However as ‘n’ grows larger this change should give us a 33% speedup. At Fractal eXtreme’s peak precision of 7,680 bits I’ve measured a speedup of about 25.6% so the theoretical predictions appear to be reasonably accurate.

Charts are fun – let’s paste one in (higher is slower):

The 32-bit version has to do four times as much work, and has fewer registers, so it is significantly slower than the 64-bit version. Fitting the performance of the 32-bit version on the chart makes it hard to see the difference between the two 64-bit versions, so let’s rescale:

This post was supposed to show the benefits of optimization-through-algebra, but the charts really make the case for the importance of 64-bit software when doing high-precision math. At the highest precisions the fastest 64-bit version is running 7.75x faster than the 32-bit version. Yikes.

##### The Mandelbrot set magnified 6.3 x 10^{408} times, calculated with 1472 bit precision

A side benefit of this optimization is that the multiplication routine isn’t needed at all, which is a nice code-size saving. The Mandelbrot DLL drops from 2,572,288 bytes to 1,814,528 bytes.

## Summary

In practice this optimization isn’t really a lot of use. In the 64-bit version of Fractal eXtreme (which all smart consumers choose because it typically runs at least 4x faster when doing deep zooms) this optimization doesn’t become worthwhile until 512 bits of precision and it doesn’t give even a 10% speedup until 1,024 bits of precision. It’s cool, but virtually unnoticeable for even the hard-core deep-zoom fractal explorer. But, given the minimal effort require to implement the optimization it has an excellent cost/benefit ratio, and an excellent cost/geeky-fun ratio.

Aside: Fractal eXtreme uses Lattice Multiplication to avoid ever having to store intermediate results to memory in its multiply routines. I independently invented it years ago and only just found out what it is called.

Pingback: Faster Fractals Through Better Scheduling | Random ASCII

Pingback: Then and Now–Performance Improvements | Random ASCII

By saying your multiplication routine takes up 740K of code, I guess it is completely unrolled and maybe that code for each size of argument is separate. So you avoid some loop overhead by paying in instruction cache misses. Have you benchmarked this on recent hardware to make sure this is worthwhile?

I am reminded of performance optimizations in x264 which shrunk code size of functions to make sure the next function was still in cache to avoid the latency of loading code every iteration.

The 740K saving is for every precision from 128-bit to 7680-bit. At 64-bit precision that is 119 different routines, so the actual code size (1.8 MB now for all of the code) only averages out to about 15 KB per routine, so in general they fit in the i-cache. Even when they don’t the execution patterns are so straightforward (straight through, fully unrolled, virtually no branches) that I suspect the prefetcher can keep up.

i-cache optimizations definitely can be important, but I think they are at their most important when you are branching around a lot. Branching makes prefetching harder, and it means that only part of some i-cache lines will be used. Measurements show that i-cache misses aren’t a huge cost for the Fractal eXtreme calculation code, and they might be zero cost, but I’m not sure.

Pingback: Fractal eXtreme, now cheaper | Random ASCII

Hi, I am confused by the differentiation you make between ‘squaring’ and multiplying by self in order to square. Are you using gmp or another high precision library which has a squaring function that performs better than mulityplying by self? I compared gmp’s multiply function to the pow_ui function, and found multiplying to be faster. Just wondering about this ‘square’ you mention.

I am not using gmp — I am using a math library that I wrote myself.

In general, however, squaring any multi-digit number can be done faster than multiplying two arbitrary numbers. If you square ABCD then A is multiplied by B, C, and by D twice. You can recognize this and just do the multiplication’s once and add the results in twice, thus saving time. A generic multiply routine could detect when squaring is being done and jump to an optimized path, but even better is to known when squaring is being done and not even have the code for generic multiplying at all.

A generic pow_ui function could detect when it was being asked to square a number and do it faster.

With this observation in hand I removed my generic multiplication routine. Squaring is almost twice as fast and it is particularly good when you always know that you are squaring.

How exactly do you recognize the multiplications? With an associative array? Isn’t this very memory intensive? The search of the multiplication takes also time.

Ideally (such as in my case when coding up the Mandelbrot set calculations) you know ahead of time that you are squaring a number. In this case I restructured the algorithm to exclusively do squaring and I discarded the multiplication routine completely.

In general you could check to see if the addresses of the two inputs match (in which case it is obviously squaring) or you could do a simple memory compare of the two inputs to see if the two inputs are identical. This testing adds some extra cost but it might be worth it in some cases.

But inside the square() function you still make a multiplication, right? But with the square() function it’s obvious that’s a multiplication. On z.r * z.i it’s also obvious that you don’t need a square function. So why using a generic pow_ui function? That function brings no additional benefit because you still need to make a multiplication. You exclusively do squaring and removed the multiplication routine? But isn’t squaring a multiplication? When I square(x) it’s a multiplication inside: x * x. I coded a simple Mandelbrot program in C. I used your trick with zrsqr and zisqr but it brings no speed improvement. Which language do you use?

I glossed over one critical detail. What I said was “While multiplication and squaring are both O(n^2) it turns out that squaring can be done roughly twice as fast as multiplication.” This is true when doing high-precision multiplication. If you are multiplying two 512-bit numbers, each consisting of 16 32-bit integers, then multiplying them requires 256 multiplications (each one multiplying two 32-bit numbers). If you are squaring them then almost half of these multiplications are redundant so it only takes a bit more than 128 multiplications. If you draw out a 16×16 grid showing all of the multiplications then you should be able to convince yourself that the multiplication results in the top-right corner are the same as those in the lower-left corner. That is why/when squaring is better than multiplication.

If you aren’t doing high-precision math then this technique does not apply because squaring is then not faster.

I’d like to see a O(1) computation optimization for deep zoom fractal computation :) OK, how about O(logN)? :) I was wondering, during the iterations do the lower bits stay the same? There must be a magical optimization to account for the fact that at deep-zoom some lower bits may be less or not “used” (as in, they stay the same) (perhaps that’s not the case, I’m just guessing). I imagine, though rather simplistic (and possibly incorrect), that some sort of “shift” of the bits that do not change during iterations and have the calculation be done only for the bits in the high precision range (which DO change during the iterations), but now since they are shifted into the normal precision range they don’t require high-precision calculation. It’s like multiplying : .20003 * .20002 = 0.0400100006 vs .20003 * .20003 = 0.0400120009. The first 5 digits do not change, so the computation could be somehow only performed on the higher digits and “append” the result to a fix digit calculation. Surely, any such optimization would necessarily have to be backed up by a mathematical proof that it still generates the same fractal. Nonsense? Perhaps. It should be at least some food for thought.

It appears that your intuition is correct. If we have two numbers, N and N+e (where N is many digits and e is assumed to be very small) then once we calculate N^2 we can calculate (N+e)^2 more cheaply, as N^2+2*e*n+e*e. If N is 30 words long (1920 bits for 64-bit) and e is a single word then N^2 requires roughly 30*30/2 multiplies (assuming we discard the bottom half) where N*e requires just 30 multiplies (e has to be multiplied by each word of N). No complex mathematical proof is needed, you just have to be sure that ‘e’ is small enough.

You aren’t limited to calculating just one neighbor in this manner, so this method can potentially give huge speedups — up to 15:1 if your numbers are 30 words long.

Except that the speedup would actually be less because squaring a number (as opposed to multiplying two arbitrary numbers) can be done with N^2/2 multiplications, or N^2/4 if you are discarding the bottom half of the result. So, a maximum 7.5:1 speedup at that deep zoom level.

The only question is how quickly do the numbers diverge under typical circumstances, and how messy is it to handle the cases where they diverge.

Apparently this is called perturbation theory and it has been implemented for fractal rendering:

http://www.fractalforums.com/announcements-and-news/superfractalthing-arbitrary-precision-mandelbrot-set-rendering-in-java/

I’ll take even a 5x speed up, thank you :) Thanks for your insight, Bruce.. and for the reference. If it’s not already explored I hope someone can get squeeze some juice out of such technique. A 2x speed up would still be pretty good, but honestly, for serious deep zoom exploration we may have to wait for some wicked quantum computers (assuming that there is a way to compute such fractals in a quantum fashion).

Divergence is not really messy. When you compute the mandelbrot iteration for a convex polygon (as in a square or a rectangle), the minimum number of iterations appears on the border of that polygon. You can keep an underestimated value of it while zooming-in. The interesting implementation is to duplicate the iteration loop in 2 parts : first stage is to compute a minimum number of iterations without even testing divergence, and taking full advantage of neighbor computation (non of them would diverge). Then for every neighbour, finish a few iterations as you suggest.

while (count++ < min_iter) // most of iterations

{

(…)

}

while (zrsqr + zisqr < 4.0 && count++ < max_iter) // a few iterations

{

(…)

}

Advantages ?

#1 most of the iterations are done in the first loop where there is no conditional branch depending on number output. With a Out-Of-Order hardware, this is critical : next loop does not depend on Zr and Zi, but on a separate counter that can be evaluated separately and extremely early. Hardware like X86 processors can accumulate hundreds of instructions in their OOO pipeline. This amounts to hundreds of cycles per iterations if all your intermediate values can be contained in processor registers (no spill to memory ..).

#2 http://www.science.eclipse.co.uk/sft_maths.pdf you can compute the value Zr and Zi for 1 point and an approximation polynomial A,B,C for its neighboring points (equations 3,4,5) and integrate the fact that you have got small values in multiplications in equation (2)

While avoiding conditional branches does help, the conditional branches are a relatively trivial cost when doing deep zooming. There the multiplies dominate the cost. Nothing else matters. Multiplying a couple of 512-bit numbers takes a lot of cycles, even on a 64-bit processor, whereas the comparisons are comparatively cheaper.

So, while it may be true that skipping the iteration checks helps when dealing with low precision (double precision, or 128-bit precision) I find that fractal calculations at those low precisions are already fast, so optimizing them is not the priority.

I’m skeptical about the claim about the minimum iterations being on the border. That is strictly mathematically true that somewhere on the border of a convex polygon in the Mandelbrot is the minimum iteration count. But where? Even if you sample every pixel on the border you could easily miss the true minimum, and find a pixel inside which is lower than any of the pixels that you saw on the border.

Agreed, the cost of the branch itself is negligible when branch prediction is available. But in a OOO architecture which can accumulate 100’s of micro-ops, this can end-up with 100’s of cycles. Since you focus on multiplications, you can verify that the the first loop has only 2 multiplications per iteration, and the processing of multiple points using vectorized instructions is extremely easy in that loop. You mention a speed-up up to 7X, while the link is about a 100X speedup…… Agreed also about missing the true minimum, and that is also true for any pixelized picture where the important details could be between the chosen pixels.