I’ve been working on Fractal eXtreme on-and-off for years, and an important (and fun) part of working on it is optimizing the calculation of the Mandelbrot set, especially the high-precision math routines that allow deep-zooming.
After working on this over the course of a decade I assumed that all the easiest optimizations were done, but recently I found a couple of trivial optimizations that I had missed. Here’s one of them.
Mandelbrot Set Math Primer
The Mandelbrot set is calculated by setting the complex number ‘c’ to the location of the point you want to calculate and then executing the following loop:
z = 0;
while (abs(z) <= 2.0)
z = z * z +c;
If the loop never exits then the point is in the Mandelbrot set. Simple enough. Practical considerations necessitate adding a limit to how many times the loop runs before giving up the search, but I’m going to ignore that for the sake of elegance and simplicity.
The Unzoomed Mandelbrot set:
Since z and c are complex numbers we have to store each one as two components and expand out operations like multiplication to multiple operations on those components. I’m not going to review complex math, but I will point out that abs(complex) is equal to the square root of (real * real + imaginary * imaginary) and if you square both sides you can avoid the square root. With that in mind a more detailed version would be:
z.r = 0;
z.i = 0;
while (z.r * z.r + z.i * z.i < 4.0)
{
temp = z.r * z.r – z.i * z.i + c.r;
z.i = z.r * z.i * 2 + c.i;
z.r = temp;
}
That’s correct but it’s not very efficient. The code does six multiplications and at high-precisions those are very expensive. It squares z.r and z.i twice each, and it multiplies by two. All of those can be avoided:
z.r = 0;
z.i = 0;
zrsqr = z.r * z.r;
zisqr = z.i * z.i;
while (zrsqr + zisqr < 4.0)
{
z.i = z.r * z.i;
z.i += z.i; // Multiply by two
z.i += c.i;
z.r = zrsqr – zisqr + c.r;
zrsqr = square(z.r);
zisqr = square(z.i);
}
Now the code is doing three multiplies per iteration, which is the minimum. So it’s perfect, and all we have to worry about is making sure that all of our basic operations – especially squaring and multiplication – are as fast as possible.
When using floating-point math the speed of addition, subtraction, and multiplication are generally identical. However floating-point math has only about 52 bits of precision which is woefully inadequate for serious fractal exploration. All of my fractal programs have featured high-precision math routines to allow virtually unlimited zooming. Fractal eXtreme supports up to 7,680 bits of precision. While this isn’t truly unlimited it is close enough because fractal calculation time generally increases as the cube of the zoom level, and even on the fastest PCs available it exceeds most people’s patience between 500 and 2,000 bits of precision.
The speed of squaring and multiplication is critical because they are O(n^2) operations and everything else is O(n). As ‘n’ (the number of 32-bit or 64-bit words) increases, the time to do the multiplications becomes the only thing that matters. Okay, at high precisions multiplication doesn’t have to be O(n^2), but the alternatives are a lot of work for uncertain gain, so I’m ignoring them.
While multiplication and squaring are both O(n^2) it turns out that squaring can be done roughly twice as fast as multiplication. Half of the partial results when squaring are needed twice, but only need to be calculated once.
All the information necessary to find the algebraic optimization is now available.
The observation (which came from somebody I know only through e-mail) was that the Mandelbrot calculations can be done using three squaring operations rather than two squares and a multiply.
The one multiply is used to calculate zr * zi. If we instead calculate (zr + zi)^2 then we get zr^2 + 2*zr*zi + zi^2. Since we’ve already calculated zr^2 and zi^2 we can just subtract them off and, voila, multiplication through squaring:
z.r = 0;
z.i = 0;
zrsqr = z.r * z.r;
zisqr = z.i * z.i;
while (zrsqr + zisqr < 4.0)
{
z.i = square(z.r + z.i) – zrsqr – zisqr;
z.i += c.i;
z.r = zrsqr – zisqr + c.r;
zrsqr = square(z.r);
zisqr = square(z.i);
}
At low precisions this is inefficient because we are doing an extra add and an extra subtract. However as ‘n’ grows larger this change should give us a 33% speedup. At Fractal eXtreme’s peak precision of 7,680 bits I’ve measured a speedup of about 25.6% so the theoretical predictions appear to be reasonably accurate.
Charts are fun – let’s paste one in (higher is slower):
Fitting the performance of the 32-bit version on the chart makes it hard to see the difference between the two 64-bit versions, so let’s rescale:
This post was supposed to show the benefits of optimization-through-algebra, but the charts really make the case for the importance of 64-bit software when doing high-precision math. At the highest precisions the fastest 64-bit version is running 7.75x faster than the 32-bit version. Yikes.
The Mandelbrot set magnified 6.3 x 10408 times, calculated with 1472 bit precision
A side benefit of this optimization is that the multiplication routine isn’t needed at all, which is a nice code-size saving. The Mandelbrot DLL drops from 2,572,288 bytes to 1,814,528 bytes.
Summary
In practice this optimization isn’t really a lot of use. In the 64-bit version of Fractal eXtreme (which all smart consumers choose because it typically runs at least 4x faster when doing deep zooms) this optimization doesn’t become worthwhile until 512 bits of precision and it doesn’t give even a 10% speedup until 1,024 bits of precision. It’s cool, but virtually unnoticeable for even the hard-core deep-zoom fractal explorer. But, given the minimal effort require to implement the optimization it has an excellent cost/benefit ratio, and an excellent cost/geeky-fun ratio.
Aside: Fractal eXtreme uses Lattice Multiplication to avoid ever having to store intermediate results to memory. I independently invented it years ago and only just found out what it is called.
Pingback: Faster Fractals Through Better Scheduling | Random ASCII
Pingback: Then and Now–Performance Improvements | Random ASCII
By saying your multiplication routine takes up 740K of code, I guess it is completely unrolled and maybe that code for each size of argument is separate. So you avoid some loop overhead by paying in instruction cache misses. Have you benchmarked this on recent hardware to make sure this is worthwhile?
I am reminded of performance optimizations in x264 which shrunk code size of functions to make sure the next function was still in cache to avoid the latency of loading code every iteration.
The 740K saving is for every precision from 128-bit to 7680-bit. At 64-bit precision that is 119 different routines, so the actual code size (1.8 MB now for all of the code) only averages out to about 15 KB per routine, so in general they fit in the i-cache. Even when they don’t the execution patterns are so straightforward (straight through, fully unrolled, virtually no branches) that I suspect the prefetcher can keep up.
i-cache optimizations definitely can be important, but I think they are at their most important when you are branching around a lot. Branching makes prefetching harder, and it means that only part of some i-cache lines will be used. Measurements show that i-cache misses aren’t a huge cost for the Fractal eXtreme calculation code, and they might be zero cost, but I’m not sure.
Pingback: Fractal eXtreme, now cheaper | Random ASCII