In a recent post I made an offhand comment that in one fractal calculation scenario my new laptop is about 37,500 times faster than my old Amiga 1000. I suspected that in other fractal calculation scenarios the improvement is much greater and decided to do some calculations and experiments.

## Deep zooming

As you zoom deeper and deeper into the Mandelbrot set the calculations have to be carried out with ever greater precision. Soon the ~53 bits of double-precision floating point is insufficient and hand-written fixed-point routines are required.

Calculating the Mandelbrot set requires addition, subtraction, squaring and comparison. The cost of most of these operations increases linearly with the precision, but the cost of squaring (unless you are doing fancy techniques that don’t become applicable until thousands of digits) increases as the square of the precision. So, in the limit the only thing that matters is the time to square numbers.

In order to get absolute maximum speed Fractal eXtreme uses fully unwound math routines – there is close to zero loop overhead because the code to square a number is written as a linear sequence of instructions. In the 64-bit version of Fractal eXtreme the basic building block is:

mov rax, [rsi+2*PtrSize]

mul QWORD PTR [rsi+8*PtrSize]

add r11, rax

adc r12, rdx

adc r13, 0

This code loads a 64-bit value, multiplies it by another 64-bit value, and then adds the 128-bit result to a 192-bit accumulator (r13:r12:r11). This building block, with different offsets each time, is repeated hundreds of times, with occasional interruptions to write a register to memory when a diagonal is finished.

The 68000 processor in the Amiga 1000 didn’t have a 64-bit multiply instruction, or even a 32-bit multiply instruction. Its multiply instruction took two 16-bit inputs and produced a 32-bit result. The Amiga 1000 building block for high-precision squaring is:

MOVE.W (A4)+,D0

MULU -(A5),D0

ADD.L D0,D2

ADDX.W D3,D1

In this case D3 is a register that contains zero and the accumulator is D1:D2. The 68000 version is shorter than the x64 version because the 68000 registers are 32-bit and can hold the entire result of the 16-bit multiplication. The use of post-increment and pre-decrement addressing modes means that the exact same block of code can be reused, pasted as many times as necessary.

Now let’s try counting cycles:

MOVE.W (A4)+,D0 ; 8 cycles

MULU -(A5),D0 ; 44-76 cycles

ADD.L D0,D2 ; 8 cycles

ADDX.W D3,D1 ; 4 cycles

Back in the old days processors didn’t execute instructions out of order, and memory could keep up with processors, so instruction timings were typically quite straightforward and predictable. The MOVE.W takes 8 cycles because there are two memory accesses (one for the instruction, one for the data) and each one takes 4 cycles. The MULU takes a variable amount of time depending on how many bits are set in the first operand (38+n*2) which averages out to 54, plus six cycles for the pre-decrement addressing mode. The ADD.L takes 8 cycles because the 68000 didn’t have a 32-bit ALU so it had to run the 16-bit adder twice, and the ADDX.W instruction takes 4 cycles because there is one memory access, for the instruction.

The average number of cycles, therefore, is 8+60+8+4=80 cycles.

The timing on the x64 code is trickier to calculate. The timings below assume a Sandy Bridge processor, available as of 2011, and are all latency timings:

mov rax, [rsi+2*PtrSize] ; 1 cycle

mul QWORD PTR [rsi+8*PtrSize] ; 3 cycles

add r11, rax ; 1 cycle

adc r12, rdx ; 2 cycles

adc r13, 0 ; 2 cycles

The timing for the 64×64 mul instruction isn’t documented but we can measure it reasonably well and it has been measured well by others. The 3 cycles for a 64-bit multiply is pretty impressive, and a significant drop from the 7 cycles for the Core 2 line.

If we add the instruction latencies then we find that this block of code has a total latency of 9 cycles. That means that code that depends on the last result in r13 can’t start executing until 9 cycles after the ‘mov’ instruction starts executing.

However latency is not necessarily the correct measure when you are dealing with an out-of-order processor. The naïve assumption is that since each block of x64 code is accumulating to the same registers that the second block can’t start until the first block is finished. But this is not the case. Each instruction has its own non-obvious dependencies:

- The mov instruction has no dependencies, except for rsi which doesn’t change. Due to the magic of register renaming it is unconstrained by previous blocks.
- The mul instruction is dependent only on rsi (which doesn’t change) and on the mov instruction that precedes it. It is unconstrained by previous blocks and can execute one cycle after the mov.
- The add instruction is dependent on r11 from the previous block, and on the result of the multiply instruction that precedes it. It can execute one cycle after the previous block’s add, or three cycles after the multiply instruction that precedes it, whichever comes first.
- Similarly the adc instructions depend on the result of the immediately previous instruction, and the same instruction from the previous block.

All this leaves us with this non-obvious result – clock cycles are along the top, and each row is a new multiplication block:

This shows that if we look only at latency then there is nothing preventing all of the mov instructions from executing simultaneously. And similarly for the mul instructions. However the latency of the adc instructions, together with the dependencies between r11/r12/r13, restrict our performance to one block every 2 cycles.

Realistically there are limitations on the number of physical execution units available. Sandy Bridge processors can start a maximum of 3 mov instructions per cycle, 1 mul instruction per cycle, 3 add instructions per cycle, and 1 adc instruction per cycle. That leads us to this diagram:

The maximum of 3 mov instructions started per cycle, 1 mul instruction started per cycle, and 1 adc instruction started per cycle all force instructions to be pushed out. In cycle 9, for instance, we can’t do the adc r13 from the second block or the adc r12 from the third block because the adc r12 from the second block hasn’t completed yet, and they both depend on it. The adc r12 from the fourth block is right out because it depends on the adc r12 from the third block.

When we combine the 2 cycle latency of adc with the restriction that we can only issue one adc per cycle then we find that our performance drops to one block executed every 3 cycles. I had not realized that until I started creating the scheduling spreadsheet and it was a surprise to me that the performance of the multiply block is constrained by a boring instruction like adc.

That’s the theoretical speed, but out-of-order processors are famously complicated, and we could be limited by some other factor (decode bandwidth, reservation stations, technical mumbo jumbo), so let’s do some measurements. According to CPU-Z my Sandy Bridge processor will crank itself up to 3.193 GHz (there are some indications it actually goes up to 3.293 GHz), however my measurements show that rdtsc always ticks along at 2.195 GHz, so if I measure with rdtsc I would expect a result of 3.0 * 2.195 / 3.193, or 2.06 cycles. And, across a wide range of test runs and loop lengths, that’s basically exactly what I measured. More than 95% of the results are between 2.01 and 2.10 cycles. It’s a triumph of theory predicting reality.

For it to count as good science I’d have to be able to predict timings for variations on the functions. According to the chart above, if I commented out the ‘mul’ instructions it shouldn’t make any difference to the performance. Unfortunately that’s where theory diverged from reality, because it actually clocked in at about 1.47 rdtsc cycles, or 2.1 ‘real’ clock cycles. I can’t explain it. I suspect that the latencies for the carry flag output and the register output are different, but this is pure speculation. Oh well.

My Sandy Bridge laptop has 4 hyperthreaded cores, so 8 logical processors. Ideally this would give an 8 times speedup, but measurements I did here suggest that I actually get about a 3.88 times speedup on this workload. Your mileage may vary.

So…

26.67 times fewer cycles per block, 446 times higher clock frequency, 16 times more work done per block, and a 3.88 times multiplier from the extra cores. This gives us a final speedup of:

If we went back further, to my trusty Apple ][ with its anemic 6502 processor that didn’t even have a multiply instruction, then we’d see a significantly larger jump.

## Conclusion

Whadya know? Computers are faster than they used to be. Over the last 16 years they have, by this measure, doubled in speed more than 19 times, thus outpacing even the optimistic projections of Moore’s law. Nice.

And, out-of-order processors can execute code amazingly quickly, but be limited by unexpected factors. It looks like the best thing that Intel and AMD could do to speed up this workload (which has a lot in common with encryption) would be to decrease the latency of adc, or add a 128-bit add instruction.

Very informative for us computer architecture geeks! I am mostly into PowerPC and MIPS where I sometimes have to do manual pipeline scheduling to get the best performance for packet processing chores. But it looks like Intel will bring the x86 architecture to this market as well.

Pingback: Fractal and Crypto Performance | Random ASCII

Time to report it to Microsoft. Share the repro, or a crash dump. That’s the only way things get improved.