There was some recent interest in what type of computer would run Fractal eXtreme’s deep zoom calculations fastest (and cheapest) so I wrote a benchmarking tool that isolates the inner calculations for more accurate measurements and arranged to have it run on a few different processors. Now I know how quickly many processors can execute the core block of high-precision math code. The results probably apply to cryptography as well.
Clock speed matters, hyperthreads are less helpful than I thought, and microarchitectural efficiency matters a lot.
Fractal calculations at floating-point precision (up to ~53 bits) aren’t very interesting anymore because they are already very fast. However when you zoom further into the Mandelbrot set and other fractals the computational load increases, roughly as the cube of the precision (up to over 7,000 bits in Fractal eXtreme). It eventually gets annoyingly slow.
As discussed on a previous post on this topic the core high-precision calculation in 64-bit Fractal eXtreme is this same block repeated again and again, with no loop overhead:
mov rax, [rsi+2*PtrSize]
mul QWORD PTR [rsi+8*PtrSize]
add r11, rax
adc r12, rdx
adc r13, 0
This code loads a 64-bit value, multiplies it by another 64-bit value, and then adds the 128-bit result to a 192-bit accumulator (r13:r12:r11). This building block, with different offsets each time, is repeated hundreds of times, with occasional interruptions to write a register to memory when a diagonal is finished. It takes 272 of these five-instruction sequences to square a 1,024 bit number and get a 1,024 bit result.
My test code has a hundred copies of this pasted into a function which it calls in a loop. It starts calling it on one thread at a time, and then gradually increases the number of threads up to the number of logical threads on the machine. The test takes about one second per logical thread, so typically four to twelve seconds on modern machines. Here are some results from an eight core FX-8150 AMD Bulldozer machine:
VendorID = ‘AuthenticAMD’
Stepping = 0x2, model = 0x1, family = 0xF, type = 0
Signature is 0xF12 (0F_01H)
CPU’s rdtsc speed is 4.228 GHz (peak speed may be higher).
Running tests on 8 thread system.
Performance with 1 threads is 0.864 GBlocks/sec
Performance with 2 threads is 1.694 GBlocks/sec
Performance with 3 threads is 2.592 GBlocks/sec
Performance with 4 threads is 3.311 GBlocks/sec
Performance with 5 threads is 4.123 GBlocks/sec
Performance with 6 threads is 4.758 GBlocks/sec
Performance with 7 threads is 5.445 GBlocks/sec
Performance with 8 threads is 6.123 GBlocks/sec
Because the test program cycles through different numbers of threads, and because it can easily be used to run lots of tests, it was easy to tease out the results of the microarchitecture, hyperthreads, Turboboost, etc.
Note that these results are not particularly applicable to anything else. And since 32-bit high-precision math is at least four times slower I didn’t even bother measuring that. This is 64-bit only.
This one was easiest. By looking at the maximum single-threaded results on a range of different processors, sometimes monitoring their CPU frequencies with CPU-Z, I was able to figure out the throughput for the block of code, in cycles. For CPUs like Bulldozer it was easy – my test machine was locked to 4.2 GHz and the blocks-per-second results divided in quite neatly. It can process a block every five cycles.
The Core 2 processors managed one block every four cycles.
For my Sandybridge laptop it was more difficult because the rdtsc frequency is 2.2 GHz, but the processor can TurboBoost to 2.6 to 3.2 GHz depending on unpredictable factors. My Sandybridge desktop machine is more predictable and that helped to nail down the results at one block every three cycles.
This is really quite amazing. The latency of the Sandybridge 64×64 multiply instruction alone is three cycles, yet it manages to overlap subsequent blocks so well that it completes an entire new block every three cycles. Remarkable.
Here are the results in table form – the lower the “Clocks per block” the better:
|Processor||Clocks per block|
|Intel Core 2||4.0|
And now, some analysis of other factors.
They don’t help. Much. In this context. Except on Atom.
Hyperthreads are threads of execution running simultaneously on the same core. They share execution resources so if one of them is making full use of a set of resources then running the same code on the other hyperthread won’t increase throughput, because they will be contending for the same finite resources. In ‘normal’ situations they help fill in time wasted waiting on cache misses, they intermingle integer and FPU code, or they fill in gaps in code with long latencies. However deep-zoom fractal code has no cache misses, no FPU code, and no latency gaps that can’t be filled in by the out-of-order engine running multiple blocks in parallel.
On a dedicated fractal machine running nothing else there would be no advantage to hyperthreads. On a real computer, running occasional other random tasks, they help let the fractal calculations fill in the gaps left by other tasks, thus keeping the computer closer to 100% productivity. They don’t increase peak throughput, but they make it easier to keep a computer fully busy, and thereby avoid wasting ~10-25% of CPU time.
The one exception is the Intel Atom. Because this is an in-order machine it is unable to run adjacent blocks in parallel, so it ends up being instruction latency bound. Therefore, as Chad’s results in the comments section show, using two threads on one Atom core gives almost a 75% speedup. The hyperthreading allows pipeline bubbles to filled in that on other processors are filled in by out-of-order instruction execution. The net result, however, is still the worst per-core performance, which is not surprising given Atom’s design goal of low-power draw. An Atom core can process one block every 20 cycles, or one block every 11.5 cycles if two hyperthreads are running one one core.
Lots of cores
They help. If they are real cores with independent execution resources of the type that you need (Bulldozer cores count) then they should give linear performance gains.
Turboboost is complicated because it is very hard to tell what frequency your CPU will actually run at. You can monitor the frequency with various tools and try to correlate that to performance but it will vary. Laptop CPUs appear to be the worst. My laptop CPU has a nominal frequency of 2.2 GHz but can Turboboost up to 3.3 GHz. More commonly it seems to peak, under heavy load, at around 2.9 GHz at first and then drops to 2.6 GHz. This makes benchmarking and estimating very tricky. The nominal and maximum rates don’t matter – only the fully loaded rate matters.
Decoding CPU ID
Making sense of the results requires translating the CPU ID information obtained by the benchmarking program into real CPU model numbers. I did that with help from the people running the tests and various Internet resources. This is what I have. Let me know if it is wrong:
- 0F_03H, 0F_04H, 0F_06H, 06_0EH: NetBurst (Pentium 4) architecture
- 06_0EH: Intel Core Solo and Intel Core Duo
- 06_0FH: 65nm Intel Core microarchitecture (Core 2?)
- 06_17H and 06_1DH: Enhanced Intel Core microarchitecture (Core 2?)
- 06_1AH, 06_1EH, 06_1FH, 06_2EH: Nehalem
- 06_25H, 06_2CH, 06_2FH: Westmere
- 06_0AH, 06_0DH, 06_2AH: Sandybridge
I don’t have results for all of these CPU types. If you have one, with a 64-bit OS installed, then consider running this. If you send me the results, perhaps with a screenshot of this part of Computer Properties:
that would be fabulous.
Given that hyperthreads don’t help much (for this task), that number of cores (not logical threads) matters, that CPU frequency matters (allowing for typical TurboBoost frequency) all you have to do is calculate:
#cores * typical maximum frequency / Clocks per block.
It’s important to note that the two-cores-per-module in Bulldozer CPUs count as separate cores for these purposes, but the two hyperthreads per core in Intel CPUs do not. Also, in the list below note that I will list a 2.2 GHz CPU and then show its frequency as 2.6 GHz. This is the difference between nominal frequencies and observed typical Turboboost frequencies.
Here are some specific examples of CPU performance, from highest to lowest.
- 3.2 GHz desktop Sandybridge i7-3930K : 6 * 3.5 GHz / 3 = 7.0 Gblocks/sec
- 4.2 GHz AMD Bulldozer FX-8150: 8 * 4.2 GHz / 5 = 6.72 Gblocks/sec
- 2.2 GHz laptop Sandybridge i7-2720QM: 4 * 2.6 GHz / 3 = 3.47 Gblocks/sec
- 2.67 GHz 65 nm Intel Core Q6700: 4 * 2.67 GHz / 4 = 2.67 Gblocks/sec
- 2.6 GHz AMD Opteron: 2 * 2.6 GHz / 3.33 = 1.56 Gblocks/sec
- 2.4 GHz laptop Core 2 Duo T7700: 2 * 2.4 GHz / 4 = 1.2 Gblocks/sec
- 1.6 GHz Atom 230: 1 * 1.6 GHz / 11.5 = 0.139 Gblocks/sec
In reality I never saw the AMD Bulldozer hit those numbers – it peaked at about 6.12 Gblocks/sec. This may indicate OS overhead, but it looks more like some modest overhead from the shared components of the two-core modules.
The Atom numbers are the aggregate cycles-per-block-per-core assuming two threads per core.
You get the idea. Number of cores times frequency divided by cycles-per-block. AMD does well on the first two, but falls slightly behind on the third and is unable to claim the top spot.
I used to love that two core T7700 laptop. It was about fifteen times faster at deep-zooms than the 32-bit laptop it replaced. Now it just feels slow. My new laptop is almost three times faster, but I still need more cores.
Thanks for the help from the people at fractalforums.
But what about GPUs?
Inevitably people ask whether a GPU would be even faster. In general the answer appears to be no.
GPUs are extremely fast for float, and perhaps even double-precision fractal calculations. However these levels of precision are already so fast on CPUs that speeding them up is not, I think, particularly interesting.
So, the GPUs then need to compete in the high-precision math field, and they are not particularly well designed for that purpose. High-precision multiplies and add-with-carry are not key features of GPUs. I found a site that did some benchmarks of fractal calculations on a few different GPUs. They found that for 128-bit precision the GPUs were up to almost four times faster than the CPUs. That sounds like a victory, but the CPU code was doing 32-bit math (4x slower than 64-bit math), written using loops in C (at least 4x slower than fully unwound assembly language).
So, it seems safe to assume that the GPU would be four times slower than Fractal eXtreme is on the CPU, and therefore not of interest.
Yes, GPUs are continuing to grow faster, but so are CPUs, so I don’t expect things to change in the near future.
Knights Corner (which has an x86 instruction set and add-with-carry and many many cores) has the potential to be very fast at deep-zoom fractals. My back of the envelope calculations suggest that each Knights Corner core should be about 50% faster than a Sandybridge CPU at deep-zoom fractal calculations. This would make a 50 core Knights Corner quite intriguing, if the price is right.