There was some recent interest in what type of computer would run Fractal eXtreme’s deep zoom calculations fastest (and cheapest) so I wrote a benchmarking tool that isolates the inner calculations for more accurate measurements and arranged to have it run on a few different processors. Now I know how quickly many processors can execute the core block of high-precision math code. The results probably apply to cryptography as well.
Clock speed matters, hyperthreads are less helpful than I thought, and microarchitectural efficiency matters a lot.
Fractal calculations at floating-point precision (up to ~53 bits) aren’t very interesting anymore because they are already very fast. However when you zoom further into the Mandelbrot set and other fractals the computational load increases, roughly as the cube of the precision (up to over 7,000 bits in Fractal eXtreme). It eventually gets annoyingly slow.
As discussed on a previous post on this topic the core high-precision calculation in 64-bit Fractal eXtreme is this same block repeated again and again, with no loop overhead:
mov rax, [rsi+2*PtrSize]
mul QWORD PTR [rsi+8*PtrSize]
add r11, rax
adc r12, rdx
adc r13, 0
This code loads a 64-bit value, multiplies it by another 64-bit value, and then adds the 128-bit result to a 192-bit accumulator (r13:r12:r11). This building block, with different offsets each time, is repeated hundreds of times, with occasional interruptions to write a register to memory when a diagonal is finished. It takes 272 of these five-instruction sequences to square a 1,024 bit number and get a 1,024 bit result.
My test code has a hundred copies of this pasted into a function which it calls in a loop. It starts calling it on one thread at a time, and then gradually increases the number of threads up to the number of logical threads on the machine. The test takes about one second per logical thread, so typically four to twelve seconds on modern machines. Here are some results from an eight core FX-8150 AMD Bulldozer machine:
CPU information:
VendorID = ‘AuthenticAMD’
Stepping = 0x2, model = 0x1, family = 0xF, type = 0
Signature is 0xF12 (0F_01H)
CPU’s rdtsc speed is 4.228 GHz (peak speed may be higher).Running tests on 8 thread system.
Performance with 1 threads is 0.864 GBlocks/sec
Performance with 2 threads is 1.694 GBlocks/sec
Performance with 3 threads is 2.592 GBlocks/sec
Performance with 4 threads is 3.311 GBlocks/sec
Performance with 5 threads is 4.123 GBlocks/sec
Performance with 6 threads is 4.758 GBlocks/sec
Performance with 7 threads is 5.445 GBlocks/sec
Performance with 8 threads is 6.123 GBlocks/sec
Because the test program cycles through different numbers of threads, and because it can easily be used to run lots of tests, it was easy to tease out the results of the microarchitecture, hyperthreads, Turboboost, etc.
Note that these results are not particularly applicable to anything else. And since 32-bit high-precision math is at least four times slower I didn’t even bother measuring that. This is 64-bit only.
Microarchitecture
This one was easiest. By looking at the maximum single-threaded results on a range of different processors, sometimes monitoring their CPU frequencies with CPU-Z, I was able to figure out the throughput for the block of code, in cycles. For CPUs like Bulldozer it was easy – my test machine was locked to 4.2 GHz and the blocks-per-second results divided in quite neatly. It can process a block every five cycles.
The Core 2 processors managed one block every four cycles.
For my Sandybridge laptop it was more difficult because the rdtsc frequency is 2.2 GHz, but the processor can TurboBoost to 2.6 to 3.2 GHz depending on unpredictable factors. My Sandybridge desktop machine is more predictable and that helped to nail down the results at one block every three cycles.
This is really quite amazing. The latency of the Sandybridge 64×64 multiply instruction alone is three cycles, yet it manages to overlap subsequent blocks so well that it completes an entire new block every three cycles. Remarkable.
Here are the results in table form – the lower the “Clocks per block” the better:
Processor | Clocks per block |
Intel Sandybridge | 3.0 |
AMD Opteron | 3.33 |
Intel Core 2 | 4.0 |
AMD Bulldozer | 5.0 |
Intel Atom | 20.0/11.5 |
And now, some analysis of other factors.
Hyperthreads
They don’t help. Much. In this context. Except on Atom.
Hyperthreads are threads of execution running simultaneously on the same core. They share execution resources so if one of them is making full use of a set of resources then running the same code on the other hyperthread won’t increase throughput, because they will be contending for the same finite resources. In ‘normal’ situations they help fill in time wasted waiting on cache misses, they intermingle integer and FPU code, or they fill in gaps in code with long latencies. However deep-zoom fractal code has no cache misses, no FPU code, and no latency gaps that can’t be filled in by the out-of-order engine running multiple blocks in parallel.
On a dedicated fractal machine running nothing else there would be no advantage to hyperthreads. On a real computer, running occasional other random tasks, they help let the fractal calculations fill in the gaps left by other tasks, thus keeping the computer closer to 100% productivity. They don’t increase peak throughput, but they make it easier to keep a computer fully busy, and thereby avoid wasting ~10-25% of CPU time.
The one exception is the Intel Atom. Because this is an in-order machine it is unable to run adjacent blocks in parallel, so it ends up being instruction latency bound. Therefore, as Chad’s results in the comments section show, using two threads on one Atom core gives almost a 75% speedup. The hyperthreading allows pipeline bubbles to filled in that on other processors are filled in by out-of-order instruction execution. The net result, however, is still the worst per-core performance, which is not surprising given Atom’s design goal of low-power draw. An Atom core can process one block every 20 cycles, or one block every 11.5 cycles if two hyperthreads are running one one core.
Lots of cores
They help. If they are real cores with independent execution resources of the type that you need (Bulldozer cores count) then they should give linear performance gains.
Turboboost
Turboboost is complicated because it is very hard to tell what frequency your CPU will actually run at. You can monitor the frequency with various tools and try to correlate that to performance but it will vary. Laptop CPUs appear to be the worst. My laptop CPU has a nominal frequency of 2.2 GHz but can Turboboost up to 3.3 GHz. More commonly it seems to peak, under heavy load, at around 2.9 GHz at first and then drops to 2.6 GHz. This makes benchmarking and estimating very tricky. The nominal and maximum rates don’t matter – only the fully loaded rate matters.
Decoding CPU ID
Making sense of the results requires translating the CPU ID information obtained by the benchmarking program into real CPU model numbers. I did that with help from the people running the tests and various Internet resources. This is what I have. Let me know if it is wrong:
- 0F_03H, 0F_04H, 0F_06H, 06_0EH: NetBurst (Pentium 4) architecture
- 06_0EH: Intel Core Solo and Intel Core Duo
- 06_0FH: 65nm Intel Core microarchitecture (Core 2?)
- 06_17H and 06_1DH: Enhanced Intel Core microarchitecture (Core 2?)
- 06_1AH, 06_1EH, 06_1FH, 06_2EH: Nehalem
- 06_25H, 06_2CH, 06_2FH: Westmere
- 06_0AH, 06_0DH, 06_2AH: Sandybridge
I don’t have results for all of these CPU types. If you have one, with a 64-bit OS installed, then consider running infprecperf.exe. If you send me the results, perhaps with a screenshot of this part of Computer Properties:
that would be fabulous.
Raw Performance
Given that hyperthreads don’t help much (for this task), that number of cores (not logical threads) matters, that CPU frequency matters (allowing for typical TurboBoost frequency) all you have to do is calculate:
#cores * typical maximum frequency / Clocks per block.
It’s important to note that the two-cores-per-module in Bulldozer CPUs count as separate cores for these purposes, but the two hyperthreads per core in Intel CPUs do not. Also, in the list below note that I will list a 2.2 GHz CPU and then show its frequency as 2.6 GHz. This is the difference between nominal frequencies and observed typical Turboboost frequencies.
Here are some specific examples of CPU performance, from highest to lowest.
- 3.2 GHz desktop Sandybridge i7-3930K : 6 * 3.5 GHz / 3 = 7.0 Gblocks/sec
- 4.2 GHz AMD Bulldozer FX-8150: 8 * 4.2 GHz / 5 = 6.72 Gblocks/sec
- 2.2 GHz laptop Sandybridge i7-2720QM: 4 * 2.6 GHz / 3 = 3.47 Gblocks/sec
- 2.67 GHz 65 nm Intel Core Q6700: 4 * 2.67 GHz / 4 = 2.67 Gblocks/sec
- 2.6 GHz AMD Opteron: 2 * 2.6 GHz / 3.33 = 1.56 Gblocks/sec
- 2.4 GHz laptop Core 2 Duo T7700: 2 * 2.4 GHz / 4 = 1.2 Gblocks/sec
- 1.6 GHz Atom 230: 1 * 1.6 GHz / 11.5 = 0.139 Gblocks/sec
In reality I never saw the AMD Bulldozer hit those numbers – it peaked at about 6.12 Gblocks/sec. This may indicate OS overhead, but it looks more like some modest overhead from the shared components of the two-core modules.
The Atom numbers are the aggregate cycles-per-block-per-core assuming two threads per core.
You get the idea. Number of cores times frequency divided by cycles-per-block. AMD does well on the first two, but falls slightly behind on the third and is unable to claim the top spot.
I used to love that two core T7700 laptop. It was about fifteen times faster at deep-zooms than the 32-bit laptop it replaced. Now it just feels slow. My new laptop is almost three times faster, but I still need more cores.
Thanks for the help from the people at fractalforums.
But what about GPUs?
Inevitably people ask whether a GPU would be even faster. In general the answer appears to be no.
GPUs are extremely fast for float, and perhaps even double-precision fractal calculations. However these levels of precision are already so fast on CPUs that speeding them up is not, I think, particularly interesting.
So, the GPUs then need to compete in the high-precision math field, and they are not particularly well designed for that purpose. High-precision multiplies and add-with-carry are not key features of GPUs. I found a site that did some benchmarks of fractal calculations on a few different GPUs. They found that for 128-bit precision the GPUs were up to almost four times faster than the CPUs. That sounds like a victory, but the CPU code was doing 32-bit math (4x slower than 64-bit math), written using loops in C (at least 4x slower than fully unwound assembly language).
So, it seems safe to assume that the GPU would be four times slower than Fractal eXtreme is on the CPU, and therefore not of interest.
Yes, GPUs are continuing to grow faster, but so are CPUs, so I don’t expect things to change in the near future.
Knights Corner (which has an x86 instruction set and add-with-carry and many many cores) has the potential to be very fast at deep-zoom fractals. My back of the envelope calculations suggest that each Knights Corner core should be about 50% faster than a Sandybridge CPU at deep-zoom fractal calculations. This would make a 50 core Knights Corner quite intriguing, if the price is right.
I noticed you made a thumbnail out of my bench file. Neat. For benchmarking purposes, you could manually set the multiplier in BIOS (if the processor is unlocked) or optionally disable turbo mode altogether, to get a stable clock for measuring purposes. This should apply to both Intel and AMD architectures.
I have a couple more observations I made with regards to the AMD Bulldozer platform. The older Phenom II architecture apparently got more work done per clock than the newer Bulldozer. The bench file I procured, rendered in 4:55 on my 4-core Phenom II @ 3.2Ghz. The same file rendered in 2:41 on my newer 8-core Bulldozer system, with the clocks locked in at 4.2Ghz and a new BIOS update to disable CPU throttling. That’s twice as many cores, 1Ghz faster clock speed, and less than twice the performance. 😐 Because of this, I have a strong speculation that the clock cycles per block is lower on the older Phenom II platform than on the newer Bulldozer. I did not get to test the new benchmark tool on the old Phenom II processor since it was released some time after completion of my new build, and to test it now would require me to do a processor swap, which would be a pain. At stock speeds, I predict the 6-core Phenom II x6 1090T would compete on par with the FX-8150 (they both run at a stock non-turbo speed of 3.6Ghz) due to lower latency of the Phenom II platform.
One more aspect of the Bulldozer architecture, there is a very small performance penalty for running two integer threads together on the same module compared to running on separate modules. This likely accounts for the This likely accounts for the 9% difference in performance between the real world 6.123 and theoretical 6.72 values. For floating point code, however, that performance penalty would be significantly greater due to the shared FPU resources.
I hope you don’t mind using your benchmark image to decorate the article — I tried to find the creator but couldn’t track it down.
If somebody does manage to run some tests on a 64-bit Phenom II that would be great — I’ll update the article if I get them.
You could be right about the sharing of resources (instruction decode mostly) causing a mild performance penalty when all cores are used. It could also be overhead from other processes running but I like your theory.
FWIW decoding the CPU ID – I ran into a useful catalogue of documented cpuid -> microarch in LLVM – check out llvm\lib\Support\Host.cpp.
for 06_0Fh we have:
case 15: // Intel Core 2 Duo processor, Intel Core 2 Duo mobile
// processor, Intel Core 2 Quad processor, Intel Core 2 Quad
// mobile processor, Intel Core 2 Extreme processor, Intel
// Pentium Dual-Core processor, Intel Xeon processor, model
// 0Fh. All processors are manufactured using the 65 nm process.
06_17h:
case 23: // Intel Core 2 Extreme processor, Intel Xeon processor, model
// 17h. All processors are manufactured using the 45 nm process.
//
// 45nm: Penryn , Wolfdale, Yorkfield (XE)
06_1DH:
case 29: // Intel Xeon processor MP. All processors are manufactured using
// the 45 nm process.
Here are my results – Intel Core2 Quad Q9450 (Yorkfield – Penryn) @ 2.66Ghz.
VendorID = ‘GenuineIntel’
Stepping = 0x7, model = 0x7, family = 0x6, type = 0
Signature is 0x677 (06_07H)
CPU’s rdtsc speed is 2.666 GHz (peak speed may be higher).
Running tests on 4 thread system.
Performance with 1 threads is 0.655 GBlocks/sec
Performance with 2 threads is 1.302 GBlocks/sec
Performance with 3 threads is 1.943 GBlocks/sec
Performance with 4 threads is 2.524 GBlocks/sec
.. time to go shopping for a Sandybridge!
Been really enjoying reading about this stuff. After reading the earlier article on out-of-order-execution, I had a thought.
What if, instead of a second ADC, you had a SETC and an ordinary ADD from the setc’d register?
Could the SETC coissue with the remaining ADC, getting you closer to the theoretical 2 cycle MUL throughput those docs claim on Sandy Bridge?
I suppose I should try it out. But if you happen to know offhand why it wouldn’t work, that would be interesting too!
I gave it a try. I was expecting a significant drop in performance but in reality there was either a slight drop or no change. Here’s my modified MASM code:
ByteOffset = 0
Repeat 1000
mov rax, [rsi+ByteOffset+0*PtrSize]
mul QWORD PTR [rsi+ByteOffset+1*PtrSize]
add r9, rax
adc r10, rdx
;;;;adc r8, 0
setc al
add r8, rax
ByteOffset = ByteOffset + PtrSize
ENDM
The SETC can’t co-issue with anything in the same block. Every instruction is dependent on the previous one, but now there are six instructions in the chain instead of just five.
This alteration requires an extra instruction, a longer dependency change, and it causes the code to write to al (the bottom eight-bits of rax) and then read from all of rax, so I really thought it would be a lot slower. Count me surprised that it worked as well as it did.
Yep, it’s counter-intuitive. But, according to the Agner Fog docs, ADC is two microops, so it seemed like substituting two similar ‘real’ ops wasn’t a loss, as long as it didn’t hit some kind of decode limit.
Now that I’ve tried it (in 32-bit, since I don’t have MASM) I got the same results (no improvement) Thanks for giving it a shot.
Looking at the second timing chart in your “then and now” performance article, I was hoping the change could, in essence, get the ADC in column 8 to happen in column 7, by having the first of the two broken-up ops co-issue with the ADC in column 8. (Same for 11 and 10, which would now be 10 and 9.) The SETC would coissue with a _different_ block’s ADC.
The Agner doc says SETC from a register uses either port0 or port5 and has a reciprocal throughput of 0.5 (so two per clock.) But, the doc isn’t too clear exactly what’s going on in with ADC (it says it has 2 p015 ops, with ‘x’ under all three.) I was hoping that ADC would leave port 0 or 5 open for a SETC to use. Guess not (or there’s some other reason I’m missing.) Wish the doc were a bit clearer in that regard. (While I’m wishing for things, wish Intel would just document this info!)
Something else that’s fun is when I added an XOR eax, eax above the setc (again, 32 bit,) it didn’t slow down at all. Another fun thing I tried was changing the _first_ adc to a setc and two adds, and the code only slowed down a slight bit.
I haven’t really played too much with Sandy Bridge, and I’ll keep playing with it!
For some reason I can’t reply to your reply, so I’m replying to mine.
Don’t have MASM? You can always get it. It’s been free for years.
http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=12654
The interactions between the different parts of the CPU pipelines — and with three or more blocks going through the execution units simultaneously, with many more in various stages of fetch/decode/register-read/retirement — is daunting. At some point the interactions may be so complex and chaotic that it is not even possible to document their behavior.
Hey, I can’t reply to it either. So I’ve FINALLY got MASM/x64 working under VS2010 Express. Did some playing, and this is the best I came up with:
mov rax, [rsi+ByteOffset+0*PtrSize]
mul QWORD PTR [rsi+ByteOffset+1*PtrSize]
add r9, rax
adc ecx, 0
add r10, rdx
adc ebx, 0
On average, it seems to run about 5% faster than the original code on my Sandy Bridge CPU– so definitely not earth shattering. It unchains the high add from the low, by collecting the carries from low into another register, to be added immediately prior to the final step of the diagonal. [The adcs are 32-bit to avoid the REX prefix, but that isn’t critical.] I’ve probably done all I can here, and this isn’t that great, or likely portable across microarchitectures, but I thought it was fun and you might want to see it.
That’s cool Chad. Thinking outside the box. I’ll have to investigate that idea. I suspect that the modest returns are because you’re increasing the instruction count, and therefore hitting other limits (instruction decode, for instance).
Avoiding the REX prefix is a good idea independently — at some point making the code smaller is its own benefit.
This is an Intel E6550: http://ark.intel.com/products/30783/Intel-Core2-Duo-Processor-E6550-(4M-Cache-2_33-GHz-1333-MHz-FSB)
VendorID = ‘GenuineIntel’
Stepping = 0xB, model = 0xF, family = 0x6, type = 0
Signature is 0x6FB (06_0FH)
CPU’s rdtsc speed is 2.333 GHz (peak speed may be higher)
Running tests on 2 thread system.
Performance with 1 threads is 0.577 GBlocks/sec
Performance with 2 threads is 1.153 GBlocks/sec
The results are consistent with your old laptop. It’s nice to see wine does not mess up cpu-bound apps.
I’d really be interested in seeing result on a 64-bit capable Atom. I suspect it will suck, but how much?
Tests on an Atom would be fascinating. Anybody got a 64-bit netbook they can run this on? The in-order nature of the Atom, plus its general slowness, would definitely lead to poor results. I’d guess 10+ cycles per block.
A 64-bit Pentium 4 and an older AMD processor would also be fascinating.
Anyone?
Here you go:
Intel(R) Atom(TM) CPU 230 @ 1.60GHz
—————————————-
VendorID = ‘GenuineIntel’
Stepping = 0x2, model = 0xC, family = 0x6, type = 0
Signature is 0x6C2 (06_0CH)
CPU’s rdtsc speed is 1.600 GHz (peak speed may be higher).
Running tests on 2 thread system.
Performance with 1 threads is 0.080 GBlocks/sec
Performance with 2 threads is 0.139 GBlocks/sec
Running tests on 2 thread system.
Performance with 1 threads is 0.080 GBlocks/sec
Performance with 2 threads is 0.138 GBlocks/sec
Running tests on 2 thread system.
Performance with 1 threads is 0.080 GBlocks/sec
Performance with 2 threads is 0.138 GBlocks/sec
And also,
Dual Core AMD Opteron(tm) Processor 285 2.59 GHz
—————————————————–
CPU information:
VendorID = ‘AuthenticAMD’
Stepping = 0x2, model = 0x1, family = 0xF, type = 0
Signature is 0xF12 (0F_01H)
CPU’s rdtsc speed is 2.593 GHz (peak speed may be higher).
Running tests on 2 thread system.
Performance with 1 threads is 0.776 GBlocks/sec
Performance with 2 threads is 1.546 GBlocks/sec
Running tests on 2 thread system.
Performance with 1 threads is 0.777 GBlocks/sec
Performance with 2 threads is 1.546 GBlocks/sec
Running tests on 2 thread system.
Performance with 1 threads is 0.776 GBlocks/sec
Performance with 2 threads is 1.541 GBlocks/sec
Thanks Chad. That’s awesome! The Atom clearly takes 20 clocks per block. The AMD chip is not so clear. The obvious ratio is 3.34:1, which seems unlikely. More likely the CPU is overclocking to 3.1 GHz, which gives 4 cycles per block.
Is the Atom chip hyperthreaded, or two cores? I’m guessing hyperthreaded. The AMD chip (what model is it?) appears to be two genuine cores.
I’ll update the chart when I get a chance.
Yessir– the Atom is a “230”, codename “Diamondville”, from 2008, which is hyperthreaded. The AMD Opteron is a “285”, codename “Italy” from 2006, with two actual cores. (It’s a champ– I used that computer up until last year.) I believe the AMD predates any kind of dynamic clock rate adjustment technology, but I’m not sure.
Thanks — very helpful. I’ve updated the post.
Pingback: Fractal eXtreme, now cheaper | Random ASCII
Pingback: Out of Order Benefits | Random ASCII
So i came across this today realized i could possibly help so here is my run on my desktop
Its a weird chip quad core unlocked into a 6 core then overclocked to 3.8ghz. If wanted i could always go down to stock clocks. The cpu is an AMD Phenom II x4 960T when unlocked into a six core it becomes a AMD Phenom II x6 1600T in CPU-Z. TurboCore is disabled in the bios and should not be changing cpu frequency at all.
Anyways here are my results:
CPU information:
VendorID = ‘AuthenticAMD’
Stepping = 0x0, model = 0xA, family = 0xF, type = 0
Signature is 0xFA0 (0F_0AH)
CPU’s rdtsc speed is 3.813 GHz (peak speed may be higher).
Running tests on 6 thread system.
Performance with 1 threads is 1.118 GBlocks/sec
Performance with 2 threads is 2.285 GBlocks/sec
Performance with 3 threads is 3.402 GBlocks/sec
Performance with 4 threads is 4.531 GBlocks/sec
Performance with 5 threads is 5.680 GBlocks/sec
Performance with 6 threads is 6.807 GBlocks/sec
and to confirm the first reply i tried to get the closest to your FX-8150. at 3.4 ghz i was able to get 6.073 GBlocks/sec vs the 6.123 the FX-8150 got.
CPU information:
VendorID = ‘AuthenticAMD’
Stepping = 0x0, model = 0xA, family = 0xF, type = 0
Signature is 0xFA0 (0F_0AH)
CPU’s rdtsc speed is 3.411 GHz (peak speed may be higher).
Running tests on 6 thread system.
Performance with 1 threads is 1.022 GBlocks/sec
Performance with 2 threads is 2.035 GBlocks/sec
Performance with 3 threads is 3.043 GBlocks/sec
Performance with 4 threads is 4.071 GBlocks/sec
Performance with 5 threads is 5.074 GBlocks/sec
Performance with 6 threads is 6.073 GBlocks/sec
So i would say that StarDust was quite correct that a 1090t would be similar to the performance of the fx-8150 but in my tests running at 3.6 got more around 6.4 GBlocks/sec.
Pingback: Floating-Point Determinism | Random ASCII
Pingback: Floating-Point Determinism | Random ASCII
Have GPUs surpassed CPUs since this was written?
I have heard that modern GPUs have some instructions (add with carry, most critically) that probably makes GPUs faster than CPUs for raw fractal calculations now. I don’t have any immediate plans to test this unfortunately – it would probably be quite tricky since a naive algorithm would probably cause most of the GPUs threads to be doing useless work most of the time.