A few months ago I wrote about some experiments to see how fast modern CPUs are at doing high-precision math such as is used in cryptography and high-precision fractal calculations.
I recently read an article that claimed that out-of-order processors give only about a 20% performance improvement and I decided to use my high-precision math code to test this claim.
Before proceeding you might want to glance at the original article and guess how much of the code’s performance is dependent on out-of-order execution. The inner block was the five instructions below unrolled such that there is virtually no loop overhead:
mov rax, [rsi+Constant1]
mul QWORD PTR [rsi+Constant2]
add r11, rax
adc r12, rdx
adc r13, 0
AMD Bulldozer took 5.0 cycles per block, Intel Core 2 took 4.0, and Intel Sandybridge took 3.0 – impressive since the latency for its 64-bit multiply instruction is 3.0 cycles. The question to ask yourself, is how many cycles per block would the Sandybridge processor take if it couldn’t use out-of-order execution?
The answer isn’t here
There was one in-order processor in the test so you might think that you should compare its results to the Sandybridge results. But that would measure something else. The in-order processor was an Atom processor which is slow in many other ways. It took 20.0 cycles per block but that is because it contains multiple design decisions that optimize it for reduced power consumption rather than performance, and out-of-order is just one of those.
What we really need is a processor that is just like Sandybridge, but with out-of-order disabled. That CPU does not exist. But we can fake it.
Reordering is magic
Reordering works really well on this code because, while each instruction in a block is dependent on the previous one, adjacent blocks are almost independent. As an example, all of the mov instructions are independent of each other and, due to the magic of register renaming, can theoretically all execute in parallel, subject only to the look-ahead abilities of the out-of-order processor. On a magical out-of-order processor with infinite resources, with one block of five instructions per row and clock cycles labeled along the top, execution could look like this:
Real out-of-order processors have limitations, especially on how frequently they can do 64-bit multiplies, but this diagram makes the point about the cool craziness that can happen with instruction reordering.
In-order is more predictable
An in-order processor can’t reach ahead, but it can probably issue multiple instructions per cycle if they are adjacent and independent. The only instructions that fit this description in my sample code are the final adc of one block and the first mov of the next block. That means that, once it gets going, it should be executing three instructions serially, and then two instructions in parallel – something like this:
The clock cycles in both cases assume one cycle latency for mov, add, and adc, and three cycle latency for mul. However, the actual latencies don’t particularly matter – it’s the topology of the dependencies that we’re trying to recreate.
In order to handicap our Sandybridge to make it behave like an in-order processor we just need to stop the mul instruction from executing until after the adc in the previous block. This is easily done by having the mul instruction multiply by r13 instead of by a value from memory so that it depends on the result of the last adc. The mov instruction then becomes irrelevant – and since all of our data is in the cache it doesn’t much matter whether we delete it or retain it.
Results go here
And now we get to our conclusion. Well, the first of two conclusions. The reordering magic on a Sandybridge processor lets this code run roughly 2.4 times faster than without reordering. The second conclusion is that reordering lets the code run 1.26 times faster. You get to choose!
Performance with 1 threads is 1.075 GBlocks/sec
Performance with 2 threads is 2.085 GBlocks/sec
Performance with 3 threads is 2.587 GBlocks/sec
Performance with 4 threads is 3.060 GBlocks/sec
Performance with 5 threads is 3.636 GBlocks/sec
Performance with 6 threads is 3.761 GBlocks/sec
Performance with 7 threads is 3.791 GBlocks/sec
Performance with 8 threads is 3.801 GBlocks/sec
Here are the results with reordering inhibited through extra dependencies:
Performance with 1 threads is 0.445 GBlocks/sec
Performance with 2 threads is 0.870 GBlocks/sec
Performance with 3 threads is 1.230 GBlocks/sec
Performance with 4 threads is 1.609 GBlocks/sec
Performance with 5 threads is 1.961 GBlocks/sec
Performance with 6 threads is 2.321 GBlocks/sec
Performance with 7 threads is 2.664 GBlocks/sec
Performance with 8 threads is 2.998 GBlocks/sec
The single-core performance increase from reordering is impressive. With one to four threads running the reordering appears to increase performance by 2.4 to 1.9 times. However once all the hyperthreads show up the gap narrows significantly, to just 1.26 times. It would appear that an in-order Sandybridge would take 7-8 cycles per block, but hyperthreading lets it squeeze out more performance.
There’s more than one way to handle data hazards
In hindsight it all makes perfect sense. Reordering lets multiple loops run in parallel – enough that the limiting factor is how many 64-bit mul instructions can be executed per cycle. Using hyperthreading doesn’t help much in this case because the four cores are (mostly) already maxed out.
When we inhibit reordering the code’s performance is limited by instruction latency and data dependencies. The multiply unit is sitting idle much of the time so there are execution resources available. These idle resources are exactly what hyperthreading is designed to consume, and so it does.
Maybe we should write better code
One could argue that single-threaded code gets better performance if you code for it. If a programmer or compiler schedules instructions then data dependencies can be minimized. Yes, this is true. Sometimes. But there are tradeoffs.
Scheduling the instructions works badly on x86 and x64 because instructions like mul have fixed input and output registers. And, in general if you manually schedule instructions you need to use more architectural registers, which may hit instruction set limits, or may require extra time to save and restore at function boundaries. In addition to giving faster code, reordering lets us write simpler and more straightforward code, that doesn’t waste as much time on setup, or potentially wasteful instructions. Sometimes runtime optimization really is the most efficient option.
This comparison necessarily looked to see how a Sandybridge processor would compare to an identical processor with the out-of-order hardware disabled. Of course nobody would just disable this hardware. Instead they would remove it and use the space for something else. Perhaps deleting the out-of-order hardware would allow for higher clock speeds, or an additional processor core. More likely, however, the extra die area would just be used for larger caches, because one of the main uses of out-of-order execution is covering the horrendous latencies of cache misses.
Get it? Prologue at the end? It’s out of order. Hilarious, right?
The actual performance of out-of-order processors is highly dependent on what code is being executed, the details of the processor implementation, and how the code and processor would differ in an alternate universe. Thus, I don’t claim that my results are representative of anything else, but this is real code doing real work (cryptography, or fractals, your choice), so take it for what it is.
For single-threaded code, or code that can’t make use of every thread on a modern CPU, reordering is an enormous boon. For embarrassingly parallel code that can use every thread available, the benefit may be less significant.
In many cases the ability of out-of-order execution to use more registers than are architecturally available can lead to performance improvements that are difficult or impossible to realize with in-order execution.
It’s interesting to see how much of a speedup the hyperthreads gave in the in-order case. Because there were lots of execution resources available the eight-thread case (using all the hyperthreads) ran 1.86 times faster than the four-thread case. It’s good to see hyperthreading coming close to its theoretical maximum of doubling performance.
On the original code out-of-order execution gave a 2.4 times speedup, or 1.26 times if you’re using hyperthreads. On the modified code – where every instruction depends on the previous instruction – out-of-order execution of course gives no speedup at all.