Zeroing out a CPU register seems like the simplest and most basic operation imaginable, but in fact x86 CPUs contain a surprising amount of special logic to make this operation run smoothly. The most obvious way of zeroing an x86 CPU register turns out to not be the best, and the alternative has some surprising characteristics.
The curious result of this investigation is a mathematical asymmetry where subtraction is, in some cases, faster than addition. This analysis was inspired by comments on Comparing Memory is Still Tricky.
Tabula rasa
The x86 instruction set does not have a special purpose instruction for zeroing a register. An obvious way of dealing with this would be to move a constant zero into the register, like this:
mov eax, 0
That works, and it is fast. Benchmarking this will typically show that it has a latency of one cycle – the result can be used in a subsequent instruction on the next cycle. Benchmarking will also show that this has a throughput of three-per-cycle. The Sandybridge documentation says that this is the maximum integer throughput possible, and yet we can do better.
It’s too big
The x86 instruction used to load a constant value such as zero into eax consists of a one-byte opcode (0xB8) and the constant to be loaded. The problem, in this scenario, is that eax is a 32-bit register, so the constant is 32-bits, so we end up with a five-byte instruction:
B8 00 00 00 00 mov eax, 0
Instruction size does not directly affect performance – you can create lots of benchmarks that will prove that it is harmless – but in most real programs the size of the code does have an effect on performance. The cost is extremely difficult to measure, but it appears that instruction-cache misses cost 10% or more of performance on many real programs. All else being equal, reducing instruction sizes will reduce i-cache misses, and therefore improve performance to some unknown degree.
Smaller alternatives
Many RISC architectures have a zero register in order to optimize this particular case, but x86 does not. The recommended alternative for years has been to use xor eax, eax. Any register exclusive ored with itself gives zero, and this instruction is just two bytes long:
33 C0 xor eax, eax
Careful micro-benchmarking will show that this instruction has the same one-cycle latency and three-per-cycle throughput of mov eax, 0 and it is 60% smaller (and recommended by Intel), so all is well.
Suspicious minds
If you really understand how CPUs work then you should be concerned with possible problems with using xor eax, eax to zero the eax register. One of the main limitations on CPU performance is data dependencies. While a Sandybridge processor can potentially execute three integer instructions on each cycle, in practice its performance tends to be lower because most instructions depend on the results of previous instructions, and are therefore serialized. The xor eax, eax instruction is at risk for such serialization because it uses eax as an input. Therefore it cannot (in theory) execute until the last instruction that wrote to eax completes. For example, consider this code fragment below:
1: add eax, 1
2: mov ebx, eax
3: xor eax, eax
4: add eax, ecx
Ideally we would like our awesome out-of-order processor to execute instructions 1 and 3 in parallel. There is a literal data dependency between them, but a sufficiently advanced processor could detect that this dependency is artificial. The result of the xor instruction doesn’t depend on the value of eax, it will always be zero.
It turns out that for x86 processors have for years handled xor of a register with itself specially. Every out-of-order Intel and AMD processor that I am aware of can detect that there is not really a data dependency and it can execute instructions 1 and 3 in parallel. Which is great. The CPUs use register renaming to ‘create’ a new eax for the sequence of instructions starting with instruction 3.
It gets better
On Sandybridge this gets even better. The register renamer detects certain instructions (xor reg, reg and sub reg, reg and various others) that always zero a register. In addition to realizing that these instructions do not really have data dependencies, the register renamer also knows how to execute these instructions – it can zero the registers itself. It doesn’t even bother sending the instructions to the execution engine, meaning that these instructions use zero execution resources, and have zero latency! See section 2.1.3.1 of Intel’s optimization manual where it talks about dependency breaking idioms. It turns out that the only thing faster than executing an instruction is not executing it.
Show us the measurements!
A full measurement of the performance of instructions on out-of-order processors is impractical, but I did come up with a micro-benchmark which can show this particular difference. The following eight instructions have no data dependencies between them so their performance is limited only by the integer throughput of the processor. By repeating this block of code many times (I found that seventy worked best) and carefully timing it I find that, as promised, my Sandybridge processor can execute three integer add instructions per cycle.
add r8, r8
add r9, r9
add r10, r10
add r11, r11
add r12, r12
add r13, r13
add r14, r14
add r15, r15
If I change the add opcode to sub or xor then the performance on many processors would be unchanged. But on Sandybridge the throughput increases to four instructions per cycle.
IPC of 2.9943 for independent adds
IPC of 3.9703 for independent subs
IPC of 3.9713 for independent xors
An even more dramatic result is found if you repeat “add r8, r8” hundreds of times. Because every instruction is dependent on the previous one this code executes at a rate of one instruction per cycle. However if you change it to “sub r8, r8” then the dependency is recognized as being spurious and the code executes at a rate of four instructions per cycle – a four times speedup from a seemingly trivial opcode change.
I haven’t figured out what is limiting performance to four instructions per cycle. It could be instruction decode or some other pathway in the processor which cannot sustain four instructions per cycle beyond short bursts. Whatever the limitation is it is unlikely to be relevant to normal code.
So there you have it – subtraction is faster than addition, as long as you are subtracting a number from itself.
64-bit versus 32-bit
The test code above runs as 64-bit code because the extra registers make it easier to have long runs of instructions with no data dependencies. The 64-bit instructions have different sizes (seven bytes for mov r8, 0 and three bytes for xor r8, r8) but this doesn’t affect the conclusions. Using the 32-bit forms for the extended registers (r8d instead of r8) gives the same code size and performance as using the 64-bit forms. In general 64-bit code is larger than 32-bit code, except when the extra registers, 64-bit registers, or cleaner ABI lead to smaller code. In general you shouldn’t expect a big performance change from using 64-bit code, but in some cases, such as Fractal eXtreme, there can be huge wins.
Update: January 7, 2013
Section 2.1.31 of the Intel optimization manual actually explains where the four instructions per cycle limitation comes from It says that the renamer “moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine.”
A coworker pointed out how this zeroing feature of the renamer probably works. Most RISC architectures have a “zero register” which always reads as zero and cannot be written to. While the x86/x64 architectures do not have an architectural zero register it seems likely that the Sandybridge processor has a physical zero register. When the renamer detects one of these special instructions it just renames the architectural register to point at the zero register. This is supremely elegant because no register clearing is actually needed. It is in the nature of register renaming that every write to a register is preceded by a rename and therefore the CPU will naturally never attempt to write to this register. This theory also explains why the dependency breaking idiom instruction CMPEQ XMM1, XMM1 still has to be executed. This instruction sets all bits to one, and I guess the Sandybridge processor doesn’t have a one’s register – yet.
Price versus cost (update March 18, 2013)
A few people suggested that xor should be cheaper than sub because xor is much cheaper to implement, at the transistor level. It is true that xor is cheaper at that level, but it is not relevant. When executing simple instructions like xor and sub the execution cost is dominated by fetching and decoding instructions, and reading and writing registers. The cost of the actual operation is so tiny that it is irrelevant.
CPU designers could create a design where an xor instruction was faster than a sub instruction. However that would mean that sub would take at least two clock cycles, since instruction lengths are integer cycle counts. Unless this modification let them at least double the clock speed it would be a net loss. It’s better to make xor take just as long as sub. You can’t make an xor instruction take, say, 0.6 cycles. But feel free to try.
This issue is discussed in the “cost vs price” section of this article titled “It’s done in hardware so it’s cheap”
From the 8088 to the Intel Pentium, the recommended way to zero a register was xor. These processors were all in-order, so dependencies were not a problem, and xor is indeed shorter.
The designs from the Pentium II to the current ones are mostly out-of-order, but they recognize xor reg, reg as a special case, at least from the point of view of dependencies.
But there was at least one out-of-order design that did not recognize xor reg, reg as a special case: the Pentium Pro. The Intel Optimization manuals for the Pentium Pro recommended “mov” to zero a register.
> I haven’t figured out what is limiting performance to four instructions per cycle.
It’s probably the retire path in this case, though since in Sandybridge so much of the machine is 4-wide (decode, dispatch, allocate, retire) it’s hard to say that any part of it is “the” bottleneck.
It looks like the renamer itself is limited to four instructions per cycle. As you say, that’s totally reasonable.
Also, a coworker figured out how the trick is implemented in the renamer. It’s deliciously elegant so I updated the post.
The ability to recognise the false dependence when zeroing a register is pretty nifty, particularly from a timing perspective on the output of the decode unit. Also, assuming that the machine has 3 integer ALU, to attain an average issue rate, it must be able to dispatch such instructions to adder in the load/store units when there is a structural hazard on the other units. It’s pretty surprising that there would be any bypass path to the ROB from the adder in the load/store unit, but it would save a would save a couple of cycles when performing auto-increment/-decrement memory operations.
As pointed out, the upper limit on the retirement rate will be determined by the number of write ports on the ROB/register file. Even in a speculative, superscalar machine IPC would typically be expected to be around 3-4 for most programs, so it is unlikely that Intel would design for higher on the unlikely case where 4 integer instructions are attempting to retire along side a returning load/store unit result. It could also be determined by the scheduler, whereby for a given issue queue, 4 is the maximum dispatch rate. For even very high end processors, POWERx for example, a dispatch rate of 4 seems about right.
Pingback: Top Eight Entertaining Blog Facts for 2012 | Random ASCII
Pingback: Counting to Ten on Linux | Random ASCII
“CPU designers could create a design where xor was faster than sub”
A subtracter has a carry-chain while XOR-ing is independent at bit level so XOR is inherently faster (a single gate-delay as opposed to 34 gate-delays in a 32-bit ripple subtractor or possibly as low as 7 or 8 gate delays if a lookahead adder is used log(32))
It would not be possible to make an XOR gate that’s slower unless you put a huge chain of inverters to delay the output of the XOR which wouldn’t make sense
I think we are agreeing, but I’m not sure. Yes, XOR is cheaper at a hardware level, like I said. However the XOR *instruction* does not execute faster than sub, on any CPU that I have ever programmed on, because (as I said) the instruction decode, instruction scheduling, register reads, register writes, and other bookkeeping are the limiting factor. Making XOR instructions run faster than sub instructions turns out to be a bad idea.
So yes, the output of XOR is ready sooner than the output of SUB but is then delayed until the end of the clock cycle so it ends up taking the exact same amount to *execute* as SUB, ADD, etc.
Just as a topic for later study if you’re interested — XOR is faster than ADD (and if it had it, subtract as well) on GreenArrays CPUs, because of how these CPUs work. They pre-compute (using combinatorial logic) every possible next-state, and the opcode just selects which of the CPU’s next state becomes official. They do this to minimize power consumption (they pack 144 cores on a die 1mm^2 in size). ADDs take between 1 and 1.5 instruction times (roughly 1.5ns or so; it’s an estimate because the CPUs run asynchronously as well), depending on how many ‘1’ bits are in the operands (think ripple carry). So, unless you can statically prove your additions involve 9 or fewer 1 bits for a carry to propegate through, you should always add with “NOP; ADD”. The NOP lets the ripple happen, and ADD finally makes the added numbers official.
I’ve worked with these chips extensively, and they’re loads of fun to hack with. 🙂
That’s pretty nuts, but with clockless it does make sense that XOR would actually be faster.
It was hard to find more information about GreenArray CPUs. I found the site easily enough but their documentation is pretty scattered.
BTW, their site says that they pack 144 cores on a die 1 cm^2 in size — 100x bigger.
Click to access PB001-100503-GA144-1-10.pdf
The GreenArrays website says that they pack 144 cores on the GA144 on a silicon die “with a die area of 21.3 square millimeters.”
which, like several other chips in a variety of die sizes, is then packaged in “an 88-pin package (1 x 1 cm)”, which is a standard QFN-88 plastic package used by lots of manufacturers.
http://www.greenarraychips.com/home/about/
“The GA144 … The chip measures 4.7 x 4.5 mm in a 180 nm process. That size was chosen because it is the largest chip we can presently package in the 10×10 mm QFN-88.”
http://www.greenarraychips.com/home/documents/greg/GA144.htm
Perhaps you are thinking of the GreenArrays GA4. The GA4 has “a die area of 0.875 square millimeters” of silicon, which is available both “a 3mmx3mm or 2mmx2mm” plastic package, but the GA4 has “only” 4 cores.
http://www.cpushack.com/2013/03/02/chuck-moore-part-2-from-space-to-greenarrays/
It may be interesting to note that in theory, this is where the energy is used. Reversible computing can (in theory) be made as energy-efficient as you like (but slow), but irreversible operations will cost you log 2 k T per bit forgotten IIRC. (Of course, you can turn down the temperature T.)
This is documented in http://www.agner.org/optimize/microarchitecture.pdf page 137 and in Intel’s optimization manuals. Zeroing a register with one of XOR, SUB, PXOR, XORPS, XORPD, PSUB* or PCMPGT* doesn’t issue past the register renamer *and* doesn’t wait for the register’s previous contents. Similar optimizations exist for copying a register to another register or using FXCH to swap registers. The fetch-decode unit is limited to 4 instructions per cycle per thread.
BTW: Nehalem probably improves some of the other zeroing methods that were 3-per-cycle to 4-per-cycle, because of the new execution ports 6 and 7.
Some other parts of the microarchitecture report are worth reading too — they discuss the rare case where Conroe and newer CPUs can execute *five* instructions per cycle per thread.
This is a great explanation of what’s going on under the hood.
Note that while
xor r8, r8
xor r8d, r8d
xor rax, rax
are 3 byte instructions,
xor eax, eax
is a 2 byte instruction:
4d31c0 xorq %r8, %r8
4531c0 xorl %r8d, %r8d
4831c0 xorq %rax, %rax
31c0 xorl %eax, %eax
So I guess one could improve code density slightly by using extended registers, in some cases.
I made a post on stackoverflow about zeroing registers recently, http://stackoverflow.com/q/33666617/224132. Someone referenced this blog post, so I addressed the 4 uops per clock “mystery” you found. It’s the issue width of the CPU. Making it wider would take geometrically more transistors in several places, for a small gain. I also found and listed many subtle advantages of xor-zeroing over mov-zeroing.
Pingback: return false vs return true - RRM Programming