Zeroing out a CPU register seems like the simplest and most basic operation imaginable, but in fact x86 CPUs contain a surprising amount of special logic to make this operation run smoothly. The most obvious way of zeroing an x86 CPU register turns out to not be the best, and the alternative has some surprising characteristics.
The curious result of this investigation is a mathematical asymmetry where subtraction is, in some cases, faster than addition. This analysis was inspired by comments on Comparing Memory is Still Tricky.
The x86 instruction set does not have a special purpose instruction for zeroing a register. An obvious way of dealing with this would be to move a constant zero into the register, like this:
mov eax, 0
That works, and it is fast. Benchmarking this will typically show that it has a latency of one cycle – the result can be used in a subsequent instruction on the next cycle. Benchmarking will also show that this has a throughput of three-per-cycle. The Sandybridge documentation says that this is the maximum integer throughput possible, and yet we can do better.
It’s too big
The x86 instruction used to load a constant value such as zero into eax consists of a one-byte opcode (0xB8) and the constant to be loaded. The problem, in this scenario, is that eax is a 32-bit register, so the constant is 32-bits, so we end up with a five-byte instruction:
B8 00 00 00 00 mov eax, 0
Instruction size does not directly affect performance – you can create lots of benchmarks that will prove that it is harmless – but in most real programs the size of the code does have an effect on performance. The cost is extremely difficult to measure, but it appears that instruction-cache misses cost 10% or more of performance on many real programs. All else being equal, reducing instruction sizes will reduce i-cache misses, and therefore improve performance to some unknown degree.
Many RISC architectures have a zero register in order to optimize this particular case, but x86 does not. The recommended alternative for years has been to use xor eax, eax. Any register exclusive ored with itself gives zero, and this instruction is just two bytes long:
33 C0 xor eax, eax
Careful micro-benchmarking will show that this instruction has the same one-cycle latency and three-per-cycle throughput of mov eax, 0 and it is 60% smaller (and recommended by Intel), so all is well.
If you really understand how CPUs work then you should be concerned with possible problems with using xor eax, eax to zero the eax register. One of the main limitations on CPU performance is data dependencies. While a Sandybridge processor can potentially execute three integer instructions on each cycle, in practice its performance tends to be lower because most instructions depend on the results of previous instructions, and are therefore serialized. The xor eax, eax instruction is at risk for such serialization because it uses eax as an input. Therefore it cannot (in theory) execute until the last instruction that wrote to eax completes. For example, consider this code fragment below:
1: add eax, 1
2: mov ebx, eax
3: xor eax, eax
4: add eax, ecx
Ideally we would like our awesome out-of-order processor to execute instructions 1 and 3 in parallel. There is a literal data dependency between them, but a sufficiently advanced processor could detect that this dependency is artificial. The result of the xor instruction doesn’t depend on the value of eax, it will always be zero.
It turns out that for x86 processors have for years handled xor of a register with itself specially. Every out-of-order Intel and AMD processor that I am aware of can detect that there is not really a data dependency and it can execute instructions 1 and 3 in parallel. Which is great. The CPUs use register renaming to ‘create’ a new eax for the sequence of instructions starting with instruction 3.
It gets better
On Sandybridge this gets even better. The register renamer detects certain instructions (xor reg, reg and sub reg, reg and various others) that always zero a register. In addition to realizing that these instructions do not really have data dependencies, the register renamer also knows how to execute these instructions – it can zero the registers itself. It doesn’t even bother sending the instructions to the execution engine, meaning that these instructions use zero execution resources, and have zero latency! See section 126.96.36.199 of Intel’s optimization manual where it talks about dependency breaking idioms. It turns out that the only thing faster than executing an instruction is not executing it.
Show us the measurements!
A full measurement of the performance of instructions on out-of-order processors is impractical, but I did come up with a micro-benchmark which can show this particular difference. The following eight instructions have no data dependencies between them so their performance is limited only by the integer throughput of the processor. By repeating this block of code many times (I found that seventy worked best) and carefully timing it I find that, as promised, my Sandybridge processor can execute three integer add instructions per cycle.
add r8, r8
add r9, r9
add r10, r10
add r11, r11
add r12, r12
add r13, r13
add r14, r14
add r15, r15
If I change the add opcode to sub or xor then the performance on many processors would be unchanged. But on Sandybridge the throughput increases to four instructions per cycle.
IPC of 2.9943 for independent adds
IPC of 3.9703 for independent subs
IPC of 3.9713 for independent xors
An even more dramatic result is found if you repeat “add r8, r8” hundreds of times. Because every instruction is dependent on the previous one this code executes at a rate of one instruction per cycle. However if you change it to “sub r8, r8” then the dependency is recognized as being spurious and the code executes at a rate of four instructions per cycle – a four times speedup from a seemingly trivial opcode change.
I haven’t figured out what is limiting performance to four instructions per cycle. It could be instruction decode or some other pathway in the processor which cannot sustain four instructions per cycle beyond short bursts. Whatever the limitation is it is unlikely to be relevant to normal code.
So there you have it – subtraction is faster than addition, as long as you are subtracting a number from itself.
64-bit versus 32-bit
The test code above runs as 64-bit code because the extra registers make it easier to have long runs of instructions with no data dependencies. The 64-bit instructions have different sizes (seven bytes for mov r8, 0 and three bytes for xor r8, r8) but this doesn’t affect the conclusions. Using the 32-bit forms for the extended registers (r8d instead of r8) gives the same code size and performance as using the 64-bit forms. In general 64-bit code is larger than 32-bit code, except when the extra registers, 64-bit registers, or cleaner ABI lead to smaller code. In general you shouldn’t expect a big performance change from using 64-bit code, but in some cases, such as Fractal eXtreme, there can be huge wins.
Update: January 7, 2013
Section 2.1.31 of the Intel optimization manual actually explains where the four instructions per cycle limitation comes from It says that the renamer “moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine.”
A coworker pointed out how this zeroing feature of the renamer probably works. Most RISC architectures have a “zero register” which always reads as zero and cannot be written to. While the x86/x64 architectures do not have an architectural zero register it seems likely that the Sandybridge processor has a physical zero register. When the renamer detects one of these special instructions it just renames the architectural register to point at the zero register. This is supremely elegant because no register clearing is actually needed. It is in the nature of register renaming that every write to a register is preceded by a rename and therefore the CPU will naturally never attempt to write to this register. This theory also explains why the dependency breaking idiom instruction CMPEQ XMM1, XMM1 still has to be executed. This instruction sets all bits to one, and I guess the Sandybridge processor doesn’t have a one’s register – yet.
Price versus cost (update March 18, 2013)
A few people suggested that xor should be cheaper than sub because xor is much cheaper to implement, at the transistor level. It is true that xor is cheaper at that level, but it is not relevant. When executing simple instructions like xor and sub the execution cost is dominated by fetching and decoding instructions, and reading and writing registers. The cost of the actual operation is so tiny that it is irrelevant.
CPU designers could create a design where an xor instruction was faster than a sub instruction. However that would mean that sub would take at least two clock cycles, since instruction lengths are integer cycle counts. Unless this modification let them at least double the clock speed it would be a net loss. It’s better to make xor take just as long as sub. You can’t make an xor instruction take, say, 0.6 cycles. But feel free to try.
This issue is discussed in the “cost vs price” section of this article titled “It’s done in hardware so it’s cheap”