Comparing Memory is Still Tricky

One of my “favorite” code-gen peculiarities in Visual Studio (VS) 2010 is its inconsistent handling of memcmp. VS 2010 had a really nice optimization that could generate faster and smaller code, but it only used the optimization if you lucked on to the ‘correct’ ways of calling memcmp, and if you were optimizing for size.

This bug was reported long ago and I wanted to check and see whether VS 2012 handled this task any better. The results, I am disappointed to say, are decidedly mixed.

The optimization that VS 2010 sometimes does is that if you are only interested in whether two buffers are the same – if you don’t care about their sort order – then the generated code can cheaply compare four bytes at a time instead of one byte at a time. This is faster, and smaller. See the original post for details – section 1.

The past: Visual Studio 2010

In the chart below there are three semantically identical statements that compare two buffers and return true if they are equal. I compiled each statement as a function, optimized for size (/Os) and then optimized for time (/Ot), for VS 2010, and looked at the function sizes:

The first two functions were close to perfect when optimized for size, but the third variation missed the optimization. Its code was not only a lot larger, it also ran about four times slower. Oops.

When optimizing for time you would always get suboptimal code. Visual Studio 2010 would generate a whopping 108 bytes of code for all three of these expressions, and most of this code was completely superfluous.

In short, your odds of getting optimal code were pretty poor.

The present: Visual Studio 2012

VS 2012 changes all this. The byte-by-byte comparisons are gone in all six cases. However a new problem has cropped up, one which threatens to make VS 2012 code slower in many cases. Here is the chart with VS 2012 results added:

The results when optimizing for size are consistent – but consistently worse!

The optimized for time functions are significantly better but, it turns out, not as good as they should be.

Boolean logic made complicated

All of these functions ultimately return true or false in the al register. That should be a simple matter of loading one or zero into al. A few bytes of code, and no memory accesses. Instead VS 2012 generates variations of this rather verbose code for setting the return value when optimizing for size:

Equal:
    mov    DWORD PTR tv79[ebp], 1
    jmp    BothCases
NotEqual:
    and    DWORD PTR tv79[ebp], 0
BothCases:
    mov    al, BYTE PTR tv79[ebp]

Ouch. The worst of these is the first instruction which is seven bytes long because the ‘one’ is embedded in the instruction as a 32-bit constant. In total these instructions add up to sixteen bytes, and they read and write memory! At the very least they could trivially simplify the code to this:

Equal:
    mov    al,1
    jmp BothCases
NotEqual:
    mov    al,0
BothCases:

A little setne trickery could probably do even better than this, but the optimized-for-time versions actually do worse, using two instructions totaling ten bytes to set al to zero, and an identically expensive sequence to set it to one. Peculiar.

NotEqual:
    mov    DWORD PTR tv79[ebp], 0
    mov    al, BYTE PTR tv79[ebp]
    pop    esi
    mov    esp, ebp
    pop    ebp
    ret    0
Equal:
    mov    DWORD PTR tv79[ebp], 1
    mov    al, BYTE PTR tv79[ebp]
    pop    esi
    mov    esp, ebp
    pop    ebp
    ret    0

Code size and instruction count are not accurate measures of performance, but additional instructions with no advantages are going to be neutral at best. Unnecessary code bloat, especially when optimizing for size, is clearly a defect, and this bug – which I have only seen when using memcmp – wastes at least ten bytes per function.

To repe or not repe

The other reason that the optimized-for-size versions got bigger is because they no longer use “repe cmpsd”. The rep prefix is a very compact way of expressing things like moving or filling memory. However, my understanding (section 16.10 String instructions) is that using the rep prefix with cmpsd is too slow. Even when optimizing for size the compiler cannot completely forget about performance. I am sure that this decision was intentional and was carefully made.

Summary

With VS 2012 the results are at least more consistent. Sort of.

All three versions are 47 bytes when optimized for time
All three versions are 46 bytes when optimized for size
All three versions are at least ten bytes larger than they should be.

The time-optimized versions of this comparison all went from 108 bytes to 47 bytes. That’s a great improvement, and it probably helped to mask the fact that these versions should have been about 37 bytes. Whatever template churns out the inline code for memcmp apparently does not cooperate with VS 2012’s code generator.

I tested with g++ and clang++ on Linux but I could not convince either one of them to inline memcmp. It looks like they got burned by the cases when an inlined memcmp could end up slower, and decided to abandon the whole experiment.

Update

I wanted to address a couple of comments that I heard about this post:

Why are there no timing measurements? I previously verified that comparing memory byte-by-byte is four times slower than comparing a DWORD at a time, all else being equal. The main defect remaining is that the code is unnecessarily large. This will probably have no effect on any benchmark, but larger code always increases i-cache miss rates, which are significant on many large projects. Since the bloated code to set the return value has no advantages, it must cost some amount of performance in some cases, and there is no useful way to measure this.
Why am I measuring memcmp given that std::equal is what all the cool kids use? std::equal can end up hitting many of the same issues, and in fact in some cases (on POD types in VC++) it ends up calling memcmp, which gets us back to where we started. The bloated code in VS 2010 with /Ot and VS 2012 with all options is then a real code-size cost even if you don’t call memcmp directly.

10 Responses to Comparing Memory is Still Tricky

Stephan says:

November 1, 2012 at 12:18 pm

Interesting post! Have you tried __builtin_memcmp with g++ and clang?

- brucedawson says:
  
  November 1, 2012 at 1:12 pm
  
  I tried __builtin_memcmp and it made no difference — I always got a function call. That means that for comparing short amounts of memory VC++ will sometimes be better — at least if they fix the setting of the return code.
  
Ben Voigt says:

November 7, 2012 at 3:15 pm

I had learned (possibly incorrectly) that a faster (than “mov al,0”) way to set a zero return value is “xor eax, eax”. If that’s true, I’m very surprised that the optimizer doesn’t use it.

- Dor says:
  
  December 26, 2012 at 1:05 am
  
  Looking at Intel IA32 spec: “xor eax, eax” instruction gives shorter assembly, but I don’t think it’s faster.
  
  - brucedawson says:
    
    December 26, 2012 at 9:41 am
    
    All of the simple integer instructions (add, xor, sub, and, etc., but not including mul or div) should have the same latency and throughput. Therefore, all else being equal you should prefer the shortest instruction, in order to avoid blowing out the instruction cache.
    
    For many years “xor eax, eax” was the recommended way of zeroing eax, presumably because it is short.
    
    One theoretical risk of using “xor eax, eax” is that it technically has a data dependency — it technically depends on the previous value of eax, and therefore it technically inhibits reordering. However an xor of a register with itself has long been special-cased by x86 processors — they know that it doesn’t *really* depend on the previous value. In fact (see 2.1.3.1 in Intel’s optimization manual) instructions like “xor eax, eax” are recognized by the register renamer and processed by it, consuming no execution resources. How cool is that?
    
    So, as of Sandybridge it appears that “xor eax, eax” is still the preferred way to zero a register. Demonstrating this is left as an exercise for the reader.
    
    - brucedawson says:
      
      December 29, 2012 at 2:19 pm
      
      I tested the performance of “xor eax, eax” and confirmed that it is both smaller and faster than “mov eax, 0”, for surprising reasons:
      
      The Surprising Subtleties of Zeroing a Register
      
Gregory Pakosz (@gpakosz) says:

November 9, 2012 at 4:49 am

Bruce, is there an easy way to report the size of each function in a large project using Visual Studio?

- brucedawson says:
  
  November 9, 2012 at 8:30 am
  
  Take a look at dia2dump. You can find its source code in “%VS100COMNTOOLS%..\..\DIA SDK\Samples\DIA2Dump”. It can be used to dump the size of every function in a DLL or EXE based on information in the PDB file. If you load the output into Excel you can do all sorts of great data mining.
  
Pingback: Visual C++ Code-gen Deficiencies | Random ASCII
Pingback: The Surprising Subtleties of Zeroing a Register | Random ASCII