Visual C++ Code-gen Deficiencies

Sensible programmers never (well, hardly ever) write assembly language anymore which means that we are at the mercy of our compilers and must trust them to wisely translate our high-level directives into machine code. Most compilers produce good code most of the time, but when they don’t it is frustrating because there it is often nothing you can practically do about it. Sometimes you can rearrange your code to make the compiler happier, but that always feels hacky, and is often difficult or impossible.

This post is planned to be an ongoing catalog of code-gen deficiencies in Microsoft Visual C++. I’m not picking on Visual C++ – I’m sure all compilers have similar problems – but it is the compiler I use the most and I’ve had years to build up a list of irritations.

This post was updated for VS 2012. Read below for details.

The Rules

This posting is intended to be a living document. As I find (or hear about) new problems I’ll add them at the bottom. As new versions of Visual C++ (VC++ to its friends) fix problems I’ll update those entries. Thus, the issue numbers should remain constant.

The type of code-gen problems I will be cataloging are not technically bugs – in all of these cases the code behaves correctly. Instead these are cases where in my considered opinion the compiler could reasonably be expected to do a better job. Optimizations that require psychic powers or would violate the C++ standard are not reasonable and I will try to avoid listing them. I’ve never written a compiler, so my assessment of implementation detail may be off in some cases, but I think most readers will agree that these ‘defects’ could reasonably be corrected.

Assessing the speed of a code fragment is tricky, and on modern out-of-order processors often the most that you can say is ‘it depends’. Therefore, in most cases I will be compiling code in /O1 mode (also known as /Os, optimize for size) which means I am asking the compiler to generate the smallest possible code. This allows for easy and unambiguous comparison of code fragments. Smaller is better, and larger is worse. Simple.

/O1 is often a useful mode because by reducing code size it reduces i-cache misses, so sometimes /O1 actually generates the fastest code anyway. And, most of these issues also exist in /O2 mode (also known as /Ot, optimize for time).

I investigate these problems by creating tiny functions and inspecting the generated code. This is best done by going to Output Files and setting Assembler Output to /FAcs. I highly recommend this technique for anyone interested in investigating compiler quality.

This post covers code-gen issues in optimized builds. The main issues in non-optimized builds have already been covered.

As of August 2011 I am doing all of these tests with Visual C++ 2010, known to its friends as Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80×86.

History

This post was first uploaded on August 15, 2011 with issues 1-4.

This post was updated August 16, 2011 with issue 5, a table of contents, and fixes for a few typos.

This post was updated November, 2012, with an analysis of the code-generation in VS 2012. One issue was fixed, one issue was improved, and three others were unchanged.

Table of contents

  1. The finicky memcmp equality optimization
  2. Equality for bools
  3. Structure initialization done too literally
  4. Bit-field reading done too literally
  5. Function forwarding funkiness

1) The finicky memcmp equality optimization

Update: this issue has been fixed in VS 2012, but a new issue now occurs.

This issue takes a while to explain, but it is probably the most significant code-gen quality issue that I have found. It can lead to four-times slower and two-times larger code. But skip it if you’re looking for lighter reading.

memcmp is defined as returning 0 for equality, <0 if “buf1 less than buf2”, and >0 if “buf1 greater than buf2”. This is actually a very vague description as it doesn’t say whether the bytes are interpreted as signed or unsigned, but that isn’t relevant for this discussion. What is relevant is that if the buffers don’t match then the result is based on the first byte that is different.

And that makes life complicated.

For maximum performance the compiler really wants to compare four bytes at a time. That works well for finding whether or not the buffers are the same, but because of the x86 CPU’s little-endian order it won’t give the right result (<0 or >0) if the buffers differ. This is because a DWORD based compare will give precedence to the high-byte, which is the last byte of the DWORDs in memory. That sucks. That means that the default small-code implementation of memcmp has to be “repe cmpsb”, which means “repeat-while-equal compare-string-byte”.

Comparing one byte at a time is slow. Comparing four bytes at a time with “repe cmpsd” is, you guessed it, about four times as fast. Which is nice. The cool thing is that if the compiler can tell that you only care about equal versus not-equal – if it can tell that you don’t care if “buf1 greater/less than buf2” – then it can do the compares four bytes at a time. And, VC++ contains this optimization! Sweet!

You can see this if you compile these two functions with /O1:

bool IsEqual1Small( void* p1, void* p2 ) {
    return memcmp(p1, p2, 200) == 0; }

bool IsEqual2Small( void* p1, void* p2 ) {
    return !!memcmp(p1, p2, 200); }

The generated code compares four bytes at a time and is both small and fast. You may then be wondering what it is I am complaining about. This is what I am complaining about:

bool IsEqual3Small( void* p1, void* p2 ) {
    return 0 == memcmp(p1, p2, 200); }

While the first two functions generate 25 bytes of code this third one – which is comically similar to IsEqual1Small, and semantically identical to IsEqual2Small, generates 38 bytes of code. This code also runs about four times slower because it is doing byte compares instead of DWORD compares. Oops.

The code-gen when compiling with /O2 is worse. It suffers from multiple defects. The main problem is that with all three versions of the compare code it fails to realize that only buffer equality/inequality is needed. It compares the buffer DWORDs at a time but when it finds a difference it wastes about 57 bytes of code trying to determine which was the first of the bytes to differ, and what the difference is. That code is pure waste.

The other problem is that in the case where the buffers are found to be identical it generates this odd piece of code:

xor     eax, eax
xor     edx, edx
test    eax, eax
sete    al

It zeroes eax, and then tests what value is in eax. Hint: it’s zero! The zeroing of edx is also unnecessary – only the “xor eax, eax” is needed.

The fast versions of these functions all weigh in at 108 bytes. All three could trivially be reduced to less than half of their size when the equality optimization is detected. This would also make them faster. This is not just a theoretical problem. This issue was found in a large code base because some memcmp overhead was showing up on a profile. Changing from “0 == memcmp” to “memcmp == 0” was enough to make the memcmp overhead disappear into irrelevance.

I assume that there is a template in the compiler for stamping out the implementations of these intrinsics. Clearly that template needs updating, along with the detection of memcmp==0 cases.

2) Equality for bools

Update: this deficiency is unchanged in VS 2012 – disappointing.

The code-gen for this function should be pretty simple. Variables of type ‘bool’ can only have two values so the truth table is pretty small:

bool IsNotEqualbool1(bool a, bool b) {
    return (a != b) ? true : false; }

The generated code isn’t terrible, but it’s not ideal. With prologue and epilogue stripped away the compiler generates these nine bytes of code:

mov     al, BYTE PTR _a$[ebp]
cmp     al, BYTE PTR _b$[ebp]
setne   al

This is straightforward, but not as clever as the compiler could be. A ‘bool’ is just a bit, and it turns out that another name for a not-equal comparison of bits is ‘xor’. That allows for this rather elegant implementation:

mov     al, BYTE PTR _a$[ebp]
xor     al, BYTE PTR _b$[ebp]

If you are implementing equality instead of inequality then you need an extra ‘not’ instruction but that is only two bytes and is therefore still victorious for size, saving one byte over the current code-gen.

So the compiler is wasting one or three bytes on bool comparisons. That’s a shame. But things are about to get worse. I find the source code for IsNotEqualbool1 to be inelegant and it is something I would never write. The comparison of ‘a’ and ‘b’ generates a bool – true or false – and using the conditional operator to convert from true/false to true/false seems… odd. Stupid. Obfuscationness.

So I would write the code as shown below:

bool IsNotEqualbool2(bool a, bool b) {
    return a != b; }

Unfortunately anybody using the pattern in IsNotEqualBool2 will be punished for their ideals. An extra instruction is inserted:

mov     cl, BYTE PTR _a$[ebp]
xor     eax, eax
cmp     cl, BYTE PTR _b$[ebp]
setne   al

That two-byte xor instruction clearly isn’t needed.  I would like it to go away.

3) Structure initialization done too literally

Update: this deficiency is unchanged in VS 2012 – disappointing.

It is common to have an array or structure initialized through one of these syntaxes:

char buffer[1024] = {0}; // Works for structure or array
char buffer[1024] = “”; // Works for char array

In both cases the syntax literally means “Set the first element to the value specified by the literal (zero), and then value initialize the remaining elements.” Value initialization for POD types means initialization to zero, so the net effect is that the entire array or structure is set to zero.

Note: it surprises some people that value initialization for POD types means initialization to zero, because we all know that when we forget to initialize locals and class member variables they do not get zeroed. The subtle explanation of this is that POD types are, by default, not value initialized. But in the example above they are. Experimenting with the sample code here and reading the discussion here and here can be instructive for the curious.

Anyway, the net result is that the array or structure is zeroed. But the VC++ compiler has an unfortunate tendency to be extremely literal about this code. In both cases it quite literally zeroes the first byte, and then does a memset(buffer + 1, 0, 1023); Here’s the code that it generates (with prologue and epilogue removed for clarity):

push    1023
lea     eax, DWORD PTR _buffer$[ebp+1]
push    0
push    eax
mov     BYTE PTR _buffer$[ebp], 0
call    _memset

That ‘mov BYTE PTR’ instruction is a whopping seven bytes long which is a lot of space to waste when compiling for size with /O1.

The other problem with the generated code I that it virtually guarantees that a misaligned pointer will be passed to memset which will probably make it take a bit longer.

In some cases zeroing the entire buffer is overkill. In that case you may want to declare the array and then zero the first byte as the next statement. If you do want the entire buffer zeroed then consider this syntax, which encourages VC++ to generate ideal code:

char buffer[1024] = {};

VC++ should not be so literal. The ‘as-if’ rule says that it is allowed to just generate a memset(buffer, 0 , 1024) in every case, and it should do that. This would save seven bytes each time this very common pattern is used.

4) Bit-field reading done too literally

Update: this deficiency is unchanged in VS 2012 – disappointing.

Consider the following artificial use of bit-fields:

struct CBitFieldTest
{
    int zeroOffset; // Offset 0.0

    bool b1 : 1; // Offset 4.0
    bool b2 : 1; // Offset 4.1
    unsigned char b3 : 5; // Offset 4.2
    bool b4 : 1; // Offset 4.7
    bool b5 : 1; // Offset 5.0

    unsigned u1 : 1; // Offset 8.0
    unsigned u2 : 1; // Offset 8.1
    unsigned u3 : 5; // Offset 8.2
    unsigned u4 : 1; // Offset 8.7
    unsigned u5 : 1; // Offset 9.0
};

As explained in my bit-field packing article the byte-sized bit-fields will be packed together, and the int-sized bit-fields will be packed together. This leads to the offsets (shown in byte.bit) form listed in the comments.

Bit-fields can, if used carefully, allow for smaller data structures. However they require longer and more complex code in order to read and write the values. The basic algorithm for reading an unsigned bit-field is:

  • Load the enclosing byte/word/dword
  • Shift it to the correct position
  • Mask off any unwanted high bits

If the bit-field is at the beginning of an element then the shift step can be skipped. If the bit-field is at the end of an element then the masking step can be skipped. So, if we are the obsessive compulsive type (and all of the best programmers are) then we should align our frequently access fields at the beginning and ending of elements.

In the example above, b1, b4, and b5 can all be read more efficiently than b2, which requires both a shift and a mask. The bit layout of u1 to u5 is identical to b1 to b5, but the element size and the type are both different which makes applying this optimization more difficult. But, it isn’t impossible. In the sample code below the one-bit bit-fields are read as ‘bool’, whether being read from the ‘bool’ type or the ‘unsigned’ type:

bool ReadB1(CBitFieldTest& bft) {
    return bft.b1; }

bool ReadB2(CBitFieldTest& bft) {
    return bft.b2; }

bool ReadB4(CBitFieldTest& bft) {
    return bft.b4; }

bool ReadB5(CBitFieldTest& bft) {
    return bft.b5; }

bool ReadU1(CBitFieldTest& bft) {
    return bft.u1; }

bool ReadU2(CBitFieldTest& bft) {
    return bft.u2; }

bool ReadU4(CBitFieldTest& bft) {
    return bft.u4; }

bool ReadU5(CBitFieldTest& bft) {
    return bft.u5; }

There are good reasons to put boolean values in a 32-bit unsigned bit-field. The main reason is that it allows packing of boolean values and larger values into one bit-field, with more flexibility than if an 8-bit base type was used. However there is a small penalty. The code generated for ReadU5(), which is logically and provably identical to ReadB5(), is one instruction longer. The compiler does a 32-bit read, and then a shift and mask, whereas if it was feeling more creative (and less literal) it could do an 8-bit read and then just a shift.

The same issue applies at the beginning and ending of every byte so it can potentially affect the reads of fully one quarter of your one-bit bit-fields.

Similar issues affect writing of bit-fields. Consider the code generated by these two functions:

void WriteB5(CBitFieldTest& bft) {
    bft.b5 = true; }

void WriteU5(CBitFieldTest& bft) {
    bft.u5 = true; }

The WriteB5 function does byte operations and ORs in a one-byte constant. The WriteU5 function does DWORD operations and ORs in a four-byte constant. Therefore WriteU5 is three bytes larger. The data-flow analysis for writing constants to bit-fields is particularly straightforward, and the savings particularly large. For projects that make heavy use of bit-fields the savings could be significant.

5) Function forwarding funkiness

Update: this deficiency is fixed in VS 2012 – hurray!

Consider this trivial little function:

void memcpyWrapper( void* dst, const void* src, size_t count) {
    memcpy( dst, src, count ); }

It doesn’t do much. It calls memcpy without altering any of the parameters. This is pretty useless but it does occur occasionally for valid or mysterious reasons. The ideal code for this function would be this:

jmp     _memcpy

The tail call optimization seems like a no brainer in this case. But VC++ does something stranger:

push    ebp
mov     ebp, esp
pop     ebp
jmp     _memcpy

It sets up a stack frame, and then tears it down. Then it does the tail call. The obvious philosophical question is, if you set up a stack frame and nobody can see it, then does it really exist?

It only wastes four bytes, and minimal execution time, so I don’t really care. But I’ve seen it enough and it’s obviously easy enough to fix that it has earned its place as issue number 5.

Conclusion

Many of these issues (with the exception of the memcmp-equality code-gen) are trivial. I don’t claim otherwise. But these changes also seem easy to implement in the compiler, and would benefit virtually all C++ programmers, so the cost/benefit ratio seems pretty good. Smaller code will example increase performance by doing less work, by reducing i-cache misses, and ultimately reduces memory pressure as well. I look forward to improvements in future versions of the compiler.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Performance, Programming, Visual Studio and tagged , . Bookmark the permalink.

7 Responses to Visual C++ Code-gen Deficiencies

  1. Interesting article. Let’s hope VC++ Dev Team reads your blog !

    I wonder if the code generated for x86, x64, and X360 have the same problems.

  2. Tramb says:

    Sensible *x86* programmers you mean🙂

    • brucedawson says:

      I think these days most sensible programmers on all platforms avoid writing assembly language. I know that the amount of assembly language written on Xbox 360 is very close to zero.

  3. Pingback: Comparing Memory is Still Tricky | Random ASCII

  4. Ben Voigt says:

    In Part 3, you’re looking for the term “value initialization”, NOT “default initalization”. Default initialization is in fact a no-op for POD types.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s