Zeroing Memory is Hard (VC++ 2015 arrays)

Quick, what’s the difference between these two C/C++ definitions of initialized local variables?

char buffer[32] = { 0 };
char buffer[32] = {};

One difference is that the first is legal in C and C++, whereas the second is only legal in C++.

Okay, so let’s focus our attention on C++. What do these two definitions mean?

The first one says that the compiler should set the first element of the array to zero, and then (roughly speaking) zero initialize the rest of the array. The second one says that the compiler should zero initialize the entire array.

The descriptions are slightly different, but the net result is the same thing – the entire array should be zero initialized. Therefore, given the “as-if” rule in C++, they are the same. So, any sufficiently advanced optimizer should generate identical code for the two constructs. Right?

But sometimes those descriptions matter. If, hypothetically, a compiler took those descriptions extremely literally then they might generate code like this for the first case:

algorithm one: buffer[0] = 0; memset(buffer + 1, 0, 31);

while generating code like this for the second case:

algorithm two: memset(buffer, 0, 32);

imageAnd, if the optimizer didn’t notice that the two statements could be folded together then the compiler might end up generating less efficient code for the first definition than for the second one.

If a compiler literally implemented algorithm one then they might end up writing a zero to the first byte then (assuming a 64-bit CPU) do three eight-byte writes. Then to fill in the remaining seven bytes they might do a four-byte write, a two-byte write, and then a one-byte write.

You know, hypothetically.

And this is exactly what VC++ does. For 64-bit builds its typical code-gen for “= {0}” is:

xor eax, eax
mov BYTE PTR buffer$[rsp+0], 0
mov QWORD PTR buffer$[rsp+1], rax
mov QWORD PTR buffer$[rsp+9], rax
mov QWORD PTR buffer$[rsp+17], rax
mov DWORD PTR buffer$[rsp+25], eax
mov WORD PTR buffer$[rsp+29], ax
mov BYTE PTR buffer$[rsp+31], al

Graphically it looks like this, with practically every write unaligned:

image

But if you omit the zero then VC++ does this:

xor eax, eax
mov QWORD PTR buffer$[rsp], rax
mov QWORD PTR buffer$[rsp+8], rax
mov QWORD PTR buffer$[rsp+16], rax
mov QWORD PTR buffer$[rsp+24], rax

Which looks something like this:

image

The second code sequence is smaller, and it executes faster. The speed difference is often immeasurable, but anytime you can get smaller code that is never slower you should prefer it. Code size affects performance on all levels (network, disk, cache) so extra code bytes are sloppy.

It’s not a big deal – it probably doesn’t noticeably affect the size of any real programs. But I just think the code generated for “= { 0 };” is kinda hilarious. It’s the code-gen equivalent of saying ‘um’ too much when giving a speech.

I first noticed and reported this behavior six years ago, and I recently noticed that it’s still an issue in VC++ 2015 Update 3. So I got curious and wrote a little python script to try compiling the code below with different buffer sizes and different optimization options for x86 and x64 targets:

void ZeroArray1()
{
    char buffer[BUF_SIZE] = { 0 };
    printf(“Don’t optimize away my empty buffer.%s\n”, buffer);
}

void ZeroArray2()
{
    char buffer[BUF_SIZE] = {};
    printf(“Don’t optimize away my empty buffer.%s\n”, buffer);
}

The graph below shows the size of the two functions in one particular build configuration – optimize for size for a 64-bit compile – across values of BUF_SIZE ranging from one to thirty two (when BUF_SIZE is greater than 32 then the code sizes are identical):

image

The savings when BUF_SIZE is equal to four, eight, and thirty two are particularly impressive – size reductions of 23.8%, 17.6%, and 20.5% respectively. The average saving is 5.4%, which is pretty significant considering that the functions all have their epilogue, prologue, and the call to printf in common.

What I want to do at this point is to recommend that all C++ programmers prefer “= {};” over = “= { 0 };” when initializing structures and arrays. I find it aesthetically superior, and it looks like it almost always generates smaller code.

But the catch is in the word almost. The results above show that there are a few sizes where “= {0};” generates better code.  For the one and two byte cases “= { 0 };” writes an immediate zero (embedded in the instruction) to the array while “= {};” zeroes a register and then writes that. For the sixteen byte case “= { 0 };” uses an SSE register to zero all bytes at once – I don’t know why the compiler doesn’t use that technique more often.

So, before giving a recommendation I felt duty bound to try multiple optimization settings, on 32-bit and 64-bit. The summary of the results is:

32-bit with /O1 /Oy-: Average saving from 1 to 32 is 3.125 bytes, 5.42%
32-bit with /O2 /Oy-: Average saving from 1 to 40 is -2.075 bytes, -3.29%
32-bit with /O2: Average saving from 1 to 40 is 1.150 bytes, 1.79%
64-bit with /O1: Average saving from 1 to 32 is 3.844 bytes, 5.45%
64-bit with /O2: Average saving from 1 to 32 is 3.688 bytes, 5.21%

The problem is with the 32-bit /O2 /Oy- results, where “= {};” is, on average, 2.075 bytes larger than “= { 0 };”. This comes from sizes 32 to 40 where the “= {};” code is usually 22 bytes larger! This is because the “= {};” code uses movaps instead of movups to zero the array, which means it has to waste a ton of instructions on making sure the stack is 16-byte aligned. Oops.

image

Conclusions

I still recommend that C++ programmers prefer “= {};”, but it’s a weak preference, given the slightly conflicting results.

It would be nice if the VC++ optimizer would generate identical code for the two constructs, and it would sure be super if that code was always the ideal code. Please?

I would like to know why the VC++ optimizer is so inconsistent about when it decides to use 16-byte SSE registers to zero memory. On 64-bit builds it only does this for 16-byte buffers initialized with “= { 0 };” despite the fact that using SSE often seems to generate smaller code.

I think this code-gen issue is symptomatic of a larger issue where adjacent initializers in aggregates are not merged. However I’ve spent too much time on this already so I’m going to leave this as a theory.

A connect bug was filed here, and the Python script can be found here.

Note that this code, which should also be equivalent, generates even worse code than ZeroArray1 and ZeroArray2, in all cases.

char buffer[32] = “”;

Although I have not run the tests myself, I hear that gcc and clang are not fooled by “= { 0 };”

On early versions of VC++ 2010 the problem was more severe. In some cases a call to memset would be used, and = { 0 }; ensured that the address would always be misaligned. In early versions of the VC++ 2010 CRT the last 128 bytes would be written four times slower (stosb instead of stosd) when misaligned. That got fixed quickly.

Tweets start here, hacker news discussion is here, and reddit discussion is here.

If you like this you might like:

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Performance, Programming, Visual Studio and tagged , , . Bookmark the permalink.

22 Responses to Zeroing Memory is Hard (VC++ 2015 arrays)

  1. fdominicus says:

    Just curious
    what yields
    arr1[10] = {0};
    arr2[10];

    arr2 = arr1?

    Regards

    • brucedawson says:

      Well, you can’t assign an array to an array. But if you wrapped the array in a struct then I would expect a memcpy (probably inlined). If you get something else then you should add that to the report?

  2. Saleel Kudchadker says:

    Nice. Thanks for the elaborate explanation. Surely prompts me to check code I’ve written.

  3. Leszek says:

    xor eax, eax
    mov QWORD PTR buffer$[rsp], rax

    Is this a typo (and should be xor rax, rax), or is there some clever way to avoid having garbage in the upper half of the QWORD that I’m not seeing?

  4. frank says:

    I like this approach char buf[2048]{};

  5. Xi Yang says:

    We evaluated the overheads of Java zeroing.
    “Why Nothing Matters: The Impact of Zeroing”,
    https://github.com/yangxi/papers/raw/master/zero-oopsla-2011.pdf

  6. Sivaprasad says:

    Nice liked the approach🙂

  7. Wolfram says:

    I could not recreate the behavior for ‘char buffer[32] = “”;’. In all scenarios I tried (O1 and O2, both with and without Oy-), the assembly code generated for this approach was identical to the code generated for ZeroArray1 (that is ‘char buffer[32] = { 0 };’), rather than “worse code than ZeroArray1 and ZeroArray2, in all cases”, as written in the article. (I used VS Community 2015, ver. 14.0.23107.0 D14REL, if that is of any relevance)
    Granted, I did all my tests from inside the IDE–which does add some additional options to the CL call–rather than calling the compiler manually or using the Python scripts, but I believe this shouldn’t impact the code generation in such a way.

    • brucedawson says:

      You should test with the latest version of the compiler – I used VS Community 2015 Update 3. But, that won’t change the results.

      If you run the Python script then you will be able to reproduce my results. You can then compare the command line options used by the IDE to find out which ones are causing your results to not match mine.

      The effect is real – the VC++ team has acknowledged it (see the next comment). Try some of the more dramatically different sizes, 32-bit and 64-bit.

  8. andrewpardoe says:

    Bruce, thanks for the kick! Six years is a long time to wait for a bug resolution.

    It turns out that our compiler has a minimum size limit in a memset optimization. I’m sure the size limit was there for a Very Good Reason (TM) at one point in time, but we are investigating whether we can remove it. Step one is understanding why it was there in the first place.

    • brucedawson says:

      Interesting – so once the size gets to 64 this optimization kicks in and merges adjacent memory-clear requests? That would explain why I saw no differences beyond there.

      The joys of working on an old compiler.

      • andrewpardoe says:

        Apparently, once the size gets to 33 the optimization kicks in. It’s both a blessing and a curse to have a 30+ year old codebase and tons of users who never want to change their code🙂

  9. el0j says:

    GCC 6.1 seems to treat both cases the same, using moves up to a 34, then rep stosq up to 8192 bytes then call memset after that. As tested on godbolt.org, can of course vary depending on arch+flags.

  10. red1939 says:

    Also clang (3.8) doesn’t seem to care that much. Zeroing 32-element array is always 2 * SIMD move (movaps with constant 0).

  11. Denis Frolov says:

    Dear @brucedawson,
    Do you think we could translate this article to Russian and publish on our corporate blog (http://habrahabr.ru/company/abbyy/) — of course, with the name of the author clearly indicated and a link to the original text? We are a language software development company and our developers will certainly appreciate this great article. Thank you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s