Visual C++ Debug Builds–”Fast Checks” Cause 5x Slowdowns

Introduction

Visual Studio’s default Debug builds run painfully slowly. With one small change to your compile settings you may find that your Debug builds run up to five times faster.

Update: the situation with Visual C++ 2013 is fundamentally unchanged. Years later, still slow by default.

Update 2, Sep 2013: the slowdown is triggered by a combination of Runtime Checks and Edit and Continue – disabling Edit and Continue can drastically improve debug performance. In particular, it avoids the allocation of 64+ bytes of stack space in empty functions. The VC++ team is aware of the problem.

The Gory Details

Debug and Release Builds

Visual Studio has two build types by default – Debug and Release. The Debug builds have optimizations disabled and asserts enabled in order to make debugging easier. The lack of optimization makes it easy to step through your code and see what is going on.

The Release builds are fully optimized. While they still have debug information (if they don’t then you should fix that right away) the many optimizations needed to get maximum performance make debugging challenging. Inlining, transformation of expressions, keeping variables in registers (and the debugger’s unfortunate inability to track the location of those variables) all make debugging of Release builds a tedious job.

Therefore, in an ideal world Visual C++ developers would do their development and debugging using the Debug build, and only use the Release build for profiling, late-cycle testing, and releasing to customers.

In the real world, however, especially when working on video games, the debug builds are often too slow. Debug builds are often ten to thirty times slower than release builds, which renders them impractical for many purposes.

Investigating Slow Debug Builds

Our slow Debug builds were a problem so I decided to find out why they were so slow, and what could be done about it. I fired up xperf and profiled one particularly slow scenario, which was taking 34 seconds. What I noticed was that the majority of the time was being spent in tiny little accessor functions. These functions do little more than return a private value from a class and in release builds they are inlined and typically map to a single instruction. In a debug build these functions aren’t inlined, which hurts performance, and they also expand out to more instructions. A lot more.

To investigate this I wrote the simplest possible accessor function and looked to see what code the compiler generated. Here is the simple function:

size_t Container::GetCount() const
{
return m_count;
}

And here is the code that Visual C++ generates with the default Debug build settings:

; Debug version with default build settings
; Prologue
push   ebp
mov    ebp, esp
sub    esp, 204
push   ebx
push   esi
push   edi
push   ecx
lea    edi, DWORD PTR [ebp-204]
mov    ecx, 51
mov    eax, 0xCCCCCCCC
rep stosd
pop    ecx
mov    DWORD PTR _this$[ebp], ecx

; Function body
mov    eax, DWORD PTR _this$[ebp]
mov    eax, DWORD PTR [eax]

; Epilogue
pop    edi
pop    esi
pop    ebx
mov    esp, ebp
pop    ebp
ret    0

This simple function, which in a Release build would be 2 instructions, and more likely would be inlined to 1 instruction, now has 13 instructions of prologue, 2 instructions of function body, and 6 instructions of epilogue, for a total of 21 instructions. What’s worse is that one of those instructions is a “rep stosd” with a count (in the ecx register) of 51. That means that we are writing to 51 consecutive DWORDs on the stack, filling them with 0Xcccccccc.

Having 21 instructions is bad enough, but executing one of them 51 times makes it far worse. That means we are effectively executing 71 instructions – to implement a function that just returns a number.

That was why these tiny accessor functions were now dominating the profile. They are called extremely frequently and in Debug builds their cost is about 70 times greater than in Release builds. Meanwhile the big functions which do the majority of the real work are only slightly slower than before because their extra overhead is small in comparison to the work that they do.

The “rep stosd” and its supporting instructions (in bold in the listing above) exist to support the “Uninitialized local usage checks” and “Stack Frame runtime checking”. These two flags are enabled by “/RTC1” flag which is described as “Enable fast checks”. In particular the stack filling appears to be associated with “Stack Frame runtime checking”.

Faster Debug Builds

Luckily these checks can easily be disabled. In your project properties go to C/C++, Code Generation, and set Basic Runtime Checks to Default. Your code will get slightly smaller, and a lot faster. The GetCount() function above goes from 21 instructions in memory, 71 instructions executed, to 15 instructions in memory, 15 instructions executed:

; Debug version with “fast checks” disabled
; Prologue
push   ebp
mov    ebp, esp
sub    esp, 68
push   ebx
push   esi
push   edi
mov    DWORD PTR _this$[ebp], ecx

; Function body
mov    eax, DWORD PTR _this$[ebp]
mov    eax, DWORD PTR [eax]

; Epilogue
pop    edi
pop    esi
pop    ebx
mov    esp, ebp
pop    ebp
ret    0

The lack of optimizations in debug builds is done for two reasons. One is to improve compile speeds, and the other is to improve debuggability. The Visual C++ debug builds are, unfortunately, bloating the code in ways that serve neither of those needs, leading to debug builds that are far slower than necessary.

Measuring the Speedup

I wrote some simple test code that called GetCount() in a loop to see how much of a performance difference this change makes. The results will vary depending on what type of CPU you have, and the results will be different in the context of a real program instead of a benchmark, but there is still some value in measuring the speed difference.

The results were clear and dramatic. With the ironically named “fast checks” disabled the benchmark runs just over five times faster than with the checks enabled. Since the measurement code adds some overhead this means that the actual speedup of the GetCount() function is even greater – perhaps as much as a six times speedup.

Lost Error Checking

The Visual C++ runtime checks do have some value. The problematic check appears to be controlled by “Stack Frames (/RTCs)” which checks for incorrect adjustments of the stack pointer. Here is an example of some incorrect code that this check would detect:

__declspec(noinline) int __cdecl AddNumbers(int x, int y)
{
return x + y;
}

typedef int(__stdcall* AddNumbers_t)(int x, int y);

void TestBadCalling()
{
AddNumbers_t pAddNumbers = (AddNumbers_t)&AddNumbers;
int sum = 0;
for (int i = 0; i < 1000000; ++i)
sum += pAddNumbers(1, 2);
printf(“1 + 2 = %d\n”, sum);
}

The AddNumbers function is declared as __cdecl which means that when it returns it is supposed to leave its arguments on the stack. However the function pointer is declared as __stdcall, which indicates a function that will clear its arguments from the stack on return. This mismatch (only possible because of the cast when assigning the function pointer) means that each time we call AddNumbers through the function pointer our stack pointer is decremented by the size of the two integer parameters. Eventually the stack will overflow and we crash, but other bad side-effects are certainly possible. This demonstrates the danger of using casts to paper over incompatibilities. Unfortunately casts are sometimes unavoidable, such as when using the Windows function GetProcAddress.

With “Stack Frames (/RTCs)” enabled VisualC++ catches this bug and brings up this impressively informative dialog:

image

A bit of poking around in the disassembly finds the straightforward code that implements this check:

mov     esi,esp
push    2
push    1
call    dword ptr [pAddNumbers]
cmp     esi,esp
call    @ILT+700(__RTC_CheckEsp)

In other words, when this check is enabled Visual C++ wraps every function call with code that makes a copy of the stack pointer (esp) and then checks after the call that esp has returned to its correct value.

What is fascinating about this is that this (quite useful) runtime check is totally unrelated to the expensive stack filling that we are trying to get rid of, yet they are controlled by the same compiler flag. I suspect that this is a compiler bug and that the stack filling is actually supposed to be used to implement the separate “Uninitialized local usage checks” feature.

Possible Future Improvements

Disabling the “fast checks” is a huge improvement but VisualC++ could go much further. Right now it saves and restores registers that it never uses, reloads values that are already in registers, and does many other wasteful things. If I ran the world then the Debug build version of GetCount() would be eight instructions, as shown below:

; Debug version with “fast checks” disabled
; Prologue
push ebp
mov ebp, esp
sub esp, 68
mov DWORD PTR _this$[ebp], ecx

; Function body
mov eax, DWORD PTR [ecx]

; Epilogue
mov esp, ebp
pop ebp
ret 0

Making these additional changes could potentially double the speed of these small accessor functions, thus significantly improve the speed of C++ programs that depend on them.

Summary

With this one change to our build settings my 34 second scenario was sped up to 14 seconds. A few other changes to optimize some of our assert checks got it down to 12.5 seconds. If you too want dramatically smaller and faster debug builds then all you need to do is set Basic Runtime Checks to Default.

Having the compiler help me find bugs is a great thing but given that this feature only very rarely finds bugs, and is extremely expensive, I don’t think that the cost/benefit ratio is acceptable.

At the very least VisualC++ should fix these flags so that we can get the value of the “Stack Frames” checks without the excessive cost of the “Uninitialized local usage checks”. I also think they should fix the compiler so that the “Uninitialized local usage checks” don’t generate extra code in functions that don’t even have local variables.

Timing details

My test program consists of a function that calls GetCount() in a tight loop, 10 million to 50 million times (more for the faster version, so that the total time is fairly consistent), and times the outer function with __rdtsc().

With “fast checks” enabled the GetCount() function typically took an average of 54.7 cycles.

With “fast checks” disabled the GetCount() function typically took an average of 10.4 cycles, or about 5.25 times faster.

Performance measurement on modern CPUs is a black and fuzzy art. My CPU is rated at 2.2 GHz and __rdtsc() runs steadily at that rate, but under load it can (and does in these tests) ramp up to 3.3 GHz. Thus the clock ticks measured by __rdtsc() should really be multiplied by 1.5, however that doesn’t change the speedup. The tests seem to be fair and representative, and consistently show a significant speed boost from turning off “fast checks”.

The full test code – which is unlikely to work well on optimized builds – is shown below:

class Container
{
public:
size_t GetCount() const;

private:
size_t m_count;
};

size_t Container::GetCount() const
{
return m_count;
}

Container g_container;

const size_t kCount = 50000000;

__int64 TimeGetCount()
{
int sum = 0;
__int64 start = __rdtsc();
for (size_t i = 0; i < kCount; ++i)
{
sum += g_container.GetCount();
}
__int64 elapsed = __rdtsc() – start;
printf(“%lld kticks for %d iterations, sum is %d\n”, elapsed / 1000, kCount, sum);
return elapsed;
}

void TestSimpleAccessor()
{
__int64 results[21];
for (int i = 0; iter < _countof(results); ++iter)
{
results[iter] = TimeGetCount();
}

    std::sort(results, results + _countof(results));

    printf(“Results are %1.1f/%1.1f/%1.1f cycles (min/med/average)\n”,
results[0] / double(kCount),
results[_countof(results)/2] / double(kCount),
results[_countof(results)-1] / double(kCount));
}

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Performance, Programming, Visual Studio, xperf and tagged , . Bookmark the permalink.

19 Responses to Visual C++ Debug Builds–”Fast Checks” Cause 5x Slowdowns

  1. Pingback: Making VirtualAlloc Arbitrarily Slower | Random ASCII

  2. Billco says:

    I dare ask: since you know the generated code is still wasteful (those 3 useless push/pop pairs), would you bother writing a preprocessor or macro to replace these trivial accessors with hand-tuned assembly code ? I haven’t done any game programming in well over a decade, just curious how aggressive you have to be these days with modern hardware.

  3. brucedawson says:

    Processing the object files to remove some of the wasteful instructions (after compilation, but before linking) would allow us to remove some wasteful instructions. It would have to be done extremely carefully to make sure didn’t change the meaning of the code, but I think it is possible.

    Since this is for debug builds this wouldn’t affect the performance our customers would see. It would just affect game performance during development. Therefore we’d have to be very careful that our post-processing of the object files was very fast, or else it would be a poor tradeoff.

    For now I’m hoping that the VC++ team will deal with this issue. My friends on that team seemed interested in improving the code-gen for debug builds.

  4. Pingback: Visual C++ Code-gen Deficiencies | Random ASCII

  5. Lewis says:

    thanks for that post it was an interesting read.

  6. Cedric says:

    It seems that setting Basic Runtime Checks to Default in debug in an application using /arch:SSE2 is causing the temporary _m128 values to be aligned on 8 bytes instead of 16. There was a similar bug reported and fixed in 2006 in VS2005 and VS2008. Any decent vector class should reproduce the issue.

    • brucedawson says:

      Cedric – what version of Visual Studio are you using that shows this bug?

      The Visual Studio team is pretty responsive about fixing genuine bugs. To improve the odds of getting *your* bug fixed I recommend creating a simple project that demonstrates the bug — the simpler the better. A cpp file and .vcxproj file, or a .cpp file and command line to build it is ideal. Post it here, or at connect.microsoft.com.

      • Cedric says:

        2010 SP1, x64. I’ll see if I can get it to happen on a simple case. it’s happening on a big video game code base. The debug config is pretty much unusable, so it sounded like a good thing to try. I’ll post something if I can get to it next week..thanks again for investigating the performance differences between debug and release, that’s a great find!

  7. pg says:

    Thank you for explanation of that rep stosd stuff. I lost about 2 hours with investigating that, but google sent me to your blog. Again:)

  8. Pingback: Vote for the VC++ Improvements That Matter | Random ASCII

  9. Pingback: How to Report a VC++ Code-Gen Bug | Random ASCII

  10. Hey Bruce, do you happen to know if the Stack Filling has been separated from the Stack Frames checks in any newer versions of the compiler? I’m working on setting up a current projects prebuilt system, and want to know if I should have /RTC enabled or not.

    • brucedawson says:

      I don’t know if the stack filling has changed. Note that if you turn off /ZI (edit-and-continue) then the problem should go away.

      Also, I saw a CppCon talk where they recommended using /Ob to allow one level of inlining in debug builds with /Ob1 — I’m not sure how this works. I saw it described in Nicolas Fleury’s C++ in AAA Games talk:

  11. Andy Wyatt says:

    I’ve just seen your post and tried a similar code example in VS2013… and it seems that the issue describe persists – the ‘heavy’ assembly generated for a simple accessor function is almost identical to that shown here. How sad! (I hoped this issue was a mere history note in Visual Studio… alas…).

  12. Pingback: ETW Central | Random ASCII

  13. Thanks so much for this post! I was given this link from a co-worker, and it helps get a handle on Release vs Debug performance issues. We are migrating from VS6.0 to VS2013, and these runtime checks are new coming from VS6.0 so the performance hit can be significant for Debug builds.

    • brucedawson says:

      In some cases it is necessary to disable the debug checks. There are the “fast checks” discussed here, but also various levels of STL checks, and the debug CRT adds extra heap checking (including locks that serialize allocations). Plus, when you run your application (debug or release) under a debugger that enables the Windows debug heap which can cause further slowdowns.

      So your compile options, defines to control the STL, CRT choice, and whether you run under a debugger can all affect your performance.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.