A Crash of Great Opportunity

It was a fairly straightforward bug. A wide-character string function was called with a byte count instead of a character count, leading to a buffer overrun. After finding the problem the fix was as simple as changing sizeof to _countof. Easy.

But bugs like this waste time. A playtest was cancelled because of the crashes, and because the buffer-overrun had trashed the stack it was not trivial to find the bad code. I knew that this type of bug was avoidable, and I knew that there was a lot of work to be done.

The work that I did included:

Diagnose it earlier

If the program had crashed inside the function that had made the bad wide-character function call then finding the bug would have been trivial – code inspection would have quickly revealed it. But as soon as execution returned from that function the trashed stack obscured the location of the bug, making its investigation trickier.

It turns out, there’s a VC++ compiler switch to prevent returning after trashing the stack. So, the first thing I did was to turn on that switch, /GS. This switch tells VC++ to add a canary on the stack and check it before returning. With this switch turned on the crash was caught in the buggy function and locating it took seconds.

The /GS switch is intended as a security feature, to protect against malicious buffer overruns, but it also works well as a developer productivity tool. It does have some runtime cost, but the tradeoff is usually worth it, especially on internal builds. Recommended.

Cure the bug

Once I’d turned on /GS and reproed the bug it was trivial to find the bug so the next step was to fix it, as described earlier.

Don’t destroy the evidence

When the buffer overrun trashed the stack the buggy function returned to a garbage address, which happened to be in the heap. Wiping out evidence of the bug’s location was bad enough, but it was made worse because after returning to an address in the heap the game executed the data in the heap as instructions. This scrambled the registers and the stack and generally confused things.

And also, an executable heap? Really? That is a security hole of the first order. Therefore the next order of business was to change the linker settings to add /NXCOMPAT. This tells Windows to make the heap and the stack non-executable. This significantly improves security and it can also simplify debugging. And, this option has no run-time cost. Recommended. Actually, this should be considered required.

While I was in there I also turned on the /DYNAMICBASE linker switch in our release branches to further increase security, also with no run-time cost.

Avoid crashes from the future

At this point I’d fixed the bug, made future bugs of this type much easier to investigate, and improved security. But there was still much left to do. It turns out that this type of mistake is easy to make. When a developer passes a buffer and a size there are at least a half dozen different ways of passing the wrong size. The best way to avoid these bugs in the future is to avoid the need to pass the size. It is dead easy to create template functions that take an array as a parameter and infer the size with 100% accuracy. Instead of writing this:

mywprintf(buffer, _countof(buffer), …); // Verbose and dangerous

you can then write this:

mywprintf_safe(buffer, …); // Compact and safe

It’s less typing, less reading, and it’s guaranteed correct. The syntax for a templated function that takes an array reference is a bit gnarly, but you only need to get it right in a few places. I covered this technique in a previous blog post.

The template technique doesn’t work if you have a raw pointer, but for any other target you should be able to create an overload to handle it. Manually passing the size should be rare.

Therefore my next task was to add template overrides for all of our string functions, and encourage everyone to use them in all new code, thus making it trivial to avoid ever writing this type of bug again.

Avoid crashes from the past

While the safe template functions would let us avoid writing this type of bug in the future they did nothing about the millions of lines of existing code. It was safe to assume that there were more size mismatches waiting to bite us. So I added SAL annotations to our string functions, fired up Visual Studio’s /analyze, and started compiling. This was, by far, the biggest task. Any large code base that has not had static analysis run on it will have lots of detectable bugs. Buffer overruns, logic errors, format-string mismatches, use-after-free, and many more. I ended up running /analyze on five major projects, fixing thousands of bugs, and setting up build machines to report new problems. This was a few months of work spread out over several years, but time well worth spending. It’s still finding new coding errors today. I discussed this experience in a previous blog post.

Back to the future

/GS, /NXCOMPAT, /DYNAMICBASE, /analyze and all its associated labors – plus actually fixing the bug – took a lot of time, but it was definitely worth it. Most of these changes were trivial and had huge payoffs. Running /analyze was by far the biggest task but, like other smart programmers, I am convinced that it was invaluable. Entire classes of bugs – serious crashing bugs and crazy logic errors – that use to show up quite frequently are now entirely extinct. It is not often that you get to entirely eradicate dozens of types of bugs and doing this definitely increased developer productivity and product reliability.

Of course, crashes that no longer happen are invisible and it is impossible to know whether this work prevented a serious black swan event. The invisibility of the benefits can make it difficult to get recognition for this type of preventative bug fixing, so keep that in mind.

Simple bugs shouldn’t always trigger months of work, but it’s important to recognize when they should.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

11 Responses to A Crash of Great Opportunity

Ben Hymers says:

September 25, 2014 at 3:36 am

/NXCOMPAT and /DYNAMICBASE are on by default in VS2013, VS2012 and VS2010 (and on for most uses in versions before then).

/GS is on by default too, though I think VS default project settings may turn it off for release builds so it’s worth checking. IMO it should only be disabled for performance profiling and for final release builds.

- brucedawson says:
  
  September 25, 2014 at 7:38 am
  
  The defaults for those settings are a bit more complicated than that. For a long time VS would set them on new projects but would not enable them on upgrades, which means that many projects, on VS 2010 at least, would not have these settings unless users manually changed them. I complained about this several times and I think VS 2013 now sets them on project upgrade in most cases, which is great. This was a VS 2010 project.
  
  But, everyone should check to be sure, because of the importance. Ideally /GS should be on for release builds as well, although that is a slight performance/security tradeoff.
  
Adrian says:

September 25, 2014 at 9:20 am

All great advice. Whenever I fix a bug, I try to ask myself how I could have caught it earlier and where similar bugs might be lurking. I just found a missing break in a switch case in _ancient_ code. With a little searching, I found three copy-paste clones of the faulty code in other places in the project. So rather than wait for the next person to find those missing break statements, I’m refactoring to eliminate the duplication and thus fixing it in just one place.

In your case, instead of making template versions of all the string functions, I would have eliminated the pattern where the caller passes in a buffer to be filled. Output buffers are an old, dangerous pattern the used to be necessary because of the language (e.g., C) or because of efficiency (e.g., C++ before RVO, rvalue references, and quality STL implementations). Embrace std::string and std::wstring, and eliminate these problems. When you have to interface with platform APIs that require separate pointers and counts, do it in one place so you only have to get it right once.

- brucedawson says:
  
  September 25, 2014 at 9:29 am
  
  I do wish more developers would look for more instances of the same bug or “how could I have caught it earlier.”
  
  Eliminating static sized buffers is definitely an option, and C++11 makes it more practical, but std::string still usually requires an allocation — an unacceptable cost in some domains. And, drastic changes in patterns can be very expensive to implement.
  
  The great thing is that with template overrides and static-analysis (to find format-string mismatches), the “old, dangerous pattern” of output buffers can actually be completely safe.
  
  - gast128 says:
    
    August 6, 2015 at 1:59 am
    
    Be aware that most string implementations use small buffer optimization, which only allocate above a certain character count. Using C++ correctly should solve almost all buffer overruns or other C problems. And if you have to stick to C, please invent your own string wrapper library around strlen, strcat, strcpy etc.
    
Matthew Fioravante says:

September 26, 2014 at 2:02 pm

This is why I’m so eagerly waiting for string_view and array_view which are coming in the next revision of the C++ standard. Passing around pointer/int pairs (or equivalently pointer/pointer) is asking for trouble and should be banned. I try to avoid it at all costs. Instead its better to wrap the invariant relationship within a type. Its pretty easy to write simple versions of these yourself.

template
class ArrayView {
public:
template
ArrayView(T (&a)[N]) : _begin(a), _end(a + N) {}
//….
private:
T* _begin = nullptr;
T* _end = nullptr;
};

int safe_sprintf(ArrayView buf, const char* fmt, …) { snprintf(buf.data(), buf.size(), /* va stuff */); }

char buf[4096];
safe_sprintf(buf, “Hello World”);

- brucedawson says:
  
  September 26, 2014 at 7:38 pm
  
  The comment system ate your less-than and greater-than markers, but I get the idea. That is a nice implementation and it arguably makes it easier to implement functions that use the template magic. Either way, the crucial point is to not have the calls to safe_printf responsible for passing in the size. Shudder.
  
  As one of the linked blog posts mentions there were many different types of fixes — format string errors and logic errors mostly — but getting rid of the buffer-size mismatches was particularly satisfying.
  
  - Matthew Fioravante says:
    
    September 30, 2014 at 2:06 pm
    
    It also automagically works with std::array, std::vector, std::string and pretty much anything else that is a contiguous container, which is pretty nice.
    
    - brucedawson says:
      
      September 30, 2014 at 7:07 pm
      
      Yeah, it’s the ultimate adapter, and it means you can add support for other container types and all functions using ArrayView will automatically gain that support. I like. You should add the same comment to the strncpy rant.
      
- Bas says:
  
  October 9, 2014 at 9:04 am
  
  This is exactly what I made for my own hobby codebase. I made a ‘Range’ class, mainly for pointers which accepts conversions from all contigeous containers (array, vector, string).
  
Pingback: Сентябрьская лента: лучшее за месяц | Сообщество Аналитиков UML2.ru