Increased Reliability Through More Crashes

Shipping games that don’t crash is hard, and it’s important to use every tool available to try to find bugs. Static code analysis is one technique that I’ve discussed in the past and for some classes of bugs it is delightfully effective.

Another strategy is to stress your game at runtime by adding additional validation, and by making the runtime environment more hostile so that rare bugs become frequent.

App Verifier is a free tool from Microsoft that adds additional checks to handles and locks, and allocates memory in a way that makes bugs more likely to lead to crashes.

Any Windows developers that are listening to this: if you’re not using App Verifier, you are making a mistake.

This post discusses App Verifier’s heap features.

This article was originally posted on #AltDevBlogADay.

Memory Stressing with Page Heap

One of the main features of App Verifier is Page Heap. This is a feature that puts every allocation on its own page in order to flush out buffer overruns and use-after-free errors.

Buffer overruns

Normally if you write beyond the end of an allocated buffer you will corrupt the heap data structures or some other allocation. This will often cause no initial problems, and then a catastrophic failure later on. This delayed failure makes it difficult to track down the problem. You might know which buffer overflowed, but not which code overflowed it.

Page Heap puts each allocation on its own 4-KB page, with the allocated memory aligned to the end of the page. Therefore if you overrun the buffer you will touch the next page. Page Heap ensures that the next page will be unmapped memory so you get a guaranteed access violation at the exact moment that you overrun the buffer.

Buffer overrun crashes with page heap are usually on the first byte of a page. That means that the last three digits of the hex address will be zero – watch for that signature in order to categorize the access violations you see.

In the awesomely buggy code below you can see that we crashed when we tried to write to 0x06D8D000 (EDI), and the memory window shows the ‘??’ pattern that indicates a fresh page of non-existent memory.

By default Page Heap keeps your allocations aligned to 8 or 16 byte boundaries so if, for instance, you allocate 4 bytes of memory there will be 4 or 12 bytes of mapped memory before the end of the page, which means that overruns will not be instantly caught. Memory corruption in the unused bytes at the end of the page will be checked for when the memory is freed.

Use after free

If you write to memory after freeing it then this will usually corrupt memory. Occasionally you will get lucky and crash immediately, but more often you will merely set the stage for a crash far in the future. Use-after-free memory corruption is usually much harder to investigate than buffer overruns.

Since Page Heap puts each allocation on its own page it can ensure that the memory will be unmapped when it is freed. That means that use-after-free will reliably give an instant access violation. A nightmare memory corruption bug becomes a tame kitten.

Whereas buffer overruns with Page Heap usually cause access violations near the beginning of a page, use-after-free with Page Heap usually causes access violations near the end of a page (assuming small allocations). Watch for the last three digits of the hex address to be near 0xFFF.

With use-after-free bugs the challenge may be to figure out who freed the memory. Sometimes, on good days, App Verifier will help with that. The process is a bit convoluted and arcane, but worth knowing about.

Page Heap records call stacks when you allocate and free memory, and WinDbg has an extension that will look up that information for a Page Heap address. If you are debugging with WinDbg then you can just type in “!heap -p -a Address” and see if a call stack for when the memory was freed is available. If you are debugging with Visual Studio then you can save a Minidump with Heap (Debug->Save Dump As) then load it into WinDbg and type the the intuitive !heap command. Whether the free stack is available depends on how long ago the memory was freed. I find that it has worked about half the time for me, and when it works it feels quite magical. If it doesn’t give an answer within ten to twenty seconds then it probably never will and ctrl+break is your friend. In the screen shot below you can see that we crashed when accessing 06decff8 (ESI), saved a crash dump, loaded it into WinDbg, and then typed in “!heap –p –a 06decff8”. Our reward for this effort was the call stack of when this memory was freed. Shazam!

Illegal Reads

Page Heap will detect illegal reads (buffer overreads and read-after-free) just as easily as illegal writes. These bugs are less serious, but worth finding and fixing since they can still lead to unpredictable behavior and crashes in the field.

True Tales

Normally I use App Verifier in a proactive mode – I use it to hunt for bugs that we don’t know about, and to make our unit tests more likely to find problems.

However it also works brilliantly as a reactive tool. A few months ago I got a call from a coworker because our game was hanging – spinning in a busy loop in the gnarliest lockless code that we have. I spent a while (too long) staring at the data structures trying to figure them out – when I realized that the problem was probably memory corruption. I turned on App Verifier and instantly hit the crash, in code that was miles away from where the symptoms appeared. An object that owned memory was being returned by value but didn’t have a copy constructor (rule of three violation). The memory owned by the object was freed, but the object’s copy still had pointers to it and we were writing through them. Without App Verifier I’m not sure how we would have found the bug, and with App Verifier the bug was trivial.

In less happy news, in some cases App Verifier will perturb the timing of your game so much that certain race conditions no longer occur – so it isn’t guaranteed to find everything.

Memory Consumption and Performance

When a 32-bit process on Windows allocates one byte it actually uses up 16 bytes of heap space – the heap granularity plus bookkeeping overhead adds the extra bytes. When using Page Heap a one byte allocation actually uses up 4 KB of memory, and 8 KB of address space. That means that fewer than 256 K allocations will exhaust the default 2 GB address space of a 32-bit program.

Marking your program as large address aware will give you (when running on 64-bit Windows) a 4 GB address space, which will postpone address space exhaustion, but will probably not avoid it entirely. Porting to 64-bit is the ideal solution. In many cases the more expedient solution is to adjust the heap settings so that only allocations within a certain size range go on their own pages, or else adjust the RandRate so that some percentage of your allocations go on their own pages.

Page heap significantly reduces your game’s performance. In addition to the greater cost of allocating and freeing memory, individual memory accesses are now more expensive, due to less efficient cache usage. Your mileage may vary, but these are the changes I’m seeing on one recent project:

	Normal	With App Verifier
Frame rate	170 fps	3.7 fps
Memory usage	0.8 GB	2.9 GB
Address space usage	1.0 GB	5.5 GB

If I reduced the number of memory allocations per frame then I could probably get performance up to about 10 fps, but even at 3.7 fps it’s okay for running tests.

Note that while we are using less than 4 GB of RAM, we are using more than 4 GB of address space, so we are only able to run in full App Verifier mode because we have a 64-bit build.

Hooking up to the process heap

Most game developers don’t use the system heap directly, for all sorts of good reasons. However Page Heap is powerful enough to justify making a conditional exception. On the projects I have worked on it has been relatively straightforward to add code that checks for a command line option on the first allocation, and when it is detected redirects all allocations to the process heap.

It’s worth pointing out that many other components within your game may already be using the Windows heap. D3D, for instance, does a lot of heap allocations which will be redirected to Page Heap by App Verifier.

Technical details

After installing App Verifier just run it, add your executable name to the list, and click Save. Don’t forget to click Save after any changes that you make. That’s it. Your game will now be stress tested any time it runs on that machine. Don’t forget to clear the list and hit Save when you are done or you may find your game (or tool) will be running noticeably slower. The settings are stored in the registry (HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options) and continue to be in force even when App Verifier isn’t running – think of the potential for practical jokes!

In the App Verifier window right-click on Basics->Heaps to edit the Page Heap settings, and check View->Property Window to see descriptions of the settings. You can specify what allocations go to Page Heap, put allocations at the beginning of pages in order to watch for buffer underruns, and configure other settings.

You should probably uncheck “Leak” since otherwise memory leaks will be considered a fatal error, which is a bit too dramatic for my tastes.

App Verifier prints debug output that explains some of the problems that it detects, and attaching a debugger after it finds a bug means you will miss this valuable information. App Verifier assumes that you will be using WinDbg, but it’s okay to use Visual Studio. App Verifier also prints a message at process startup to let you know that Page Heap is enabled.

The Visual C++ debug CRT puts padding around allocations in debug builds. This makes Page Heap less effective, so you should prefer using Page Heap with the release CRT.

App Verifier tries to ‘add value’ to access violations by catching them with an exception handler and printing out helpful information. That’s redonkulous! All this does is complicate the diagnosis by putting you six levels deeper in the stack. You can disable this on a per-solution basis by going to Debug->Exceptions->Win32 Exceptions->Access violation and checking the ‘Thrown’ box so that your game halts on the offending instruction.

Getting App Verifier

App Verifier and WinDbg are both available for free as part of the “Microsoft Windows SDK for Windows 7 and .Net Framework 4” (get the Windows 8 or 8.1 SDK now) – don’t you love Microsoft product names? You should already have the Windows SDK installed for xperf and /analyze and source indexing. How many reasons do you need to install this thing?

See this post for details on getting the Windows SDK.

Summary

App Verifier and Page Heap are free goodness
Access violation address that ends near 0x000? Buffer overrun.
Access violation address that ends near 0xFFF? Use after free.
Access violation address of zero? Page Heap induced address space exhaustion (port to 64-bit or configure the Page Heap settings to avoid this)

I’ve found dozens of serious bugs using AppVerifier. Buffer overruns, use after free, invalid handle usage caused by race conditions, and more. Our build machines now use App Verifier on some of the nightly unit tests and it continues to protect us and save us from wasting time.

P.S. The day after posting this on AltDevBlogADay I found a long-standing read-overrun bug in some decade old code, thus showing the value of having App Verifier on as you exercise all code paths.

Updates, June 2013

We exclusively use the RandRate feature of App Verifier to put a portion of allocations on individual pages without running out of address space. We use percentages varying from 8% to 15%. Because not all allocations are going to pageheap it can take a few runs to flush out bugs, but it actually works really well.

We use batch files to enable/disable App Verifier. These batch files are customized with the appropriate settings for each game. The settings are created in the App Verifier UI and then saved using regedit. Look in HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\yourgame.exe. Note that you still need to install App Verifier before enabling these settings, but you no longer need to use the UI.

We have our games set up to automatically detect when App Verifier is enabled (by looking in the registry) so that they can automatically switch to using the system heap.

In one case we shared our batch files for enabling App Verifier with a game server operator in order to get better data on a crash that he was seeing.

I wish that App Verifier would record more call stacks. It is brilliant when App Verifier finds a use-after-free, but when it can’t display the call stack that freed the memory it makes me sad.

Update, October 2018

The undocumented ‘Cuzz’ checkbox in the App Verifier settings apparently enables randomization of some aspects of the OS scheduler, which can help to flush out race condition bugs.

App Verifier defaults to creating log files to track each run. This makes launching of processes about three times slower initially, and arbitrarily slower over time due to a linear search to find the next unused log-file name. These log files can also grow to consume many GB of disk space. You should disable App Verifier’s log files with this command:

18 Responses to Increased Reliability Through More Crashes

Pingback: Top Ten Technologies of 2011 | Random ASCII
Ted Mielczarek (@TedMielczarek) says:

January 2, 2012 at 5:28 pm

A lot of these sound like the kinds of errors that Valgrind finds pretty easily. Have you ever tried Purify or any similar tools? I haven’t heard of any of my coworkers having great success with Purify. I know some of them have run Firefox under Valgrind via Wine, which is crazy but apparently works and produces useful data. (It probably helps when you have Valgrind developers on staff.)

- brucedawson says:
  
  January 2, 2012 at 6:32 pm
  
  I tried Purify years ago and had good results with it. I should probably try it again, although it’s not practical for many developers because of the price. I would like to use Valgrind. We sometimes use it on the Linux or MacOS versions of our software.
  
Pingback: 64-Bit Made Easy | Random ASCII
Pingback: When Even Crashing Doesn’t Work | Random ASCII
louiz’ says:

July 6, 2012 at 5:46 pm

Since Valgrind was mentioned as a Linux “equivalent”, I think some people might be interested in the libefence, which also provides a way to help you detect buffer overruns and use-after-free errors.
http://linux.die.net/man/3/libefence (also available on some other UNIX-like, I used it on freebsd as well).
You just need to link you program with the library, and efence will surround your memory allocations with inaccessible memory pages which will trigger a segmentation fault when used.

In the hope that it will help someone looking for these features under something else than Windows.

Ted Mielczarek (@TedMielczarek) says:

July 24, 2012 at 8:03 am

Related: there’s also Address Sanitizer nowadays: http://code.google.com/p/address-sanitizer/ which does some similar checking, but runs way faster than Valgrind. The only limitation is that your code has to compile with clang, since ASan is part of LLVM. We’ve got Firefox builds working with ASan on Linux/Mac, but not on Windows yet.

Pingback: You’ve Got a Bug in Your Bug (Finder) | Random ASCII
Pingback: Xperf Basics: Recording a Trace (the easy way) | Random ASCII
Pingback: VC++ /analyze Bug Finder Bug Fixed | Random ASCII
Pete says:

June 15, 2013 at 12:31 am

I’ve long desired a valgrind equivalent on windows, but was never able to rationalize the cost of Purify. But recently ran across Dr Memory, which has worked pretty well so far: https://code.google.com/p/drmemory/ — only downside is that it is 32-bit only.

Michael says:

September 15, 2014 at 10:42 am

Do you still use Application Verifier 4.1, or are you using 6.x? It looks like Microsoft has stopped shipping the private symbols that make the !avrf windbg extension work for 6.x.

- brucedawson says:
  
  September 16, 2014 at 2:38 pm
  
  We use 6.3. I’ve never used the !avrf extension — we mostly just use pageheap (plus a bit of handle verification and miscellaneous other stuff).
  
Alexander Riccio says:

June 15, 2015 at 12:07 am

This is the 87th time I’ve linked to this article, and I’ve *just* noticed the John Carmack (“you are making a mistake”@GDC) reference.

+10 points for you.

ericlaw says:

May 17, 2018 at 12:59 pm

Typo: “If it doesn’t give an answer within ten to twenty seconds then it probably will and ctrl+break is your friend”

Probably [never] will?

- brucedawson says:
  
  May 17, 2018 at 4:11 pm
  
  Thanks. Fixed! I need you to proof-read all of my articles in a more timely manner.
  
Alexander Riccio says:

February 10, 2021 at 5:12 pm

While I’m toying around with running ruby under application verifier during web dev, I’m thinking, that yeah, maybe they should’ve tried a binary search for the lowest unused log file name. I bet they could do something clever with a std::binary_search just trying to open the filename 🤷

- brucedawson says:
  
  February 11, 2021 at 1:10 am
  
  Using binary search when you don’t know the size of the set is a bit weird. You need to use exponential exploration to find an unused log file name, and then binary search to find the earliest unused log file name, and all that fancy coding is worthless because by the time you finish your search there may have been a dozen other log file names created (if, for example, you’re testing gomacc.exe which may be launched 1,000+ times in parallel)
  
  Simpler options would include storing the last log file name in the registry or using a date/time or just time stamp to order the log files.