In last week’s episode we discussed how 32-bit processes on 64-bit Windows might corrupt the exception state after a crash, and how any processes on 64-bit Windows might actually continue running after a crash. Serious stuff.
This week’s installment of “Failing to Fail” is less dramatic, but still important for developers who want robust software, as we cover failure to terminate and failures to record a crash dump.
Update: a technique for handling abort() was added to the post and to the sample code, July 22, 2012.
As a special bonus I also mention how to record crash dumps from all crashing processes on your machine, to make debugging easier than ever before.
What we have here is a failure to terminate
This post was originally posted to altdevblogaday.
Crashes happen. Any program more complicated than “Hello world” probably has some bugs. One measure of professional software development is how you deal with these crashes. What should happen is that the program should save a crash dump and then commit suicide (TerminateProcess() or _exit(), not ExitProcess() or exit()).
What you don’t want is for the doomed process to put up a dialog saying “Hey, I’m a doomed process”. But unfortunately that is what the Visual C++ C Run Time (VC++ CRT) does in some cases, as we see to the right.
If you accidentally call a pure virtual function (see the sample code for one possible way this can happen) then the handler for this brings up a dialog. If you’re a developer then you can attach a debugger and get a call stack, but most of the world is not developers. They don’t know what a pure virtual function call is, and they don’t care. Displaying this dialog just slows down the crash recovery process, while confusing your users.
But it’s worse than that. If you have a bevy of exception handlers ready to catch Win32 exceptions (access violations, etc.) then you will be disappointed because they won’t catch pure-call errors, even after someone presses OK. So, your in-house crash-dump recording system is helpless against this bug, which means it takes longer to get it fixed.
Worse yet, if this error happens on a server (I’ve seen it happen) then your headless server now has a hung process that is waiting for someone to click OK. Unit tests will timeout eventually, and servers may timeout if you have a watchdog, but the whole process is delayed by this dialog.
I wouldn’t be writing about this unless I had a solution to offer. The dialog above is the default behavior, but changing the default is simple enough once you know that you should. All you have to do is call _set_purecall_handler() with a function that intentionally crashes. My preferred implementation does a __debugbreak() followed by TerminateProcess(). If I’m running under the debugger this drops me into it quite neatly, and if I’m not then my unhandled exception filter will catch the exception and write out a minidump. The TerminateProcess() is there to discourage people who catch the exception in the debugger from trying to continue.
See the sample code for a concrete example of setting this up. You can use the menu options to try triggering pure-call errors with and without installing the error handler.
Invalid parameters aren’t technically crashes
The VC++ CRT detects a few types of invalid parameters to CRT functions and it treats them as fatal errors. This includes buffer overflow detection if you use the safer CRT functions (and you haven’t requested truncation), but the simplest way to trigger these checks is with “printf(NULL);”.
No dialog pops up – at least not in release builds – and the process is terminated, but it isn’t terminated through calling your carefully crafted exception handlers. Windows Error Reporting (WER) will be notified of the problem, which is good, but I want these invalid parameters treated like a crash so that my exception handlers get invoked. Luckily there is an easy solution for this problem as well. If you call _set_invalid_parameter_handler() then you can give it the same code (just with a different signature) as for your pure-call handler so that your exception handlers will notice something has gone wrong. And now your programs will be crashier than ever before. Which is a good thing. This technique is also demonstrated in the sample code.
WER is your friend
Windows Error Reporting (WER) is a handy feature built in to Windows. Most developers know that WER records crash dumps on millions of users’ machines and stores them, and most developers know that it is possible to get access to the crash dumps for your software. This is a fabulous way of finding out where your software is actually crashing on actual customers’ actual machines. There are a few hoops to jump through, but it’s worth getting it set up. However I have no special knowledge of how to arrange such access so I will say no more.
A lesser known feature of WER is that you can get it to record crashes on your own machines. All you have to do is set a few registry keys. I’m gonna go out on a limb here and say that every C++ developer on Windows should configure this. It’s trivially simple and WER will sometimes catch crashes that your other systems do not. WER is great at catching process startup and shutdown crashes, crashes in processes you forgot to add minidump handling to, and it even records minidumps for pure-virtual function calls and invalid CRT parameters.
The full documentation is available here. If you spend two minutes configuring this (I have the last 30 crashes saved as full dumps in c:\temp\crashdumps) then you will be better able to investigate crashes on your machine, regardless of what process is crashing.
Update – one more missed failure type
Stefan Reinalter pointed out that some libraries will handle errors by calling abort(), and this can be another way for a process to fail without your crash handler being called. He also supplied the fix, which is to call signal(SIGABRT, &AbortHandler); to install a handler which will be called if abort() is called. Signal can also be used to install handlers for other types of failures.
It’s not enough to read about this, you have to actually do a tiny bit of coding and registry work to get things crashing smoothly. Here are your tasks.
- Be sure to call _set_purecall_handler, _set_invalid_parameter_handler, and signal. If you use the DLL version of the CRT then once per process is fine. If you use the static-link version of the CRT then you need to call it once for each copy of the CRT – once for each DLL that statically links the CRT. The sample code available here should help.
- Configure the registry to save crash dumps on all of your machines, by following the simple directions here.
- If you haven’t already then be sure to follow the instructions in last week’s post, including configuring VS to halt on first-chance exceptions, calling EnableCrashingOnCrashes(), and using SetUnhandledExceptionFilter() to catch crashes.
- Set up a system for recording and uploading minidumps, using MiniDumpWriteDump or breakpad or the Steamworks APIs.
That’s it. Good luck with the goal of more stable software through crashing vigorously.