It was a fairly straightforward bug. A wide-character string function was called with a byte count instead of a character count, leading to a buffer overrun. After finding the problem the fix was as simple as changing sizeof to _countof. Easy.
But bugs like this waste time. A playtest was cancelled because of the crashes, and because the buffer-overrun had trashed the stack it was not trivial to find the bad code. I knew that this type of bug was avoidable, and I knew that there was a lot of work to be done.
The work that I did included:
Diagnose it earlier
If the program had crashed inside the function that had made the bad wide-character function call then finding the bug would have been trivial – code inspection would have quickly revealed it. But as soon as execution returned from that function the trashed stack obscured the location of the bug, making its investigation trickier.
It turns out, there’s a VC++ compiler switch to prevent returning after trashing the stack. So, the first thing I did was to turn on that switch, /GS. This switch tells VC++ to add a canary on the stack and check it before returning. With this switch turned on the crash was caught in the buggy function and locating it took seconds.
The /GS switch is intended as a security feature, to protect against malicious buffer overruns, but it also works well as a developer productivity tool. It does have some runtime cost, but the tradeoff is usually worth it, especially on internal builds. Recommended.
Cure the bug
Once I’d turned on /GS and reproed the bug it was trivial to find the bug so the next step was to fix it, as described earlier.
Don’t destroy the evidence
When the buffer overrun trashed the stack the buggy function returned to a garbage address, which happened to be in the heap. Wiping out evidence of the bug’s location was bad enough, but it was made worse because after returning to an address in the heap the game executed the data in the heap as instructions. This scrambled the registers and the stack and generally confused things.
And also, an executable heap? Really? That is a security hole of the first order. Therefore the next order of business was to change the linker settings to add /NXCOMPAT. This tells Windows to make the heap and the stack non-executable. This significantly improves security and it can also simplify debugging. And, this option has no run-time cost. Recommended. Actually, this should be considered required.
While I was in there I also turned on the /DYNAMICBASE linker switch in our release branches to further increase security, also with no run-time cost.
Avoid crashes from the future
At this point I’d fixed the bug, made future bugs of this type much easier to investigate, and improved security. But there was still much left to do. It turns out that this type of mistake is easy to make. When a developer passes a buffer and a size there are at least a half dozen different ways of passing the wrong size. The best way to avoid these bugs in the future is to avoid the need to pass the size. It is dead easy to create template functions that take an array as a parameter and infer the size with 100% accuracy. Instead of writing this:
mywprintf(buffer, _countof(buffer), …); // Verbose and dangerous
you can then write this:
mywprintf_safe(buffer, …); // Compact and safe
It’s less typing, less reading, and it’s guaranteed correct. The syntax for a templated function that takes an array reference is a bit gnarly, but you only need to get it right in a few places. I covered this technique in a previous blog post.
The template technique doesn’t work if you have a raw pointer, but for any other target you should be able to create an overload to handle it. Manually passing the size should be rare.
Therefore my next task was to add template overrides for all of our string functions, and encourage everyone to use them in all new code, thus making it trivial to avoid ever writing this type of bug again.
Avoid crashes from the past
While the safe template functions would let us avoid writing this type of bug in the future they did nothing about the millions of lines of existing code. It was safe to assume that there were more size mismatches waiting to bite us. So I added SAL annotations to our string functions, fired up Visual Studio’s /analyze, and started compiling. This was, by far, the biggest task. Any large code base that has not had static analysis run on it will have lots of detectable bugs. Buffer overruns, logic errors, format-string mismatches, use-after-free, and many more. I ended up running /analyze on five major projects, fixing thousands of bugs, and setting up build machines to report new problems. This was a few months of work spread out over several years, but time well worth spending. It’s still finding new coding errors today. I discussed this experience in a previous blog post.
Back to the future
/GS, /NXCOMPAT, /DYNAMICBASE, /analyze and all its associated labors – plus actually fixing the bug – took a lot of time, but it was definitely worth it. Most of these changes were trivial and had huge payoffs. Running /analyze was by far the biggest task but, like other smart programmers, I am convinced that it was invaluable. Entire classes of bugs – serious crashing bugs and crazy logic errors – that use to show up quite frequently are now entirely extinct. It is not often that you get to entirely eradicate dozens of types of bugs and doing this definitely increased developer productivity and product reliability.
Of course, crashes that no longer happen are invisible and it is impossible to know whether this work prevented a serious black swan event. The invisibility of the benefits can make it difficult to get recognition for this type of preventative bug fixing, so keep that in mind.
Simple bugs shouldn’t always trigger months of work, but it’s important to recognize when they should.