64-bit Windows 7 SP1 has a stack corruption bug that affects developers. Any developer with an AVX capable processor who is writing 32-bit code on 64-bit Window 7 SP1 is vulnerable. That sounds like a lot of conditions but I could summarize it by saying that most developers are vulnerable to this bug.
This bug corrupts the stack when you are debugging a 32-bit program and it crashes, leaving you with a garbage call stack that doesn’t even show where the crash happened. It has been reported on here twice before.
A hot fix is available now. If you are reading this then you probably need it.
Update: it appears that the fix was simultaneously rolled into a security update, thus making the hot-fix unnecessary. If you have KB2859537 (part of the August 2013 set of security patches) then you should have the fix and don’t need the hot-fix. Odd, but great!
The bug is easy to reproduce. Create a Win32 console project, 32-bit, and paste in this code:
int main(int argc, char* argv)
char* p = 0;
p = 0;
Debug or release, it doesn’t matter. Run it under the VS debugger. What you want to see when it crashes is this call stack:
What you actually get is this:
That’s a pretty broken call stack. The function that crashed is missing. Think about that. The location of the crash has been lost! That means that this trivial bug (can you see the mistake in my code?) is now challenging to find.
I initially wrote about this bug (along with another crash-related peculiarity in 64-bit Windows) in July 2012. I wrote about this bug a second time in March 2013, asking whether Microsoft should fix it.
This bug has been known for well over a year, but a fix is now available. There are several steps to the process of installing the hot fix, but I think the result makes it well worth it.
And there was much rejoicing, and some people just finding out the cause of their woes.
Should I install the fix?
Are you a C++ programmer, working on 64-bit Windows 7, debugging 32-bit programs, with a recent processor that has AVX support? If you answered yes to this question then hell-yeah you should install this patch. I installed it the morning it came out. I’ve had no problems with it and I’ve advised all of my coworkers to install it.
If you are running Windows 8, or Windows Vista, or 32-bit Windows, or you don’t have an AVX capable processor, then don’t bother.
Prior to the availability of this fix the best workaround was to disable AVX support in Windows. This was done by running “bcdedit /set xsavedisable 1” from an administrator command prompt. This workaround is still recommended if you can’t install the hot fix for some reason. If you previously applied this workaround and you want to remove it just run this command from an administrator command prompt:
bcdedit /deletevalue xsavedisable
You can see the state of the xsavedisable flag by running bcdedit with no parameters from an administrator command prompt. If xsavedisable is not listed or is set to zero then the workaround is not in place.
Why so long?
I think there are a few reasons why it took a while for this bug to get fixed. Developers who hit this bug had no idea what the problem was and therefore no idea to whom they should report it. Many developers actually got used to this broken behavior quite quickly and forgot that call stacks on crashes used to work. Unlike application crashes there is no instrumentation that automatically counts up incidences of this bug so Microsoft had no visibility into the severity of the problem. And, most people at Microsoft upgraded to Windows 8 where this bug is fixed, so they never saw it. Relying on measurements to decide what bugs to fix is smart, but you also have to consider what to do for bugs that your measurements don’t detect.
In this case the appropriate calculation would be to estimate the number of developers affected by the bug (I’m gonna ballpark that at ‘millions’) and then multiply by the severity of the bug (‘bloody annoying’) and that gives you an overall impact of ‘millions of bloody annoyed developers, developers, developers’. See? Math is easy. Then you just compare that to the risk and make a decision. But if you wait for complaints or other measurements then you may underestimate the seriousness of bugs like this.
A hot fix isn’t ideal because many developers who are affected by this bug won’t know to install it, but it’s a good start. Maybe it will get rolled into a service pack or a Windows Update at some point. If you think that’s a good idea then be sure to mention that to your Microsoft contacts.
The reddit discussion can be found here.
thx for that info, I already gave it up to expect a fix for this bug. All my coworkers and me were affected by this bug. Luckily we are writing portable code and could resolve most issues using Linux and GDB.
Do you know if this affects walking the stack with dbghelp.dll or generating minidumps? If this will fix our self-generated crash reports, we should start installing this hotfix on our systems that still run 32-bit applications.
Our experience has been that this does not affect generating of crash dumps. It’s not that dbghelp.dll is immune to the bug in anyway — once the stack is corrupted information is lost and cannot be retrieved — it’s that crash dumps are usually generated before the stack corruption occurs.
The order of events, as I understand it, is: a crash happens, the first chance exception handler is called, in many products this triggers code that saves a minidump, then some Wow64 debugging code runs and corrupts the stack, and then the debugger gains control. I’m not clear on the details but our observations suggest something like this.
So, the hot-fix should only be needed on developer machines.
It almost feels like it’s too late. While we are still on Windows 7 the amount of 32bit debugging that we do now is extremely minimal.
We have 64-bit versions of much of our code, but we still have enough customers running 32-bit Windows that 32-bit code is what we ship, so we mostly debug 32-bit code.
But yeah, it is pretty late.
Interesting, I got used to crashes due to null pointer references required careful stepping and printf-like statements to fix. This may be the end of that!
Thank you so much for this! I’ve spent far too many cycles trying to figure out how _unlock() could call Lua’s garbage collector…
Pingback: Should This Windows 7 Bug be Fixed? | Random ASCII
I’m wondering… does this happen only if you attach a debugger after the app has already crashed? I’ve seen this quite a lot, but not always. I thought that I don’t see it when running directly from the debugger. But I might be wrong. I didn’t know what actually causes it, so I adapted. Guess that, with years, I learned that when there’s a “noncritical” bug like that in MS code there not much hope that it will be ever fixed, so I better just learn to live with it. 🙂
It also happens if you start the application under the debugger. I think that if the first-chance exception handler runs then the stack is corrupted.
Microsoft *will* fix bugs, but sometimes you have to insist. A few people worked hard to convince Microsoft to fix this one.
They checked the fix into the GDR branch so that KB2859537 aka MS13-063 includes this fix already.
Could you interpret your comment for those of us who aren’t experience in the details of Windows patching? i.e.; can you please decode GDR, and make explicit the implications of this fix being in KB2859537 instead of just in a hot-fix?
It sounds like you are saying that the fix will be pushed as part of Windows Update now, but my contacts have mentioned nothing about that.
Thanks. I checked on my laptop and found that KB2859537 was installed on August 13th as part of patch Tuesday and, as you said, seems to have fixed the bug, without my needing to install the hot fix. This is quite different from what my Microsoft contacts told me, but is great news. This means that virtually all developers will get the fix, and the bug will just “go away”.
Pingback: Developers Rejoice Again | Random ASCII
Pingback: Bugs I Got Other Companies to Fix in 2013 | Random ASCII