Analyzing a Confusing Crash–Stack Walks Gone Bad

Part of my job always seems to include crash analysis. A program crashes on a customer’s machine, a minidump is uploaded to the cloud, and it might be my desk that it appears on when Monday morning rolls around. The expectation is that I can make sense of it so that we can ship more reliable software.

The Windows ecosystem makes this as easy as possible because the debuggers will automatically download the binaries and the debug symbols, and if source indexing has been applied then they will also automatically download the correct source files as needed.

Chrome has a public symbol server and uses source indexing so debugging Chrome is accessible for any developer.

But even with all of this information automatically appearing the crash analysis can be tricky. I recently came across a crash that showed some interesting quirks which I want to share. I used a mixture of windbg and Visual Studio for this crash analysis and they were completely quirk-for-quirk compatible.

Update, September 2019: VS 2019 seems to have fixed the annoying off-by-one bug. windbg is as confusing as ever.

I’ll start with an (edited for clarity and security) call stack from windbg:

0:000> k
# ChildEBP RetAddr
00 04daeff0 12243227 `anonymous namespace’::InvalidParameter
01 04daf014 12243784 _invalid_parameter
02 04daf030 116cd82f _invalid_parameter_noinfo_noreturn
03 (Inline) ——– std::_Deallocate
04 (Inline) ——– std::allocator<float, …>::deallocate
05 04daf03c 1148d3f9 std::vector<float, …>::_Tidy
06 (Inline) ——– std::vector<unsigned int, …>::~vector
07 04daf040 11481b34 FooBar::~Foobar

Explicable inlining

We have a FooBar destructor that calls a vector<unsigned int> destructor and then, ultimately, an invalid parameter handler kills the process.

One nice thing with recent toolchains is that the call stacks shown by both debuggers don’t just include the functions associated with the stack frames found by the debugger – they also include functions that were inlined, and that therefore share a stack frame. This can make understanding the call stack much easier. For instance, if you just look at the stack frames and the call instructions you would see that std::vector<>::_Tidy makes a call to invalid_parameter_noinfo_noreturn, but its source shows no such call. However once the inlining is shown it becomes clear how the call appears – the invalid parameter function is called by _Deallocate after two levels of inlining.

I Comprehend Further

The first mystery comes from looking at the std:: functions in frames 4 to 6. Frame 6 is the destructor for a vector of unsigned ints, which is correct. The vector destructor calls _Tidy which calls deallocate, which is fine, except that those functions are templated by float, not by unsigned int. And that makes no sense –a vector templated by one type can’t call functions templated by an unrelated type.

Origami-crane.jpgThe answer is ICF – Identical COMDAT Folding. If you compile with /Gy and link with /OPT:ICF then two functions that compile down to the same machine code will be merged in order to save space. In many cases the code for vector<unsigned> and vector<float> will be the same because the types are the same size, and they will be merged. You now have two different functions with the same address. The debugger finds multiple function names for the same address and it has no way to know which one it would be most sensible to display. so it often gets it wrong. Understanding and watching for this is crucial when looking at optimized code.

Aside: inconsistent selection of symbols can lead to inconsistencies in automated crash analysis – breakpad is working on that.

Off by one handled badly

The second mystery is the weirdest one and it happens when trying to figure out why the invalid parameter function was called. When I navigated to stack frames 3, 4, or 5 (_Deallocate, deallocate, or _Tidy) and viewed the assembly language then the debuggers got very confused, and unhelpful. This is what they displayed:

Visual Studio

116CD82D ??        ?? ??
116CD82E ??        ?? ??
116CD82F 00 55 89  add     byte ptr [ebp-77h],dl
116CD832 E5 53     in      eax,53h
116CD834 57        push    edi 

Windbg

No prior disassembly possible
116cd82f 005589    add     byte ptr [ebp-77h],dl
116cd832 e553      in      eax,53h
116cd834 57        push    edi

The instruction at the return address of 116CD82F looks suspicious, the next instruction is definitely bogus, and both debuggers stubbornly refused to show what came before. What is up, and what do we do with it?

To find out what is going on we need to go up the stack a bit farther, to frame 6 or 7 (~vector or ~FooBar) where we find this disassembly:

11481b2f  call std::vector<float, …>::_Tidy (116cd7d8)
11481b34  mov  ecx,esi

The return address is an innocuous mov instruction but it is the call instruction before that we are interested in. It is a call to 116cd7d8 and it is there that we must go, because that instruction is the call to _Tidy, and we want to see the disassembly of _Tidy. If you paste that address (perhaps with an 0x prefix for Visual Studio) into the disassembly window address bar you will see something like this:

  Function std::vector<int,std::allocator<int> >::_Tidy:
116CD7D8 55             push ebp
116CD7D9 89 E5          mov  ebp,esp
116CD7DB 56             push esi
116CD7DC 89 CE          mov  esi,ecx
116CD7DE 8B 06          mov  eax,dword ptr [esi]
116CD7E0 85 C0          test eax,eax
116CD7E2 74 44          je   116CD828h

116CD828 5E             pop  esi
116CD829 5D             pop  ebp
116CD82A C3             ret
116CD82B E8 25 5A B7 00 call _invalid_parameter_noinfo_noreturn
  Function std::list<std::pair<…>::clear:
116CD830 55             push ebp
116CD831 89 E5          mov  ebp,esp
116CD833 53             push ebx
116CD834 57             push edi

Our problematic return address (as shown in windbg call stacks, and in the failed disassembly of both debuggers) is 116cd82f. If you look for that address in the disassembly you will find – no instruction at that address. That address is the last byte of the call instruction at the end of the _Tidy function.

Luckily the root cause is something I have dealt with before – I even fixed it in breakpad years ago (reviewed here).

It turns out that a common trick when displaying an x86/x64 call stack is to subtract one from return addresses before looking them up in the symbol tables. The return address is the instruction after the function call which could be from an arbitrarily different line of code (thanks to optimizers), but subtracting one from the return address gets an address that is guaranteed to be inside the call instruction, and therefore will let the debugger show the line number of the call instead of the return. This is such a clever and seamless trick that we normally don’t even notice it is happening – until it fails.

imageIn the case of breakpad the problem was that subtracted addresses were printed out as part of the stack walk report. I would look at those addresses and get confused, so I fixed the stack walker to do symbol lookups with decremented addresses, but always print out the original return addresses – simple enough.

Visual Studio and windbg appear to do a similar thing where they decrement the address before looking up symbols but then print the original address in windbg stacks. Most crucially both debuggers go to the unmodified address when walking the stack in assembly language.

Except for this one time. This time the debuggers showed a decremented return address that points to the middle of an instruction and this confused themselves so much that they couldn’t display anything reasonable in the disassembly window. Why did both debuggers mess up in this case?

I don’t have the debugger’s source code at hand but it seems fairly clear what is going on. Both debuggers clearly use the trick where they subtract one from the return address before doing symbol lookup. They normally add one back to the address before displaying or using it when stack walking, but in this case that would have pushed the return address into a different function, because this was a bizarre (and usually impossible) example of a call instruction as the last instruction of a function.

Personally I think they should have just used the correct return address. The confusion caused by having it in the wrong function cannot possibly be greater than the confusion caused by not showing disassembly at all.

I also think that refusing to show prior disassembly, with no explanation whatsoever, is a bizarre decision. I’ve seen this unwillingness to show disassembly a few times in the past and I wonder if it was sometimes this same issue.

Another issue that can prevent disassembly from being displayed is if the code bytes actually aren’t available, as in this case. That’s what happens if you don’t push your binaries to your symbol server as well as your symbols.

Resolution

The ultimate cause of this crash is, well, I’m not sure. All this analysis makes it clear that the vector got corrupted somehow, but it’s not clear how. The minidumps that we get from customers only contain the contents of the stack (keeping them to less than a MB most of the time), and the object being destroyed was on the heap, so we don’t even know what type of corruption occurred. For now I’m going to land a change to record a bit more data in the FooBar destructor in hopes of figuring out what is going on.

I uploaded a repro (source code, build.bat, .exe, .pdb, and .dmp file) of the problematic stack walking to my blog. Just load the .dmp file into windbg and go up the stack to the Parent function – not that the disassembly window refuses to show disassembly (you had one job…)

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Debugging and tagged , . Bookmark the permalink.

8 Responses to Analyzing a Confusing Crash–Stack Walks Gone Bad

  1. JCG says:

    Hmm, that was a bit anti-climactic.

    Hopefully will be a follow-up at some point!

  2. “For now I’m going to land a change to record a bit more data in the FooBar destructor in hopes of figuring out what is going on.”

    For a moment, I read “in hopes that the additional data padding will avoid the problem” 🙂

  3. Martin Ba says:

    “The ultimate cause of this crash is, well, I’m not sure. All this analysis makes it clear that the vector got corrupted somehow, but it’s not clear how.” … 😀

    That really made me feel better. Must’ve been a dozen times now where I had to conclude the very same after some (similarly involved) Crash Dump Analysis. Good’ to see other people don’t have some magic wand either. 😉

  4. John Payson says:

    Showing disassembly prior to a particular address is difficult on the x86 because there will often be many valid sequences of instructions that could precede any particular address. On something like the ARM Cortex-M3 there won’t be more than two, but on the x86 there could be more than a half-dozen. Perhaps starting disassembly from the previous thing that seems like a symbol and then assuming a contiguous sequence of valid instructions might work, but there’s no guarantee that there won’t be some data between the last exported symbol and a particular piece of code.

    • brucedawson says:

      It’s true that backwards disassembly on x86/x64 has some challenges. But those challenges have, in general, been met. Both debuggers have no trouble disassembling backwards from the *actual* return address in tihs crash. They just get confused when asked to disassemble from the offset-by-one return address.

      Thus, their disassembly failure is an unforced error.

      In most cases the windbg ‘uf’ command can be used to disassemble an entire function. If the windbg disassembly used this as its source of information instead of ub/u then that would also avoid this issue.

  5. Richard Nutman says:

    I was looking at some disassembly in gdb recently to see which vector instructions used less bytes, and it amazingly got all the size of instructions completely wrong! Visual Studio showed the correct sizes. Didn’t know if it was an x64 thing, but gdb said it was in x64 mode.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.