I’ve been a big fan of symbol servers for years. They are a part of the Microsoft/Windows ecosystem that is far better than anything I have seen for other operating systems. With Microsoft’s and Chrome’s symbol servers configured I can download a user’s Chrome crash and start analyzing it immediately – with code bytes and symbols – without knowing or caring what OS or Chrome version that user was running. Since Chrome’s symbols are source-indexed I will even get source-code popping up automatically, and all of this is available to anyone who is interested – no special Google privileges required.
ETW traces record a wealth of information about how a Windows system is behaving. When analyzing a new and unknown problem there is no replacement for loading the trace into WPA and following the clues to a solution. The thrill of the hunt and the creative challenge of finding a visualization that will reveal the root cause never gets old (too nerdy? Sorry – I do enjoy this).
But sometimes you want to extract some commonly found piece of information from multiple traces, and doing this manually is tedious and error prone.
I recently hit some multi-minute delays on my workstation. After investigating I found that the problem was due to a lock being held for five minutes, and during that time the lock-holder was mostly just spinning in a nine-instruction loop.
Coming up with a good title for my blog posts is critical but I immediately realized that the obvious title of “48 processors blocked by nine instructions” was taken already by a post from less than a month earlier. The number of processors blocked is different, and the loop is slightly longer but, really, it’s just deja vu all over again. So, while I will explain the new issue that I found I first want to discuss the question of why this keeps happening.
Why does this happen?
Roughly speaking it’s due to an observation which I’m going to call Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
I’ve recently started using heap snapshots on Windows to track heap allocations. I was able to use heap snapshots to record call stacks for all outstanding allocations in Chrome’s browser process over a full two weeks, letting me account for pretty much every byte of memory consumed.
Since then I have used heap snapshots to find wasteful memory usage in the Windows heap, a memory leak in a security tool injected into Chrome, and many details of Chrome’s memory usage that I was not previously aware of.
I first read about heap snapshots here. This page gives the mechanics of how to record a heap snapshot but it spends very little time explaining what heap snapshots are or how to use them effectively.
I seem to have a habit of writing about super powerful machines whose many cores are laid low by misuse of locks. So. Yeah. It’s that again.
But this one seems particularly impressive. I mean, how often do you have one thread spinning for several seconds in a seven-instruction loop while holding a lock that stops sixty-three other processors from running. That’s just awesome, in a horrible sort of way.
Contrary to popular belief I don’t actually have a machine with 64 logical processors, and I’ve never seen this particular problem. But a friend hit this problem,
nerd-sniped me asked for help, and I decided it was interesting enough to look at. They sent me an ETW trace that contained enough information for me to craft a tweet-for-help which resolved the issue swiftly.