I recently noticed that my Windows 7 home laptop was running slower than usual. It seemed to be spending a lot of time waiting on disk I/O. When I looked at Task Manager it showed that my laptop was caching about one fifth as much disk data as normal, but the cause of this was not obvious.
Part of my job at Valve is to be a performance expert – to investigate both mundane and esoteric performance problems. So when I realized that I had a serious and systemic problem I knuckled down and brought all the tools of my trade to bear, and ultimately I found the culprit.
The story you are about to hear is true. Only the precise values of cached, available, and free memory have been changed to protect the innocent.
Spoiler alert: the Western Digital My Book Live backup drive that I’d purchased from Costco was the culprit.
Even better spoiler alert: Western Digital contacted me nine days after I posted this entry, and took the bug quite seriously. About four months later they published a fix. The story has a happy ending, and you can change the world through blogging.
In order to understand this mystery it is necessary to understand the importance of the Windows disk cache. Having lots of memory in your computer is important to allow running lots of programs without hitting the page file, but also to allow lots of data to be cached in memory. Hard drives, with their spinning physical platters, are among the slowest components in a computer, and laptop drives are the slowest type of hard drives. Windows caches data from the hard drive so that it will be available instantly, instead of requiring many milliseconds for each read. The more data that is cached, the less frequently the disk drive will be needed, and the faster your computer will be. Because of the Windows disk cache there is no such thing as too much memory. My work machine has 32 GiB which means that I typically have over 24 GiB of disk data cached, and this allows me to bounce around between different programs and different projects with minimal disk I/O.
Windows (starting with Vista) has systems in place whose only purpose is to fill up the disk cache with data that its logging shows you are likely to need. SuperFetch watches your usage patterns and knows how to prefetch data when you launch an application. It also prefetches data when your system is idle.
These features were confusing when Windows Vista first came out. Users would complain that Vista was a pig because it used so much memory, because Task Manager would show very little free memory. But it turns out that this is a good thing. Free memory is a total waste. Free memory is not doing anything useful. A well running Windows machine should have free memory near zero, and cached memory (the Windows disk cache) nice and high. Cached memory is fulfilling a useful purpose, it makes your computer run faster, and it can be repurposed (turned into free memory to satisfy allocations) the instant that it is needed.
A healthy machine
When Windows is running well Task Manager (available by right-clicking on the taskbar) should show Physical Memory statistics similar to those shown to the right: lots of cached memory, lots of available memory, and very little free memory. If available memory is low (less than a few thousand MiB) then it is worth purchasing more memory so that Windows will have more space for running programs and caching data. If free memory is high (more than a few hundred MiB) then either Windows has not had time to populate it with useful data or else something is wrong.
An unhealthy machine
On my laptop the performance tab of Task Manager looked more like this:
Looking at the area circled in purple shows that I’ve got lots of available memory, but my disk cache is smaller than expected for a machine with 8 GiB of RAM, and free memory is unusually high. On my 8 GiB laptop I normally have about 4-5 GiB of data cached, but when I noticed this problem my laptop was typically only caching about 1-2 GiB of data.
Looking at the history of memory usage shows why cache memory is low and free memory is high. The area circled in red shows three brief spikes of memory usage. Each of those spikes represent a transient 4 GiB memory allocation, and Windows satisfied the first of those allocations by evicting 4 GiB of useful data from the file cache, leaving it as wasteful freed memory at the end.
Task Manager helped me to understand the structure of the problem, but it was no use at finding the culprit. Its Peak Working Set column showed nothing that could explain the spikes. It was time to invoke my favorite system profiling tool.
Investigating with xperf
I turned to the most powerful performance investigation tool that I know of: xperf (also known as the Windows Performance Toolkit). This is a free suite of tools created by Microsoft that lets you profile all aspects of a Windows machine’s performance, from CPU usage, to disk I/O, to memory allocations, and more. Xperf is not an easy tool to use, but it rewards expert users with almost unlimited insight into the performance of Windows.
I decided to use the new Windows Performance Recorder UI (wprui) to record a trace. I selected all of the options that were related to memory allocations and told it to log to memory. In this mode it continually monitors your system, writing detailed performance information to circular buffers in memory. When you notice a performance problem you click the Save button and it saves the buffers to disk. Thus, it is a retroactive profiler that can at any given time record the last ~30 seconds of system activity. I generally leave in-memory tracing running continuously so that I can always catch rare performance problems, without having to worry about reproducing them. When this screenshot was taken I’d had wprui running for most of a week, but the ‘Time’ field wraps after one day.
Caught in the act
The memory spikes were mostly happening at random times and were easy to miss, so I wasn’t having any luck recording a trace. However I eventually realized that I was getting a memory spike every time that my laptop resumed from standby. That made things much easier. All I had to do was have wprui recording to memory, then suspend and resume my laptop, log in, and click ‘Save’. Done. Then I loaded the trace (into xperfview or wpa) and looked for memory related anomalies. This can be tricky because you often don’t know what you are looking for or where you will find it, but when you look at the right graph it is sometimes quite obvious. Can you see the oh-so-subtle memory spike in the graph below?
So far we are seeing similar data to what Task Manager showed us, but with xperf we can dig deeper. The trace contains call stacks for, among other things, every pool memory allocation. After selecting the appropriate range (from the middle of the spike to the left of it) we can get a summary table. By arranging the summary table columns appropriately, sorting by size, and drilling down into the AIFO type (Allocated Inside the selected time range, Freed Outside), we quickly find the 4,294,967,296 byte allocation that is causing all the pain. And we have a call stack.
My first assumption upon seeing this call stack was that this was a Windows bug. I’d recently installed some Windows updates and I guessed that one of them was causing this bug. I was wrong, but it will take a bit more digging to see that.
The call stack shows that FstubAllocateDiskInformation is making two calls to ExAllocatePoolWithTag, and one of them is for a lot of memory. It’s time to look at FstubAllocateDiskInformation.
The truth is in the code
The simplest way to look at FstubAllocateDiskInformation is to use Local Kernel Debugging. This lets me use windbg to grab a snapshot of the kernel’s memory without actually breaking in to the kernel. Local kernel debugging is limited in that you can’t set breakpoints or step through code, but it lets you view disassembly, and it is simpler than setting up cables and a second machine to allow live kernel debugging. It was good enough.
Obviously the ‘uf’ command is what we need to dump the disassembly of this function. I’ve reformatted the results and put the crucial assembly language below, with a few lines bold faced and a few comments added:
lkd> uf ntkrnlmp!FstubAllocateDiskInformation
mov qword ptr [rsp+8],rbx
mov qword ptr [rsp+10h],rbp
mov qword ptr [rsp+18h],rsi
mov r8d,42747346h ; pool tag of FstB
mov edx,40h ; allocation size
xor ecx,ecx ; flags
mov ecx,dword ptr [rbx+24h]
test ecx,ecx ; compare SectorSize to zero
mov rax,qword ptr [rbx+28h]
mov dword ptr [rbx+8],ecx
mov r8d,42747346h ; pool tag of FstB
mov rdx,rcx ; SectorSize/allocation size
mov ecx,4 ; flags
mov qword ptr [rbx+38h],rax
mov qword ptr [rbx],rbp
A careful reading of the code above isn’t required in order to follow along. The code contains two calls to ExAllocatePoolWithTag, one for 0×40 bytes and one for a variable amount. Both use a pool tag of 42747346h, which is a four-byte code for ‘FstB’, the file system pool. A full analysis required looking at the input parameter types and their layouts, but I also went to the reactos site and looked at their version of the source code. It was quite helpful for confirming my analysis of the assembly language. The crucial lines can be paraphrased as:
if (DiskInformation->SectorSize != 0)
In other words, FstubAllocateDiskInformation asks for the geometry of the disk (using FstubGetDiskGeometry), and then allocates however many bytes the driver says is the sector size.
I briefly stopped to verify the sector size of my main disk:
C:\>fsutil fsinfo ntfsinfo c:
NTFS Volume Serial Number : 0x00b8457cb8457168
Version : 3.1
Number Sectors : 0x0000000048ad8ae2
Total Clusters : 0x000000000915b15c
Free Clusters : 0x00000000016ac4a5
Total Reserved : 0×0000000000000780
Bytes Per Sector : 512
Bytes Per Physical Sector : 512
and then, in a long-delayed flash of insight, I realized what the problem was. A few weeks earlier, around the time the performance problems first started, and around the time of the Windows Updates that I had suspected, I had purchased a backup drive. When I went to use this backup drive it had installed a driver. And this driver was the cause of all of my problems:
I fired up device manager, found the problematic driver, uninstalled it, and without even a reboot my computer was fixed. I could suspend and resume my laptop with no memory spikes and without my disk cache getting purged. Bliss! Now my laptop was acting like a computer with 8 GiB, instead of like a computer with 4 GiB.
I’m not a device driver expert – user-mode is my domain – but I can imagine what must be going on. The device driver appears to be responsible for filling in a DISK_GEOMETRY structure when responding to IOCTL_DISK_GET_DRIVE_GEOMETRY queries. It’s not clear to me what the driver is supposed to do if this information is not available. It could return a failure code, or it could return a structure initialized with zeroes, or it could initialize the fields – including BytesPerSector – with 0xFFFFFFFF. And that’s what this driver does. Since SectorSize is an unsigned value and since pool allocations are rounded up this results in a pool allocation of exactly 4 GiB. This is an easy mistake to make, but the Western Digital developers should have noticed that the driver was causing problems.
The Windows kernel carefully tests for BytesPerSector being equal to zero, but it doesn’t do any other sanity checking to see if BytesPerSector is one of the small number of reasonable values. Thus, the kernel missed an opportunity to avoid this problem.
If you own this device I recommend returning it. However that may not be sufficient. I was hitting this performance problem even when I was miles away from – and not connected to – the backup drive. It may be necessary to run device manager and manually uninstall the driver, but I’m not sure.
I also recommend contacting Western Digital and demanding that they release a fix.
When I found the problem I mostly just wanted to return the drive, but first I made sure I had the latest drivers on my machine and all the latest updates on the NAS, to make sure my results were valid. I then deleted my data and wiped the drive. The Full Factory Restore option securely erases the drive, which is important when returning a drive that had several full backups of my laptop on it. And, of course I uninstalled the drivers, to make sure the problem went away.
Once again xperf let me investigate the mysterious and find the root cause in a truly scientific manner. I want more developers to use evidence-based-techniques to investigate performance problems rather than guessing.
I feel sorry for regular consumers who naively purchased this drive and have little hope of avoiding this problem. If they notice the performance problem they are unlikely to guess that it is correlated to the new backup drive.
I contacted Microsoft through some personal contacts and they have made sure that the appropriate teams know about this issue. I contacted Western Digital through regular support channels on August 29th, 2012
but have not heard back from them yet – they are now aware of the issue because one of their engineers saw this article. Support replied to my request on September 7th, 2012 – nine days later.
If you liked this article then you might like my other forensic performance investigations with xperf:
- Xperf and Visual Studio: The Case of the Breakpoint Hangs
- PowerPoint Poor Performance Problem (and Xperf for Excess CPU Consumption for a more detailed tutorial that investigates the same issue)
- Faster Fractals Through Better Scheduling – investigating and optimizing some Fractal eXtreme inefficiencies
- Xperf Wait Analysis–Finding Idle Time – investigating three second hangs in Visual Studio in tutorial form
- Windows Live Photo Gallery—Poor Performance Peculiarities – investigating and working around brief hangs when zooming in on images
- Fixing another Photo Gallery performance bug – investigating and working around startup hangs
For more information on using xperf to do your own investigations, look at category/xperf on this blog.