I recently noticed that my Windows 7 home laptop was running slower than usual. It seemed to be spending a lot of time waiting on disk I/O. When I looked at Task Manager it showed that my laptop was caching about one fifth as much disk data as normal, but the cause of this was not obvious.
Part of my job is to be a performance expert – to investigate both mundane and esoteric performance problems. So when I realized that I had a serious and systemic problem I knuckled down and brought all the tools of my trade to bear, and ultimately I found the culprit.
The story you are about to hear is true. Only the precise values of cached, available, and free memory have been changed to protect the innocent.
Spoiler alert: the Western Digital My Book Live backup drive that I’d purchased from Costco was the culprit.
Even better spoiler alert: Western Digital contacted me nine days after I posted this entry, and took the bug quite seriously. About four months later they published a fix. The story has a happy ending, and you can change the world through blogging.
Caching counts
In order to understand this mystery it is necessary to understand the importance of the Windows disk cache. Having lots of memory in your computer is important to allow running lots of programs without hitting the page file, but also to allow lots of data to be cached in memory. Hard drives, with their spinning physical platters, are among the slowest components in a computer, and laptop drives are the slowest type of hard drives. Windows caches data from the hard drive so that it will be available instantly, instead of requiring many milliseconds for each read. The more data that is cached, the less frequently the disk drive will be needed, and the faster your computer will be. Because of the Windows disk cache there is no such thing as too much memory. My work machine has 32 GiB which means that I typically have over 24 GiB of disk data cached, and this allows me to bounce around between different programs and different projects with minimal disk I/O.
Windows (starting with Vista) has systems in place whose only purpose is to fill up the disk cache with data that its logging shows you are likely to need. SuperFetch watches your usage patterns and knows how to prefetch data when you launch an application. It also prefetches data when your system is idle.
These features were confusing when Windows Vista first came out. Users would complain that Vista was a pig because it used so much memory, because Task Manager would show very little free memory. But it turns out that this is a good thing. Free memory is a total waste. Free memory is not doing anything useful. A well running Windows machine should have free memory near zero, and cached memory (the Windows disk cache) nice and high. Cached memory is fulfilling a useful purpose, it makes your computer run faster, and it can be repurposed (turned into free memory to satisfy allocations) the instant that it is needed.
A healthy machine
When Windows is running well Task Manager (available by right-clicking on the taskbar) should show Physical Memory statistics similar to those shown to the right: lots of cached memory, lots of available memory, and very little free memory. If available memory is low (less than a few thousand MiB) then it is worth purchasing more memory so that Windows will have more space for running programs and caching data. If free memory is high (more than a few hundred MiB) then either Windows has not had time to populate the disk cache with useful data or else something is wrong.
An unhealthy machine
On my laptop the performance tab of Task Manager looked more like this:
Looking at the area circled in purple shows that I’ve got lots of available memory, but my disk cache is smaller than expected for a machine with 8 GiB of RAM, and free memory is unusually high. On my 8 GiB laptop I normally have about 4-5 GiB of data cached, but when I noticed this problem my laptop was typically only caching about 1-2 GiB of data.
Looking at the history of memory usage shows why cache memory is low and free memory is high. The area circled in red shows three brief spikes of memory usage. Each of those spikes represent a transient 4 GiB memory allocation, and Windows satisfied the first of those allocations by evicting 4 GiB of useful data from the file cache, leaving it as wasteful freed memory at the end.
Task Manager helped me to understand the structure of the problem, but it was no use at finding the culprit. Its Peak Working Set column showed nothing that could explain the spikes. It was time to invoke my favorite system profiling tool.
Investigating with xperf
I turned to the most powerful performance investigation tool that I know of: ETW tracing (aka xperf) using the Windows Performance Toolkit. This is a free suite of tools created by Microsoft that lets you profile all aspects of a Windows machine’s performance, from CPU usage, to disk I/O, to memory allocations, and more. Xperf is not an easy tool to use, but it rewards expert users with almost unlimited insight into the performance of Windows.
I decided to use the new Windows Performance Recorder UI (wprui) to record a trace. I selected all of the options that were related to memory allocations and told it to log to memory (now I would use UIforETW with VirtualAlloc stacks always checked in the Settings dialog, POOL in Extra kernel flags, and PoolAlloc in Extra kernel stackwalks). In this mode wprui or UIforETW continually monitors your system, writing detailed performance information to circular buffers in memory. When you notice a performance problem you click the Save button and it saves the buffers to disk. Thus, it is a retroactive profiler that can at any given time record the last ~30 seconds of system activity. I generally leave in-memory tracing running continuously so that I can always catch rare performance problems, without having to worry about reproducing them. When this screenshot was taken I’d had wprui running for most of a week, but the ‘Time’ field wraps after one day.
Caught in the act
The memory spikes were mostly happening at random times and were easy to miss, so I wasn’t having any luck recording a trace. However I eventually realized that I was getting a memory spike every time that my laptop resumed from standby. That made things much easier. All I had to do was have wprui recording to memory, then suspend and resume my laptop, log in, and click ‘Save’. Done. Then I loaded the trace into WPA (Windows Performance Analyzer) and looked for memory related anomalies. This can be tricky because you often don’t know what you are looking for or where you will find it, but when you look at the right graph it is sometimes quite obvious. Can you see the oh-so-subtle memory spike in the graph below?
So far we are seeing similar data to what Task Manager showed us, but with xperf we can dig deeper. The trace contains call stacks for, among other things, every pool memory allocation. After selecting the appropriate range (from the middle of the spike to the left of it) we can get a summary table. By arranging the summary table columns appropriately, sorting by size, and drilling down into the AIFO type (Allocated Inside the selected time range, Freed Outside), we quickly find the 4,294,967,296 byte allocation that is causing all the pain. And we have a call stack.
My first assumption upon seeing this call stack was that this was a Windows bug. I’d recently installed some Windows updates and I guessed that one of them was causing this bug. I was wrong, but it will take a bit more digging to see that.
The call stack shows that FstubAllocateDiskInformation is making two calls to ExAllocatePoolWithTag, and one of them is for a lot of memory. It’s time to look at FstubAllocateDiskInformation.
The truth is in the code
The simplest way to look at FstubAllocateDiskInformation is to use Local Kernel Debugging. This lets me use windbg to grab a snapshot of the kernel’s memory without actually breaking in to the kernel. Local kernel debugging is limited in that you can’t set breakpoints or step through code, but it lets you view disassembly, and it is simpler than setting up cables and a second machine to allow live kernel debugging. It was good enough.
Obviously the ‘uf’ command is what we need to dump the disassembly of this function. I’ve reformatted the results and put the crucial assembly language below, with a few lines bold faced and a few comments added:
lkd> uf ntkrnlmp!FstubAllocateDiskInformation
nt!FstubAllocateDiskInformation:
mov qword ptr [rsp+8],rbx
mov qword ptr [rsp+10h],rbp
mov qword ptr [rsp+18h],rsi
push rdi
sub rsp,20h
mov rsi,rdx
mov rbp,rcx
mov r8d,42747346h ; pool tag of FstB
mov edx,40h ; allocation size
xor ecx,ecx ; flags
call nt!ExAllocatePoolWithTag
test rax,rax
je Exit1nt!FstubAllocateDiskInformation+0x35:
lea rdx,[rax+10h]
mov rcx,rbp
mov rbx,rax
call nt!FstubGetDiskGeometry
test eax,eax
js Exit2nt!FstubAllocateDiskInformation+0x4c:
mov ecx,dword ptr [rbx+24h]
test ecx,ecx ; compare SectorSize to zero
je Exit3nt!FstubAllocateDiskInformation+0x57:
mov rax,qword ptr [rbx+28h]
test rax,rax
je Exit3nt!FstubAllocateDiskInformation+0x64:
xor edx,edx
mov dword ptr [rbx+8],ecx
mov r8d,42747346h ; pool tag of FstB
div rax,rcx
mov rdx,rcx ; SectorSize/allocation size
mov ecx,4 ; flags
mov qword ptr [rbx+38h],rax
mov qword ptr [rbx],rbp
call nt!ExAllocatePoolWithTag
test rax,rax
je Exit4
A careful reading of the code above isn’t required in order to follow along. The code contains two calls to ExAllocatePoolWithTag, one for 0x40 bytes and one for a variable amount. Both use a pool tag of 42747346h, which is a four-byte code for ‘FstB’, the file system pool. A full analysis required looking at the input parameter types and their layouts, but I also went to the reactos site and looked at their version of the source code. It was quite helpful for confirming my analysis of the assembly language. The crucial lines can be paraphrased as:
FstubGetDiskGeometry(DiskInformation);
…
if (DiskInformation->SectorSize != 0)
ExAllocatePoolWithTag(DiskInformation->SectorSize);
In other words, FstubAllocateDiskInformation asks for the geometry of the disk (using FstubGetDiskGeometry), and then if the sector size is non-zero it allocates however many bytes the driver says is the sector size.
I briefly stopped to verify the sector size of my main disk:
C:\>fsutil fsinfo ntfsinfo c:
NTFS Volume Serial Number : 0x00b8457cb8457168
Version : 3.1
Number Sectors : 0x0000000048ad8ae2
Total Clusters : 0x000000000915b15c
Free Clusters : 0x00000000016ac4a5
Total Reserved : 0x0000000000000780
Bytes Per Sector : 512
Bytes Per Physical Sector : 512
…
and then, in a long-delayed flash of insight, I realized what the problem was. A few weeks earlier, around the time the performance problems first started, and around the time of the Windows Updates that I had suspected, I had purchased a backup drive. When I went to use this backup drive it had installed a driver. And this driver was the cause of all of my problems:
Problem solved!
I fired up device manager, found the problematic driver, uninstalled it, and without even a reboot my computer was fixed. I could suspend and resume my laptop with no memory spikes and without my disk cache getting purged. Bliss! Now my laptop was acting like a computer with 8 GiB, instead of like a computer with 4 GiB.
I’m not a device driver expert – user-mode is my domain – but I can imagine what must be going on. The device driver appears to be responsible for filling in a DISK_GEOMETRY structure when responding to IOCTL_DISK_GET_DRIVE_GEOMETRY queries. It’s not clear to me what the driver is supposed to do if this information is not available. It could return a failure code, or it could return a structure initialized with zeroes, or it could initialize the fields – including BytesPerSector – with 0xFFFFFFFF. And that’s what this driver does. Since SectorSize is an unsigned value and since pool allocations are rounded up this results in a pool allocation of exactly 4 GiB. This is an easy mistake to make, but the Western Digital developers should have noticed that the driver was causing problems.
The Windows kernel carefully tests for BytesPerSector being equal to zero, but it doesn’t do any other sanity checking to see if BytesPerSector is one of the small number of reasonable values. Thus, the kernel missed an opportunity to avoid this problem.
If you own this device I recommend returning it. However that may not be sufficient. I was hitting this performance problem even when I was miles away from – and not connected to – the backup drive. It may be necessary to run device manager and manually uninstall the driver, but I’m not sure.
I also recommend contacting Western Digital and demanding that they release a fix.
Postscript
When I found the problem I mostly just wanted to return the drive, but first I made sure I had the latest drivers on my machine and all the latest updates on the NAS, to make sure my results were valid. I then deleted my data and wiped the drive. The Full Factory Restore option securely erases the drive, which is important when returning a drive that had several full backups of my laptop on it. And, of course I uninstalled the drivers, to make sure the problem went away.
Once again ETW let me investigate the mysterious and find the root cause in a truly scientific manner. I want more developers to use evidence-based-techniques to investigate performance problems rather than guessing.
I feel sorry for regular consumers who naively purchased this drive and have little hope of avoiding this problem. If they notice the performance problem they are unlikely to guess that it is correlated to the new backup drive.
I contacted Microsoft through some personal contacts and they have made sure that the appropriate teams know about this issue. I contacted Western Digital through regular support channels on August 29th, 2012 but have not heard back from them yet – they are now aware of the issue because one of their engineers saw this article. Support replied to my request on September 7th, 2012 – nine days later.
Other stories
If you liked this article then you might like my other forensic performance investigations with xperf:
- Xperf and Visual Studio: The Case of the Breakpoint Hangs
- PowerPoint Poor Performance Problem (and Xperf for Excess CPU Consumption for a more detailed tutorial that investigates the same issue)
- Faster Fractals Through Better Scheduling – investigating and optimizing some Fractal eXtreme inefficiencies
- Xperf Wait Analysis–Finding Idle Time – investigating three second hangs in Visual Studio in tutorial form
- Windows Live Photo Gallery—Poor Performance Peculiarities – investigating and working around brief hangs when zooming in on images
- Fixing another Photo Gallery performance bug – investigating and working around startup hangs
For more information on using xperf to do your own investigations, look at ETW central.
Interesting read as always. I love your performance whodunits. Keep them coming.
You are slowly giving me xperf fever 🙂 Sadly, I do very little actual Windows programming these days, but I can think of a few frustrating apps and drivers I wouldn’t mind profiling, just to prove myself right.
Crap. I was having the same problems with the brand new desktop computer, and everything chalked up to the motherboard going bad. To find out that it was just the external drive my dad gave me as a hand-me-down… well this pisses me off.
nice catch.
If you had not known about a recent installation of the network storage device, how would you have continued further to discover that the problem was the driver. I loved the article, but the leap you made at the end was an educated guess that ended up being correct.
Excellent question. I think I would have set up real kernel debugging. By rebooting Windows in debug mode and hooking up a second machine to run windbg I could have stepped through the problematic code to find the culprit.
Or, since I knew it was a storage driver I could have looked through device manager or other lists to find all of the storage drivers and remove them one at a time. I believe there were only two.
Bruce,
Hmmm, I work at Western Digital but not on the external drive side. However, I’ve pointed this out to the guy that runs the driver group. Let’s see if we can find out what wrong on our side (or if there is a combo issue with something else in the system).
Do you have a personal email that we can touch bases with you at? I’m on the development side, and we’re always looking for competent people that spot problems.
Bruce,
This “driver” is not required for the NAS to work. Most of what it does is provide a WD specific icon in Windows Explorer and add two items to a right-click popup menu. Well, that and apparently cause your speed issues.
Leave out the “driver” and keep the NAS. It will continue to work fine.
Another WD guy
That’s good to know. My recollection was that I tried to avoid installing any special software (I refused to put the DVD in) but there were some driver-install popups that seemed to be necessary, or at least something was pushed on me. I *think* that the driver was installed when I went to the Network area and double-clicked on the device there. Unfortunately I took the drive back already — I hadn’t allowed for the advice I might get in the comments.
Nice article. Thanks. I have to look at xperf me thinks 🙂
Great post,thanks!
Can we call you the “House” of the computer software diagnostics world?
Pingback: Backin’ Up is Hard to Do | Random ASCII
If you haven’t already, you may wish to visit http://technet.microsoft.com/en-US/sysinternals and download the Sysinternals Suite of tools. They are excellent weapons for troubleshooting Windows. The Process Explorer tool is especially useful and is basically Task Manager on steroids. I’ve replaced Task Manager with this now and it’s saved me countless hours of troubleshooting time thanks to its many features above and beyond what the basic Task Manager provides
Autoruns is also another fabulous tool in the Suite which helps you check exactly what is firing up during every boot, allowing to you to trim down boot times.
The sysinternals tools are indeed awesome, and I used them frequently. Every Windows developer should have them installed.
The sysinternals tools are, in general, easy to use and are real-time. Xperf, on the other hand, is harder to use and requires recording a trace and then analyzing it (non-real-time) but in return you get richer detail about the recorded time interval.
Forgot to also add this link to a Channel 9 series on the top tools available in the Suite and how to use them effectively.
http://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-2-Process-Explorer
Thanks This help to troubleshoot performance problem with my PC.
Pingback: Windows Slowdown, Investigated, Identified, and Now Fixed | Random ASCII
Pingback: Top Eight Entertaining Blog Facts for 2012 | Random ASCII
Pingback: Defective Heat Sinks Causing Garbage Gaming | Random ASCII
Pingback: Bugs I Got Other Companies to Fix in 2013 | Random ASCII
Pingback: Self Inflicted Denial of Service in Visual Studio Search | Random ASCII
Pingback: Knowing Where to Type ‘Zero’ | Random ASCII
I’m not surprised that WD manage to introduce a bug into apparently-simple software…
It’s a bit worrying that they can’t code Windows software, though. Considering Western Digital (WD) FORCE users to install Windows just to be able to RMA the drives / test them properly. NO Mac option, NO Linux option, NO boot disc / .iso. Which they used to provide, a good method.
That’s right, YET ANOTHER breaking of best-practice to suit the bean counters running YET ANOTHER lame capitalist entity…
What else does one describe this as “imperfect, but beneficial to humanity”? I don’t think so.
“An incompetently-flawed system and modus operandi” – YES.
After contacting WD with a complaint about this (what if one’s Windows died due to a corrupt HDD and one needs to troubleshoot if it’s the drive or not without wasting time installing Windows again for the sake of it? To do this now requires 1x EXTRA HDD and 1x EXTRA TIME TO INSTALL WINDOWS). Multiply this by how many consumers need to do it daily – and how the hell can it be more ‘economical’ to NOT develop the software in a bootable .iso format? It appears that there’s NO good reason why this can’t be done, apart from ‘capitalism makes things too cheaply’ syndrome. That and today’s ‘coders’ have no clue about ‘bare metal’ hardware support, and/or if they DO, are too expensive for cheapskate WD, it seems.
If I were working for WD, I’d be completely apologetic, regardless of whether it was my department or not. Rather than the typical corporate egotism.
Reblogged this on Jarek Przygódzki. Blog programisty.
Pingback: ETW Central | Random ASCII