Windows Slowdown, Investigated and Identified

I recently noticed that my Windows 7 home laptop was running slower than usual. It seemed to be spending a lot of time waiting on disk I/O. When I looked at Task Manager it showed that my laptop was caching about one fifth as much disk data as normal, but the cause of this was not obvious.

Part of my job at Valve is to be a performance expert – to investigate both mundane and esoteric performance problems. So when I realized that I had a serious and systemic problem I knuckled down and brought all the tools of my trade to bear, and ultimately I found the culprit.

The story you are about to hear is true. Only the precise values of cached, available, and free memory have been changed to protect the innocent.

Spoiler alert: the Western Digital My Book Live backup drive that I’d purchased from Costco was the culprit.

Even better spoiler alert: Western Digital contacted me nine days after I posted this entry, and took the bug quite seriously. About four months later they published a fix. The story has a happy ending, and you can change the world through blogging.

Caching counts

In order to understand this mystery it is necessary to understand the importance of the Windows disk cache. Having lots of memory in your computer is important to allow running lots of programs without hitting the page file, but also to allow lots of data to be cached in memory. Hard drives, with their spinning physical platters, are among the slowest components in a computer, and laptop drives are the slowest type of hard drives. Windows caches data from the hard drive so that it will be available instantly, instead of requiring many milliseconds for each read. The more data that is cached, the less frequently the disk drive will be needed, and the faster your computer will be. Because of the Windows disk cache there is no such thing as too much memory. My work machine has 32 GiB which means that I typically have over 24 GiB of disk data cached, and this allows me to bounce around between different programs and different projects with minimal disk I/O.

Windows (starting with Vista) has systems in place whose only purpose is to fill up the disk cache with data that its logging shows you are likely to need. SuperFetch watches your usage patterns and knows how to prefetch data when you launch an application. It also prefetches data when your system is idle.

These features were confusing when Windows Vista first came out. Users would complain that Vista was a pig because it used so much memory, because Task Manager would show very little free memory. But it turns out that this is a good thing. Free memory is a total waste. Free memory is not doing anything useful. A well running Windows machine should have free memory near zero, and cached memory (the Windows disk cache) nice and high. Cached memory is fulfilling a useful purpose, it makes your computer run faster, and it can be repurposed (turned into free memory to satisfy allocations) the instant that it is needed.

A healthy machine

imageWhen Windows is running well Task Manager (available by right-clicking on the taskbar) should show Physical Memory statistics similar to those shown to the right: lots of cached memory, lots of available memory, and very little free memory. If available memory is low (less than a few thousand MiB) then it is worth purchasing more memory so that Windows will have more space for running programs and caching data. If free memory is high (more than a few hundred MiB) then either Windows has not had time to populate it with useful data or else something is wrong.

An unhealthy machine

On my laptop the performance tab of Task Manager looked more like this:

image

Looking at the area circled in purple shows that I’ve got lots of available memory, but my disk cache is smaller than expected for a machine with 8 GiB of RAM, and free memory is unusually high. On my 8 GiB laptop I normally have about 4-5 GiB of data cached, but when I noticed this problem my laptop was typically only caching about 1-2 GiB of data.

Looking at the history of memory usage shows why cache memory is low and free memory is high. imageThe area circled in red shows three brief spikes of memory usage. Each of those spikes represent a transient 4 GiB memory allocation, and Windows satisfied the first of those allocations by evicting 4 GiB of useful data from the file cache, leaving it as wasteful freed memory at the end.

Task Manager helped me to understand the structure of the problem, but it was no use at finding the culprit. Its Peak Working Set column showed nothing that could explain the spikes. It was time to invoke my favorite system profiling tool.

Investigating with xperf

I turned to the most powerful performance investigation tool that I know of: xperf (also known as the Windows Performance Toolkit). This is a free suite of tools created by Microsoft that lets you profile all aspects of a Windows machine’s performance, from CPU usage, to disk I/O, to memory allocations, and more. Xperf is not an easy tool to use, but it rewards expert users with almost unlimited insight into the performance of Windows.

I decided to use the new Windows Performance Recorder UI (wprui) to record a trace. I selected all of the options that were related to memory allocations and told it to log to memory. In this mode it continually monitors your system, writing detailed performance information to circular buffers in memory. When you notice a performance problem you click the Save button and it saves the buffers to disk. Thus, it is a retroactive profiler that can at any given time record the last ~30 seconds of system activity. I generally leave in-memory tracing running continuously so that I can always catch rare performance problems, without having to worry about reproducing them. When this screenshot was taken I’d had wprui running for most of a week, but the ‘Time’ field wraps after one day.

image

Caught in the act

The memory spikes were mostly happening at random times and were easy to miss, so I wasn’t having any luck recording a trace. However I eventually realized that I was getting a memory spike every time that my laptop resumed from standby. That made things much easier. All I had to do was have wprui recording to memory, then suspend and resume my laptop, log in, and click ‘Save’. Done. Then I loaded the trace (into xperfview or wpa) and looked for memory related anomalies. This can be tricky because you often don’t know what you are looking for or where you will find it, but when you look at the right graph it is sometimes quite obvious. Can you see the oh-so-subtle memory spike in the graph below?

image

So far we are seeing similar data to what Task Manager showed us, but with xperf we can dig deeper. The trace contains call stacks for, among other things, every pool memory allocation. After selecting the appropriate range (from the middle of the spike to the left of it) we can get a summary table. By arranging the summary table columns appropriately, sorting by size, and drilling down into the AIFO type (Allocated Inside the selected time range, Freed Outside), we quickly find the 4,294,967,296 byte allocation that is causing all the pain. And we have a call stack.

image

My first assumption upon seeing this call stack was that this was a Windows bug. I’d recently installed some Windows updates and I guessed that one of them was causing this bug. I was wrong, but it will take a bit more digging to see that.

The call stack shows that FstubAllocateDiskInformation is making two calls to ExAllocatePoolWithTag, and one of them is for a lot of memory. It’s time to look at FstubAllocateDiskInformation.

The truth is in the code

The simplest way to look at FstubAllocateDiskInformation is to use Local Kernel Debugging. This lets me use windbg to grab a snapshot of the kernel’s memory without actually breaking in to the kernel. Local kernel debugging is limited in that you can’t set breakpoints or step through code, but it lets you view disassembly, and it is simpler than setting up cables and a second machine to allow live kernel debugging. It was good enough.

Obviously the ‘uf’ command is what we need to dump the disassembly of this function. I’ve reformatted the results and put the crucial assembly language below, with a few lines bold faced and a few comments added:

lkd> uf ntkrnlmp!FstubAllocateDiskInformation
nt!FstubAllocateDiskInformation:
    mov     qword ptr [rsp+8],rbx
    mov     qword ptr [rsp+10h],rbp
    mov     qword ptr [rsp+18h],rsi
    push    rdi
    sub     rsp,20h
    mov     rsi,rdx
    mov     rbp,rcx
    mov     r8d,42747346h ; pool tag of FstB
    mov     edx,40h       ; allocation size
    xor     ecx,ecx       ; flags
   call    nt!ExAllocatePoolWithTag
    test    rax,rax
    je      Exit1

nt!FstubAllocateDiskInformation+0×35:
    lea     rdx,[rax+10h]
    mov     rcx,rbp
    mov     rbx,rax
    call    nt!FstubGetDiskGeometry
    test    eax,eax
    js      Exit2

nt!FstubAllocateDiskInformation+0x4c:
    mov     ecx,dword ptr [rbx+24h]
    test    ecx,ecx       ; compare SectorSize to zero
    je      Exit3

nt!FstubAllocateDiskInformation+0×57:
    mov     rax,qword ptr [rbx+28h]
    test    rax,rax
    je      Exit3

nt!FstubAllocateDiskInformation+0×64:
    xor     edx,edx
    mov     dword ptr [rbx+8],ecx
    mov     r8d,42747346h ; pool tag of FstB
    div     rax,rcx
    mov     rdx,rcx       ; SectorSize/allocation size
    mov     ecx,4         ; flags
    mov     qword ptr [rbx+38h],rax
    mov     qword ptr [rbx],rbp
    call    nt!ExAllocatePoolWithTag
    test    rax,rax
    je      Exit4

A careful reading of the code above isn’t required in order to follow along. The code contains two calls to ExAllocatePoolWithTag, one for 0×40 bytes and one for a variable amount. Both use a pool tag of 42747346h, which is a four-byte code for ‘FstB’, the file system pool. A full analysis required looking at the input parameter types and their layouts, but I also went to the reactos site and looked at their version of the source code. It was quite helpful for confirming my analysis of the assembly language. The crucial lines can be paraphrased as:

FstubGetDiskGeometry(DiskInformation);

if (DiskInformation->SectorSize != 0)
    ExAllocatePoolWithTag(DiskInformation->SectorSize);

In other words, FstubAllocateDiskInformation asks for the geometry of the disk (using FstubGetDiskGeometry), and then allocates however many bytes the driver says is the sector size.

I briefly stopped to verify the sector size of my main disk:

C:\>fsutil fsinfo ntfsinfo c:
NTFS Volume Serial Number :       0x00b8457cb8457168
Version :                         3.1
Number Sectors :                  0x0000000048ad8ae2
Total Clusters :                  0x000000000915b15c
Free Clusters  :                  0x00000000016ac4a5
Total Reserved :                  0×0000000000000780
Bytes Per Sector  :               512
Bytes Per Physical Sector :       512

and then, in a long-delayed flash of insight, I realized what the problem was. A few weeks earlier, around the time the performance problems first started, and around the time of the Windows Updates that I had suspected, I had purchased a backup drive. When I went to use this backup drive it had installed a driver. And this driver was the cause of all of my problems:

image

Problem solved!

imageI fired up device manager, found the problematic driver, uninstalled it, and without even a reboot my computer was fixed. I could suspend and resume my laptop with no memory spikes and without my disk cache getting purged. Bliss! Now my laptop was acting like a computer with 8 GiB, instead of like a computer with 4 GiB.

I’m not a device driver expert – user-mode is my domain – but I can imagine what must be going on. The device driver appears to be responsible for filling in a DISK_GEOMETRY structure when responding to IOCTL_DISK_GET_DRIVE_GEOMETRY queries. It’s not clear to me what the driver is supposed to do if this information is not available. It could return a failure code, or it could return a structure initialized with zeroes, or it could initialize the fields – including BytesPerSector – with 0xFFFFFFFF. And that’s what this driver does. Since SectorSize is an unsigned value and since pool allocations are rounded up this results in a pool allocation of exactly 4 GiB. This is an easy mistake to make, but the Western Digital developers should have noticed that the driver was causing problems.

The Windows kernel carefully tests for BytesPerSector being equal to zero, but it doesn’t do any other sanity checking to see if BytesPerSector is one of the small number of reasonable values. Thus, the kernel missed an opportunity to avoid this problem.

If you own this device I recommend returning it. However that may not be sufficient. I was hitting this performance problem even when I was miles away from – and not connected to – the backup drive. It may be necessary to run device manager and manually uninstall the driver, but I’m not sure.

I also recommend contacting Western Digital and demanding that they release a fix.

Postscript

imageWhen I found the problem I mostly just wanted to return the drive, but first I made sure I had the latest drivers on my machine and all the latest updates on the NAS, to make sure my results were valid. I then deleted my data and wiped the drive. The Full Factory Restore option securely erases the drive, which is important when returning a drive that had several full backups of my laptop on it. And, of course I uninstalled the drivers, to make sure the problem went away.

Once again xperf let me investigate the mysterious and find the root cause in a truly scientific manner. I want more developers to use evidence-based-techniques to investigate performance problems rather than guessing.

I feel sorry for regular consumers who naively purchased this drive and have little hope of avoiding this problem. If they notice the performance problem they are unlikely to guess that it is correlated to the new backup drive.

I contacted Microsoft through some personal contacts and they have made sure that the appropriate teams know about this issue. I contacted Western Digital through regular support channels on August 29th, 2012 but have not heard back from them yet – they are now aware of the issue because one of their engineers saw this article. Support replied to my request on September 7th, 2012 – nine days later.

Other stories

If you liked this article then you might like my other forensic performance investigations with xperf:

  1. Xperf and Visual Studio: The Case of the Breakpoint Hangs
  2. PowerPoint Poor Performance Problem (and Xperf for Excess CPU Consumption for a more detailed tutorial that investigates the same issue)
  3. Faster Fractals Through Better Scheduling – investigating and optimizing some Fractal eXtreme inefficiencies
  4. Xperf Wait Analysis–Finding Idle Time – investigating three second hangs in Visual Studio in tutorial form
  5. Windows Live Photo Gallery—Poor Performance Peculiarities – investigating and working around brief hangs when zooming in on images
  6. Fixing another Photo Gallery performance bug – investigating and working around startup hangs

For more information on using xperf to do your own investigations, look at category/xperf on this blog.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Investigative Reporting, Performance, Programming, xperf and tagged . Bookmark the permalink.

22 Responses to Windows Slowdown, Investigated and Identified

  1. MST says:

    Interesting read as always. I love your performance whodunits. Keep them coming.

  2. Billco says:

    You are slowly giving me xperf fever :) Sadly, I do very little actual Windows programming these days, but I can think of a few frustrating apps and drivers I wouldn’t mind profiling, just to prove myself right.

  3. Sean says:

    Crap. I was having the same problems with the brand new desktop computer, and everything chalked up to the motherboard going bad. To find out that it was just the external drive my dad gave me as a hand-me-down… well this pisses me off.

  4. dc says:

    nice catch.

  5. Felipe says:

    If you had not known about a recent installation of the network storage device, how would you have continued further to discover that the problem was the driver. I loved the article, but the leap you made at the end was an educated guess that ended up being correct.

    • brucedawson says:

      Excellent question. I think I would have set up real kernel debugging. By rebooting Windows in debug mode and hooking up a second machine to run windbg I could have stepped through the problematic code to find the culprit.

      Or, since I knew it was a storage driver I could have looked through device manager or other lists to find all of the storage drivers and remove them one at a time. I believe there were only two.

  6. Ted says:

    Bruce,

    Hmmm, I work at Western Digital but not on the external drive side. However, I’ve pointed this out to the guy that runs the driver group. Let’s see if we can find out what wrong on our side (or if there is a combo issue with something else in the system).

    Do you have a personal email that we can touch bases with you at? I’m on the development side, and we’re always looking for competent people that spot problems.

  7. Chris says:

    Bruce,

    This “driver” is not required for the NAS to work. Most of what it does is provide a WD specific icon in Windows Explorer and add two items to a right-click popup menu. Well, that and apparently cause your speed issues.

    Leave out the “driver” and keep the NAS. It will continue to work fine.

    Another WD guy

    • brucedawson says:

      That’s good to know. My recollection was that I tried to avoid installing any special software (I refused to put the DVD in) but there were some driver-install popups that seemed to be necessary, or at least something was pushed on me. I *think* that the driver was installed when I went to the Network area and double-clicked on the device there. Unfortunately I took the drive back already — I hadn’t allowed for the advice I might get in the comments.

  8. Elwood says:

    Nice article. Thanks. I have to look at xperf me thinks :-)

  9. Russ LaValle says:

    Can we call you the “House” of the computer software diagnostics world?

  10. Pingback: Backin’ Up is Hard to Do | Random ASCII

  11. Darren Evans says:

    If you haven’t already, you may wish to visit http://technet.microsoft.com/en-US/sysinternals and download the Sysinternals Suite of tools. They are excellent weapons for troubleshooting Windows. The Process Explorer tool is especially useful and is basically Task Manager on steroids. I’ve replaced Task Manager with this now and it’s saved me countless hours of troubleshooting time thanks to its many features above and beyond what the basic Task Manager provides

    Autoruns is also another fabulous tool in the Suite which helps you check exactly what is firing up during every boot, allowing to you to trim down boot times.

    • brucedawson says:

      The sysinternals tools are indeed awesome, and I used them frequently. Every Windows developer should have them installed.

      The sysinternals tools are, in general, easy to use and are real-time. Xperf, on the other hand, is harder to use and requires recording a trace and then analyzing it (non-real-time) but in return you get richer detail about the recorded time interval.

  12. Darren Evans says:

    Forgot to also add this link to a Channel 9 series on the top tools available in the Suite and how to use them effectively.

    http://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-2-Process-Explorer

  13. Fibonacci says:

    Thanks This help to troubleshoot performance problem with my PC.

  14. Pingback: Windows Slowdown, Investigated, Identified, and Now Fixed | Random ASCII

  15. Pingback: Top Eight Entertaining Blog Facts for 2012 | Random ASCII

  16. Pingback: Defective Heat Sinks Causing Garbage Gaming | Random ASCII

  17. Pingback: Bugs I Got Other Companies to Fix in 2013 | Random ASCII

  18. Pingback: Self Inflicted Denial of Service in Visual Studio Search | Random ASCII

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s