Anytime my computer is a little bit slow I’m likely to record a trace and take a quick look. If I’m lucky I’ll find an easy workaround, in which case I’ve got a potential blog post and a smoother running computer. And, recording a trace to my uber-fast SSD only takes a few seconds, so the cost is minimal.
But, a few months ago I noticed that it was taking six minutes to record a simple trace, whether I used wprui or my batch files. This sort of delay really takes all the fun and spontaneity out of random performance investigations.
Spoiler alert: I profiled xperf and found a workaround, and the workaround is built-in to UIforETW, my open-source ETW trace recording tool.
At first I assumed that I wouldn’t be able to investigate this problem very well – how was I supposed to record a profile of a profile being recorded? But while working on the previous blog post I made a crucial observation: the six minute hang was during the merge step. The merge step happens after the profile has been recorded so it is actually easy to profile. I just put a “pause” command in my batch file, right before the merge step, and then used wprui to record a trace. I hit Ctrl+C to halt xperf after the merge step had been running for a while (it ran for 44.58 s according to the trace) and I was sure that because I’d caught the slowdown right from the start I’d be able to understand it.
I’ve profiled WPA to investigate slow symbol loading a couple of times (Xperf Symbol Loading Pitfalls and Slow Symbol Loading in Microsoft’s Profiler, Take Two) but this is the first time I’ve actually profiled the profiler.
The CPU usage showed the xperf process (red) and system process (green) playing ping-pong table tennis:
Well that’s odd.
The next step was to figure out what they were doing. For that I switched to CPU Usage (Sampling) with a stack view. The stack swung through perfctrl.dll (no symbols), advapi32.dll, through RegLoadAppKey, and then spent ~11.35 s (a quarter of the total time) in ntoskrnl.exe!KeFlushMultipleRangeTb <itself>. Huh. Weird.
The system process stack looked surprising similar, for reasons that I don’t understand:
xperf.exe was spending a lot of its time idle so I thought an idle-thread analysis with the CPU Usage (Precise) graph might be useful. It showed the thread waiting for 7.0 s in CmpFileRead, which is a registry hive reading function:
Previous traces had made me suspect the registry so I had recorded my final trace with Registry I/O activity, as well as Disk I/O activity and File I/O activity: I didn’t want to miss anything. The registry data showed that 94 keys were opened by xperf.exe. 93 were opened at the start of the trace, and 1 at the very end. The last one opened before the madness struck was \Registry\MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AppCompatFlags. Hmmm. AppCompat. Foreshadowing?
Finally I looked at the Disk I/O table. I wasn’t expecting to see anything interesting because it would take a crazy amount of disk I/O to make my SSD blink, especially with 8 GB of RAM to cache reads. And whadya know? I found a crazy amount of disk I/O. Look at this – over the 44.5 s of the trace there were 10,375 reads from Amcache.hve, totaling over 2.6 GB of data!
This is even more remarkable when you look at the size of the file:
The 2.6 GB of reads is equivalent to reading this 21.5 MB file 124 times, while never using the disk cache. If left to run for the usual six minutes this would mean reading the entire Amcache.hve file a thousand times. The purpose of Amcache.hve is unclear, but it is clearly related to AppCompat, and is presumably related to the last registry key read before the burst of activity.
Tests on a second machine showed that the length of the merge step was proportional to the size of Amcache.hve – curious.
Later tests showed that the “xperf -flush” command, which is needed when doing circular-buffer tracing, also triggers this painful hang.
The disk I/O is not the problem – my SSD is fast enough to read this file a thousand times in much less than six minutes – but it points to a possible solution. If the merge step is being held up by processing this file, then let’s temporarily rename the file during the merge step. And this works! When recording the simplest possible traces the merge step now takes about a second, and on longer traces it takes a reasonable time, proportional to the trace length.
I reported this to Microsoft. The Windows Performance Toolkit team is aware of the problem and is working on a fix.
I don’t know how prevalent this problem is. It affected both of my Windows 7 machines, but Windows 8 seems to be immune. If you are hitting this problem then consider running this batch file and sharing the results. It measures the merge time for a minimal trace, and displays the size of amcache.hve:
xperf -on PROC_THREAD+LOADER -f kernel.etl
@echo About to merge the trace at %time%
xperf -merge kernel.etl trace.etl
@echo Finished saving the trace at %time%
@rem List the size of the trace file and amcache.hve
@dir trace.etl | find /i “trace.etl”
@dir c:\Windows\AppCompat\Programs\amcache.hve | find /i “.hve”
I’d paste the renaming code into this blog post but the environment variables and error checking make it obscure and verbose. Instead, I’ve created an open source tool for recording and managing ETW traces that always tries to workaround this bug. See the UIforETW announcement for more details.