Profiling the profiler: working around a six minute xperf hang

Posted on March 2, 2015 by brucedawson

Anytime my computer is a little bit slow I’m likely to record a trace and take a quick look. If I’m lucky I’ll find an easy workaround, in which case I’ve got a potential blog post and a smoother running computer. And, recording a trace to my uber-fast SSD only takes a few seconds, so the cost is minimal.

But, a few months ago I noticed that it was taking six minutes to record a simple trace, whether I used wprui or my batch files. This sort of delay really takes all the fun and spontaneity out of random performance investigations.

Spoiler alert: I profiled xperf and found a workaround, and the workaround is built-in to UIforETW, my open-source ETW trace recording tool.

At first I assumed that I wouldn’t be able to investigate this problem very well – how was I supposed to record a profile of a profile being recorded? But while working on the previous blog post I made a crucial observation: the six minute hang was during the merge step. The merge step happens after the profile has been recorded so it is actually easy to profile. I just put a “pause” command in my batch file, right before the merge step, and then used wprui to record a trace. I hit Ctrl+C to halt xperf after the merge step had been running for a while (it ran for 44.58 s according to the trace) and I was sure that because I’d caught the slowdown right from the start I’d be able to understand it.

I’ve profiled WPA to investigate slow symbol loading a couple of times (Xperf Symbol Loading Pitfalls and Slow Symbol Loading in Microsoft’s Profiler, Take Two) but this is the first time I’ve actually profiled the profiler.

Nice rally!

The CPU usage showed the xperf process (red) and system process (green) playing ping-pong table tennis:

Well that’s odd.

The next step was to figure out what they were doing. For that I switched to CPU Usage (Sampling) with a stack view. The stack swung through perfctrl.dll (no symbols), advapi32.dll, through RegLoadAppKey, and then spent ~11.35 s (a quarter of the total time) in ntoskrnl.exe!KeFlushMultipleRangeTb <itself>. Huh. Weird.

The system process stack looked surprising similar, for reasons that I don’t understand:

xperf.exe was spending a lot of its time idle so I thought an idle-thread analysis with the CPU Usage (Precise) graph might be useful. It showed the thread waiting for 7.0 s in CmpFileRead, which is a registry hive reading function:

Previous traces had made me suspect the registry so I had recorded my final trace with Registry I/O activity, as well as Disk I/O activity and File I/O activity: I didn’t want to miss anything. The registry data showed that 94 keys were opened by xperf.exe. 93 were opened at the start of the trace, and 1 at the very end. The last one opened before the madness struck was \Registry\MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AppCompatFlags. Hmmm. AppCompat. Foreshadowing?

Finally I looked at the Disk I/O table. I wasn’t expecting to see anything interesting because it would take a crazy amount of disk I/O to make my SSD blink, especially with 8 GB of RAM to cache reads. And whadya know? I found a crazy amount of disk I/O. Look at this – over the 44.5 s of the trace there were 10,375 reads from Amcache.hve, totaling over 2.6 GB of data!

This is even more remarkable when you look at the size of the file:

The 2.6 GB of reads is equivalent to reading this 21.5 MB file 124 times, while never using the disk cache. If left to run for the usual six minutes this would mean reading the entire Amcache.hve file a thousand times. The purpose of Amcache.hve is unclear, but it is clearly related to AppCompat, and is presumably related to the last registry key read before the burst of activity.

Tests on a second machine showed that the length of the merge step was proportional to the size of Amcache.hve – curious.

Later tests showed that the “xperf -flush” command, which is needed when doing circular-buffer tracing, also triggers this painful hang.

Workaround time!

The disk I/O is not the problem – my SSD is fast enough to read this file a thousand times in much less than six minutes – but it points to a possible solution. If the merge step is being held up by processing this file, then let’s temporarily rename the file during the merge step. And this works! When recording the simplest possible traces the merge step now takes about a second, and on longer traces it takes a reasonable time, proportional to the trace length.

I reported this to Microsoft. The Windows Performance Toolkit team is aware of the problem and is working on a fix.

I don’t know how prevalent this problem is. It affected both of my Windows 7 machines, but Windows 8 seems to be immune. If you are hitting this problem then consider running this batch file and sharing the results. It measures the merge time for a minimal trace, and displays the size of amcache.hve:

xperf -on PROC_THREAD+LOADER -f kernel.etl
xperf -stop
@echo About to merge the trace at %time%
xperf -merge kernel.etl trace.etl
@echo Finished saving the trace at %time%
@rem List the size of the trace file and amcache.hve
@dir trace.etl | find /i “trace.etl”
@dir c:\Windows\AppCompat\Programs\amcache.hve | find /i “.hve”

I’d paste the renaming code into this blog post but the environment variables and error checking make it obscure and verbose. Instead, I’ve created an open source tool for recording and managing ETW traces that always tries to workaround this bug. See the UIforETW announcement for more details.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

This entry was posted in Investigative Reporting, Performance, Programming, xperf and tagged metaprofiling, xperf. Bookmark the permalink.

13 Responses to Profiling the profiler: working around a six minute xperf hang

Pingback: ETW Trace Compression (and xperf syntax refresher) | Random ASCII
Jon says:

March 3, 2015 at 4:01 pm

Thanks for another great ETW article. I have some questions:

1) How did you report this to Microsoft? Did you open a Connect item?

2) Do you have advice for someone wanting to record less information but over a long period of time? Eg say I want to record a few days’ worth of disk and file I/O only, eg to prove disk is a bottleneck, where performance counters don’t give enough info.

3) Are there scenarios where you’d recommend Process Monitor over wprui/xperf/wpa?

4) Is it possible your workaround of renaming the file will break something?

Reply
- brucedawson says:
  
  March 3, 2015 at 6:45 pm
  
  1) I have not found a public bug database for WPT. This is actually quite frustrating. I am lucky to know some team members from my time at Microsoft, but even with that access it is annoying to not have any ability to track bugs. I know that some bugs I have reported got dropped between the cracks during team changes.
  
  2) There are many things you can do to lower the data rate, but a few days’ worth of ETW data is impractical. What I have done is leave tracing to a circular buffer enabled for weeks at a time, and then saved the buffers to disk when some condition is detected.
  
  3) Sysinternal’s procmon is great for real-time monitoring. I definitely use it. But for hard-core investigations of mysteries I really like that a single ETW trace contains virtually everything I could ever need.
  
  4) I’d hate to guarantee that renaming the file is totally safe, but I got the impression from my Microsoft contacts that it wasn’t particularly risky.
  
  Reply
  - Jon says:
    
    March 4, 2015 at 4:27 pm
    
    Thanks for your response.
    
    Re (2): I’m looking for a tool I can run on production servers to collect this data and retain for days rather than minutes/hours. It’s mainly file I/O that I’m interested in (CPU tools are easier to come by). If ETW tools could summarize top file I/O for the last 1 second, it’d be possible to retain a lot more data. Do you know if that’s possible?
    
    Reply
    - brucedawson says:
      
      March 5, 2015 at 11:12 am
      
      I’m not aware of any built-in support for what you want to do. This is exactly the sort of thing that dtrace does — Brendan Gregg’s book talks about the advantage of summarizing performance data kernel-side to avoid sending huge volumes of data across the kernel/user divide.
      
      You could write an ETW sink that would received this data and summarize it. There is some documentation about the ETW APIs here:
      http://mollyrocket.com/casey/stream_0029.html
      Or, more positively, you could look at Vance Morrison’s PerfView:
      http://blogs.msdn.com/b/vancem/archive/2015/01/28/links-to-recent-perfview-videos-defrag-tools-nov-2014-jan-2015.aspx
      
      Reply
Jon says:

March 9, 2015 at 11:25 pm

“summarizing performance data kernel-side” – I assume that’s not possible with ETW currently?

That Casey Muratori blog post was amusing. I imagine it would be possible to wrap MS’s API and expose it as his ideal API to make it easier for everyone.

It seems like the info I’m after is collected by Resource Monitor (resmon.exe); the only thing missing is storing the data historically. So it shouldn’t be too hard to achieve.

Thanks for the perfview links, I had a quick look and its UI is ..interesting. I’ll check out the videos.

One thing that is unfortunate about ETW tracing/analyzing is how slow everything is, especially on my spinning disks. Saving traces, loading them in WPA, loading symbols is all very slow. Sasha Goldstein said it best: http://blogs.microsoft.co.il/sasha/2014/12/02/diagnosing-native-memory-leaks-etw-wpa/
“Now, load symbols (Trace > Load Symbols) and grab a few cups of coffee. Maybe have lunch and visit some old friends. You should even have time for a quick romantic getaway, perhaps over a weekend or something.”

I’d be happy to forego some data (on occasion) in order to make things faster.

Reply
- brucedawson says:
  
  March 10, 2015 at 9:20 am
  
  I actually don’t find xperf/WPA particularly slow. Some suggestions:
  1) Get an SSD. I got a 1 TB SSD last year for $450 and it has made me more productive.
  2) Be careful not to create overly long traces. Most should be a few hundred MB.
  3) Symbol loading is slowest the first time because all OS .pdb files need translating to .symcache files. It is *much* faster next time.
  4) You can filter out .pdb files that you don’t care about with https://twitter.com/JohnWintellect/status/542102993009516545 – but I don’t.
  5) You can then filter out .symcache files that you don’t care about by deleting them – but I don’t.
  6) Maybe disable symbol server after your first analysis so subsequent trace analyses don’t go to the network to look for (still) missing PDB files?
  7) Make sure you have enough memory. My laptop has 8 GB, and 16+ GB would be better.
  
  You can also disable things like context-switch call stacks or file I/O to reduce the data rate, but I generally only do that when collecting that data is affecting the performance of the code I am profiling.
  
  If symbol loading is still slow then you are probably hitting one of the pathological cases that I’ve blogged about. See https://randomascii.wordpress.com/2014/11/04/slow-symbol-loading-in-microsofts-profiler-take-two/ (and the earlier article it references) and perhaps do a quick profile of symbol loading to see if dbghelp.dll or msdia*.dll are using all the time. With symbol stripping I can translate and load Chrome’s *enormous* symbols in 30-40 seconds, much less if they are already translated.
  
  Reply
Pingback: UIforETW – Windows Performance Made Easier | Random ASCII
Alois Kraus says:

July 6, 2015 at 6:36 am

Great post. I hit this issue on a Winows 7 x86 machine with a 70MB amcache.hve file. Stopping a trace did take forever (more than 30minutes). Is there a fix from MS already? What happens if I forget to rename the file back because the script is stopped with Ctrl-Break?

Reply
- brucedawson says:
  
  July 6, 2015 at 9:18 am
  
  Thanks for commenting. I’m still not sure how prevalent this issue is and it is good to hear from other people who have also hit it.
  
  One answer to your question is that I recommend using UIforETW for recording traces and this makes it harder to get into the situation where the file doesn’t get renamed back.
  
  A better answer is that if you don’t rename the file back then Windows will eventually regenerate it, with no apparent ill consequences.
  
  If Windows regenerates the file then you can potentially get into a situation where you have Amcache.hve *and* Amcache_temp.hve. That will then make the rename fail which gets you back into it taking 30 minutes to record a trace. UIforETW handles this by deleting Amcache_temp.hve first. If you are using batch files then be sure that they do that. Here is where UIforETW deletes the old temp file if it exists: https://github.com/google/UIforETW/blob/master/UIforETW/UIforETWDlg.cpp#L826
  
  I’ve been using this workaround for four months now with no problems.
  
  Reply
Alois Kraus says:

July 7, 2015 at 12:20 am

Ahh I see you do this all the time even for Win8? If there are no grave consequences I will also add it and see how it works out. If stopping becomes faster then I hope more people will use it. UIforETW is nice but I have also a whole bunch of workarounds for .NET stuff added which UIForETW currently will not take care of. My main use case are managed .NET applications where the pdb Generation on the fly is very important. Xperf does a bad job there. It works but it is slooow. Have you considered to add tracelog to get more possibilties to filter out specific Events or to suppress stackwalking for uninteresting high volume evnents?

Reply
- brucedawson says:
  
  July 7, 2015 at 8:45 am
  
  Yep, I try the rename unilaterally. On my one Windows 8 machine it always fails (file locked?) but luckily that machine doesn’t hit the slowdown.
  
  That machine does hit a *different* slowdown when tracing to in-memory circular buffers, which renders that mode useless. That’s quite annoying. I don’t know if it is Windows 8.1, the 64 GB of RAM, the dual socket motherboard, or ??? Microsoft has not been forthcoming.
  
  If you ever feel like adding your .NET workarounds to UIforETW that would probably be great. Something to consider…
  
  Reply
Pingback: ETW Central | Random ASCII