Xperf Wait Analysis–Finding Idle Time

The Windows Performance Toolkit, also known as xperf, is a powerful (and free!) system-wide Windows profiler. In the past I’ve talked about using xperf to identify slowdowns in PowerPoint (tutorial version is here) by using xperf’s built-in sampling profiler, but that actually understates the true value of Xperf. While I think xperf is a better sampling profiler than most of the alternatives (higher frequency, lower overhead, kernel and user mode), xperf is really at its best when it reveals information that other profilers cannot measure at all.

This article was updated September 2015 to reflect WPA 10 and UIforETW

In short, lots of profilers can tell you what your program is doing, but few profilers are excellent at telling you why your program is doing nothing.

More details about this particular investigation can be found in the second ETW Training Video.

Our story so far

Xperf has a high learning curve. Therefore I highly recommend that you start by reading some previous articles from this series. The best place to start is at ETW Central which has a well organized list of resources.

The rest of this post assumes that you have installed UIforETW and familiarized yourself with some of the basics of WPA trace analysis.

Wait Analysis Victories

I’ve had good luck using Wait Analysis to find many performance problems. Some of these delays were short enough to be difficult to notice, yet long enough to matter. Others were debilitating. All were difficult or impossible to analyze through CPU sampling or other ‘normal’ CPU profilers. Some examples include:

  • Finding the cause of brief startup hangs in Internet Explorer and various games
  • Profiling Luke Stackwalker to find out why it caused frame rate glitches in the game it was profiling
  • Finding the cause of a 10x perf-reduction when upgrading to a newer version of Windows
  • Finding the cause of frame rate hitches during fraps recording
  • Finding the chain of lock contention that caused frame rate hitches on a heavily loaded system
  • Finding the cause of (and a workaround for) repeated 2-6 second hangs in Visual Studio’s output window

The last investigation is the one I want to cover today. It is sufficiently simple and self-contained that I can cover it end-to-end in a single (long) post.

Finding the hang

imageWhen profiling a transient problem such as a frame-rate glitch or a temporary hang the first challenge is to locate the hang in the trace. A trace might cover 60 seconds, and a hang might last for 2 seconds or less, so knowing where to look is crucial. There are a number of ways to do this:

  • Find the key stroke that triggered the hang, through logging of input events
  • Use instrumentation in the functions of interest
  • Look for patterns in the CPU usage or other data
  • Look at the Window in Focus graph for clues 
  • Use OS hang-detection events
  • Remember when the hang happened relative to when you stopped and started tracing

I’ve used all of these techniques. To help with the first option UIforETW has an input event logger which puts all keyboard and mouse input into the trace. If a hang is triggered by a particular key press or mouse click then finding its start point in the trace is trivial.

Custom instrumentation (emitting ETW events at key points in your game, see the ETWEventDemo source) is also a common technique. Emitting an event every frame, or using CETWScope around slow operations, makes hitches plainly visible. However this doesn’t work when investigating performance problems in other people’s code, such as in Visual Studio.

In some cases a hang will be plainly visible in the CPU consumption. One recent hang showed a significant hole in the otherwise consistent CPU usage, plain as day.

The Window in Focus doesn’t show up for this trace in WPA 10, but it does show up in WPA 8.1. This happens quite frequently and is one reason why the Traces context menu in UIforETW has an option to load traces into WPA 8.1.

A specific event that indicates the time and duration of a hang would be ideal, and Windows 7 and above actually have such an event. The Microsoft-WindowsWin32k ETW user provider will emit an event whenever a thread resumes pumping messages after a significant delay. Windows Vista and earlier users are out of luck, but on Windows 7+ this is often exactly what is needed, and this provider is enabled by UIforETW.

It’s hands on time

I’ve uploaded a .zip file of a sample trace (2015-09-25_20-56-25 VS F8 short hang.zip) to my github bigfiles repo. If you download this trace, unzip it to UIforETW’s trace directory (default is documents\etwtraces) then you can load the trace into WPA 10 by double-clicking it in UIforETW. Following along is by far the best way to learn wait analysis. You should also download the .txt file with the matching name and copy it to the etwtraces directory. UIforETW will display it when the trace is selected, and you can type your findings into the Trace information area.

Once the trace is loaded be sure to Load Symbols from the Trace menu, unless you have Load symbols automatically after trace load configured.

This trace records a hang where I was looking through VC++ build warnings by repeatedly pressing F8. The hang was about six seconds long and happened after the third time that I pressed F8. These sorts of details can be very helpful when investigating performance problems so I always try to record them in the Trace information area.

Finding the hang start

With the trace loaded into WPA 10 you should see “Generic Events  Randomascii Chrome and Multi Events” as the top graph and table. If not then you should use UIforETW’s Settings dialog to Copy startup profiles and then reload the trace. Having the right startup profile will make this process far easier.

In the Generic Events table you can drill down into UIforETW, Multi-Input, Keyboard, Keydown, and you will see 20 keyboard events. The F7 key press built the project and the three F8 presses navigated through the warnings. Since we know that the hang started on the third F8 press (selected) we can tell that the hang started at 6.5 s into the trace, shown by the blue vertical line:

image

Since I mentioned that the hang was about six seconds long we already have a rough estimate of when it ends, but we can do better than that.

Finding the hang length

By opening the UI Delays graph from System Activity in the Graph Explorer we can get even better information. Whenever a UI thread stops pumping messages for more than 200 ms the Microsoft-Windows-Win32k provider records events and the UI Delays graph displays them. Here we can see that devenv.exe went for 5,469 ms without checking for messages, and for 3,172 ms of that time there was input in the queue that was being ignored.

image

The table also shows us the process ID (9104) and the UI thread’s thread ID (10364). When investigating an idle-time hang it is crucial to know which process and thread to investigate so this information is vital.

Zooming in to the cause

When analyzing an idle hang you should select the entire region of the hang, and it is particularly important to select the end of the hang. It is better to select a few extra tens or hundreds of milliseconds at the end rather than risk missing the crucial events that end the hang. This selection can be done by dragging with the mouse or by selecting the MsgCheck Delay bar in the UI Delays graph. Then just right-click on the selected region and select Zoom from the context menu.

Who woke whom?

If a thread is not running, and it then starts running, then there was a context switch that started it (the new thread) running. That context switch is recorded in our ETW trace and contains all sorts of useful information. Include in this information is (for the traces recorded with UIforETW) the new process name and thread ID, the call stack which the thread woke up on (which is the same one it went to sleep on), the length of time it was not running and, for threads that were waiting on some synchronization primitive, the thread that woke it up.

Ponder that, because it’s crucial. An ETW trace tells you, for each context switch, how long the thread was not running, and who woke it up. That’s why it is important to have the end of the hang selected, because that is (presumably) the time of the context switch that gets the thread running again.

The CPU Usage (Precise) table contains the context switch information. With UIforETW’s startup profiles this data defaults to being displayed as a graph only. To do wait analysis we need to display the table, by clicking on Display Graph and Table:

image

We know need to drill down into New Process devenv.exe (9104), New Thread Id 10,364. The CPU Usage column shows us that this thread only consumed 111.7 ms of CPU time over the 5.469 s time range shown, so we immediately know that this is an idle hang. Knowing when to do wait analysis (to find idle delays) and CPU Usage analysis (for busy delays) is crucial. Doing CPU Usage analysis when a thread is idle will lead to nonsense results.

It is important to remember that there is nothing wrong with a thread being idle. Most thread’s should be idle. We are interested in this particular idleness because it is the UI thread and because it failed to pump messages for 5.469 s. This idleness is problematic because it caused a hang.

The devenv UI thread did 47 context switches during this time period and we want to find the one that ended the longest period of idle time, so we need to sort the table by Time Since Last which, as I explained here, represents how long a thread had not been running when it was context-switched in again. The UIforETW startup profile configures two Time Since Last columns, one aggregated by Sum and the other by Max. Aggregating by sum is appropriate when you want to find a call stack whose context switches collectively account for the most time. Aggregating by max is appropriate when you want to find a single context switch that represents the longest idle time. In this case it doesn’t matter which one we use, but I recommend experimenting.

If we drill down (select the first entry in the thread’s call stack and repeatedly press right-arrow) then we’ll find that there is a single context switch (see the Count column) that restarted the UI thread after it was idle for 5.341 s. That’s our hang, clear as day.

The NewThreadStack for this 5.202 s call stack starts at _RtlUserThreadStart and winds through a lot of Visual Studio code. Microsoft is kind enough to publish symbols for much of VS, as well as for Windows and about fifty rows down we get to the interesting details:

image

It’s a single context switch (‘count’ is 1) that put the Visual Studio UI thread to sleep for 5.202 s. It doesn’t get much clearer than that.

Farther down in the call stack we can see that ntoskrnl.exe calls FLTMGR.SYS which calls mup.sys!MupiCallUncProvider (and the eventual call to WaitForSingleObject). Ponder that for a bit.

image

If we go down to the bottom of the stack and expand the next three columns then we can see the Readying Process/Thread Id/Stack, which is the thread that woke us, which can also be described as “who we were waiting for”:

image

In this case it was the System process (thread 5880) in an IopfCompleteRequest call that goes through MUP.SYS. If we know what MUP.SYS is then that gives us another clue as to the root cause, but even without that we know that Visual Studio called CreateFileW and it took a long time to return.

What about the other threads?

In our selected region their are context switch events for many threads in devenv.exe that spent most of the time range idle. So why aren’t we looking at them?

It’s important to understand that Sum:Time Since Last just measures how long a thread was idle, and there is nothing wrong with a thread being idle. A thread being idle is only a problem if it is supposed to be doing something and isn’t. In fact, if devenv.exe has lots of threads then they had better be idle most of the time or else my four-core machine is going to be constantly busy.

The reason we are looking at thread 10,364 is because (according to the MsgCheckDelay event) it is the UI thread and it went for 5.469 s without pumping messages. It kept me waiting, and that makes me angry. You wouldn’t like me when I’m angry.

File I/O summary table

Since we know that the hang is related to file I/O we should look at what file I/O is happening during this time period. Luckily this information is also recorded by UIforETW.

In the WPA Graph Explorer go to the Storage section and open up the File I/O graph and bring up a summary table. If you change the View Preset to Count by Process, Thread, Type and drill down into Process devenv.exe (9104), Thread 10,364 and look for an event with a long duration. If we sort by duration then we can just keep expanding with right-arrow. It doesn’t take long to find the event in question.

The IRP and File Object columns aren’t much use, but you can hide those or scroll over and find the File Path column. I’m sure this screen shot won’t show up very well, but I can’t shorten it any more. It contains too much glorious information. Click on the image for deeper details:

image

We can see here that a Create file event, from devenv.exe, thread 10,364, took 5.342 s, trying to open \\Perforce\main, and that ultimately the network path was not found.

Wow.

The hang is because Visual Studio tried to open a non-existent network file, and sometimes that causes a 5.342 s network timeout. Hence the hang.

The remainder of the hang is from a few other context switches and CPU time that account for the rest of the 5.5.469 s, but the one long bit of idle time is all that matters in this case.

What’s the cause?

The file name associated with this hang is quite peculiar. The full name is:

\\Perforce\main\src\Lib\Public\sharedstuff.lib#127 – opened for edit

That doesn’t look like a file name. That looks more like the output from Perforce. And that’s exactly what it is. At Valve (where I was working when I hit this problem) build results are stored in Perforce so we have pre-build steps to check these files out. The checkout commands print their results to the Visual Studio output window.

Visual Studio ‘helpfully’ reverses the slashes and decides that this represents a file name on \\perforce\main. Since this whole thing started with me pressing F8 this means that Visual Studio was trying desperately to treat this potential file name as a source-file name associated with an error or warning message.

Oops.

That explains the CResultList::AttemptToNavigate entry on the hang call stack – everything makes more sense once you understand the problem.

Conclusion

Once the cause of the hang was understood I modified our pre-build step to pipe the output through sed.exe and had it rewrite the output so that Visual Studio would no longer find it interesting. This avoids the hang, and also made it so that F8 would take the selection to interesting errors and warnings, instead of to these mundane progress messages. A little sed magic replaces “//” with the empty string, and “…” with “—“ :

sed -e s!//!! -e s!\.\.\.!—!

This changes the hang-prone results before:

image

to the hang-proof benign text after:

image

I also reported the bug to the Visual Studio team. Having a trace is very powerful for this because it meant that I could tell them definitively what the problem was, and I could share the trace in order to let them confirm my findings. Just like minidump files are a powerful way to report crash bugs, xperf traces are a powerful way to report performance bugs. The bug was fixed in Visual Studio 2012 – UNC paths are now ignored by the output window’s parser.

Mup.sys is the driver used for network file I/O. Therefore its presence on the New Thread Stack and the Ready Thread stack was a clue that a network delay was the problem. Doing file I/O on the UI thread is always a bit dodgy if you want to avoid hangs, and doing network file I/O is particularly problematic, so watching for mup.sys is a good idea.

Wait chains

Some wait analysis investigations are more complicated than this one. In several investigations I have found that the main thread of a game was idle for a few hundred milliseconds waiting on a semaphore, critical section, or other synchronization object. In that case the readying thread is critical because that is the thread that released the synchronization object. Once I find out who was holding up the main thread I can move the analysis to that thread and apply either busy-thread analysis (CPU sampling) or idle thread analysis (finding what that thread was waiting on). Usually just one or two levels of hunting is needed to find the culprit, but I did recently trace back across six context switches in four different processes in order to track down an OS scheduling problem.

When following wait chains it is important to understand the order of events. If thread 1234 is readied by thread 5678 at time 10.5 s, then any context switches or CPU activity that happen to thread 5678 after that point are not relevant to the wait chain. Since they happened after thread 1234 was woken they can’t be part of its wait chain.

For CPU activity this is dealt with by selecting the region of interest. For context switches this is dealt with by drilling down all the way and then looking at the SwitchInTime (s) column. This column records the time of the context switch.

It’s worth pointing out that if you busy wait (spinning on some global variable flag) or use your own custom synchronization primitives (CSuperFastCriticalSection) then these techniques will not work. The OS synchronization primitives are instrumented with ETW events that allow, in almost all cases, perfect following of wait chains. Even if your custom synchronization code is faster (and it probably isn’t) it isn’t enough faster to make up for the loss of wait analysis. The ability to profile your code trumps any small performance improvement.

Can’t any profiler do this?

Sampling profilers and instrumented profilers might be able to tell you that your program is idle, and they might even be able to tell you where your program is idle, but they generally can’t tell you why your program is idle. Only by following the chain of readying threads and looking at other information can you be sure to find the cause of your idle stalls.

It’s also convenient that you can leave xperf running in continuous-capture mode, where it is constantly recording all system activity to a circular buffer. When you notice a problem you can just record the buffer to disk, and do some post-mortem profiling.

Wait analysis is not baking

Baking is all about precisely following a recipe – improvisation tends to lead to failure. Wait analysis, on the other hand, is all about creativity, thinking outside the box, and understanding the entire system. You have to understand context switches, you have to think about what idle time is good and what is bad, you have to think about when to look at CPU usage and when to look at idle time, and you often have to invent some new type of analysis or summary table ordering in order to identify the root cause. It’s not easy, but if you master this skill then you can solve problems that most developers cannot.

Second opinion

For another discussion of analyzing hangs with xperf see this wait analysis article by Trey Nash.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in AltDevBlogADay, Investigative Reporting, Performance, Programming, xperf and tagged , , , , , , , . Bookmark the permalink.

46 Responses to Xperf Wait Analysis–Finding Idle Time

  1. Pingback: The New Xperf is Here! | Random ASCII

  2. Very interesting article, thank you so much for taking the time to write this up! Lots to learn and that can only be good🙂

  3. Pingback: The Lost Xperf Documentation–CPU Scheduling | Random ASCII

  4. This link appears to be dead. Either that or my corp’s proxy server does not like you.
    ftp://ftp.cygnus-software.com/pub/MultiProvider.zip

  5. Pingback: Windows Live Photo Gallery—Poor Performance Peculiarities | Random ASCII

  6. Pingback: In Praise of Idleness | Random ASCII

  7. Aaron says:

    Do you have any idea where Wait Analysis went in the new WPA 6.2 GUI?

    • brucedawson says:

      Yes, I do know where it went. Under Computation find CPU Usage (Precise) and that drag over to the analysis window. Then set the table to “Display graph and table” (right-hand corner of the table’s title bar) to make the summary table show up. Then use the View Editor to set up the columns correctly. You’re all set.

      Oh — one final step. Profiles->Save Startup Profile, so you only have to do this once.

      I’ll blog about this one of these days.

      WPA is better in most ways than xperfview. There are a couple of annoying bugs, but those should be fixed in the final release, and then I assume I will be able to recommend it with no misgivings.

      • Aaron says:

        OK, it looks like this works. It looks like there’s a predefined column corresponding to the old sum column, but its label is indistinguishable from the max Waits column.

        Slightly unrelated, but on the topic of finding missing things: Do you know where the Disk I/O Detail graph? Specifically, xperfview graph has disk offset on the vertical access, along with some general descriptive information about the disks and partitions.

        • brucedawson says:

          The Disk I/O Detail graph is well hidden, but it is still around. Crack open the Storage section, then right-click on the Disk Usage section, and select Disk Offset Graph.

          I guess it’s traditional for that graph to be well hidden.

  8. Pingback: WPA–Xperf Trace Analysis Reimagined | Random ASCII

  9. Eric says:

    > “However this doesn’t work when investigating performance problems in other people’s code, such as in Visual Studio.”

    It sounds like you have an equally good method but there is actually an event that does this in VS 2010. WPF has a frame present event documented in ‘\Windows\Microsoft.NET\Framework\v4.0.30319\WPF\wpf-etw.man’ event id 11061 – WClientUceNotifyPresent signifies when WPF presents a frame to the screen.

    • brucedawson says:

      Interesting. I’ll have to try that. Unfortunately they don’t make it easy to use because that provider is not registered by default. I had to do this to register it:

      wevtutil im C:\Windows\Microsoft.NET\Framework\v4.0.30319\WPF\wpf-etw.man

      Then it showed up in the list of providers given by “xperf -providers” as “Microsoft-Windows-WPF” so now I should be able to enable it, next time I need that.

  10. Pingback: Windows Slowdown, Investigated and Identified | Random ASCII

  11. Pingback: The Lost Xperf Documentation–Disk Usage | Random ASCII

  12. Pingback: Xperf for Excess CPU Consumption | Random ASCII

  13. Pingback: Xperf Basics: Recording a Trace (the easy way) | Random ASCII

  14. Pingback: Xperf for Excess CPU Consumption: WPA edition | Random ASCII

  15. Pingback: Defective Heat Sinks Causing Garbage Gaming | Random ASCII

  16. Pingback: Coherent Labs » Timestamps for performance measurements

  17. Pingback: New Version of Xperf–Upgrade Now | Random ASCII

  18. Hi Bruce,

    this turned out to be just what I needed. Steam recently occasionally hangs for a good minute or so. This is especially annoying since it also pauses any game trying to communicate with Steam.

    So being a software engineer I gave profiling a try. I had only briefly looked into xperf after reading about it here so the most time was spent trying to understand the program and experimenting. But in the end it was mostly a matter of looking at the maximum Time Since Last of various threads.

    The problem turned out to be a blocking call to WinHttpGetProxyForUrl in the steamclient.dll in one thread that takes 115 seconds. I don’t know why it takes so long*, I just know that about 30 seconds later a couple of other Steam threads start waiting for the blocked one, at which point the games stop reacting as well.

    I should probably experiment with WinHttpGetProxyForUrl and see if I can reproduce the problem…

    Do you think that’s something that should be fixed in Steam? I suppose so, it may be a rare thing to happen (and I’m still curious why it does) but such a blocking call should probably not be done in a time critical thread anyway, right? In case you want to fix it, do you need the trace? It’s pretty heavy at 1.7GB and I suppose the function name should narrow it down by a lot so probably not.

    Thanks for the articles and have a nice day.

    Willi Schinmeyer

    * The MSDN has this to say: “The auto-detection process can be slow, possibly as long as several seconds.” That sounds more like “up to 10s or so” to me.

    • brucedawson says:

      It seems terribly appropriate that you would turn xperf against Steam. Isn’t it wonderful that it can be used to profile *any* program, whether you wrote it or not.

      Proxy resolution is a tricky problem, I remember that much from my days working on Internet Explorer, but I don’t remember much more than that. However I think that if proxy resolution takes that long it is hard to avoid hangs. Asynchronous programming is all well and good, but at some point somebody *needs* the answer.

      I’ll e-mail you and get some traces — we have a system for getting traces when needed, with Valve’s instrumentation and my preferred providers.

      • brucedawson says:

        Thanks for the trace. I confirmed your analysis. It’s annoying that WinHttpGetProxyForUrl sometimes takes so long to return, but we’ll fix the code to ensure it doesn’t cause hangs in games launched from Steam.

        Good job.

  19. tama says:

    Number 6 is ‘ReadyThreadStack’ not ‘ReadyingThreadStack’, right?

    Unfortunately the link to the other wait analysis article at windowsitpro doesn’t work for me.

    Google brought up this one: http://devproconnections.com/development/use-xperfs-wait-analysis-application-performance-troubleshooting

    • brucedawson says:

      You are correct. “ReadyingThreadStack” would be more consistent with the other column names, but “ReadyThreadStack” is what it is actually called.

      I’ll update the article link — I think you found the right one, although I can’t be certain.

  20. tama says:

    You can’t use Microsoft-WindowsWin32k with xbootmgr, can you?

  21. Pingback: Timestamps for performance measurements | Prog stuff

  22. Pingback: Bugs I Got Other Companies to Fix in 2013 | Random ASCII

  23. Pingback: You Got Your Web Browser in my Compiler! | Random ASCII

  24. Pingback: Self Inflicted Denial of Service in Visual Studio Search | Random ASCII

  25. Hey there Bruce, somebody did look closely at the data – smart choice; and thanks but I neither know someone living in Kirkland, nor do I drink. So buy one for yourself! You deserve it for keeping up the “dying art” of easter eggs.

  26. ralphcorderoy says:

    Educational article, thanks. Minor inconsistency: “My system has eight hardware threads (four hyperthreaded cores)” v. “or else my six-core machine is going to be constantly busy”.

  27. Pingback: Fix Wpf-etw.man Errors - Windows XP, Vista, 7 & 8

  28. Pingback: UIforETW – Windows Performance Made Easier | Random ASCII

  29. Pingback: The Lost Xperf Documentation–CPU Usage (Precise) | Random ASCII

  30. Trass3r says:

    Nice article. Could you also post one about finding the cause of deadlocks? In that case there is no readying thread, the app just hangs forever.

  31. Pingback: ETW Central | Random ASCII

  32. Pingback: You Got Your Web Browser in my Compiler! | Random ASCII

  33. Pingback: Windows Live Photo Gallery—Poor Performance Peculiarities | Random ASCII

  34. Pingback: PowerPoint Poor Performance Problem | Random ASCII

  35. Pingback: How to get the UI Delay graph to appear in ETL traces. Particularly for Black Box ETW flight recorders. | Libro Magica

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s