The Windows Performance Toolkit, also known as xperf, is a powerful (and free!) system-wide Windows profiler. In the past I’ve talked about using xperf to identify slowdowns in PowerPoint (tutorial version is here) by using xperf’s built-in sampling profiler, but that actually understates the true value of Xperf. While I think xperf is a better sampling profiler than most of the alternatives (higher frequency, lower overhead, kernel and user mode), xperf is really at its best when it reveals information that other profilers cannot measure at all.
This article was updated September 2015 to reflect WPA 10 and UIforETW
In short, lots of profilers can tell you what your program is doing, but few profilers are excellent at telling you why your program is doing nothing.
More details about this particular investigation can be found in the second ETW Training Video.
Our story so far
Xperf has a high learning curve. Therefore I highly recommend that you start by reading some previous articles from this series. The best place to start is at ETW Central which has a well organized list of resources.
The rest of this post assumes that you have installed UIforETW and familiarized yourself with some of the basics of WPA trace analysis.
Wait Analysis Victories
I’ve had good luck using Wait Analysis to find many performance problems. Some of these delays were short enough to be difficult to notice, yet long enough to matter. Others were debilitating. All were difficult or impossible to analyze through CPU sampling or other ‘normal’ CPU profilers. Some examples include:
- Finding the cause of brief startup hangs in Internet Explorer and various games
- Profiling Luke Stackwalker to find out why it caused frame rate glitches in the game it was profiling
- Finding the cause of a 10x perf-reduction when upgrading to a newer version of Windows
- Finding the cause of frame rate hitches during fraps recording
- Finding the chain of lock contention that caused frame rate hitches on a heavily loaded system
- Finding the cause of (and a workaround for) repeated 2-6 second hangs in Visual Studio’s output window
The last investigation is the one I want to cover today. It is sufficiently simple and self-contained that I can cover it end-to-end in a single (long) post.
Finding the hang
When profiling a transient problem such as a frame-rate glitch or a temporary hang the first challenge is to locate the hang in the trace. A trace might cover 60 seconds, and a hang might last for 2 seconds or less, so knowing where to look is crucial. There are a number of ways to do this:
- Find the key stroke that triggered the hang, through logging of input events
- Use instrumentation in the functions of interest
- Look for patterns in the CPU usage or other data
- Look at the Window in Focus graph for clues
- Use OS hang-detection events
- Remember when the hang happened relative to when you stopped and started tracing
I’ve used all of these techniques. To help with the first option UIforETW has an input event logger which puts all keyboard and mouse input into the trace. If a hang is triggered by a particular key press or mouse click then finding its start point in the trace is trivial.
Custom instrumentation (emitting ETW events at key points in your game, see the ETWEventDemo source) is also a common technique. Emitting an event every frame, or using CETWScope around slow operations, makes hitches plainly visible. However this doesn’t work when investigating performance problems in other people’s code, such as in Visual Studio.
In some cases a hang will be plainly visible in the CPU consumption. One recent hang showed a significant hole in the otherwise consistent CPU usage, plain as day.
The Window in Focus graph tracks the programs you switch between and can given vital temporal clues.
A specific event that indicates the time and duration of a hang would be ideal, and Windows 7 and above actually have such an event. The Microsoft-WindowsWin32k ETW user provider will emit an event whenever a thread resumes pumping messages after a significant delay. Windows Vista and earlier users are out of luck, but on Windows 7+ this is often exactly what is needed, and this provider is enabled by UIforETW.
It’s hands on time
I’ve uploaded a .zip file of a sample trace (2015-09-25_20-56-25 VS F8 short hang.zip) to my github bigfiles repo. If you download this trace, unzip it to UIforETW’s trace directory (default is documents\etwtraces) then you can load the trace into WPA 10 by double-clicking it in UIforETW. Following along is by far the best way to learn wait analysis. You should also download the .txt file with the matching name and copy it to the etwtraces directory. UIforETW will display it when the trace is selected, and you can type your findings into the Trace information area.
Once the trace is loaded be sure to Load Symbols from the Trace menu, unless you have Load symbols automatically after trace load configured.
This trace records a hang where I was looking through VC++ build warnings by repeatedly pressing F8. The hang was about six seconds long and happened after the third time that I pressed F8. These sorts of details can be very helpful when investigating performance problems so I always try to record them in the Trace information area.
Finding the hang start
With the trace loaded into WPA 10 you should see “Generic Events Randomascii Chrome and Multi Events” as the top graph and table. If not then you should use UIforETW’s Settings dialog to Copy startup profiles and then reload the trace. Having the right startup profile will make this process far easier.
In the Generic Events table you can drill down into UIforETW, Multi-Input, Keyboard, Keydown, and you will see 20 keyboard events. The F7 key press built the project and the three F8 presses navigated through the warnings. Since we know that the hang started on the third F8 press (selected) we can tell that the hang started at 6.5 s into the trace, shown by the blue vertical line:
Since I mentioned that the hang was about six seconds long we already have a rough estimate of when it ends, but we can do better than that.
Finding the hang length
By opening the UI Delays graph from System Activity in the Graph Explorer we can get even better information. Whenever a UI thread stops pumping messages for more than 200 ms the Microsoft-Windows-Win32k provider records events and the UI Delays graph displays them. Here we can see that devenv.exe went for 5,469 ms without checking for messages, and for 3,172 ms of that time there was input in the queue that was being ignored.
The table also shows us the process ID (9104) and the UI thread’s thread ID (10364). When investigating an idle-time hang it is crucial to know which process and thread to investigate so this information is vital.
Zooming in to the cause
When analyzing an idle hang you should select the entire region of the hang, and it is particularly important to select the end of the hang. It is better to select a few extra tens or hundreds of milliseconds at the end rather than risk missing the crucial events that end the hang. This selection can be done by dragging with the mouse or by selecting the MsgCheck Delay bar in the UI Delays graph. Then just right-click on the selected region and select Zoom from the context menu.
Who woke whom?
If a thread is not running, and it then starts running, then there was a context switch that started it (the new thread) running. That context switch is recorded in our ETW trace and contains all sorts of useful information. Include in this information is (for the traces recorded with UIforETW) the new process name and thread ID, the call stack which the thread woke up on (which is the same one it went to sleep on), the length of time it was not running and, for threads that were waiting on some synchronization primitive, the thread that woke it up.
Ponder that, because it’s crucial. An ETW trace tells you, for each context switch, how long the thread was not running, and who woke it up. That’s why it is important to have the end of the hang selected, because that is (presumably) the time of the context switch that gets the thread running again.
The CPU Usage (Precise) table contains the context switch information. With UIforETW’s startup profiles this data defaults to being displayed as a graph only. To do wait analysis we need to display the table, by clicking on Display Graph and Table:
We know need to drill down into New Process devenv.exe (9104), New Thread Id 10,364. The CPU Usage column shows us that this thread only consumed 111.7 ms of CPU time over the 5.469 s time range shown, so we immediately know that this is an idle hang. Knowing when to do wait analysis (to find idle delays) and CPU Usage analysis (for busy delays) is crucial. Doing CPU Usage analysis when a thread is idle will lead to nonsense results.
It is important to remember that there is nothing wrong with a thread being idle. Most thread’s should be idle. We are interested in this particular idleness because it is the UI thread and because it failed to pump messages for 5.469 s. This idleness is problematic because it caused a hang.
The devenv UI thread did 47 context switches during this time period and we want to find the one that ended the longest period of idle time, so we need to sort the table by Time Since Last which, as I explained here, represents how long a thread had not been running when it was context-switched in again. The UIforETW startup profile configures two Time Since Last columns, one aggregated by Sum and the other by Max. Aggregating by sum is appropriate when you want to find a call stack whose context switches collectively account for the most time. Aggregating by max is appropriate when you want to find a single context switch that represents the longest idle time. In this case it doesn’t matter which one we use, but I recommend experimenting.
If we drill down (select the first entry in the thread’s call stack and repeatedly press right-arrow) then we’ll find that there is a single context switch (see the Count column) that restarted the UI thread after it was idle for 5.341 s. That’s our hang, clear as day.
The NewThreadStack for this 5.202 s call stack starts at _RtlUserThreadStart and winds through a lot of Visual Studio code. Microsoft is kind enough to publish symbols for much of VS, as well as for Windows and about fifty rows down we get to the interesting details:
It’s a single context switch (‘count’ is 1) that put the Visual Studio UI thread to sleep for 5.202 s. It doesn’t get much clearer than that.
Farther down in the call stack we can see that ntoskrnl.exe calls FLTMGR.SYS which calls mup.sys!MupiCallUncProvider (and the eventual call to WaitForSingleObject). Ponder that for a bit.
If we go down to the bottom of the stack and expand the next three columns then we can see the Readying Process/Thread Id/Stack, which is the thread that woke us, which can also be described as “who we were waiting for”:
The Readying Process/Thread Id/Stack is the thread that “readied” the thread that was waiting. That is, it satisfies the condition that the thread was waiting on. That might be a lock, an event, a thread, a process, or something else. However if a thread is readied by the expiration of a timer then the Readying Process/Thread Id is irrelevant. Timer expiration is implemented by having a interrupt fire, the interrupt service routine schedules a deferred procedure call (DPC) and that DPC executes in the context of an existing thread. The DPC is like a body snatcher that assumes the identity of its host process/thread, readies a bunch of other threads, and then goes away. When doing wait analysis it is a good idea to have the Ready Thread in DPC enabled, and if it says True then ignore the Readying Process/Thread completely, as discussed in this random twitter thread.
In this case it was the System process (thread 5880) in an IopfCompleteRequest call that goes through MUP.SYS. If we know what MUP.SYS is then that gives us another clue as to the root cause, but even without that we know that Visual Studio called CreateFileW and it took a long time to return.
What about the other threads?
In our selected region their are context switch events for many threads in devenv.exe that spent most of the time range idle. So why aren’t we looking at them?
It’s important to understand that Sum:Time Since Last just measures how long a thread was idle, and there is nothing wrong with a thread being idle. A thread being idle is only a problem if it is supposed to be doing something and isn’t. In fact, if devenv.exe has lots of threads then they had better be idle most of the time or else my four-core machine is going to be constantly busy.
The reason we are looking at thread 10,364 is because (according to the MsgCheckDelay event) it is the UI thread and it went for 5.469 s without pumping messages. It kept me waiting, and that makes me angry. You wouldn’t like me when I’m angry.
File I/O summary table
Since we know that the hang is related to file I/O we should look at what file I/O is happening during this time period. Luckily this information is also recorded by UIforETW.
In the WPA Graph Explorer go to the Storage section and open up the File I/O graph and bring up a summary table. If you change the View Preset to Count by Process, Thread, Type and drill down into Process devenv.exe (9104), Thread 10,364 and look for an event with a long duration. If we sort by duration then we can just keep expanding with right-arrow. It doesn’t take long to find the event in question.
The IRP and File Object columns aren’t much use, but you can hide those or scroll over and find the File Path column. I’m sure this screen shot won’t show up very well, but I can’t shorten it any more. It contains too much glorious information. Click on the image for deeper details:
We can see here that a Create file event, from devenv.exe, thread 10,364, took 5.342 s, trying to open \\Perforce\main, and that ultimately the network path was not found.
The hang is because Visual Studio tried to open a non-existent network file, and sometimes that causes a 5.342 s network timeout. Hence the hang.
The remainder of the hang is from a few other context switches and CPU time that account for the rest of the 5.469 s, but the one long bit of idle time is all that matters in this case.
What’s the cause?
The file name associated with this hang is quite peculiar. The full name is:
\\Perforce\main\src\Lib\Public\sharedstuff.lib#127 – opened for edit
That doesn’t look like a file name. That looks more like the output from Perforce, only with back slashes instead of the forward slashes that Perforce uses. And that’s exactly what it is. At Valve (where I was working when I hit this problem) build results are stored in Perforce so we have pre-build steps to check these files out. The checkout commands print their results to the Visual Studio output window.
Visual Studio ‘helpfully’ reverses the slashes and decides that this represents a file name on \\perforce\main. Since this whole thing started with me pressing F8 this means that Visual Studio was trying desperately to treat this potential file name as a source-file name associated with an error or warning message.
That explains the CResultList::AttemptToNavigate entry on the hang call stack – everything makes more sense once you understand the problem.
Once the cause of the hang was understood I modified our pre-build step to pipe the output through sed.exe and had it rewrite the output so that Visual Studio would no longer find it interesting. This avoids the hang, and also made it so that F8 would take the selection to interesting errors and warnings, instead of to these mundane progress messages. A little sed magic replaces “//” with the empty string, and “…” with “—“ :
sed -e s!//!! -e s!\.\.\.!—!
This changes the hang-prone results before:
to the hang-proof benign text after:
I also reported the bug to the Visual Studio team. Having a trace is very powerful for this because it meant that I could tell them definitively what the problem was, and I could share the trace in order to let them confirm my findings. Just like minidump files are a powerful way to report crash bugs, xperf traces are a powerful way to report performance bugs. The bug was fixed in Visual Studio 2012 – UNC paths are now ignored by the output window’s parser.
Mup.sys is the driver used for network file I/O. Therefore its presence on the New Thread Stack and the Ready Thread stack was a clue that a network delay was the problem. Doing file I/O on the UI thread is always a bit dodgy if you want to avoid hangs, and doing network file I/O is particularly problematic, so watching for mup.sys is a good idea.
Some wait analysis investigations are more complicated than this one. In several investigations I have found that the main thread of a game was idle for a few hundred milliseconds waiting on a semaphore, critical section, or other synchronization object. In that case the readying thread is critical because that is the thread that released the synchronization object. Once I find out who was holding up the main thread I can move the analysis to that thread and apply either busy-thread analysis (CPU sampling) or idle thread analysis (finding what that thread was waiting on). Usually just one or two levels of hunting is needed to find the culprit, but I did recently trace back across six context switches in four different processes in order to track down an OS scheduling problem.
When following wait chains it is important to understand the order of events. If thread 1234 is readied by thread 5678 at time 10.5 s, then any context switches or CPU activity that happen to thread 5678 after that point are not relevant to the wait chain. Since they happened after thread 1234 was woken they can’t be part of its wait chain.
For CPU activity this is dealt with by selecting the region of interest. For context switches this is dealt with by drilling down all the way and then looking at the SwitchInTime (s) column. This column records the time of the context switch.
It’s worth pointing out that if you busy wait (spinning on some global variable flag) or use your own custom synchronization primitives (CSuperFastCriticalSection) then these techniques will not work. The OS synchronization primitives are instrumented with ETW events that allow, in almost all cases, perfect following of wait chains. Even if your custom synchronization code is faster (and it probably isn’t) it isn’t enough faster to make up for the loss of wait analysis. The ability to profile your code trumps any small performance improvement.
Can’t any profiler do this?
Sampling profilers and instrumented profilers might be able to tell you that your program is idle, and they might even be able to tell you where your program is idle, but they generally can’t tell you why your program is idle. Only by following the chain of readying threads and looking at other information can you be sure to find the cause of your idle stalls.
It’s also convenient that you can leave xperf running in continuous-capture mode, where it is constantly recording all system activity to a circular buffer. When you notice a problem you can just record the buffer to disk, and do some post-mortem profiling.
Wait analysis is not baking
Baking is all about precisely following a recipe – improvisation tends to lead to failure. Wait analysis, on the other hand, is all about creativity, thinking outside the box, and understanding the entire system. You have to understand context switches, you have to think about what idle time is good and what is bad, you have to think about when to look at CPU usage and when to look at idle time, and you often have to invent some new type of analysis or summary table ordering in order to identify the root cause. It’s not easy, but if you master this skill then you can solve problems that most developers cannot.
For another discussion of analyzing hangs with xperf see this wait analysis article by Trey Nash. For instructions on doing the same type of analysis on Linux see Brendan Gregg’s post on Off-CPU Analysis.