For many years xperfview.exe has been the main tool for analyzing xperf/ETW traces. However starting in fall 2011 the Windows Performance Toolkit started including wpa.exe as an alternative. While the early versions had some significant rough edges, the latest version (10.0.10240.16384, released in tandem with Windows 10) is now superior to xperfview in basically all ways.
In this post I’ll briefly explain how to switch from using xperfview to WPA, and why this is worthwhile.
Updated October 2013 and September 2015.
Before proceeding you should be sure to get the latest version of WPA. The simplest way to do that is by getting UIforETW. There are details at ETW Central, and this should make recording a sample trace easy.
It may be useful to understand xperf analysis basics, xperf for excess CPU consumption (WPA edition), and xperf wait analysis, but much of the UI flow is quite different with WPA, so some of the details will be different.
The two articles in the xperf documentation series, CPU sampling and CPU scheduling, are still relevant – understanding the column names is as important as ever. Those articles are timeless classics. The UI for selecting columns has changed, but that is easy enough to adjust to.
When you first launch WPA it may be dauntingly austere and blank. All of the graphs are collapsed and hidden:
If you see this then you’re not using UIforETW and you’ll have to drag graphs on by yourself. UIforETW copies over a startup profile which includes all of the graphs below. Whenever you install a new version of UIforETW you should probably use the Settings dialog to copy over a new startup profile, in case it has been improved.
App-specific generic events can be crucial for navigating a trace. I use them to identify frame boundaries, user input, and other key events. With them I can see when the frame-rate drops, and I can see what events the drop correlates to. UIforETW inserts lots of custom events for input, timer frequency, and more, and Generic Events is where they will be found.
The display of generic events in xperfview, the old trace viewing tool, left much to be desired:
If I looked in the ProviderIds drop-down then I could see that these blue diamonds corresponded to my Valve-FrameRate provider, but that provider emits events for simulation ticks as well as render ticks, and the distinction is invisible in xperfview.
The display of generic events in WPA is much improved. The events are displayed hierarchically so that I can drill down and find all my events. The layout is configurable (just show the table and rearrange columns) so you can group in different ways:
By default the UIforETW startup profile filters out all unknown generic events. If you want to see all of the events (there are many from Windows) then you can use the View Editor to change the filter or else use a different view preset.
The Generic Events graph can be found in Graph Explorer under System Activity.
Window in Focus
This graph shows what process’ window is active. This can be helpful to choose where to analyze. Most games lower their frame rate when they are not active and investigating a poor frame rate when you’re intentionally going idle is pointless.
In the graph shown below the user switched from explorer to WLXPhotoGallery, and then DWM owned the active window for a while.
Aside: DWM becomes active whenever Windows detects that a program has hung – DWM takes over window management so that the user can still move the hung window around.
The Windows in Focus graph can be found in the System Activity section.
CPU Usage (Precise)
CPU Usage (Precise) is the graph formerly known as CPU Usage (there were three variants for grouping by thread/process/CPU). This graph is constructed from context switch data (and interrupt and DPC data) which means it is a sub-microsecond accurate measurement of what thread is running when and why, as well as exactly how much CPU time each thread and process is consuming. That seems pretty important so you probably want this one available. In Graph Explorer under Computation you can find CPU Usage (Precise). UIforETW defaults to putting this graph in a form that is optimized for doing wait analysis – display the table if you want to do this.
The “Timeline by Process, Thread” view, available from the View Presets dropdown, also looks interesting, and makes some patterns more obvious. In particular, if two threads or processes are ping-ponging (taking turns executing) this behavior is much easier to follow on the timeline graph. However the timeline graph devolves into solid bars of color when zoomed out – it really only works when examining fine details.
The screen shot above shows the utilization and timeline graphs over a 20 ms time period.
CPU Usage (Sampled)
CPU Usage (Sampled) is the graph formerly known as CPU Sampling (there were three variants for grouping by thread/process/CPU). This data is constructed using data from the profile provider, which periodically interrupts all CPUs to see what they are doing, with a default rate of 1 KHz.
In UIforETW if you click Fast sampling then the sampling frequency is raised to 8 KHz
Because this data is sampled it cannot tell you what is going on between samples. However with enough samples and with well behaved code the sampling data can be extremely useful. In particular, if you have call stack collection enabled for the profile provider then each sample includes a call stack.
In general I would say that the main purpose of the sampled CPU data is looking at the call stacks, because this tells you (subject to sampling artifacts) what your code is actually doing. Therefore it seems passing strange that WPA defaults to not showing you call stacks.
Since most programs are CPU bound at least part of the time I think this data should always be enabled, which is why UIforETW’s startup profile defaults to showing this data, in table form (the CPU Usage (Precise) data is better for graphing). The UIforETW startup profile also enables stacks. If you want to see the data grouped by module and function then change the View Preset using the drop-down – look for presets that start with randomascii.
If you look in Graph Explorer in the Computation section you’ll find CPU Usage (Sampled section).
If you drag this graph over you now have two graphs showing CPU usage – one precise and one sampled. That’s a waste of space. The precise CPU usage is, well, more precise so we want its graph, and all we really want from the sampled data is call stacks, and those are on the summary table. That means that we now make our first use of the buttons on the right side of the graph’s title bar. The three left buttons let you control whether to display “graph and table”, “graph only”, or “table only”. The default is graph only, but for CPU Usage (Sampled) I think we want table only, as shown above. This gives us a table of precise CPU usage, and a graph of sample CPU usage. Perfect.
If you want to customize the columns for any table click on the View Editor (Ctrl+E) button that is circled in red in the screen shot below:
When you drill in to the sampled data call stacks the graphs are highlighted to show where the samples came from, which can be invaluable data once you learn how to interpret it.
At this point you should have something beautiful like this:
What about wait analysis?
Wait analysis is crucial for solving some types of performance problems and the table necessary for doing this type of analysis is attached to CPU Usage (Precise), since wait analysis uses context switch data. The first step is to enable the graph by clicking “Display graph and table”. If you are using the UIforETW startup profile then the columns will be arranged appropriately for wait analysis, as described here.
I normally have the CPU usage (precise) summary table turned off and I make it visible when I need it. However I always have my favorite column orderings set up, ready and waiting for me.
Profiles-> Save Startup Profile
Once you get everything perfectly arranged, don’t forget to save your startup profile. If you create a profile that is well suited to the type of analysis that your company does then you can share it with your coworkers. WPA 8.1 stores its startup profile in an XML file in “Documents\WPA Files\Startup.wpaProfile”. WPA 10 stores its startup profile in %”localappdata%\Windows Performance Analyzer\Startup.wpaProfile”.
As with xperfview you can zoom in/out with ctrl+mouse-wheel. You can also select a region and then zoom to that region, either in the main window (right-click, Zoom), in a new tab (right-click, Zoom all in new view), or just the current graph in a new tab (right-click, Zoom graph in new view). The workflow is a bit different from xperfview but ultimately the ability to do deep analysis in a single window makes it worthwhile.
In addition to mouse and mouse-wheel based zooming you can use ctrl+”+” and ctrl+”-“ for keyboard based zooming. They don’t zoom around the mouse, but at least I can finally zoom in/out a bit when I don’t have a mouse-wheel available.
With WPA 10 the navigation (zooming and panning) comes with an undo feature, which is quite wonderful.
WPA continues the fine xperfview tradition of hiding some of the disk analysis graphs. In particular the cool chart showing disk-head movements is now found by opening up the Storage section, right-clicking on Disk Usage, and selecting Disk Offset Graph. There’s something amazing and terrifying about seeing how far the physical read/write head on your poor disk sometimes ends up moving.
WPA has improved the trace analysis experience in several ways. Some of these include:
- Asynchronous symbol loading – keep working during the (sometimes slow) symbol loading process
- Easy saving and sharing of startup profiles
- Deep analysis in one window
- Better display of Generic events
- Ctrl+”+” and ctrl+”-“ for keyboard based zooming
- Highlighting of relevant graph sections – makes viewing profile sample locations easier (more on this later)
I keep finding that ETW/xperf/WPA give me access to information that most other developers don’t have. This lets me find and fix performance problems that are otherwise invisible or intractable. I continue to enjoy having x-ray glasses that actually work.
Thanks for this post. I follow your blog specifically for information about xperf and have been playing around with WPA lately as well.
In case you’re looking for suggestions on content for future posts, I’d be interested in a post about GPUView if you have any information about that. The presentations and info on the web show how to see really high-level information like whether your app is GPU bound or CPU bound, if it’s waiting on sync objects in the case of multi-GPU configs, etc. But they don’t talk that much about in-depth analysis like how to know which objects specifically are causing slowdowns, how to make the correspondence between what you see in GPUView and your own engine (like how to identify which resource is causing a sync between GPUs, etc.) and so on. So if you have information to share about GPUView it would be very useful.
I am always interested in suggestions, but unfortunately my GPUView skills are weak. I can distinguish between GPU busy and GPU idle, and that’s about it. Once I think I determined that a slowdown was caused by V-RAM thrashing (huge numbers of DMA packets) but I didn’t feel particularly confident in my answer. I think you should play around with GPUView and write some articles once you’re feeling expert.
Are you sure you couldn’t ask for a “guest post” from someone you know who has some GPUView expertise? 🙂
Seriously though, I will obviously continue tinkering with it, I would have done that anyways. I was just hoping for some “insider info” 🙂
Take a look at http://graphics.stanford.edu/~mdfisher/GPUView.html. It seems to contain some useful information, and the author might be somebody worth talking to in order to get more information.
Unfortunately since I’m no longer at Microsoft I don’t have reliable access to “insider info”.
Bruce, thanks alot for this post – I’ve been consuming windows trace data in various forms and with various tools and didnt realize that there was a new kid on the block in the new windows SDK.
I appreciate the time you spent to share this.
Pingback: “Performance Analyzer doesn’t process traces while elevated” | Random ASCII
Pingback: Xperf Symbol Loading Pitfalls | Random ASCII
Pingback: The Lost Xperf Documentation–Disk Usage | Random ASCII
Pingback: Visual Studio Single Step Performance Fixes | Random ASCII
Pingback: The Lost Xperf Documentation–CPU sampling | Random ASCII
Pingback: The Lost Xperf Documentation–CPU Scheduling | Random ASCII
Pingback: Techniques for Zooming in Xperf Trace Viewers | Random ASCII
Pingback: Xperf Basics: Recording a Trace (the easy way) | Random ASCII
Pingback: New Version of Xperf–Upgrade Now | Random ASCII
Pingback: Bugs I Got Other Companies to Fix in 2013 | Random ASCII
Pingback: Process Tree from an Xperf Trace | Random ASCII
Pingback: ETW Central | Random ASCII
Pingback: Xperf for Excess CPU Consumption: WPA edition | Random ASCII
At what thread priority is CPU Usage (Sampled) collected? Does this sample threads that are running at elevated thread priority? I have an application that’s actually causing Windows itself to stutter (mouse movement becomes jerky), and it looks like my main thread is running at elevated priority some of the time(seen in GPUView), but I don’t see this activity in Windows Performance Analyzer (I don’t see the functions that GPUView samples during the period when the stutter occurs).
The CPU samples are collected by an interrupt service routine, so thread priorities are not relevant.
If GPUView shows your thread running the WPA will show your thread running – they show the same data. Probably you are seeing the difference between CPU Usage (Precise) which is based on context switches and shows exactly when your thread is running and CPU Usage (Sampled) which is based on samples and is therefore statistical, and may miss your thread due to aliasing of the run times with the sample times.
You could try increasing the sampling rate (Fast sampling in UIforETW) but I think that is solving the wrong problem. You probably don’t care what your thread is doing, you care about why other processes (the ones that are unresponsive) are *not* running, and the CPU Usage (Precise) data and wait analysis is therefore more relevant. See some of my hang investigations, such as:
OK, I did eventually find the function that’s hogging thread time in WPA using CPU Usage (Sampled) as well. The function call (GreSelectRedirectionBitmap) was buried quite deeply in the call stack, under BeginPaint. By the end of my 24 hour run, it was using almost 90% of the time consumed by my main application thread. I will try using CPU Usage (Precise). Thank you for your suggestion.
The “delays by process” graph that you reference in your mouse hang blog post does appear to be available when I open my events file in WPA. Possibly my events file does not include events from a necessary trace provider. Do you know what events are required for this graph?
The graph is called UI Delays, and Delays by Process, Type is a particular view of the UI Delays graph. If the UI Delays graph is present it will be in Graph Explorer under System Activity.
The provider that records that data is Microsoft-Windows-Win32k and this provider is automatically recorded by UIforETW. However, the graph will not appear if there are no UI hangs.
Pingback: When Debug Symbols Get Large | Random ASCII – tech blog of Bruce Dawson