Xperf (Windows Performance Toolkit, also known as ETW) is a powerful tool for investigating performance issues, however it is a challenging tool to use. Some of this difficulty comes from intrinsic complexity – in order to fully investigate thread scheduling issues, for instance, you need to fully understand the Windows thread scheduler.
This post was updated September 2015 to include information about UIforETW, WPA, and new columns.
However much of the difficulty comes from ‘accidental’ or unnecessary complexity, and included in this accidental complexity is the total lack of documentation of summary table columns. The data represented by these columns is very specific and subtle, however the lack of documentation means that a user of xperf must first reverse engineer the column meanings, build up a mental model, and then start analyzing the performance question at hand. In some cases the aspiring performance expert must even figure out the units for the columns!
Other posts in this series include:
In this post I take on the challenge of documenting the columns for the CPU Usage (Sampled) summary table which displays CPU samples to let you determine where CPU time is spent.
This post assumes that you are recording traces with UIforETW and that the WPA (Windows Performance Analyzer) startup profiles which ship with UIforETW have been installed. You can do this from the Settings dialog by clicking on Copy startup profiles. This post also assumes that you are using WPA 10 to view the traces – this is the default when you double-click on a trace in UIforETW‘s Traces list.
CPU Usage (Sampled)
Xperf has an excellent sampling based profiler built in. Running at a default rate of 1 KHz (but capable of running at up to 8 KHz – just check Fast sampling in UIforETW) the xperf sampler grabs call stacks from across the system, both kernel and user-mode stacks, with minimal overhead. It is far superior to the many SuspendThread/ResumeThread based systems in various open source projects.
UIforETW’s default startup profile shows the CPU Usage (Sampled) data in table form. There is a graph of sampled data available but the CPU Usage (Precise) graph is more accurate than the CPU Usage (Sampled) graph, so I chose to show the precise graph (based on context switches) instead.
The default set of columns in UIforETW‘s startup profile is labeled “Randomascii inclusive (stack)”. It’s columns are:
- Thread ID
- Orange bar (columns to the left are used for grouping)
- Weight (in view) (ms)
If you want per-process grouping then hide the Thread ID column. If you want grouping by process name instead of process ID then hide the Process column and show the Process Name column instead. As long as you sort by the Count or Weight (in view) columns the call stacks that consume the most CPU time will be near the top.
Another useful View Preset is labeled “Randomascii exclusive (module and function)”. It’s columns are:
- Thread ID
- Orange bar (columns to the left are used for grouping)
- Weight (in view) (ms)
The inclusive view is most useful for attributing samples by call stacks (who called the expensive function) whereas the exclusive view is most useful for seeing what functions consumed the most CPU time all by themselves. In addition to the column modifications suggested for the inclusive view, it can sometimes be worth adding the Address column to the exclusive view, to see where in a function individual samples hit. As long as you sort by Count or Weight the thread with the most samples will be at the top, with the module with the most samples within that that thread at the top, and so on.
Note that the concept of “what instruction was executing when the timer interrupt fired” is not clearly defined for super-scalar out-of-order processors that can have many dozens of instructions in flight. The best analysis of how to interpret the Address column I have seen is found here.
There are other variations that can be useful – most summary table questions are answered by saying “arrange the columns like this…” – but they quickly get esoteric. Once you understand the Zen of summary tables you will know when to use them.
On to the column docs
It is assumed that you have already read the posts explaining how to record a trace and and that you have a trace to investigate.
All of the columns in the sampling summary table show data that is related to a particular sample or samples, and the grouping (the columns to the left of the orange bar) affects how many samples and how many rows are displayed. With the default “Randomascii inclusive” view there will be one row for each process, and as you drill down more rows will be revealed. The grouping by stack is particularly important since it lets you see the big picture more easily.
Any threads or processes that were not sampled during the visible time period will not show up, even if they did briefly run.
Here are the important columns:
- Process Name – this is the name of the process associated with a sample (i.e.; devenv.exe) without the process ID. This column is useful when you want to group all processes from one executable together
- Process – this is the name of the process plus its process ID. This column is useful when you want each process to be grouped separately
- Stack – this is a dynamically explorable view of the call stacks for all the samples in the current group. This column is only useful when it is to the left of the orange column. This column displays inclusive time – when a function shows up in the stack data it means that a sample was collected in that function or one of its descendants, along that particular path – see below for more details
- Module – this is the module in which a particular sample hit. This column displays exclusive time and should rarely be used at the same time as the Stack column
- Function – this is the function in which a particular sample hit. This column displays exclusive time and is often used with the Module column
- Address – this column specifies the raw instruction pointer address of a sample. This column displays exclusive time and is often used with the Module and/or Function columns
- Thread ID – this is the ID of the thread associated with a sample. If you enable this column and position it between Process and Stack then you will be grouping by thread (within each process) which lets you analyze threads separately
- Weight (in view) – this is an estimate of how much CPU time was spent in the current grouped row. It is a statistical estimate because a sampling profiler, by its nature, has no idea what is happening between samples. If you have enough samples and if your code is not synchronized to a timer then it will be reasonably accurate. The units are milliseconds
- %Weight – this is an estimate of what percentage of the total available CPU time was spent in the current grouped row. If you have an eight core (or four-core eight-thread) CPU then 100% represents all threads running all the time, and 12.5% represents one thread running all the time. On multi-core machines this column really isn’t very useful
- Count – this is a count of how many samples were recorded. At the default sampling rate of 1 KHz this number will be very similar to Weight (perturbed by minor variations in sample times) but if the sampling rate is changed then this will no longer map to milliseconds, so you should prefer the Weight column for estimating CPU usage. The main value of this column is to let you know when you have a statistically significant number of samples
- TimeStamp – this is the nanosecond precision time at which a sample was recorded. The units are seconds. Occasionally it is valuable to sort samples by time (you’ll usually have to group by samples) – but not often
- Annotation – this, column, which is available on most or all WPA tables, can be filled in during trace analysis. See this post for more details.
The most important columns are typically available by right-clicking on any column header and checking the column of interest. Some of the more esoteric columns (Image RVA, Is PGO’ed, Compiler Optimization) are only available by going to the View Editor in WPA and dragging from the Available Columns along the left to the column on the right.
Here are the generally unimportant columns, documented for completeness:
- Display Name – this is normally the process name but for services it appears to be the lpDisplayName parameter to CreateService
- Thread Start Module – the module where the function passed to CreateThread is located
- Thread Start Function – the name of the function passed to CreateThread
- DPC/ISR – this column specifies whether a particular sample occurred during Regular CPU Usage, DPC, or ISR time. If you suspect that excessive device-driver time is a problem then you can enable this column and drag it to the left to see the grouping, but normally this information is not relevant to those focused on regular CPU Usage
- CPU – this is the CPU that a thread was running on when a sample was recorded. You should rarely care, unless you are working on the OS thread scheduler. Note that WPA 10’s tooltip for this column is incorrect
- Priority – this is the priority of the thread running when a sample was recorded
- Table – this contains the hard-coded value “CPU Usage (Sampled)”
- Section Name – this is the Portable Executable section where the sample hit, typically “.text” for user-mode programs
- Section RVA – this is the Relative Virtual Address of the sample, relative to the beginning of its section
- Image RVA – this is the Relative Virtual Address of the sample, relative to beginning of the DLL or EXE
- Compiler Optimization – this shows what level of optimization the code was compiled to. Take it with a grain of salt, as hand-tuned assembly language showed up as optimized for Size while debug-mode non-optimized C++ showed up as Speed. Potentially you can use this to make sure your hot functions are optimized for speed
- Inlined Functions – this shows what functions were inlined in to another function, which could be important, but since WPA processes Chrome’s private symbols so slowly I always strip them and therefore miss this information. Is it useful? Let me know
- Trace # – this just appears to represent which trace (WPA can load multiple traces simultaneously) you are viewing
Unknown or unclear columns:
- Stack Tag and Stack (Frame Tags) are ways of adding additional meaning to stack frames. They are (lightly) documented here. WPA ships with some stack tags – see “C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\Catalog\default.stacktags” for an example
- Thread Name, Thread Activity Tag, and have unknown meaning and seem to be unused.
- The Source File name and Source Line Number columns look promising, but never seem to show any data
- Rank has unknown meaning but is used, showing values from –1 to 2 in my sample traces on Windows 10
- There are also PGO related columns, sometimes only available in the View Editor. You can find out more about them in this article at Kyle Sabo’s web site
See also this article which explains how to interrupt the sample locations on an out-of-order super-scalar processor.