Last year I wrote an article about how to use xperf to analyze a CPU bound program, using the xperfview UI. Since then the WPA UI has come out and replaced xperfview. This updated version of the article explains how to do the same analysis steps using WPA.
It’s an encouraging sign that this article is shorter and simpler than the original.
The original article on using xperfview to analyze excess CPU consumption can be found here, but consider it deprecated. This article has been updated for UIforETW and WPA 10.
To see the techniques discussed here demonstrated take a look at the first video at ETW Training Videos Available Now.
Xperf for excess CPU consumption
When an application is slow you can generally categorize the problem as CPU bound or CPU idle. Either the CPU is spending an annoyingly long time on computing some result, or the CPU is idle, waiting on some external resource such as the disk, the network, or the user.
The reality can be more complicated since you can end up with your process idle while waiting on another process that is CPU, or you can have a busy loop that is consuming CPU time while waiting on the disk (bad programmer – busy loops are for kids), but as a rough categorization it works pretty well.
(see my post In Praise of Idleness for many reasons not to busy wait)
This distinction is important because the WPA techniques for investigating these two problems are quite different. For a CPU bound process or thread the question is “where is my code spending all those execution cycles” and for a CPU idle process the question is “where is my code spending all its idle time waiting?” Sometimes a slowdown will be a mixture of idle time and CPU time, in which case you can either subdivide the hang until you’re looking at a region that is mostly one or the other, or else apply both investigative techniques.
Today we cover the easy case: the CPU is spinning in your process and you’d like to know where.
For investigating why your CPU or thread is idle see Xperf Wait Analysis – Finding Idle Time.
For instructions on how to install the latest version of xperf, how to record traces, how to use custom user providers to insert custom ETW events, and how to configure WPA for easier trace analysis read my Xperf Basics: Recording a Trace (the ultimate easy way) post. As the title suggests, it’s really easy, finally.
PowerPoint Performance Problem
If you already have a trace showing excess CPU consumption then congratulations. However in order to follow along I recommend downloading an example trace showing PowerPoint being slow, as discussed here.
Make sure you have the WPA startup profile set up (see the Recording a Trace (the easy way) post) then load your trace into WPA. Be sure you’ve read The New Xperf is Here and make sure your symbol paths are configured.
When you run UIforETW it will copy over a good WPA startup profile if you don’t have one already. If you want to be sure you have the latest/greatest version then go to the UIforETW Settings dialog and click Copy startup profiles.
If you run UIforETW and use it to record a trace, or copy a previously recorded trace to its documents\etwtraces directory, you can then launch WPA from UIforETW and then the symbol path will be set up.
Finding the right place in the trace is crucial so that you know which time period to analyze. Sometimes this is trivial – perhaps your process does just one thing – but other times it takes more work to find the right area. The best way to find the area of interest in your trace is to insert user events – look for ETWEventDemo in the UIforETW announcement post. The source code for ETWEventDemo and UIforETW are both good examples of how to emit custom ETW events using ETWProviders*.dll.
If you’re trying to profile somebody else’s program (why do I keep doing that?) you can’t insert user events into that program, but you can use input events (recorded by UIforETW) to orient yourself.
The trace below shows KeyDown events for every key press, including virtual key codes but for this investigation we just need to know when the events occurred, and that can be plainly seen in the screen show below:
We can see that on my torture-test PowerPoint file it takes about a second of CPU time to process each key stroke (the two CPU usage spikes at the left). We can also see that if we type many characters at once (on the right) then they get batched up and processed more efficiently – that’s actually a clue as to where the problem lies.
If you don’t have user events then there are various tricks that can be used. When I type slowly at the beginning of the trace the CPU spikes are obvious. Note that the CPU Usage (Precise) graph is the best choice for finding spikes because, since it is based on context switch data, it is the most accurate way of determining when CPU usage begins and ends.
Also note that for the screenshot above I clicked on POWERPOINT.EXE in the CPU Usage graph, which caused all of Power Point’s CPU usage ranges to be highlighted. WPA frequently does this type of helpful highlighting – watch for it.
Other orientation tricks
Another trick for finding the right time range in a trace is to use the Window in Focus graph. If your application hangs when you perform a particular action then do that action and then immediately switch to another program. Or vice-versa. The Window Thread in Focus graph shows which window is selected when, as shown here where I alt+tabbed from WPRUI to POWERPNT.EXE.
Once you have found the relevant section you can use the mouse to select that section, then right-click it and select one of the zoom options from the context menu.
If you are following along with the example PowerPoint trace linked above then right-click on one of the graphs, choose Select Time Range…, and set the start time and end time to 5.65 s and 6.40 s.
Scroll down to the CPU Usage (Sampled) table. This table shows data gathered by the sampling profiler. With the default UIforETW settings a call stack is recorded with each sample and it is these call stacks that let us answer the question of where the code is spending all those execution cycles. In the UIforETW startup profile for WPA the CPU Usage (Sampled) table shows call stacks grouped by process, thread ID, and stack, with the data fields being Count, Weight, and %Weight. The table should look something like this:
Before going any further it’s worth explaining the the Count and Weight columns. Count is simply a count of how many samples were accumulated for the current row. In the screen shot above this shows how many samples occurred in each process, with PowerPoint getting the most (the idle process is hidden by default). Weight is an estimate of how many milliseconds of execution time those samples represent. With the default 1 KHz sampling rate the Weight and Count fields are very close – the deviation is because the samples aren’t perfectly evenly spaced. If you change the sampling frequency (“xperf -setprofint 1221 cached” sets the sampling frequency to 8 KHz) then the Count field will be about eight times larger than the Weight field. Since milliseconds are more intrinsically meaningful than sampling count I tend to focus on the Weight column, and sometimes hide the Count column when I need more space. However it’s important to remember that a sampling profiler only gives you a statistical impression of CPU consumption. If a particular area just has a few samples then it could just be a statistical anomaly. You need hundreds or thousands of samples before you can draw conclusions.
Choosing what column to sort by is important when using summary tables, so that the most important data floats to the top. For a CPU Sampling Summary Table you can sort by Weight or Count. They are proportional to each other and they both represent CPU consumption, which is what we care about.
For more details on the CPU sampling columns see The Lost Xperf Documentation–CPU sampling.
It’s time to drill into the sampling data. Go to the Trace menu and make sure that Load Symbols is checked. Click the arrow beside POWERPNT to drill into that process, and then drill in to thread 11780 – the thread consuming the most CPU time. You should see something like this:
Root is the imaginary root of all call stacks, and ?!? is a sample with no call stack. When we expand [Root] we see the actual roots of all of the call stacks. The most common root is ntdll.dll!_RtlUserThreadStart and on this example virtually all of the Weight (all but one of the samples) managed to stack walk all the way back that function, which is excellent. If you have a lot of stacks (by Weight) that don’t go back to _RtlUserThreadStart or some similarly plausible start function then either your stack depth is too great (96 for Vista, 192 for Windows 7) for the stack walker to handle, or else some of the code on your stack has frame pointers omitted and the EBP chain was broken. Make sure you compile with /Oy- to avoid this.
Note that all of the numbers shown when grouping by Stack (Weight, %Weight, and Count) are inclusive. They include all of the descendant functions of the call stack.
Here’s what wpa looks like after you click on the plus sign beside [Root] to expand it:
When you get several function names above each other with no ‘-‘ at the left then this means that this portion of the stack is shared by all functions along this branch of the stack. This is the case for all of the functions visible above, even those in MSO.DLL and ppcore.dll which we have no symbols for. Here we have scrolled down a bit, descending deeper into the call stacks:
When you get several function names above each other with a ‘-‘ at the left then that means that the stack splits at this point – in the screen shot above most of the samples were from call stacks that went into GFX.DLL and some went into OART.DLL (no symbols for either).
If you are sorting by Weight, %Weight, or Count then you don’t need to think about this too much. Just click on Root and repeatedly press right arrow. This will automatically expand the call stack along the hottest path. After doing this for a bit you should see something like this:
784 samples along this path go through GdipDrawImageRectRect, which calls GpGraphics::DrawImage, which calls itself (or some overload with the same name), which calls gpGraphics::DrvDrawImage. Then we have records of three other functions being called: DrawImage, PipeLockBits, and PipeUnlockBits. All of the samples that have DrvDrawImage in their stack are somewhere deep inside one of these three functions. You can verify that by adding the counts:
445 + 324 + 15 = 784
In other words, the relationship between GdipDrawImageRectRect/ GpGraphics::DrawImage/ GpGraphics::DrawImage/ gpGraphics::DrvDrawImage is a great-grandparent/ grandparent/ parent/ child relationship. wpa (like xperfview before it) doesn’t indent child functions if there is only one child – this avoids wasting horizontal space. Meanwhile the relationships between gpGraphics::DrvDrawImage and the three indented functions are parent/child relationships, and the three indented functions are siblings.
A bit more poking around reveals that PowerPoint is spending an impressive amount of time decoding the background image almost every time it redraws the screen – a lot of time is spent in GdiPlus.dll!GpWicDecoder::Decode. A more detailed analysis of this weirdness can be found at PowerPoint Poor Performance Problem.
Effective use of summary tables is all about deciding what columns to show, where to put them, and what to sort by. The columns to the left of the orange bar are used for hierarchical grouping, and the columns to the right of the orange bar display data, often accumulated across many rows that are collapsed together. Any column can be used for sorting.
The default set of rows for WPA, as for xperfview before it, is not ideal for most users. It hides the stack, and shows other data columns that are less important. That’s why you should use my startup profile which defaults to showing CPU sampled data with its “Stack based – inclusive” view.
However there are other useful ways to group data. It is sometimes useful to group by Module and Function instead of stack. You can do this manually – by right-clicking a column header and checking, unchecking, and rearranging, or you can make use of the Module and function based – exclusive preset which comes with my startup profile – the drop down for switching views is shown above.
With that view selected we can expand POWERPNT.EXE by clicking and see all the modules within that process where samples were recorded. Then we can expand, say, GdiPlus.dll and see all of the functions within this module within POWERPNT.EXE where samples were recorded. By looking at the Weight column we can now see that, for the selected time period, POWERPNT.EXE used about 763 ms of CPU time, and within that process GdiPlus.dll used about 518 ms of CPU time:
It is important to understand that what is counted when using the Module and Function columns is what instruction was being executed when the sampler interrupt fired. Call stacks are ignored when using these columns, so these columns represent exclusive timing information. If you’re doing micro-optimization then you can go one level further and add the Address column between the Function column and the orange bar. This lets you see which instructions in the hot functions in the hot modules are using the most CPU time. The meaning of this is a bit fluffy given the parallel nature of out-of-order processors, but occasionally this will help you find an unexpectedly-hot loop or an integer divide which is slowing your code.
Sampling limitations to remember
There are many ways of misinterpreting the summary table above. You might be tempted to conclude that DrvDrawImage only called three functions, but in fact we don’t know that. We only know that when we took samples and walked the stack we never caught it in the act of calling any other functions, at least not when it was called along this call stack.
It can be tempting to assume that the order of the child functions indicates something about the order in which they were called. It does not. Since we are sorting by Weight it merely indicates which child functions were sampled the most.
It’s also important (and annoying) to remember that a sampling profiler, by its very nature, tells us nothing about how many times a function was called. We can’t tell if PipeLockBits was called once and took ~291 milliseconds that time, or if it was called 1,000 times and took about ~.291 milliseconds each time.
Digging to the bottom
If you dig in deep enough then you’ll get to the bottom of the call stacks – the leaf functions where most of the samples are usually taken. Sometimes this is indicated by having a function shown as its own child but with <itself> appended, to indicate that the samples were taken from it:
In the example above we can see that Convert_XYZW_ZYXW_32bpp_SSSE3_Impl<0,1> took about 42 milliseconds of this second segment, and about 27 milliseconds of that was spent in KiPageFault. Interesting…
When you see a function on the call stack that doesn’t tell you how much time was spent in that function. It tells you how much time was spent in that function (and its children) along that call stack. If the function was called through other paths (totally different sections of the code, different levels of recursion, etc.) then those call stacks will be separate. wpa has an option to display all callers or callees of a particular function. Just right-click on the function in the call stack, go to Butterfly, and select View Callers or View Callees:
For instance, we might notice KiPageFault on one of the call stacks and wonder if this was consuming a lot of CPU time along other call stacks. A quick trip to View Callers and we get this view:
This tells us that on this thread in this process KiPageFault only showed up in 30 call stacks. Note that when viewing a Callers butterfly view the call stacks are reversed.
It’s worth pointing out that the butterfly view is just another summary table. It is viewing a filtered set of the data, but the same techniques of adding, removing, and rearranging columns in order to examine the data are just as relevant as for ‘regular’ summary tables.
You can return to viewing the regular call stacks by going to the Butterfly menu and selecting Clear.
Not that kind of page fault
The presence of KiPageFault on the call stack is interesting. The CPU Sampling Summary Table shows CPU consumption. When we think of page faults we normally think of pages being faulted in from disk, during which time our process is waiting on the disk. When our process is waiting on disk it isn’t running. Therefore, page faults that hit the disk do not show up as CPU consumption. This can be verified by looking at the Hard Faults graph (in the Memory section) for the selected region and noting that there are none.
These instances of KiPageFault must therefore be soft faults. This could mean that some pages were removed from the working set but were still in memory, and when the pages were touched they had to be brought in. More likely it represents a large block of memory that was allocated (perhaps with VirtualAlloc). When you request that memory be committed the operating system actually does very little. The memory doesn’t get added to your working set until you first touch it. So, it seems reasonable to guess that this code allocated a block of memory and when writing to it triggered many soft page faults. This issue was discussed in Hidden Costs of Memory Allocation.
Understanding how to explore the CPU Sampling Summary tables is a valuable skill and allows investigation of many types of performance problems. The summary table skills learned are applicable to all other summary tables and should make our next investigation, CPU Scheduling, go more smoothly.
By using custom user events you can easily identify slow frames and then drill in to those specific areas. The importance of adding a bit of domain specific ETW instrumentation to your code cannot be exaggerated. ETW instrumentation is very light weight so you can keep it in your release builds, so that you can investigate performance anomalies in your shipping code.
If the CPU Usage (Sampled) graph is not-available/is-empty then that may mean that some sort of software has taken over the interrupt which the ETW sampling profiler needs to generate this data. VTune Amplifier 2019 is one program that is known for doing this – it breaks ETW sampling after you have used it until you reboot again.
- Xperf Basics: Recording a Trace (the ultimate easy way) – getting xperf, using custom providers, recording a trace, and configuring wpa
- WPA–Xperf Trace Analysis Reimagined – overview of how to use wpa
- The Lost Xperf Documentation – CPU Sampling – detailed explanation of all of the CPU sampling columns
- Stack Walking in Xperf – an old and slightly archaic description of recording and analyzing call stacks with xperf
- The meaning of interrupt-base sampling of super-scalar out-of-order processors is not obvious so I’m glad that this article does some analysis of it