Xperf for Excess CPU Consumption

When an application is slow you can, in my experience, generally categorize the problem as CPU bound or CPU idle. Either the CPU is spending an annoyingly long time on computing some result, or the CPU is idle, waiting on some external resource such as the disk, the network, or the user.

This article has been superseded by Xperf for Excess CPU Consumption: WPA Edition. The new recording and analysis tools make ETW traces much easier to use.

The reality can be more complicated since you can end up with your process idle while waiting on a CPU bound process, or you can have a busy loop that is consuming CPU time while waiting on the disk (bad programmer – busy loops are for kids), but as a rough categorization it works pretty well.

This distinction is important because the xperf techniques for investigating these two problems are quite different. For a CPU bound process or thread the question is “where is my code spending all those execution cycles” and for a CPU idle process the question is “where is my code spending all its idle time waiting?” Sometimes a hang will be a mixture of idle time and CPU time, in which case you can either subdivide the hang until you’re looking at a region that is mostly one or the other, or else apply both investigative techniques.

Today we cover the easy case: the CPU is spinning in your process and you’d like to know where.

For investigating why your CPU or thread is idle see Xperf Wait Analysis – Finding Idle time.

Update, WPRUI recorder, WPA viewer

Newer versions of xperf come with wprui, a UI for recording traces, an WPA, a new viewer for traces. Both tools are recommended. WPA’s summary tables are the same as xperfview’s, and The New Xperf is Here will help explain how to get to the summary tables.

PowerPoint Performance Problem

If you already have a trace showing excess CPU consumption then congratulations. If not then you can always just download an example trace. Or, use the batch files from my Recording a Trace post to record a trace of some CPU bound activity (your slow program, Visual Studio building, or whatever).

Load your trace into xperfview. Be sure you’ve read Xperf Analysis Basics, make sure your symbol paths are configured, and make sure you have the right set of graphs enabled in the main xperfview window.

Getting Oriented

Finding the right place in the trace is crucial so that you know which time period to analyze. Sometimes this is trivial – perhaps your process does just one thing – but other times it takes more work to find the right area. The best way to find the area of interest in your trace is to insert user events (again, see the Recording a Trace post), find those in the Generic Events graph, and use them to select the region of interest.

If you don’t have user events then there are various tricks that can be used. When profiling PowerPoint I type a single character, then wait several seconds, and then type another character. On the CPU Usage by Process graph this creates lovely peaks, as shown here:

image

The peak that is selected is about 1.27 seconds long so I now know how long it takes to type a character in PowerPoint on my laptop if the background is absurdly large.

There are tricks that can be used to identify a particular spot in a trace.If your application hangs when you perform a particular action then do that action and then immediately switch to another program. The Window Thread in Focus graph shows which window is selected and can be used to orient yourself in the trace.

Once you have found the relevant section highlight that section, then right-click it and select Clone Selection to copy it to all of the other graphs.

If you are using the example PowerPoint trace linked above then right-click on one of the graphs, choose Select Interval, and set the start time and end time to 10.793881273s and 13.378331726s. Don’t forget to use Clone Selection to copy the selected area to all of the graphs.

Digging Deep

Scroll up to the CPU Sampling by Process graph. The CPU Sampling graphs show a smoothed out estimate of CPU usage based on the (default 1 KHz) sampling profiler. With my preferred tracing batch files the CPU Sampling graphs grab a call stack with each sample and it is these call stacks that let us answer the question of where the code is spending all those execution cycles.

Right-click and make sure that Load Symbols is checked. Then right-click on the selected region of interest and select Summary Table. Summary Tables are xperfview’s way of allowing deeper investigation into many different types of data. The default CPU Sampling Summary Table (regardless of which CPU Sampling graph you invoked it from) looks like this:

image

Effective use of summary tables is all about deciding what columns to show, where to put them, and what to sort by. The columns to the left of the orange bar are used for hierarchical grouping, and the columns to the right of the orange bar display data, often accumulated across many rows that are collapsed together.

The default set of rows is not ideal for most users. The DPC/ISR category is particularly unimportant for most users. It records whether each sample was Regular CPU Usage, ISR (Interrupt Service Routine), or DPC (Deferred Procedure Call). You can see this most easily by dragging the DPC/ISR column to the left, so that it is the first level of grouping. You should see that the Regular CPU Usage group is the vast majority of the samples (the highest Count). If you expand the DPC or ISR categories you’ll probably see that they only apply to the System and Idle processes and… I’ve already spent too much time on this column. I’ve used xperfview for years without having to know anything about it, and I’m sure you can too. Click on the flyout at the left, uncheck DPC/ISR, and move on.

Weight what?

Before going any further it’s worth explaining the four columns at the right. Count is simply a count of how many samples were accumulated for the current row. In the screen shot above this shows how many samples occurred in each process, with the Idle process getting the most. Weight is an estimate of how many milliseconds of execution time those samples represent. With the default 1 KHz sampling rate the Weight and Count fields are very close – the deviation is because the samples aren’t perfectly evenly spaced. If you change the sampling frequency (“xperf -setprofint 1221 cached” sets the sampling frequency to 8 KHz) then the Count field will be about eight times larger than the Weight field. Since milliseconds are more intrinsically meaningful than sampling count I tend to focus on the Weight column, and sometimes hide the Count column when I need more space. It’s important to remember that a sampling profiler only gives you a statistical impression of CPU consumption. If a particular area just has a few samples then it could just be a statistical anomaly. You need hundreds or thousands of samples before you can draw conclusions.

%Weight is simply weight divided by the total CPU time available. The formula is simply 100 * (Weight / 1000) / (duration * numProcessors). The multiplication by 100 is to convert to a percentage, and the division by 1000 is to convert from milliseconds to seconds. The duration is shown at the top of the window, and the number of processors can be found in Trace, System Configuration. On the screen shot above the calculation is 100 * (1389.056618 / 1000) / (1.401404407 * 8 ) = 12.39%. On an eight processor system a %Weight of 12.5% represents one thread running continuously.

TimeStamp is the time each sample was taken. Since this isn’t meaningful for a group of samples this column is blank until you drill down to a sample per row.

For many purposes the only data columns that are useful in this summary table are Weight and %Weight.

Choosing what column to sort by is important when using summary tables, so that the most important data floats to the top. For a CPU Sampling Summary Table you can sort by Weight, %Weight, or Count. They are all proportional to each other and they all represent CPU consumption, which is what we care about.

Grouping for fun and profit

The default first column is Process. That is usually a good choice because when profiling we are usually interested in focusing on a particular process so grouping the samples by process is perfect.  If you were injecting a DLL into every process in the system then you would probably want to drag the Module column to the left so that you could see your code across all processes in the system. That’s an unusual scenario, but it’s cool that xperfview supports it. A more common modification would be if you were working on a system that spawned multiple processes, all using the same executable. It might be useful to group all processes with the same name together, instead of grouping by process ID. To do that just click on the flyout, check Process Name, and uncheck Process. I use this most commonly when profiling a Visual Studio build so that the dozens of cl.exe processes are grouped together.

The next two columns are Module and Function. This can be a useful column ordering. What it means is that we can expand POWERPNT.EXE by clicking on its plus sign and see all the modules within that process where samples were recorded. Then we can expand, say, GdiPlus.dll and see all of the functions within this module within POWERPNT.EXE where samples were recorded. By looking at the % Weight column we can now see that, for the selected time period, POWERPNT.EXE had 12.39% of the sample weight, and within that process WindowsCodecs.dll had 5.74% of the total sample weight, and GdiPlus.dll had 4.31%.

image

It is important to understand that what is counted when using the Module and Function columns is what instruction was being executed when the sampler interrupt fired. Call stacks are ignored when using these columns, so these columns represent exclusive timing information. If you’re doing micro-optimization then you can go one level further and add the Address column between the Function column and the orange bar. This lets you see which instructions in the hot functions in the hot modules are using the most CPU time. The meaning of this is a bit fluffy given the parallel nature of out-of-order processors, but occasionally this will help you find an unexpectedly-hot loop or an integer divide which is slowing your code.

image

You can also group by Function and then Module, to find the hottest (exclusively speaking) function across all modules. This is also useful for seeing functions that may exist in multiple modules. The important thing is to understand that most summary table questions can be answered by rearranging the columns.

Call stacks

It can be interesting to know what functions are being executed, but it is usually more useful to know what code is responsible for the instructions being executed. For that we need a call stack. My xperf batch files always enable the Profile provider to get samples, and the Profile stackwalk option to get call stacks on each sample.

Click on the flyout on the left, uncheck Module, Function, and Address, and check Stack. Make sure that the Stack column is between the Process column and the orange column so that grouping is done by process and then by call stack.

After expanding POWERPNT.EXE and the first two plus signs in the Stack column we get something like this:

image

Root is the imaginary root of all call stacks. When we expand it we see the actual roots of all of the call stacks. The most common root is ntdll.dll!_RtlUserThreadStart and on this example virtually all of the Weight (all but one of the samples) managed to stack walk all the way back that function, which is excellent. If you have a lot of stacks (by Weight) that don’t go back to _RtlUserThreadStart or some similarly plausible start function then either your stack depth is too great (96 for Vista, 192 for Windows 7) for the stack walker to handle, or else some of the code on your stack has frame pointers omitted and the EBP chain was broken. Make sure you compile with /Oy- to avoid this.

Note that all of the numbers shown when grouping by Stack (Weight, %Weight, and Count) are inclusive. They include all of the descendant functions of the call stack.

If you click on the plus sign beside _RtlUserThreadStart you get this:

image

When you get several function names above each other with no ‘-‘ at the left (__RtlUserThreadStart and BaseThreadInitThunk) then this means that this portion of the stack is shared by all functions along this branch of the stack. When you get several function names above each other with a ‘-‘ at the left then that means that the stack splits at this point – some code went into MSO.DLL and some went into POWERPNT.EXE (no symbols for either).

If you are sorting by Weight, %Weight, or Count then you don’t need to think about this too much. Just click on Root and repeatedly press right arrow. This will automatically expand the call stack along the hottest path. After doing this for a bit you should see something like this:

image

All the samples along this path go through GdipDrawImageRect, which calls GpGraphics::DrawImage, which calls itself (or some overload with the same name), which calls gpGraphics::DrvDrawImage. Then we have records of three other functions being called: PipeLockBits, DrawImage, and PipeUnlockBits. All of the samples are somewhere deep inside one of these three functions. You can verify that by adding the weights:

989.939505 + 360.972406 + 14.000538 = 1364.912449

A bit more poking around reveals that PowerPoint is spending an impressive amount of time decoding the background image almost every time it redraws the screen. A more detailed analysis of this weirdness can be found at PowerPoint Poor Performance Problem.

Sampling limitations to remember

There are many ways of misinterpreting the summary table above. You might be tempted to conclude that DrvDrawImage only called three functions, but in fact we don’t know that. We only know that when we took samples and walked the stack we never caught it in the act of calling any other functions, at least not when it was called along this call stack.

It can be tempting to assume that the order of the child functions indicates something about the order in which they were called. It does not. Since we are sorting by Weight it merely indicates which child functions were most expensive.

It’s also important (and annoying) to remember that a sampling profiler, by its very nature, tells us nothing about how many times a function was called. We can’t tell if PipeLockBits was called once and took 990 milliseconds that time, or if it was called 400 times and took an average of about 2.5 milliseconds each time.

Digging to the bottom

If you dig in deep enough then you’ll get to the bottom of the call stacks – the leaf functions where most of the samples are usually taken. Sometimes this is indicated by having a function shown as its own child but with <itself> appended, to indicate that the samples were taken from it.

image

In the example above we can see that Convert32ARGBto32ABGR took about 263 milliseconds of this 1.4 second segment, and about 163 milliseconds of that was spent in KiPageFault. Interesting.

Butterflies

When you see a function on the call stack that doesn’t tell you how much time was spent in that function. It tells you how much time was spent in that function (and its children) along that call stack. If the function was called through other paths (totally different sections of the code, different levels of recursion, etc.) then those call stacks will be separate. Xperfview has an option to display all callers or callees of a particular function. Just right-click on the function in the call stack, go to View Callers or View Callees, and select Innermost or Outermost. This is sometimes called a butterfly view. The distinction between Innermost and Outermost is only relevant for recursive functions.

image

For instance, we might notice KiPageFault on one of the call stacks and wonder if this was consuming a lot of CPU time along other call stacks. A quick trip to View Callers and we get this view:

image

In addition to the 163 milliseconds of time we had noticed previously, KiPageFault is also hit within _VEC_memcpy where it consumes 35 milliseconds, and we can follow the call stack back to the root to find more about how it ended up being triggered. When viewing a Callers butterfly view the call stacks are reversed, with the root now at the bottom, so that the stack in the picture above indicates that HrReadScanline calls _VEC_memzero which calls _VEC_memcpy which then triggers KiPageFault.

It’s worth pointing out that the butterfly view is just another summary table. It is viewing a filtered set of the data, but the same techniques of adding, removing, and rearranging columns in order to examine the data are just as relevant as for ‘regular’ summary tables.

Not that kind of page fault

The presence of KiPageFault on the call stack is interesting. The CPU Sampling Summary Table shows CPU consumption. When we think of page faults we normally think of pages being faulted in from disk, during which time our process is waiting on the disk. When our process is waiting on disk it isn’t running. Therefore, these ‘classic’ page faults do not show up as CPU consumption. This can be verified by looking at the Hard Faults graph for the selected region and noting that there are none.

These instances of KiPageFault must therefore be soft faults. This could mean that some pages were removed from the working but were still in memory, and when the pages were touched they had to be brought in. More likely it represents a large block of memory that was allocated (perhaps with VirtualAlloc). When you request that memory be committed the operating system actually does very little. The memory doesn’t get added to your working set until you first touch it. So, it seems reasonable to guess that this code allocated a block of memory and when writing to it triggered many soft page faults.

Conclusion

Understanding how to explore the CPU Sampling Summary tables is a valuable skill and allows investigation of many types of performance problems. The summary table skills learned are applicable to all other summary tables and should make our next investigate, CPU Scheduling, go more smoothly.

Documentation about all of the CPU Sampling columns can be found at The Lost Xperf Documentation – CPU Sampling.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Investigative Reporting, Performance, Programming, xperf and tagged , , , . Bookmark the permalink.

6 Responses to Xperf for Excess CPU Consumption

  1. Pingback: WPA–Xperf Trace Analysis Reimagined | Random ASCII

  2. Pingback: Windows Slowdown, Investigated and Identified | Random ASCII

  3. Pingback: Summarizing Xperf CPU Usage with Flame Graphs | Random ASCII

  4. Pingback: Xperf Basics: Recording a Trace (the easy way) | Random ASCII

  5. Pingback: Xperf for Excess CPU Consumption: WPA edition | Random ASCII

  6. Pingback: ETW Central | Random ASCII

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s