Xperf is a powerful (and free) Windows profiler that provides amazing insights into what is constraining the performance of Windows programs. Xperf includes a sampling profiler and a very capable viewer of the sampled call stacks (see below for UIforETW instructions).
Good news! As of summer 2016 WPA includes the ability to display flame graphs natively! This is far more flexible and faster than the cumbersome export/generate process discussed here. See this article for details, and stop reading this article.
However the numerical and text-based nature of the xperf stack viewer is sometimes not ideal. Sometimes a graphical summary would be better. In this post I show how to create Flame Graphs (like the one to the right) that graphically visualize xperf sampling data.
The normal way to explore the sampling data in an xperf capture is to use a summary table with the Stack column to the left of the orange grouping bar. This makes it easy to drill down along the hottest stack.
(See my article “Xperf for Excess CPU Consumption: WPA edition” for details.)
In the screenshot below, showing Visual Studio Single Step performance problems, we can see the call stacks which account for the majority of the CPU time:
Unfortunately, while this does nicely highlight the stack which is the largest contributor to CPU consumption, the main highlighted stack only contains 27 samples out of a total of 637 on this thread. The siblings contribute another 23 samples, but it’s still hard to see the big picture. Such is the nature of displaying data primarily through text and numbers.
Flame Graphs are an alternate way of displaying call stacks, popularized by Brendan Gregg on his blog, in posts such as this one.
Any type of weighted call stack can be converted to a Flame Graph so I decided to write a Python script that would convert ETW traces to collapsed stack files that could be fed to Brendan’s flamegraph.pl script. A Flame Graph of the sampling data above gives an alternate way of visualizing what is going on:
The vertical axis is the call stack, and the horizontal axis represents how much time was spent along each call stack.
Creating flame graphs from ETW data is easier than ever before with UIforETW. Starting with v1.19 the UIforETW releases contain basic flame graph support – just select a trace and select Scripts-> Create flame graph from the context menu. Okay, don’t forget to make sure that python and perl are installed and Brendan’s flamegraph.pl script is in the UIforETW bin directory. Additional features (selecting what to flame graph) may come later.
It is immediately viscerally obvious that the majority of the time is spent along two call stacks – the largest one rooted in XapiWorkerThread::ProcessSyncTask and a smaller one rooted in XapiWorkerThread::ProcessASyncTask. The full name of the second one isn’t visible in the screen shot above, but when you view the VisualStudio.svg file then when you hover over an entry it shows the full name. Firefox, Chrome, and IE 10 can all display the .svg files.
How does it work?
The collapsed stack format that is used as an input to flamegraph.pl is easy enough to understand. Each line represents a call stack with individual entries separated by semicolons, and each line ends with a space and a number representing the count or weight of the line. This text file can even be used with grep to do some basic exploring of the call stacks, although the length of the lines usually makes this unwieldy.
The script to convert xperf traces to collapsed stack files has been rewritten to use wpaexporter which makes the process much simpler and cleaner. The export process is slightly faster, and the processing of the exported data is far faster as there is an order of magnitude less data to parse.
The next section of this post is now obsolete and is only retained for historical interest. The new script is clean enough that, with the wpaexporter post linked above, it should be easy to understand.
To convert xperf files to the collapsed stack format I made use of a few ‘lightly documented’ and unsupported features of xperf.
You can get some hints about how they work by typing xperf -help processing. None of the specific processing commands gave me what I wanted so I used the generic dumper command. The syntax for this, expressed in Python, is:
‘xperf -i %s -symbols -o %s -a dumper -stacktimeshifting -range %d %d’
The arguments are the input ETL file, the output text file, and the beginning and ending of the time range in microseconds.
The format of the output is only documented through text lines at the top of the output file which list the names of the fields for each of the types of output data. The first type of data that we care about is SampledProfile:
SampledProfile, TimeStamp, Process Name (PID), ThreadID, PrgrmCtr, CPU, ThreadStartImage!Function, Image!Function, Count, SampledProfile type
And the other important data type is Stack:
Stack, TimeStamp, ThreadID, No., Address, Image!Function
Unfortunately these blocks of data are disjoint and there is no documentation on how to associate them. It appears that the only way to do this is to use the common information which is the TimeStamp and ThreadID. So, my Python parser scans the xperf results looking for SampledProfile and Stack data. When it sees a SampledProfile event it says “Hey, this TimeStamp and ThreadID is associated with this Process Name(PID). Watch for it”. When a blob of stack data shows up the script uses the TimeStamp and ThreadID to associate the data with a process. It’s crude, and doesn’t produce identical results to the WPA UI, but it appears to be close enough to be useful.
The rest of the script is just data manipulation – dictionary lookups, sorting, spindling, and other boring details.
Typical usage looks like this:
C:> xperf_to_collapsedstacks.py trace.etl
> wpaexporter.EXE trace.etl –profile ExportCPUUsageSampled.wpaProfile –symbols
Elapsed time for wpaexporter: 58.033 s
Found 10456 samples from 98 threads.
Writing 8320 samples to temporary file collapsed_stacks_0.txt
> perl flamegraph.pl collapsed_stacks_0.txt
Results are in POWERPNT.EXE_5100_3308.svg – they should be auto-opened in the default SVG viewer.
I currently have my Python script configured to summarize just the busiest thread as collapsed call stack files, but that can easily be configured, eventually through UIforETW.
Summary tables versus Flame Graphs
When all of the CPU consumption is along a single stack then drilling down into the WPA summary table is straightforward. However if the samples are spread out over a couple of call stacks then it can take a while to drill down and find the significant parts of both call stacks. If there are a half-dozen significant call stacks then summary tables are quite unwieldy.
In the example below, from Outlook when my entire computer was having a bad day, there were four thousand samples on the main thread, almost all of them in ClearSendMessages. However they are on many different call stacks. In the screen shot below we can see two of the dozen or so call stacks, accounting for fewer than a thousand of the four thousand samples. It took several minutes of exploring to realize that this one function was accounting for almost all of the CPU time:
The Flame Graph makes it obvious. The many irrelevant functions are squeezed horizontally and are easy to ignore. The expensive functions are wide horizontal lines and you can instantly look at the top of the widest columns and realize that the same function is always at the top:
The three wide areas at the top are all ClearSendMessages, and so is the one at the left, and this is plainly obvious, instead of being hidden amongst a sea of text. You can see more details in the original Outlook.svg file.
Initial versions of this graph had severely truncated call stacks due to a rarely seen detail of how call stack collection happens during kernel callbacks. Stack walking cannot cross the kernel boundary (they use different stacks, and may even have different pointer sizes) so the collected stacks are cut off at the kernel boundaries. The xperf trace viewers hide this by merging in the other portions of the stack, and in order to get the dumper action to do this you have to specify the -stacktimeshifting option (thanks to Kyle in the comments). Since WindowProcs are kernel callbacks it means that this call stack coalescing is needed more frequently than you might think.
(See my article on When Even Crashing Doesn’t Work for other implications of crossing the kernel divide in WindowProcs)
It appears that stack collection on Linux using perf behaves identically in the kernel, with [VDSO] entries showing up as a separate stack root.
Flame Graphs are clearly a useful alternate way of visualizing stack data. If this sort of graphical summary was built in to WPA it would make some performance issues easier to understand. Currently if you want to use Flame Graphs you have to find an area of interest where CPU consumption is the issue and then run the scripts, passing the time range along to them.
The most likely use cases I can see for this is doing reports – visual summaries of performance problems found – or helping in finding regressions. I can certainly imagine a nightly performance test which creates Flame Graphs for key parts of a process so that the causes of regressions can be seen visually.
Other things that would need to be done to make these truly useful would include:
- Create scripts for dumping idle time stacks, perhaps in the same graph as the CPU consumption but in a different color
- Create scripts for summarizing memory allocations
- See if any of the trace parsing C# code that Vance Morrison blogs about would provide better ways of parsing traces
- Integrate Flame Graphs into WPA to make it easy to get a graph of the current time range being explored
In most cases the ability to interactively explore the xperf trace data in WPA – zooming in and out and seeing many types of data on the timeline – will outweigh the benefits of generating Flame Graphs outside of WPA. However I anticipate using them on some investigations in order to get a better high-level overview of what is happening, and to more easily share findings with those who aren’t as accustomed to reading xperf’s call stacks.
I hope that some day Microsoft integrates this type of visualization, either directly or though allowing plugins to the WPA UI. Done!
The script that converts xperf traces to collapsed call stacks can be found in the UIforETW repo on github at xperf_to_collapsedstacks.py and is available from the Traces context menu. You’ll need to download flamegraph.pl separately and put it in the same directory. The two sample Flame Graphs are VisualStudio.svg and Outlook.svg.
Alois Kraus wrote a managed tool that uses the TraceEvent Library to convert ETL files to flame graphs – you can find more details here.