A bit over three years ago I wrote about how to use flame graphs to visualize CPU Usage (Sampled) data from ETW, and a year ago I added flame graph support to UIforETW. However these techniques are clumsy and slow and what I really wanted – what I asked for – was flame graph support in Windows Performance Analyzer (WPA), Microsoft’s ETW trace viewer.
And with the 10.0.14393 (Windows 10 Anniversary Edition) version of Windows Performance Toolkit (WPT) I finally got my wish! WPA can now natively display data as flame graphs, and it is good.
Disclaimer: WPT 10.0.14393 requires Windows 8 or above. If you install it on Windows 7 or below it will crash. UIforETW v1.42 will install the latest WPT on Windows8 or above, but will install the previous version on Windows 7. If you want flame graphs on ‘ancient’ operating systems you’ll need to stick to the old-fashioned methods.
The easiest way to get started with flame graphs is to use the built-in CPU Usage (Sampled) preset for Flame by Process, Stack
Make sure the graph is visible and you’ll see the classic flame-graph towers, one for each process, with the widths of the towers representing how much CPU time was recorded. The selected process will be in blue and the rest are in forest-fire reds.
As usual WPA likes to show data for all processes on the system, which is particularly distracting with flame graphs, so you may want to select the process(es) of interest, right-click, and then select Filter to Selection.
Then, because the flame graphs are always scaled to fit in the graph window you will usually want to stretch the graph window vertically in order to be able to read the function names, and you’ll get something like this (truncated) view:
The towers of red represent call stacks, with the root way down off the bottom of the screenshot. The width of each row indicates how many samples were recorded for that particular call stack.
The graph above is taken from PowerPoint Poor Performance Problem and the flame graph makes it immediately obvious, based on row widths, that GpBitmap::PipeLockBits (which ultimately calls GpWicDecoder::Decode) and DpDriver::DrawImage (which ultimately calls DpOutputSpanStretch<1>::StretchScanline) are where the time is going, with KiPageFault consuming a lot of the Decode time. In cases like these the flame graph makes it far easier to get the big picture of what is going on, because the insignificant details become sub-pixel columns that you can’t even see, and when I say that “PowerPoint was slow in this case because it decodes and scales the background image on every refresh” it’s not hard to see that yeah, that’s the problem.
So that’s it – you can now use flame graphs to visualize where CPU time is going. But WPA is infinitely configurable and can be used to graph many types of data in arbitrarily many ways. Graphing CPU Usage (Sampled) data with the default flame graph is just scratching the surface.
Let’s say that instead of graphing CPU Usage (Sampled) we wanted to graph context switches. And let’s say that instead of viewing call stacks we wanted to view the process name and thread ID of the thread being switched in, and the process name and thread ID of the thread that readied that thread. And, finally, let’s say that we wanted the width of our towers to represent the number of context switches (other candidates might include the amount of CPU time consumed, the amount of time the thread was idle, or any other numeric data column).
The startup profile which UIforETW prefers to use (go to Settings, Copy startup profiles to get a fresh copy of it, recommended) has a CPU Usage (Precise) graph open so we can start with that. Click the Select Chart Type button and select Flame, as shown here:
The default grouping columns in the Randomascii Wait Analysis preset are these:
- New Process
- New Thread
- New Thread Stack
- Readying Process
- Readying Thread Id
- Ready Thread Stack
and the graphing column (far right) defaults to % CPU Usage. When WPA makes a flame graph it takes all of the data in all of the grouping columns (left of the gold bar) and stacks them together. That means that each context switch makes a flame graph column with two call stacks and four other rows of data. This is a daunting amount of data and isn’t what we want so the first thing to do is to hide the New Thread Stack and Ready Thread Stack columns. That can be done with the View Editor (Ctrl+E) or by showing the table and then right-clicking on the column headers.
At this point we have a flame graph, but the width of the rows is proportional to CPU Usage, not context switch counts. To fix that we need to drag the Count column to the right of the blue bar, and move the % CPU Usage column to the left of the blue bar. At this point WPA will say:
Flame graph requires one graphing column with Sum aggregation
Okay, sounds reasonable. To fix this we need to open up the View Editor, scroll to the bottom of the right section, and change the Aggregation type for Count to Sum. With that configured we have the flame graph that we wanted!
It turns out that this flame graph is quite helpful in understanding some annoying Windows 10 behavior that I’ve been hitting. RuntimeBroker.exe likes to spring to life spontaneously on my laptop and it wasn’t clear what was triggering it. I had a trace of this activity handy and our context switch flame graph makes it clear that there are two processes that “Ready” RuntimeBroker.exe – itself and Music.UI.exe:
Processes often ready their own threads (one thread readying another) so the really important news is that Music.UI.exe is talking to RuntimeBroker.exe and, because RuntimeBroker always does work on behalf of other processes, we can conclude that Music.UI.exe is ultimately the reason why my CPU gets very busy at random times. Unfortunately I can’t find a way to turn off this behavior (the available settings in Music.UI.exe are minimalist to the extreme) so this knowledge is quite useless so far.
It’s an easy change to have the flame graph widths proportional to % CPU Usage or to display New Thread Stack as one of the grouping columns – it all depends on what sort of problem you are trying to investigate. Unfortunately some numeric data columns (Time Since Last, for instance) cannot be graphed, for unknown reasons.
The Annotation column can be used as another level of the flame graph hierarchy. In the example below I selected all context switches that were readied by RuntimeBroker.exe (by temporarily rearranging the columns), right-clicked, and used Annotate Selection to tag them as “Self-readied” (and the others as “Readied by other process”). I then moved Annotation to the second column so that the context switches were grouped by this new attribute, making it easier to see all the IPC related context switches.
The Annotation column is a powerful way of doing arbitrarily filtering and grouping while analyzing a trace, typically to separate out the areas that are understood from those that still need investigation.
Operators are standing by
A new startup profile that contains a Randomascii flame by Process, thread, count has been committed to UIforETW and the next release will contain this startup profile, to make flame graphs even easier to use. If you find other flame graph arrangements useful, let me know and I might add them.
The official WPA flame graph documentation is here.
For more ETW/WPA documentation, tutorials, and case studies see ETW Central.