ETW Flame Graphs Made Easy

A bit over three years ago I wrote about how to use flame graphs to visualize CPU Usage (Sampled) data from ETW, and a year ago I added flame graph support to UIforETW. However these techniques are clumsy and slow and what I really wanted – what I asked for – was flame graph support in Windows Performance Analyzer (WPA), Microsoft’s ETW trace viewer.

And with the 10.0.14393 (Windows 10 Anniversary Edition) version of Windows Performance Toolkit (WPT) I finally got my wish! WPA can now natively display data as flame graphs, and it is good.

Disclaimer: WPT 10.0.14393 requires Windows 8 or above. If you install it on Windows 7 or below it will crash. UIforETW v1.42 will install the latest WPT on Windows8 or above, but will install the previous version on Windows 7. If you want flame graphs on ‘ancient’ operating systems you’ll need to stick to the old-fashioned methods.

The easiest way to get started with flame graphs is to use the built-in CPU Usage (Sampled) preset for Flame by Process, Stack

image

Make sure the graph is visible and you’ll see the classic flame-graph towers, one for each process, with the widths of the towers representing how much CPU time was recorded. The selected process will be in blue and the rest are in forest-fire reds.

As usual WPA likes to show data for all processes on the system, which is particularly distracting with flame graphs, so you may want to select the process(es) of interest, right-click, and then select Filter to Selection.

Then, because the flame graphs are always scaled to fit in the graph window you will usually want to stretch the graph window vertically in order to be able to read the function names, and you’ll get something like this (truncated) view:

image

The towers of red represent call stacks, with the root way down off the bottom of the screenshot. The width of each row indicates how many samples were recorded for that particular call stack.

The graph above is taken from PowerPoint Poor Performance Problem and the flame graph makes it immediately obvious, based on row widths, that GpBitmap::PipeLockBits (which ultimately calls GpWicDecoder::Decode) and DpDriver::DrawImage (which ultimately calls DpOutputSpanStretch<1>::StretchScanline) are where the time is going, with KiPageFault consuming a lot of the Decode time. In cases like these the flame graph makes it far easier to get the big picture of what is going on, because the insignificant details become sub-pixel columns that you can’t even see, and when I say that “PowerPoint was slow in this case because it decodes and scales the background image on every refresh” it’s not hard to see that yeah, that’s the problem.

So that’s it – you can now use flame graphs to visualize where CPU time is going. But WPA is infinitely configurable and can be used to graph many types of data in arbitrarily many ways. Graphing CPU Usage (Sampled) data with the default flame graph is just scratching the surface.

Challenge accepted

Let’s say that instead of graphing CPU Usage (Sampled) we wanted to graph context switches. And let’s say that instead of viewing call stacks we wanted to view the process name and thread ID of the thread being switched in, and the process name and thread ID of the thread that readied that thread. And, finally, let’s say that we wanted the width of our towers to represent the number of context switches (other candidates might include the amount of CPU time consumed, the amount of time the thread was idle, or any other numeric data column).

The startup profile which UIforETW prefers to use (go to Settings, Copy startup profiles to get a fresh copy of it, recommended) has a CPU Usage (Precise) graph open so we can start with that. Click the Select Chart Type button and select Flame, as shown here:

image

The default grouping columns in the Randomascii Wait Analysis preset are these:

  • New Process
  • New Thread
  • New Thread Stack
  • Readying Process
  • Readying Thread Id
  • Ready Thread Stack

and the graphing column (far right) defaults to % CPU Usage. When WPA makes a flame graph it takes all of the data in all of the grouping columns (left of the gold bar) and stacks them together. That means that each context switch makes a flame graph column with two call stacks and four other rows of data. This is a daunting amount of data and isn’t what we want so the first thing to do is to hide the New Thread Stack and Ready Thread Stack columns. That can be done with the View Editor (Ctrl+E) or by showing the table and then right-clicking on the column headers.

At this point we have a flame graph, but the width of the rows is proportional to CPU Usage, not context switch counts. To fix that we need to drag the Count column to the right of the blue bar, and move the % CPU Usage column to the left of the blue bar. At this point WPA will say:

Flame graph requires one graphing column with Sum aggregation

Okay, sounds reasonable. To fix this we need to open up the View Editor, scroll to the bottom of the right section, and change the Aggregation type for Count to Sum. With that configured we have the flame graph that we wanted!

It turns out that this flame graph is quite helpful in understanding some annoying Windows 10 behavior that I’ve been hitting. RuntimeBroker.exe likes to spring to life spontaneously on my laptop and it wasn’t clear what was triggering it. I had a trace of this activity handy and our context switch flame graph makes it clear that there are two processes that “Ready” RuntimeBroker.exe – itself and Music.UI.exe:

image

Processes often ready their own threads (one thread readying another) so the really important news is that Music.UI.exe is talking to RuntimeBroker.exe and, because RuntimeBroker always does work on behalf of other processes, we can conclude that Music.UI.exe is ultimately the reason why my CPU gets very busy at random times. Unfortunately I can’t find a way to turn off this behavior (the available settings in Music.UI.exe are minimalist to the extreme) so this knowledge is quite useless so far.

It’s an easy change to have the flame graph widths proportional to % CPU Usage or to display New Thread Stack as one of the grouping columns – it all depends on what sort of problem you are trying to investigate. Unfortunately some numeric data columns (Time Since Last, for instance) cannot be graphed, for unknown reasons.

Annotation column

The Annotation column can be used as another level of the flame graph hierarchy. In the example below I selected all context switches that were readied by RuntimeBroker.exe (by temporarily rearranging the columns), right-clicked, and used Annotate Selection to tag them as “Self-readied” (and the others as “Readied by other process”). I then moved Annotation to the second column so that the context switches were grouped by this new attribute, making it easier to see all the IPC related context switches.

image

The Annotation column is a powerful way of doing arbitrarily filtering and grouping while analyzing a trace, typically to separate out the areas that are understood from those that still need investigation.

Operators are standing by

A new startup profile that contains a Randomascii flame by Process, thread, count has been committed to UIforETW and the next release will contain this startup profile, to make flame graphs even easier to use. If you find other flame graph arrangements useful, let me know and I might add them.

The official WPA flame graph documentation is here.

For more ETW/WPA documentation, tutorials, and case studies see ETW Central.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Performance, uiforetw, xperf and tagged , , . Bookmark the permalink.

21 Responses to ETW Flame Graphs Made Easy

  1. Pingback: The Lost Xperf Documentation–CPU Usage (Precise) | Random ASCII

  2. Pingback: ETW Central | Random ASCII

  3. Marc C says:

    This is just Plain Freakin’ Awesome™ !

  4. Benjamin Michael Craig says:

    Funny, I was just plugging flame graphs on reddit.

    • brucedawson says:

      Cool. Good to see ETW getting a mention or two. Could be good to have some filtering by platform since perf is Linux specific and ETW is Windows specific.

      I’d love to see ETW profiling get CPU performance counters like perf has. Maybe some day.

      • Kyle Sabo says:

        tracelog.exe -profilesources Help
        tracelog.exe -eflags Help

        tracelog.exe -start counters -f counters.etl -eflag PROFILE+CSWITCH -PMC TotalIssues,BranchInstructions:CSWITCH+PROFILE

        xperf -i counters.etl -o counters.txt

        If you have Hyper-V enabled you won’t be able to collect any counters, otherwise you should have counter information attached to the selected events.

        In the XPerf text dump you should see “Pmc” events with the counter values after every instance of the event(s) that you collected them on. No WPA support at this time.

        • brucedawson says:

          Hey, thanks Kyle. That’s very helpful. I’ll give that a try.

        • brucedawson says:

          Thanks Kyle. I finally tried this but got mixed results. I added “tracelog.exe -stop counters” to the recipe in order to actually record the trace. But in the resulting counters.txt file I find that every New Process Name for the CSwitch events is “”Unknown” ( -1)”, which isn’t much use. I thought that might mean that PROC_THREAD+LOADER was missing so I added -img (it seemed equivalent) but that didn’t help. I’m way more familiar with using xperf.exe to control tracing so an xperf.exe method would be great.

          Also, it looks like the counters are total event counts from the beginning of the trace, and therefore need to have the previous counter subtracted in order to attribute counts to a scheduled time-slice. Since the counters are per-CPU this suggests that I need to keep track of previous counters per-CPU, but my initial attempts at using the CPU column from the CSwitch events to do this seemed to show counters moving backwards. Any samples code that interprets the data would be appreciated.

          • akraus1 says:

            Try
            tracelog.exe -start counters -f counters.etl -eflag PROFILE+CSWITCH+PROC_THREAD+LOADER -PMC TotalIssues,BranchInstructions:CSWITCH+PROFILE -stackwalk PROFILE+CSWITCH+DISPATCHER

            then use xperf to stop Profiling
            xperf -stop counters
            xperf -merge counters.etl counters_WithImageId.etl

            If I am not mistaken that should configure the CPU sample stacks to the configured CPU specific sample counters. I have not extensively tried them out but at a first glance it looks reasonable.
            tracelog is cool to start ETW tracing, or to reconfigure it and xperf is the best one to stop an ETW session which does all the necessary things for us user mode guys.

          • brucedawson says:

            Ah – that definitely helps. I ditched the -stackwalk modifier for now, to keep things simple, but now I get process names. Excellent!

            The one remaining glitch is that Kyle’s original instructions said that the Pmc events would be after every instance of the event(s) that you collected them on, but I’m actually seeing them before. With that corrected I am seeing reasonable results – no more numbers walking backwards. The data suggests a very steady rate of 18.7% to 20.9% of instructions being branches. After changing to measuring branches and branch mispredicts I’m seeing that 1.2 to 13.7% of branches are mispredicts. Much experimentation remains, and there should be a blog post in the future.

  5. akraus1 says:

    Flame Graphs are really nice. I like them most in conjunction with stacktags to get a good overview where most cycles are spent. I have tried to flame CSwitch events wait times as well but that seems unfortunately not to work. That would allow me to see which stack tags wait most or consume most CPU which would be interesting on its own.

    • brucedawson says:

      It should work, if configured correctly, unless it doesn’t make logical sense. Maybe share your profile and a trace?

      I have found the flame graphs to work as intended in every job I’ve given them so far.

      • akraus1 says:

        E.g. Wait Time us Sum or TimeSinceLast us Sum columns cannot be graphed at all. Sometimes it would be useful to see in a Wait Stack tag how much e.g. waiting per thread all of your WCF queries did do. WPA does currently not allow that. I think it is not the fault of Flame Graphs but WPA not allowing to graph them. Sometimes it would also be interesting to graph Wait and CPU time at once per thread in different colors as Flame Graphs. Since Stacktags allow me to select specific regions of my application I can tag only the actually executing e.g. query and not the thread waiting for work.
        Then such combined wait/CPU graphs would become very useful.

        • brucedawson says:

          Huh. You’re right. No graphing of Time Since Last Sum. How disappointing. That’s a surprising limitation. I’ll have to ask why. Thanks for pointing it out. I fixed the post.

          • brucedawson says:

            So, good news, I was given advice on how to flame graph Time Since Last. Set up the flame graph (maybe starting with my preset and then putting TSL (sum) in place of Count (sum). Then open the View Editor, click Advanced, and go to Graph Configuration. Delete the Start Time and Duration rows. Then add in Start Time/Last Switch-Out Time and Duration/Time Since Last, click OK twice, and Boom! Flame graph of Time Since Last!

            I’ll add that as a preset for the startup profile.

  6. akraus1 says:

    Thanks Bruce! That did the trick. The advanced view editor is one of the least documented things in WPA.

  7. Onduril says:

    Thank for the tips, flame graphs are very useful. I think I could finish to use wpa more often…
    I mainly find a few practical usage issues:
    Didn’t found away to zoom in the graph but not in the analysis view timeline.
    The font could be smaller to get the names printed sooner.
    The names are cut on the right, i would personally prefer a left cut.
    Also getting different color per module would be great.

  8. Cristian Vasile says:

    I have the idea is to complement flame graphs with a visualization based on
    nodes (aka functions names) and relations between them (aka edges) in
    order to see the flow of stack calls. The thickness of links encodes the
    number of calls of that functions in source file.

    It’s the first iteration of this old idea i had it from last year, I
    am sure that needs more work.

    I placed three files (.pdf, .png and .svg) on my google drive here –
    https://drive.google.com/open?id=0B8gOkKS32Rz2dFVVNXZTVnczcmc
    Let me know what do you think.

    Regards,
    Cristian.

    • brucedawson says:

      I couldn’t tell for sure how useful the data was because I’m not familiar with the code being run, and I couldn’t zoom in far enough to clearly see the function names.

      In general I find that different visualizations expose different types of issues.

      You may find it helpful to have a way to (optionally) trim the rarely paths, to avoid clutter. The WPA flame graphs have such an option, and it sometimes helps quite a bit.

  9. Pingback: CPU Performance Counters on Windows | Random ASCII

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s