Bulk ETW Trace Analysis in C#

WPA CPU usage graphETW traces record a wealth of information about how a Windows system is behaving. When analyzing a new and unknown problem there is no replacement for loading the trace into WPA and following the clues to a solution. The thrill of the hunt and the creative challenge of finding a visualization that will reveal the root cause never gets old (too nerdy? Sorry – I do enjoy this).

But sometimes you want to extract some commonly found piece of information from multiple traces, and doing this manually is tedious and error prone.

Initially the only solution was to run xperf actions (xperf -help processing) and parse the output. This works well enough, but only if there is an action which maps well to the information that you want.

The next option was wpaexporter, which I wrote about here. With wpaexporter you can set up a summary table the way that you want it and export the data to a .csv file. This is infinitely flexible, but setting up the WPA profiles is a bit fussy, setting up .json files (needed if you want to efficiently export multiple profiles) is even worse, and .csv processing is something I’d rather avoid.

Finally, we have something better.

Trace Processor

In May of 2019 Microsoft announced Trace Processor Preview 0.1.0 (latest is currently TraceProcessor 0.3.0). This is a NuGet package (Microsoft.Windows.EventTracing.Processing.All) that makes it easy to write robust and efficient ETW trace processing code in C#. I’ve used this for experimental projects and real work and found it to be an easy way to do programmatic analysis of ETW traces.

Chrome processes identified by a Python scriptUIforETW ships with some trace analysis scripts that use the old-school xperf actions so one of my first experiments was to rewrite one of these using the Trace Processor Preview. I chose to rewrite IdentifyChromeProcesses.py in C#.

I had to relearn C# (let me know if I’m doing anything wrong) and learn the trace processing API (quite easy due to Intellisense, despite minimal documentation) but overall the project went smoothly.

The basic pattern for analyzing a trace (after using Project-> Manage NuGet Packages to install the latest EventTracing.Processing.All) is that you start by calling TraceProcessor.Create:

using (ITraceProcessor trace = TraceProcessor.Create(traceName))

Then you call trace.UseXXX() once for each type of data you want to use, which returns an IPendingResult<XXX> type. There are dozens of types of data available, all listed on the announcement page and available through Intellisense. Then you call trace.Process() to scan the trace and read in all of the requested data, and then you grab the results from all  the IPendingResult objects – like this:

var pendingSchedulingData = trace.UseCpuSchedulingData();
var pendingSamplingData = trace.UseCpuSamplingData();

trace.Process();

// Convert from pending to actual data.
var schedulingData = pendingSchedulingData.Result;
var samplingData = pendingSamplingData.Result;

Then you iterate through the data in whatever way you want, something like this:

var execTimes = new Dictionary<IProcess, CPUUsageDetails>();
// Scan through the context-switch data to build up how much time
// was spent by each process.
foreach (var slice in schedulingData.CpuTimeSlices)
{
// Yep, this happens.
if (slice.Process == null)
continue;
execTimes.TryGetValue(slice.Process, out CPUUsageDetails last);
last.ns += slice.Duration.Nanoseconds;
last.contextSwitches += 1;
execTimes[slice.Process] = last;
}

Stacks

Many ETW events can contain call stacks. This includes the context switch events in the CPU Usage (Precise) data, the samples in the CPU Usage (Sampled) data, the allocations in heap snapshots, and potentially many more events. These stacks can be crucial for understanding the data, and Trace Processor supports handling these call stacks. Following the pattern of other data types supported by Trace Processor you invoke trace.UseSymbols() to indicate that you want to load the symbols. After calling trace.Process() and retrieving the pending symbols results you then need to load the symbols with LoadSymbolsAsync followed immediately by waiting for the results (so it’s not really async), like this:

var pendingSnapshotData = trace.UseHeapSnapshots();
var pendingSymbols = trace.UseSymbols();

trace.Process();

var snapshotData = pendingSnapshotData.Result;
var symbols = pendingSymbols.Result;
symbols.LoadSymbolsAsync(new SymCachePath(@”c:\symcache”),
new SymbolPath(@”SRV*C:\symbols*https://msdl.microsoft.com/download/symbols”))
.GetAwaiter().GetResult();

The SymCachePath argument is the path where the cache of processed symbols is stored (default is c:\symcache) – this cache allows much faster symbol loading, and if you are analyzing traces that you have previously loaded into WPA then this is all that you need. The SymbolPath constructor takes a list of symbol paths, equivalent to the semi-colon separated entries in _NT_SYMBOL_PATH.

After you have done this then you can traverse the stacks of events and call GetAnalyzerString() or similar on the stack or on individual entries. I’ve published sample code for comparing heap snapshot traces. A pair of heap snapshots have also been made available in order to test the code.

Setup steps

I only rarely create new Trace Processor and each time I forgot the steps so I’m going to list them here, purely for my own convenience:

  1. Create a new C# Console App project
  2. Go to Project-> Manage NuGet Packages…, then to the Browse tab, then search for Microsoft.Windows.EventTracing.Processing.All and click Install
  3. Start writing code. “using Microsoft.Windows.EventTracing;” will be required and may be inserted by Intellisense. The code snippets in this post can be used, or more complete examples can be found here.

Performance

When doing the simplest type of report – just listing a Chrome process tree – the Python script was consistently about 20% faster. However for the more complex reports – listing the CPU usage for each process – the C# tool was typically three to eight times faster! The simple reports are very quick to generate either way, so the minor Trace Processor performance regression doesn’t much matter, but the significant Trace Processor speedup on the more expensive reports is quite welcome.

In the CPU-usage case the Python script is forced to process the trace twice – once for command-line information and once for CPU usage data – whereas the C# code only has to process it once. However that isn’t enough to explain all of the speedup. It may be that the time to generate and then parse large quantities of text is the difference, but it ultimately doesn’t really matter.

Accuracy

I tested the new tool on all 600 ETW traces that I have squirreled away to make sure I was handling all of the edge cases. The process-tree results that I get are identical (except in some truly weird traces that somehow ended up with impossible process hierarchies) but in the CPU Usage case there were small discrepancies. After a bit of discussion with the team it was determined that this was due to threads that were running at the very beginning and end of the trace. This difference was probably avoidable but it was so tiny I didn’t bother.

Flexibility

The ultimate benefit to using the trace processing API is that you can explore the data anyway that you want, without having to parse multiple .csv files. You can even process traces with lost events or time inversions (but be wary of weird results) which wpaexporter won’t let you do – see the sample code link below for examples.

Things to watch for

Be aware that the first time you use Trace Processor there will be a message printed out. In my case this message was ending up in my output, which I don’t like, but the Trace Processor team has added a property to TraceProcessorSettings called SuppressFirstTimeSetupMessage that can be used to stop it from printing.

Some of my traces are huge and analyzing these traces – whether with WPA or with the trace processing system – can consume a lot of memory, especially during loading. Making sure you have lots of memory is crucial – my laptop has 32 GB – but you also need to make sure that your C# code is running as a 64-bit process. Unfortunately the default with Visual Studio’s .Net Framework is to run C# code as a 32-bit process, with a maximum address space of 4 GiB. Therefore when you create a new trace processing project you should go to the project properties, Build tab, and uncheck Prefer 32-bit.

Conclusion

The CPU usage summaries of IdentifyChromeProcesses.cs are enough faster that I’m tempted to switch to the C# tool just to get that benefit. Switching to C# would also avoid having a dependency on Python being installed, which is far from guaranteed on Windows. And, the trace processing API feels like a more powerful way to do bulk trace processing.

Now when I want to do some repetitive analysis of ETW traces I reach for the Trace Processor package and I can usually create a custom analysis tool extremely quickly.

Some day I want to add a feature to UIforETW that will run a tool or script on multiple traces, to look for newly discovered patterns in old traces, and the performance and flexibility of the trace processing API make that more tempting.

References:

The best way to learn the Trace Processor API is to look at a few samples for the basic patterns, look at the available types of data, and then start writing code.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in uiforetw, xperf and tagged , . Bookmark the permalink.

11 Responses to Bulk ETW Trace Analysis in C#

  1. Christian Sauer says:

    Hi,
    a small nitpick: .Result looks like it should really be async code? If you can leverage async/await there could possible be an even larger performance gain, e.g. waiting for all Results at once, vs. waiting for them one after another.

    • brucedawson says:

      My understanding is that all of the result generation is done in the Process() call and getting the .Result values is just copying a value. So, I don’t think there is any potential for increased parallelism there, but let me know if I’m wrong.

      The scanning of the results could be done in parallel – I haven’t thought about that.

      • maiak says:

        The current version of TraceProcessor is not thread safe, and there can be some unexpected behaviors when trying to process results in parallel. We are working on a future version that will be thread safe, and which can be processed in parallel. (I work on the team that develops the TraceProcessor library at Microsoft.)

  2. akraus1 says:

    With chrome being a generic container process you might also want to employ my old trick to rewrite the ETL file and rename the processes based on their command line. That makes it possible to view in WPA the correctly named chrome processes which makes analysis of generic container processes much much easier. I am doing this since years. The TraceEvent library contains a sample to rewrite ETL files.

  3. Zhentar says:

    Although it’s missing from the help header, wpaexporter actually does support -tti and -tle parameters for time inversions & lost events.

    I took a look at your C#, and made it a bit more idiomatic (here: https://gist.github.com/Zhentar/c9a7869dc844cfd4c6acab8179807d05 ), mostly by using newer features like ValueTuple instead of Tuple, string interpolation instead of string.Format, and using linq instead of list manipulation+for loops.

  4. brucedawson says:

    Good to know about the -tti and -tle parameters for wpaexporter. I guess that’s a good reason to use a structured argument parser since then it’s typically impossible to omit arguments from the help string.

    Thanks for the C# pointers – I’ll take a look.

  5. Ahmed Mahdy says:

    Glad the Trace Processor tool has gotten better. Last I used it (when it was first announced) it could not really load symbols at all. I wonder if it can give improvements for analyzing big heap captures (which takes ages in WPA)

  6. coding says:

    Does anyone know how to get the current thread(New ThreadId) of the awakened thread?
    Now the problem is focused on getting this piece of ETW event, keyWords:0x800: dispatcher: Dispatcher operation. in Microsoft open source code PerfView,This event source is called DispatcherReadyThreadTraceData,the code show as below:

    public sealed class DispatcherReadyThreadTraceData : TraceEvent
    {
    public int AwakenedThreadID { get; }
    public int AwakenedProcessID { get; }
    public AdjustReasonEnum AdjustReason { get; }
    public int AdjustIncrement { get; }
    public ReadyThreadFlags Flags { get; }
    public override int ProcessID { get; }
    public override string[] PayloadNames { get; }
    protected internal override Delegate Target { get; set; }
    public event Action Action;
    public override object PayloadValue(int index);
    public override StringBuilder ToXml(StringBuilder sb);
    protected internal override void Dispatch();
    protected internal override void SetState(object newState);
    public enum AdjustReasonEnum
    {
    None = 0,
    Unwait = 1,
    Boost = 2
    }
    [Flags]
    public enum ReadyThreadFlags : byte
    {
    ReadiedFromDPC = 1,
    KernelSwappedOut = 2,
    ProcessSwappedOut = 4
    }
    }

    I can’t get the current thread id (New ThreadId), and only can get the AwakenedThreadID, no good solution.

  7. coding says:

    when I getting this piece of ETW event, keyWords:0x800: dispatcher: Dispatcher operation. in Microsoft open source code PerfView,This event source is called DispatcherReadyThreadTraceData,the code show as below:

    public sealed class DispatcherReadyThreadTraceData : TraceEvent
    {
    public int AwakenedThreadID { get; }
    public int AwakenedProcessID { get; }
    public AdjustReasonEnum AdjustReason { get; }
    public int AdjustIncrement { get; }
    public ReadyThreadFlags Flags { get; }
    public override int ProcessID { get; }
    public override string[] PayloadNames { get; }
    protected internal override Delegate Target { get; set; }
    public event Action Action;
    public override object PayloadValue(int index);
    public override StringBuilder ToXml(StringBuilder sb);
    protected internal override void Dispatch();
    protected internal override void SetState(object newState);
    public enum AdjustReasonEnum
    {
    None = 0,
    Unwait = 1,
    Boost = 2
    }
    [Flags]
    public enum ReadyThreadFlags : byte
    {
    ReadiedFromDPC = 1,
    KernelSwappedOut = 2,
    ProcessSwappedOut = 4
    }
    }

    I can’t get the current thread id (New ThreadId), and only can get the AwakenedThreadID, no good solution.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.