ETW Heap Tracing–Every Allocation Recorded

Posted on April 27, 2015 by brucedawson

Event Tracing for Windows (ETW, aka xperf) is usually used to monitor CPU usage, through its sampling profiler and its ability to record detailed information about context switches. Well, ETW is also used to monitor file I/O, and disk I/O, and sometimes registry accesses, and of course GPU activity, window-in-focus, UI Delays, process lifetimes, and a few other things. Okay, so ETW gets used for a lot of different things, normally configured in the same way and recorded across the entire system.

Heap profiling is different. Even to those who are used to the odd incantations needed to record an ETW trace it can be daunting trying to figure out how to do heap profiling. I’ve postponed writing about it because it was too much work to explain how to record heap traces and analyze them. But now that UIforETW has been released it is almost trivial to record ETW heap traces, and even analyzing them is made easier. Plus, recent (circa 2016/2017) ETW changes have made recording heap traces even easier.

And, as a bonus, system memory-lists and VirtualAlloc calls for all processes are recorded in all UIforETW traces and lightly documented at the bottom of this post, with optional VirtualAlloc call stacks for lightweight memory profiling.

In short, ETW heap tracing makes it straightforward to record and analyze every allocation and free by any application that uses the Windows heap, which is the default memory allocator for VC++ applications. This lets you investigate memory leaks, allocation churn, and other heap issues.

This post was updated in June, 2015 because there is now support for heap profiling of multiple process names simultaneously.

This post was updated in October, 2018 to cover new heap tracing (PID and launch-and-trace) and analysis options.

Note that if you just want to record the size and call stack of all outstanding allocations then heap snapshots are a much more efficient option, allowing weeks of allocations to be recorded. They are documented here.

(See the UIforETW announcement post for details on UIforETW)

Heap tracing is different

ETW heap tracing is not enabled globally because it is expensive. Recording a call stack on every heap allocation for just one process can be a huge volume of data, and doing this for all processes would be prohibitive. So, ETW heap tracing is enabled for specific processes. There are three ways to do this:

1) Tracing processes by name

The first way to enable heap tracing (the one supported by UIforETW for the longest) works by specifying the name of the process to be traced, and then recording heap events. It’s a two-step dance.

Step 1: A TracingFlags registry entry is created and set to ‘1’ in the Image File Execution Options for each process name that will be traced to tell the Windows heap to configure itself for tracing when a process with that name is launched. As is always the case with Image File Execution Options the options don’t affect already running processes – only processes launched after the registry key is set are affected.

Step 2: An extra ETW session is created using the “-heap -Pids 0” incantation. This session will record information from processes that had a TracingFlags registry entry of ‘1’ when they started.

To do this with UIforETW you need to go to the settings dialog and specify the name of your processes. It defaults to Chrome.exe because that’s what I originally wrote UIforETW for, but you can change it to anything else. The .exe suffix is mandatory, and you can have multiple process names listed if you want, just separate them with semi-colons:

Then, you need to set the tracing mode to Heap tracing to file. Once you do this a TracingFlags entry is created for the specified process and set to 1. It will be zeroed when you exit UIforETW or when you change tracing modes. This means that as soon as Heap tracing to file is selected you can launch the process that you want to heap trace.

In order to actually start recording the heap tracing information you need to click Start Tracing.

If you understand this two-step dance then you will now know how to use it. You need to set the Heap-profiled process name and select Heap tracing to file before launching the process that you want to trace. And you need to start tracing before any data will actually be recorded. Whether you start tracing before or after you start your process(es) determines whether or not you will record allocations made at process startup.

And yes, this does work for multiple processes. When I profile Chrome I get heap data from all Chrome processes – as long as they are started after the registry key is set. It even works for multiple different process names.

2) Tracing processes by PID

The second way to enable heap tracing is to specify the PID of a process (or processes, up to two) that is already running. In the UIforETW settings dialog type in one or two PIDs, separated by semi-colons, then set the heap tracing type to Heap tracing to file and when you start tracing you will get heap data from the processes specified. It’s quite simple, with the disadvantages being that you can’t use it to get profile startup, and you need to manually adjust whenever the PIDs you are interested in change. Here’s what it looks like in the settings dialog:

3) Launching and tracing a process

The third option is to get ETW to startup the process that you want to trace, to ensure that you get heap tracing from startup of that process and no others. To use this method you need to put a fully qualified path to the binary to launch in the heap-profiled processes field. Then when you start heap tracing this executable will be launched and traced.

Analyzing heap traces

After you have recorded your scenario (keep it short to avoid generating traces that are large and unwieldy) you save the trace buffers as usual. The trace name will be something like “date_time_bruced_notepad_heap.etl” which helps to remind you what was recorded. As usual you should double-click the trace to load it into Windows Performance Analyzer (WPA).

The UIforETW startup profile does not initially show any graphs for viewing heap or memory data so you will need to add them (and optionally save a new startup profile). The memory graphs are (surprise!) found under the Memory section in Graph Explorer and the first one you should add is probably Heap Allocations. If you drag this over you will get the WPA default settings for viewing the heap which, as usual, I think are wrong. The default view shows columns that are usually not needed such as Address and AllocTime, and omits vital columns such as Stack and Type. To get the usual-best-defaults (TM) select the “Randomascii Heap Analysis” preset from the View Preset drop-down, giving you something like this:

Now you can start drilling down into your process or processes. If you want to group by heap you can add the Handle column but it is usually of minimal interest.

The next column is the most crucial and least obvious. In the world of ETW memory explorations there are four Types of allocations, defined by where the allocation time and free time occur in relation to the displayed timespan. These types are:

AIFO – Allocated Inside Freed Outside: These are blocks of memory that were allocated in the displayed timespan but were not freed in the displayed timespan. These are the most important type of allocations if you are looking for memory consumed in a time region. They may have been freed after the displayed timespan, or never. When you drill down into the call stacks for this type of memory the Count represents unfreed allocations, and Size represents unfreed memory.
AOFI – Allocated Outside Freed Inside: These are blocks of memory that were allocated before the displayed timespan and were then freed in the displayed timespan. These are the mirror images of AIFO, and when calculating how many non-freed allocations have occurred during a timespan you usually want to subtract AOFI from AIFO.
AOFO – Allocated Outside Freed Outside: These are blocks of memory whose lifespan is infinite, at least compared to the displayed timespan. These are the immortals, existing outside of time itself – at least until you zoom out and make their birth or death visible.
AIFI – Allocated Inside Freed Inside: These blocks of memory are the fruit flies of heap analysis – they live fast, burn brightly, and are freed before the unerring hands of time reach the right edge of the WPA screen. This type of allocation is most interesting if you are investigating high allocation churn.

After the Type column is the Stack column, and it behaves as usual. With the default Randomascii table layout you will be drilling into the stacks for a particular type of allocation, sorted by whatever column you choose.

Then we have the orange bar, and with it the end of the columns used for grouping.

Next we have Count which, as usual, counts how many allocations are summarized by each line. In the Heap Allocations graph the Count for a process includes every allocation that was alive for any part of the visible timespan. That’s such a broad category that I find the count column impossible to meaningfully interpret unless I am grouping by Type.

Impacting Size is “the change in heap usage in the current viewport”. What that means is that it summarizes, for each row, the total change in how much memory was requested from the heap over the displayed timespan. For AOFO and AIFI this is always zero because AOFO means no allocation or free in this time range, and AIFI means matched allocation and free in this time range. For AIFO this number is how many bytes were requested, and it will match Size. For AOFI this number is always the negative of how many bytes were requested, it is -Size. Impacting Size is quite useful once you understand it. It is meaningful even if the Type column is not used for grouping.

The Size column is simply the total of how many bytes were requested by the allocations summarized on that line, regardless of when the allocations occurred. Both Size and Impacting Size are measured in terms of how many bytes were requested so a one-byte allocation will show up as ‘1’, not as its much greater total heap overhead.

Other heap graphs

When doing a heap trace there are a couple of other graphs that are available:

The Low Fragmentation Heap graphs expose some detailed heap internals which you are welcome to spelunk through.

The Heap Extents graph presumably shows how much committed memory is used by the processes being monitored, which should allow an astute user to infer heap efficiency and other useful information. But I lack enough experience with this graph to say anymore.

Other memory graphs

There is some memory information recorded in all UIforETW graphs, because the VIRT_ALLOC and MEM_INFO providers are recorded:

The Virtual Alloc graph shows similar information to the heap graph, including having the same Type column as the heap graph. As usual the defaults are not great, so use the RandomAscii configuration to clarify. Note that UIforETW always records VirtualAlloc information for all processes, ’cause you never can tell when it might be useful. When a heap trace is being recorded it also records a call stack on each VirtualAlloc call, which makes exploring the data more fruitful. If you are viewing VirtualAlloc on a non-heap trace then you might want to hide the [empty] Stack column.

It’s worth pointing out that you can only use the VirtualAlloc data to draw conclusions about how much memory a process has allocated if you trace from when the process starts – otherwise you will have missed many VirtualAlloc calls and you will hugely underestimate the memory consumption.

The Memory Utilization graph shows, sampled about every 0.5 s, how much memory is in various memory lists such as the Active List, Zero and Free Lists, and Standby Lists. There are books and lectures that explain these lists in more detail but the main thing to remember is that the Standby Lists plus the Zero and Free Lists are the available memory and if this ever gets low (less than ~800 MB) then Windows will start trimming working sets and this will cause poor performance at some indeterminate time in the future.

Here we can see the Zero and Free Lists being temporarily drained to satisfy some short-term memory needs before bouncing back to excessive levels when the temporary memory was released.

Documentation == good

While writing this blog post I realized that I couldn’t remember exactly how I’d designed UIforETW’s heap tracing. Next time that I forget I’ll just read this post instead of staring at the source code to figure it out. And, while relearning how it worked I realized it had a bug, fixed in change 7c8e56d, and tweaked a few other details as well.

So make sure you get the very latest version of UIforETW!

References

Other takes on heap profiling can be found here and here.

UIforETW is available at https://github.com/google/UIforETW, with pre-built UIforETW binaries available in the releases section.

UIforETW announcement is available at: https://randomascii.wordpress.com/2015/04/14/uiforetw-windows-performance-made-easier/

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

This entry was posted in memory, xperf and tagged heap, memory, virtualalloc standby lists. Bookmark the permalink.

35 Responses to ETW Heap Tracing–Every Allocation Recorded

K. Gadd says:

April 27, 2015 at 8:22 am

A while back I wrote a frontend that used UMDH and the kernel mode heap instrumentation flags to record allocation stacks, automatically take heap snapshots, and diff them. I hadn’t even realized you could use ETW to do the same stuff, that’s neat. I’ll have to try ETW + WPA out the next time I’m looking into heap usage on Windows.

https://github.com/kg/HeapProfiler

FWIW my tool might actually have some analysis mechanisms that ETW doesn’t (treemaps, text-searching in stacktraces, etc), so maybe worth fiddling with. UMDH is a pretty easy (in my experience) way to get at most of this data if you want to process it automatically, like in automated tests. My profiler codebase actually has a simple parser for it you can repurpose to that end: https://github.com/kg/HeapProfiler/blob/master/HeapSnapshot.cs#L814

Reply
- brucedawson says:
  
  April 27, 2015 at 9:19 am
  
  Interesting. The ETW heap profiling’s main weakness is that it can’t handle long-running applications because it records separate call stacks for each allocation, and it doesn’t discard data when memory is freed. Does your heap profiler coalesce matching call stacks and/or discard data on freed memory? That could make it particularly interesting.
  
  Reply
  - K. Gadd says:
    
    April 27, 2015 at 9:28 am
    
    It coalesces matching call stacks – it creates a database of code offset -> stack frame mappings, and a database of tracebacks, and then the heap allocation data references those. It’s relatively dense considering the information it stores.
    
    It doesn’t discard information on memory that ends up freed, but you could probably modify to do that. It aims to record a full history of allocations/frees. I tuned it to be able to handle relatively-large traces, like leaving Firefox open on a website for an hour or so. The main limitation in terms of recording size is that the key-value store it uses has a maximum size of around 4GB per stream (32-bit offsets…).
    
    Part of the trick is that it just periodically takes UMDH snapshots, then aggressively processes them. That would mean that an allocation + free pair between snapshots won’t be tracked, I think. I imagine it’s less detailed than ETW that way, but if the snapshot interval is good enough it will suffice to find almost every kind of leak I can think of. I think if your goal is to identify performance hits from allocation you want full allocation tracing like ETW can provide.
    
    I found treemaps to be the most useful thing.
    
    Reply
    - brucedawson says:
      
      April 27, 2015 at 2:33 pm
      
      Treemaps are nice. I like flame-graphs also — I wonder if they would work for summarizing memory?
      
      Reply
      - Alexander Riccio says:
        
        April 28, 2015 at 1:59 am
        
        Flamegraphs? Interesting – never heard of them before, but I bet they’d be great for visualizing heap churn.
        
        Reply
      - brucedawson says:
        
        April 28, 2015 at 9:07 am
        
        Here’s an example of using flame graphs to summarize CPU usage from ETW:
        
        Summarizing Xperf CPU Usage with Flame Graphs
        
        Reply
      - K. Gadd says:
        
        April 28, 2015 at 10:35 am
        
        It has graphs for various statistics (like private heap size,
        address space size, etc) and I found them pretty useful. The sampling
        rate is linked to the interval of heap layout samples, though, which
        isn’t quite so good – it probably would be ideal to take high-level
        statistics samples every second or so.
        
        One upside to the heap layout snapshot approach is that you can filter the allocations by
        who allocated them and get a flamegraph just charting the amount of
        live allocated memory from that source. Helps for narrowing down usage
        to a particular subset of the application.
        
        Reply
      - brucedawson says:
        
        October 5, 2018 at 12:15 am
        
        I checked, and WPA’s flame graphs work just fine for visualizing memory.
        
        Reply
    - Alexander Riccio says:
      
      April 28, 2015 at 1:51 am
      
      Treemaps are awesome….but you’re running python 2.6??!?
      
      Reply
Kyle Sabo says:

April 27, 2015 at 5:12 pm

You can avoid all the mucking around with registry keys if you’re able to start the process in a suspended state and grab its PID and start the tracing before resuming it. Perhaps not too useful for something like Chrome that starts child processes, but it does avoid accidentally leaving the IFEO key in place after you’re done your investigation, as well as avoiding having XPerf launch the process for you as an Administrator.

Reply
- brucedawson says:
  
  April 27, 2015 at 7:49 pm
  
  Hmmm — the current setup seems ideal for profiling process startup (especially with multiple processes) but it’s a poor fit for grabbing a heap trace from running processes. I might add a new mode that scans for all currently running processes that fit the specified name and pass those process IDs to the -Pids command.
  
  What would be ideal is if I could pass zero as well. So -heap -pids 0,14321,1732 in order to trace already running processes 1432 and 1732, and any new process that match the IFEO key. Would this work? Ooh, I hope so.
  
  Reply
- brucedawson says:
  
  April 27, 2015 at 7:50 pm
  
  > it does avoid accidentally leaving the IFEO key in place
  
  How bad is that anyway? I get the impression that it just sets a flag in the process structure that tells the heap tracing to turn on when needed, so it’s basically free otherwise. Is that correct?
  
  Reply
  - Kyle Sabo says:
    
    April 28, 2015 at 9:40 am
    
    I think the only problem is if you go to trace another program and it turns on heap events for an unrelated process because the key was still set. If heap tracing is turned off, whether the key is set or not shouldn’t affect anything.
    
    XPerf only lets you pass 2 PIDs to the -pids command. Passing 0 is useful to work around that limitation.
    
    Reply
    - brucedawson says:
      
      April 28, 2015 at 1:28 pm
      
      Thanks Kyle. It would be nice if xperf would allow an arbitrary number of PIDs, but oh well.
      
      Reply
Jon says:

April 27, 2015 at 11:37 pm

UI for ETW is great. What do you think about a “minimize to tray” feature?
Also the “Close” button can be probably removed as the “X” does that.

Reply
- brucedawson says:
  
  April 28, 2015 at 12:30 am
  
  I think a minimize to tray feature would be appreciated by some people, although I don’t know if I would use it. I suppose if the tray icon displayed useful information that would be a nice bonus.
  
  Indeed the close button could be removed — I think it showed up there for some vestigial reason.
  
  Reply
Michael says:

April 28, 2015 at 4:30 pm

In the “Impacting Size” paragraph, I think there are two misuses of the Types. “AIFO means matched allocation and free in this time range”. That seems most clearly that that should be AIFI since you started the sentence discussing AOFO and AIFI. “For AIFO this number is always the negative of how many bytes were requested, it is -Size.” That seems like it should be AOFI?

Reply
- brucedawson says:
  
  April 28, 2015 at 5:31 pm
  
  Good catches. I’ve fixed both of these. It’s hard to find a good editor whose eyes won’t glaze over when reading this stuff.
  
  Reply
Milos Tosic says:

April 28, 2015 at 8:06 pm

I’ve developed a memory profiler targeting native apps that records full allocation history, supporting profiling on multiple platforms. Each allocation/free is recorded with a call stack and no information is discarded. The captured data can be large but this is offsetted by using LZ4 compression internally. It has a number of visualizations: memory usage/allocation count timeline, stack trees, tree maps, allocation histograms, etc.

It’s a commercial tool with a fully functional 30 day trial period: http://mtuner.net

It comes with an SDK that allows adding custom event markers, naming heaps and having full control over what exactly is being profiled.

One feature that separates is from the rest of memory analysis tools is filtering mode. For example, user could select a time range between two markers (begin/end data load for example), specific allocation size range and heap to get a list of all memory allocations that satisfy that criteria. Any combination of filters can be applied simultaneously. This makes it very quick and easy to find allocations / leaks of interest.

Reply
Alois Kraus says:

May 7, 2015 at 9:57 am

I agree that realtime parsing of ETW events would massively help to reduce the very high data rate. On the other hand if you are not searching for small leaks which build up during hours but for random spikes then recording Heap allocation data into a ring buffer paired with a Performance Counter trigger to stop profiling when one or more memory thresholds are satisfied is also a good way to just get the data at the point where a spike occurs. You can e.g. reate in the stop trigger another trigger which fires at +100MB again and you start the recording again … or you build this into UIForETW?

Reply
- brucedawson says:
  
  May 8, 2015 at 3:32 pm
  
  Yep, that technique can definitely work. I’ve used a variant on that, recording VirtualAlloc stacks only, which is much lower overhead than recording heap stacks, recording to a circular-memory buffer, and having code to trigger the saving of the trace when a memory spike occurs. This was sufficient to let us solve a memory spike that happened after a server had been running for hours.
  
  I’d be happy to have this in UIforETW. I think it’s mostly a matter of figuring out the design — what sort of triggers should be supported? One way to handle this would be to have an external program monitoring whatever it thinks is important and then sending a ‘record trace’ message to UIforETW.
  
  Reply
Alois Kraus says:

May 10, 2015 at 1:45 am

Some basic stuff like working set, Private Bytes thresholds, (e.g. 500,600,1000,3000) with autmatic restarting of the recording would be nice. To decouple the retrieval you could call a script or whatever which returns as errorlevel the data point I am after.
Then you can configure in the UI the logic what should happen if the threshold is reached (from above or below) and how to react upon it (stop recording, restart again, disable this threshold or activate it again after 100s).
Since I am mostly doing managed code I would need also support for the .NET Runtime Providers as well to be able to match my call stacks. See Record.cmd http://geekswithblogs.net/akraus1/archive/2013/06/02/153041.aspx what would be necessary to get at least the data collection right. NGen pdb creation could be left to TraceEvent library like PerfView does it but that could also be folded as custom data collection step in you UI with some good Defaults which assume that xperf is your path.

Reply
Pingback: UIforETW – Windows Performance Made Easier | Random ASCII
Pingback: ETW Central | Random ASCII
Cam H says:

December 6, 2015 at 2:50 pm

I was wondering if there was any special magic to track heap allocations when using a custom allocator, specifically tcmalloc which I believe is the allocator used in Chrome? Or are you only going to see occasional VirtualAllocs when the custom allocator needs more space?

Reply
- brucedawson says:
  
  December 6, 2015 at 2:55 pm
  
  I’m not aware of a way to hook up custom allocators so that they are tracked by ETW heap tracing, so custom allocators can cause many allocations to be missed.
  
  However Chrome moved off of tcmalloc about a year ago, so heap tracing works reasonably well with Chrome.
  
  Reply
Sam says:

February 16, 2017 at 12:23 am

@Bruce do you experience issues with WPA? I have to profile large applications where the recorded ETL is about 9gb and when I open it in WPA 10.0.10586.15 on a x64 Win 7 with 16 gb ram it crashes, after I add “Stacks” to the view for Heap Allocs. Event Viewer says it crashes with OutOfMemory exception even if WPA does not blow up in process explorer in terms of commit.
On smaller traces ~400mb there is no Heap Allocs, which is strange but even in VirtualAlloc Commit LifeTimes graph, when I add Commit Stack, I see symbols disabled, when I load symbols, it instantly crashes again, Event Viewer says WPA choked on an System.InvalidOperationException. Older WPA 6.x behaves similar.

Reply
- brucedawson says:
  
  February 16, 2017 at 4:11 am
  
  I occasionally find traces that will crash WPA. If you have a small (less than 1 GB) trace that crashes the latest WPA then upload it somewhere with public visibility and send me the link. I can try to let the WPA team know.
  
  Reply
akraus1 says:

February 16, 2017 at 12:19 pm

I also have this issue but usually it goes away if I delete the SymCache folder (not to be confused with the Symbols folder). It seems that from time to time it gets out of sync or different version of dbghelp.dll create slightly different versions of the cached pdbs which then cause a new WPA version frequently to crash until things settle down.

Reply
Aaron says:

January 3, 2019 at 2:24 pm

The new-ish “heap snapshot” functionality (see https://docs.microsoft.com/en-us/windows-hardware/test/wpt/record-heap-snapshot) can reduce the overhead greatly of this type of heap analysis. When enabled, it keeps efficient track of the allocation stacks for each live block, and when you snapshot this information gets dumped to the ETL. Newer versions of WPA has support for viewing this infosource.

Reply
- brucedawson says:
  
  January 26, 2019 at 6:02 pm
  
  Thanks for the information – sounds interesting. Maybe somebody will add support for it to UIforETW…
  
  Reply
  - akraus1 says:
    
    January 27, 2019 at 2:49 am
    
    @Aaron: With which Windows 10 version does this work? WPA had this on its command line for years but it did not work for a long time.
    
    Reply
yujui says:

December 6, 2023 at 12:24 am

I’m trying to trace heap allocations of a kernel mode driver. I’ve tried to start tracing with PID 4 (System) but I didn’t get the “Randomascii Heap Analysis” graph.

Can I use the techniques here to trace heap allocations of a kernel mode driver?

Reply
- brucedawson says:
  
  December 6, 2023 at 11:46 pm
  
  I do not know but I wouldn’t be surprised if it doesn’t work. The system process would be a protected process and has other extra security measures so it wouldn’t be surprising if infiltration of that sort was prohibited. Still a pity of course.
  
  Reply
- akraus1 says:
  
  December 7, 2023 at 3:24 am
  
  How do you allocate memory in the kernel? See https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/_kernel/#memory-allocation-and-buffer-management. There are no heaps in the kernel. The C/C++ heaps are pure user mode libraries implemented by a common method (RtlAllocateMemory) which is instrumented with ETW Tracing. The closest you want as a device driver developer is to enable ETW pool tracing (for xperf .. POOL and for stackwalk PoolAlloc+PoolFree . As a device driver you usually allocate private memory from the pool allocation methods.
  
  Reply