ETW Heap Tracing–Every Allocation Recorded

Event Tracing for Windows (ETW, aka xperf) is usually used to monitor CPU usage, through its sampling profiler and its ability to record detailed information about context switches. Well, ETW is also used to monitor file I/O, and disk I/O, and sometimes registry accesses, and of course GPU activity, window-in-focus, UI Delays, process lifetimes, and a few other things. Okay, so ETW gets used for a lot of different things, normally configured in the same way and recorded across the entire system.

Heap profiling is different. Even to those who are used to the odd incantations needed to record an ETW trace it can be daunting trying to figure out how to do heap profiling. I’ve postponed writing about it because it was too much work to explain how to record heap traces and analyze them. But now that UIforETW has been released it is almost trivial to record ETW heap traces, and even analyzing them is made easier.

And, as a bonus, system memory-lists and VirtualAlloc calls for all processes are recorded in all UIforETW traces and lightly documented at the bottom of this post, with optional VirtualAlloc call stacks for lightweight memory profiling.

In short, ETW heap tracing makes it straightforward to record and analyze every allocation and free by any application that uses the Windows heap, which is the default memory allocator for VC++ applications. This lets you investigate memory leaks, allocation churn, and other heap issues.

This post was updated in June, 2015 because there is now support for heap profiling of multiple process names simultaneously!

(See the UIforETW announcement post for details on UIforETW)

Heap tracing is different

ETW heap tracing is not enabled globally because it is expensive. Recording a call stack on every heap allocation for just one process can be a huge volume of data, and doing this for all processes would be prohibitive. So, ETW heap tracing (the way that I use it at least) works by specifying the name of the process to be traced, and then recording heap events. It’s a two-step dance.

Step 1: A TracingFlags registry entry is created and set to ‘1’ in the Image File Execution Options for each process name that will be trace to tell the Windows heap to configure itself for tracing when a process with that name is launched. As is always the case with Image File Execution Options the options don’t affect already running processes – only processes launched when the registry key is set are affected.

image

Step 2: An extra ETW session is created using the “-heap -Pids 0” incantation. This session will record information from processes that had a TracingFlags registry entry of ‘1’ when they started.

The details are a bit messy but now that UIforETW is written I don’t have to bother explaining the details, and you don’t have to pretend to listen. If you want to record a heap trace use UIforETW, and if you want to know how it works then look at the code, or click the Show commands button to see most of the dirty laundry.

Recording heap traces with UIforETW

First you need to go to the settings dialog and specify the name of your processes. It defaults to Chrome.exe because that’s what I originally wrote UIforETW for, but you can change it to anything else. The .exe suffix is mandatory, and you can have multiple process names listed if you want, just separate them with semi-colons.

imageThen, you need to set the tracing mode to Heap tracing to file. Once you do this a TracingFlags entry is created for the specified process and set to 1. It will be zeroed when you exit UIforETW or when you change tracing modes. This means that as soon as Heap tracing to file is selected you can launch the process that you want to heap trace.

In order to actually start recording the heap tracing information you need to click Start Tracing.

If you understand this two-step dance then you will now know how to use it. You need to set the Heap-profiled process name and select Heap tracing to file before launching the process that you want to trace. And you need to start tracing before any data will actually be recorded. Whether you start tracing before or after you start your process(es) determines whether or not you will record allocations made at process startup.

And yes, this does work for multiple processes. When I profile Chrome I get heap data from all Chrome processes – as long as they are started after the registry key is set. It even works for multiple different process names.

Recording heap traces can generate a very high data rate. If you hit problems with lost events then you will need to increase the size of the buffers used for the heap session. If the buffers are too large then you may be unable to start tracing on small-memory machines. UIforETW currently lacks any UI for configuring this, so use the source Luke.

Because of the high data rate for heap traces you may want to turn off context switch call stacks and CPU sampling call stacks. These call stacks are often not needed when doing heap tracing and they just waste disk bandwidth and file space if they are not going to be used.

Analyzing heap traces

After you have recorded your scenario (keep it short to avoid generating traces that are large and unwieldy) you save the trace buffers as usual. The trace name will be something like “date_time_bruced_notepad_heap.etl” which helps to remind you what was recorded. As usual you should double-click the trace to load it into Windows Performance Analyzer (WPA).

The UIforETW startup profile does not initially show any graphs for viewing heap or memory data so you will need to add them (and optionally save a new startup profile). The memory graphs are (surprise!) found under the Memory section in Graph Explorer and the first one you should add is probably Heap Allocations. If you drag this over you will get the WPA default settings for viewing the heap which, as usual, I think are wrong. The default view shows columns that are usually not needed such as Address and AllocTime, and omits vital columns such as Stack and Type. To get the usual-best-defaults (TM) select the verbosely named “Randomascii Outstanding Size by Process, Handle, Type, Stack” from the View Preset drop-down, giving you something like this:

image

Now you can start drilling down into your process or processes. Most processes have many heaps and I find it handy to group by the heap handle, since it sometimes provides a high-level categorization of allocations. For instance, msvcrt.dll clients such as DWrite.dll typically used a separate heap from the VC++ C runtime. But, if the Handle column is of no interest then feel free to close it.

The next column is the most crucial and least obvious. In the world of ETW memory explorations there are four Types of allocations, defined by where the allocation time and free time occur in relation to the displayed timespan. These types are:

  • AIFO – Allocated Inside Freed Outside: These are blocks of memory that were allocated in the displayed timespan but were not freed in the displayed timespan. These are the most important type of allocations if you are looking for memory consumed in a time region. They may have been freed after the displayed timespan, or never. When you drill down into the call stacks for this type of memory the Count represents unfreed allocations, and Size represents unfreed memory.
  • AOFI – Allocated Outside Freed Inside: These are blocks of memory that were allocated before the displayed timespan and were then freed in the displayed timespan. These are the mirror images of AIFO, and when calculating how many non-freed allocations have occurred during a timespan you usually want to subtract AOFI from AIFO.
  • AOFO – Allocated Outside Freed Outside: These are blocks of memory whose lifespan is infinite, at least compared to the displayed timespan. These are the immortals, existing outside of time itself – at least until you zoom out and make their birth or death visible.
  • AIFI – Allocated Inside Freed Inside: These blocks of memory are the fruit flies of heap analysis – they live fast, burn brightly, and are freed before the unerring hands of time reach the right edge of the WPA screen. This type of allocation is most interesting if you are investigating high allocation churn.

After the Type column is the Stack column, and it behaves as usual. With the default Randomascii table layout you will be drilling into the stacks for a particular type of allocation, sorted by whatever column you choose.

Then we have the orange bar, and with it the end of the columns used for grouping.

Next we have Count which, as usual, counts how many allocations are summarized by each line. In the Heap Allocations graph the Count for a process includes every allocation that was alive for any part of the visible timespan. That’s such a broad category that I find the count column impossible to meaningfully interpret unless I am grouping by Type.

Impacting Size is “the change in heap usage in the current viewport”. What that means is that it summarizes, for each row, the total change in how much memory was requested from the heap over the displayed timespan. For AOFO and AIFI this is always zero because AOFO means no allocation or free in this time range, and AIFI means matched allocation and free in this time range. For AIFO this number is how many bytes were requested, and it will match Size. For AOFI this number is always the negative of how many bytes were requested, it is -Size. Impacting Size is quite useful once you understand it. It is meaningful even if the Type column is not used for grouping.

The Size column is simply the total of how many bytes were requested by the allocations summarized on that line, regardless of when the allocations occurred. Both Size and Impacting Size are measured in terms of how many bytes were requested so a one-byte allocation will show up as ‘1’, not as its much greater total heap overhead.

Other heap graphs

When doing a heap trace there are a couple of other graphs that are available:

The Low Fragmentation Heap graphs expose some detailed heap internals which you are welcome to spelunk through.

The Heap Extents graph presumably shows how much committed memory is used by the processes being monitored, which should allow an astute user to infer heap efficiency and other useful information. But I lack enough experience with this graph to say anymore.

Other memory graphs

There is some memory information recorded in all UIforETW graphs, because the VIRT_ALLOC and MEM_INFO providers are recorded:

The Virtual Alloc graph shows similar information to the heap graph, including having the same Type column as the heap graph. As usual the defaults are not great, so use the RandomAscii configuration to clarify. Note that UIforETW always records VirtualAlloc information for all processes, ’cause you never can tell when it might be useful. When a heap trace is being recorded it also records a call stack on each VirtualAlloc call, which makes exploring the data more fruitful. If you are viewing VirtualAlloc on a non-heap trace then you might want to hide the [empty] Stack column.

image

It’s worth pointing out that you can only use the VirtualAlloc data to draw conclusions about how much memory a process has allocated if you trace from when the process starts – otherwise you will have missed many VirtualAlloc calls and you will hugely underestimate the memory consumption.

The Memory Utilization graph shows, sampled about every 0.5 s, how much memory is in various memory lists such as the Active List, Zero and Free Lists, and Standby Lists. There are books and lectures that explain these lists in more detail but the main thing to remember is that the Standby Lists plus the Zero and Free Lists are the available memory and if this ever gets low (less than ~800 MB) then Windows will start trimming working sets and this will cause poor performance at some indeterminate time in the future.

imageHere we can see the Zero and Free Lists being temporarily drained to satisfy some short-term memory needs before bouncing back to excessive levels when the temporary memory was released.

Documentation == good

While writing this blog post I realized that I couldn’t remember exactly how I’d designed UIforETW’s heap tracing. Next time that I forget I’ll just read this post instead of staring at the source code to figure it out. And, while relearning how it worked I realized it had a bug, fixed in change 7c8e56d, and tweaked a few other details as well.

So make sure you get the very latest version of UIforETW!

References

Other takes on heap profiling can be found here and here.

UIforETW is available at https://github.com/google/UIforETW, with pre-built UIforETW binaries available in the releases section.

UIforETW announcement is available at: https://randomascii.wordpress.com/2015/04/14/uiforetw-windows-performance-made-easier/

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in memory, xperf and tagged , , . Bookmark the permalink.

26 Responses to ETW Heap Tracing–Every Allocation Recorded

  1. K. Gadd says:

    A while back I wrote a frontend that used UMDH and the kernel mode heap instrumentation flags to record allocation stacks, automatically take heap snapshots, and diff them. I hadn’t even realized you could use ETW to do the same stuff, that’s neat. I’ll have to try ETW + WPA out the next time I’m looking into heap usage on Windows.

    https://github.com/kg/HeapProfiler

    FWIW my tool might actually have some analysis mechanisms that ETW doesn’t (treemaps, text-searching in stacktraces, etc), so maybe worth fiddling with. UMDH is a pretty easy (in my experience) way to get at most of this data if you want to process it automatically, like in automated tests. My profiler codebase actually has a simple parser for it you can repurpose to that end: https://github.com/kg/HeapProfiler/blob/master/HeapSnapshot.cs#L814

    • brucedawson says:

      Interesting. The ETW heap profiling’s main weakness is that it can’t handle long-running applications because it records separate call stacks for each allocation, and it doesn’t discard data when memory is freed. Does your heap profiler coalesce matching call stacks and/or discard data on freed memory? That could make it particularly interesting.

      • K. Gadd says:

        It coalesces matching call stacks – it creates a database of code offset -> stack frame mappings, and a database of tracebacks, and then the heap allocation data references those. It’s relatively dense considering the information it stores.

        It doesn’t discard information on memory that ends up freed, but you could probably modify to do that. It aims to record a full history of allocations/frees. I tuned it to be able to handle relatively-large traces, like leaving Firefox open on a website for an hour or so. The main limitation in terms of recording size is that the key-value store it uses has a maximum size of around 4GB per stream (32-bit offsets…).

        Part of the trick is that it just periodically takes UMDH snapshots, then aggressively processes them. That would mean that an allocation + free pair between snapshots won’t be tracked, I think. I imagine it’s less detailed than ETW that way, but if the snapshot interval is good enough it will suffice to find almost every kind of leak I can think of. I think if your goal is to identify performance hits from allocation you want full allocation tracing like ETW can provide.

        I found treemaps to be the most useful thing.

  2. Kyle Sabo says:

    You can avoid all the mucking around with registry keys if you’re able to start the process in a suspended state and grab its PID and start the tracing before resuming it. Perhaps not too useful for something like Chrome that starts child processes, but it does avoid accidentally leaving the IFEO key in place after you’re done your investigation, as well as avoiding having XPerf launch the process for you as an Administrator.

    • brucedawson says:

      Hmmm — the current setup seems ideal for profiling process startup (especially with multiple processes) but it’s a poor fit for grabbing a heap trace from running processes. I might add a new mode that scans for all currently running processes that fit the specified name and pass those process IDs to the -Pids command.

      What would be ideal is if I could pass zero as well. So -heap -pids 0,14321,1732 in order to trace already running processes 1432 and 1732, and any new process that match the IFEO key. Would this work? Ooh, I hope so.

    • brucedawson says:

      > it does avoid accidentally leaving the IFEO key in place

      How bad is that anyway? I get the impression that it just sets a flag in the process structure that tells the heap tracing to turn on when needed, so it’s basically free otherwise. Is that correct?

      • Kyle Sabo says:

        I think the only problem is if you go to trace another program and it turns on heap events for an unrelated process because the key was still set. If heap tracing is turned off, whether the key is set or not shouldn’t affect anything.

        XPerf only lets you pass 2 PIDs to the -pids command. Passing 0 is useful to work around that limitation.

  3. Jon says:

    UI for ETW is great. What do you think about a “minimize to tray” feature?
    Also the “Close” button can be probably removed as the “X” does that.

    • brucedawson says:

      I think a minimize to tray feature would be appreciated by some people, although I don’t know if I would use it. I suppose if the tray icon displayed useful information that would be a nice bonus.

      Indeed the close button could be removed — I think it showed up there for some vestigial reason.

  4. Michael says:

    In the “Impacting Size” paragraph, I think there are two misuses of the Types. “AIFO means matched allocation and free in this time range”. That seems most clearly that that should be AIFI since you started the sentence discussing AOFO and AIFI. “For AIFO this number is always the negative of how many bytes were requested, it is -Size.” That seems like it should be AOFI?

  5. Milos Tosic says:

    I’ve developed a memory profiler targeting native apps that records full allocation history, supporting profiling on multiple platforms. Each allocation/free is recorded with a call stack and no information is discarded. The captured data can be large but this is offsetted by using LZ4 compression internally. It has a number of visualizations: memory usage/allocation count timeline, stack trees, tree maps, allocation histograms, etc.

    It’s a commercial tool with a fully functional 30 day trial period: http://mtuner.net

    It comes with an SDK that allows adding custom event markers, naming heaps and having full control over what exactly is being profiled.

    One feature that separates is from the rest of memory analysis tools is filtering mode. For example, user could select a time range between two markers (begin/end data load for example), specific allocation size range and heap to get a list of all memory allocations that satisfy that criteria. Any combination of filters can be applied simultaneously. This makes it very quick and easy to find allocations / leaks of interest.

  6. Alois Kraus says:

    I agree that realtime parsing of ETW events would massively help to reduce the very high data rate. On the other hand if you are not searching for small leaks which build up during hours but for random spikes then recording Heap allocation data into a ring buffer paired with a Performance Counter trigger to stop profiling when one or more memory thresholds are satisfied is also a good way to just get the data at the point where a spike occurs. You can e.g. reate in the stop trigger another trigger which fires at +100MB again and you start the recording again … or you build this into UIForETW?

    • brucedawson says:

      Yep, that technique can definitely work. I’ve used a variant on that, recording VirtualAlloc stacks only, which is much lower overhead than recording heap stacks, recording to a circular-memory buffer, and having code to trigger the saving of the trace when a memory spike occurs. This was sufficient to let us solve a memory spike that happened after a server had been running for hours.

      I’d be happy to have this in UIforETW. I think it’s mostly a matter of figuring out the design — what sort of triggers should be supported? One way to handle this would be to have an external program monitoring whatever it thinks is important and then sending a ‘record trace’ message to UIforETW.

  7. Alois Kraus says:

    Some basic stuff like working set, Private Bytes thresholds, (e.g. 500,600,1000,3000) with autmatic restarting of the recording would be nice. To decouple the retrieval you could call a script or whatever which returns as errorlevel the data point I am after.
    Then you can configure in the UI the logic what should happen if the threshold is reached (from above or below) and how to react upon it (stop recording, restart again, disable this threshold or activate it again after 100s).
    Since I am mostly doing managed code I would need also support for the .NET Runtime Providers as well to be able to match my call stacks. See Record.cmd http://geekswithblogs.net/akraus1/archive/2013/06/02/153041.aspx what would be necessary to get at least the data collection right. NGen pdb creation could be left to TraceEvent library like PerfView does it but that could also be folded as custom data collection step in you UI with some good Defaults which assume that xperf is your path.

  8. Pingback: UIforETW – Windows Performance Made Easier | Random ASCII

  9. Pingback: ETW Central | Random ASCII

  10. Pingback: 50 Bytes of Code That Took 4 GB to Compile | Random ASCII

  11. Cam H says:

    I was wondering if there was any special magic to track heap allocations when using a custom allocator, specifically tcmalloc which I believe is the allocator used in Chrome? Or are you only going to see occasional VirtualAllocs when the custom allocator needs more space?

    • brucedawson says:

      I’m not aware of a way to hook up custom allocators so that they are tracked by ETW heap tracing, so custom allocators can cause many allocations to be missed.

      However Chrome moved off of tcmalloc about a year ago, so heap tracing works reasonably well with Chrome.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s