Event Tracing for Windows (ETW, aka xperf) is usually used to monitor CPU usage, through its sampling profiler and its ability to record detailed information about context switches. Well, ETW is also used to monitor file I/O, and disk I/O, and sometimes registry accesses, and of course GPU activity, window-in-focus, UI Delays, process lifetimes, and a few other things. Okay, so ETW gets used for a lot of different things, normally configured in the same way and recorded across the entire system.
Heap profiling is different. Even to those who are used to the odd incantations needed to record an ETW trace it can be daunting trying to figure out how to do heap profiling. I’ve postponed writing about it because it was too much work to explain how to record heap traces and analyze them. But now that UIforETW has been released it is almost trivial to record ETW heap traces, and even analyzing them is made easier.
And, as a bonus, system memory-lists and VirtualAlloc calls for all processes are recorded in all UIforETW traces and lightly documented at the bottom of this post, with optional VirtualAlloc call stacks for lightweight memory profiling.
In short, ETW heap tracing makes it straightforward to record and analyze every allocation and free by any application that uses the Windows heap, which is the default memory allocator for VC++ applications. This lets you investigate memory leaks, allocation churn, and other heap issues.
This post was updated in June, 2015 because there is now support for heap profiling of multiple process names simultaneously!
(See the UIforETW announcement post for details on UIforETW)
Heap tracing is different
ETW heap tracing is not enabled globally because it is expensive. Recording a call stack on every heap allocation for just one process can be a huge volume of data, and doing this for all processes would be prohibitive. So, ETW heap tracing (the way that I use it at least) works by specifying the name of the process to be traced, and then recording heap events. It’s a two-step dance.
Step 1: A TracingFlags registry entry is created and set to ‘1’ in the Image File Execution Options for each process name that will be trace to tell the Windows heap to configure itself for tracing when a process with that name is launched. As is always the case with Image File Execution Options the options don’t affect already running processes – only processes launched when the registry key is set are affected.
Step 2: An extra ETW session is created using the “-heap -Pids 0” incantation. This session will record information from processes that had a TracingFlags registry entry of ‘1’ when they started.
The details are a bit messy but now that UIforETW is written I don’t have to bother explaining the details, and you don’t have to pretend to listen. If you want to record a heap trace use UIforETW, and if you want to know how it works then look at the code, or click the Show commands button to see most of the dirty laundry.
Recording heap traces with UIforETW
First you need to go to the settings dialog and specify the name of your processes. It defaults to Chrome.exe because that’s what I originally wrote UIforETW for, but you can change it to anything else. The .exe suffix is mandatory, and you can have multiple process names listed if you want, just separate them with semi-colons.
Then, you need to set the tracing mode to Heap tracing to file. Once you do this a TracingFlags entry is created for the specified process and set to 1. It will be zeroed when you exit UIforETW or when you change tracing modes. This means that as soon as Heap tracing to file is selected you can launch the process that you want to heap trace.
In order to actually start recording the heap tracing information you need to click Start Tracing.
If you understand this two-step dance then you will now know how to use it. You need to set the Heap-profiled process name and select Heap tracing to file before launching the process that you want to trace. And you need to start tracing before any data will actually be recorded. Whether you start tracing before or after you start your process(es) determines whether or not you will record allocations made at process startup.
And yes, this does work for multiple processes. When I profile Chrome I get heap data from all Chrome processes – as long as they are started after the registry key is set. It even works for multiple different process names.
Recording heap traces can generate a very high data rate. If you hit problems with lost events then you will need to increase the size of the buffers used for the heap session. If the buffers are too large then you may be unable to start tracing on small-memory machines. UIforETW currently lacks any UI for configuring this, so use the source Luke.
Because of the high data rate for heap traces you may want to turn off context switch call stacks and CPU sampling call stacks. These call stacks are often not needed when doing heap tracing and they just waste disk bandwidth and file space if they are not going to be used.
Analyzing heap traces
After you have recorded your scenario (keep it short to avoid generating traces that are large and unwieldy) you save the trace buffers as usual. The trace name will be something like “date_time_bruced_notepad_heap.etl” which helps to remind you what was recorded. As usual you should double-click the trace to load it into Windows Performance Analyzer (WPA).
The UIforETW startup profile does not initially show any graphs for viewing heap or memory data so you will need to add them (and optionally save a new startup profile). The memory graphs are (surprise!) found under the Memory section in Graph Explorer and the first one you should add is probably Heap Allocations. If you drag this over you will get the WPA default settings for viewing the heap which, as usual, I think are wrong. The default view shows columns that are usually not needed such as Address and AllocTime, and omits vital columns such as Stack and Type. To get the usual-best-defaults (TM) select the verbosely named “Randomascii Outstanding Size by Process, Handle, Type, Stack” from the View Preset drop-down, giving you something like this:
Now you can start drilling down into your process or processes. Most processes have many heaps and I find it handy to group by the heap handle, since it sometimes provides a high-level categorization of allocations. For instance, msvcrt.dll clients such as DWrite.dll typically used a separate heap from the VC++ C runtime. But, if the Handle column is of no interest then feel free to close it.
The next column is the most crucial and least obvious. In the world of ETW memory explorations there are four Types of allocations, defined by where the allocation time and free time occur in relation to the displayed timespan. These types are:
- AIFO – Allocated Inside Freed Outside: These are blocks of memory that were allocated in the displayed timespan but were not freed in the displayed timespan. These are the most important type of allocations if you are looking for memory consumed in a time region. They may have been freed after the displayed timespan, or never. When you drill down into the call stacks for this type of memory the Count represents unfreed allocations, and Size represents unfreed memory.
- AOFI – Allocated Outside Freed Inside: These are blocks of memory that were allocated before the displayed timespan and were then freed in the displayed timespan. These are the mirror images of AIFO, and when calculating how many non-freed allocations have occurred during a timespan you usually want to subtract AOFI from AIFO.
- AOFO – Allocated Outside Freed Outside: These are blocks of memory whose lifespan is infinite, at least compared to the displayed timespan. These are the immortals, existing outside of time itself – at least until you zoom out and make their birth or death visible.
- AIFI – Allocated Inside Freed Inside: These blocks of memory are the fruit flies of heap analysis – they live fast, burn brightly, and are freed before the unerring hands of time reach the right edge of the WPA screen. This type of allocation is most interesting if you are investigating high allocation churn.
After the Type column is the Stack column, and it behaves as usual. With the default Randomascii table layout you will be drilling into the stacks for a particular type of allocation, sorted by whatever column you choose.
Then we have the orange bar, and with it the end of the columns used for grouping.
Next we have Count which, as usual, counts how many allocations are summarized by each line. In the Heap Allocations graph the Count for a process includes every allocation that was alive for any part of the visible timespan. That’s such a broad category that I find the count column impossible to meaningfully interpret unless I am grouping by Type.
Impacting Size is “the change in heap usage in the current viewport”. What that means is that it summarizes, for each row, the total change in how much memory was requested from the heap over the displayed timespan. For AOFO and AIFI this is always zero because AOFO means no allocation or free in this time range, and AIFI means matched allocation and free in this time range. For AIFO this number is how many bytes were requested, and it will match Size. For AOFI this number is always the negative of how many bytes were requested, it is -Size. Impacting Size is quite useful once you understand it. It is meaningful even if the Type column is not used for grouping.
The Size column is simply the total of how many bytes were requested by the allocations summarized on that line, regardless of when the allocations occurred. Both Size and Impacting Size are measured in terms of how many bytes were requested so a one-byte allocation will show up as ‘1’, not as its much greater total heap overhead.
Other heap graphs
When doing a heap trace there are a couple of other graphs that are available:
The Low Fragmentation Heap graphs expose some detailed heap internals which you are welcome to spelunk through.
The Heap Extents graph presumably shows how much committed memory is used by the processes being monitored, which should allow an astute user to infer heap efficiency and other useful information. But I lack enough experience with this graph to say anymore.
Other memory graphs
There is some memory information recorded in all UIforETW graphs, because the VIRT_ALLOC and MEM_INFO providers are recorded:
The Virtual Alloc graph shows similar information to the heap graph, including having the same Type column as the heap graph. As usual the defaults are not great, so use the RandomAscii configuration to clarify. Note that UIforETW always records VirtualAlloc information for all processes, ’cause you never can tell when it might be useful. When a heap trace is being recorded it also records a call stack on each VirtualAlloc call, which makes exploring the data more fruitful. If you are viewing VirtualAlloc on a non-heap trace then you might want to hide the [empty] Stack column.
It’s worth pointing out that you can only use the VirtualAlloc data to draw conclusions about how much memory a process has allocated if you trace from when the process starts – otherwise you will have missed many VirtualAlloc calls and you will hugely underestimate the memory consumption.
The Memory Utilization graph shows, sampled about every 0.5 s, how much memory is in various memory lists such as the Active List, Zero and Free Lists, and Standby Lists. There are books and lectures that explain these lists in more detail but the main thing to remember is that the Standby Lists plus the Zero and Free Lists are the available memory and if this ever gets low (less than ~800 MB) then Windows will start trimming working sets and this will cause poor performance at some indeterminate time in the future.
Documentation == good
While writing this blog post I realized that I couldn’t remember exactly how I’d designed UIforETW’s heap tracing. Next time that I forget I’ll just read this post instead of staring at the source code to figure it out. And, while relearning how it worked I realized it had a bug, fixed in change 7c8e56d, and tweaked a few other details as well.
So make sure you get the very latest version of UIforETW!
UIforETW announcement is available at: https://randomascii.wordpress.com/2015/04/14/uiforetw-windows-performance-made-easier/