Three years ago I found a 32 GB memory leak caused by CcmExec.exe failing to close process handles. That bug is fixed, but ever since then I have had the handles column in Windows Task Manager enabled, just in case I hit another handle leak.
Because of this routine checking I noticed, in February of 2021, that one of Chrome’s processes had more than 20,000 handles open!
This Chrome bug is fixed now but I wanted to share how to investigate handle leaks because there are other leaky programs out there. I also wanted to share my process of learning.
A bit of investigation with Chrome’s Task manager showed that the process in question was the renderer process for gmail, and a bit more looking around showed that most Chrome renderer processes had fewer than 1,000 kernel handles.
20,000 seemed anomalous and after monitoring for a few hours I could see that the handle count was going up indefinitely. Mercy sakes alive, looks like we’ve got us a handle leak.
Slow steps to a solution
The first thing I wanted to know was what type of handles these were. Windows handles can refer to files, processes, threads, events, semaphores, and many other kernel objects. I turned to the sysinternals’ handle tool to see which type it was, but it said there were only a few hundred handles open in that process. It turns out that handle.exe defaults to only displaying information about file handles. To get full information you can pass -a to dump information about all handles, or -s to summarize by type. The -s option is most helpful for this investigation. Typical output looked for gmail like this (running it from an admin command prompt would have resolved the <Unknown type> handles):
>handle -s -p 13360
Nthandle v4.22 – Handle viewer
Copyright (C) 1997-2019 Mark Russinovich
Sysinternals – http://www.sysinternals.com
Handle type summary:
<Unknown type> : 4
<Unknown type> : 77
<Unknown type> : 48
ALPC Port : 1
Directory : 2
Event : 20858
File : 147
IoCompletion : 4
IRTimer : 6
Key : 8
Semaphore : 8
Thread : 31
TpWorkerFactory : 3
Total handles: 21207
So. Event handles it is. It could be worse. Leaked process handles keep around expensive kernel structures which seem to add up to about 64 KB per handle. Event handles, in comparison, are pretty cheap – perhaps sixteen bytes or so (it’s hard to measure). It was possible that the impact of this leak was minuscule. But, Chrome uses RAII objects to manage all of its handles so we shouldn’t have leaks – I wanted to understand what was going wrong.
After a bit of asking around I found that none of my coworkers knew how to investigate handle leaks, so I had to figure it out. I attached a debugger to the gmail process and tried setting a breakpoint on the functions that create events but I got too much noise. Hundreds of events were being created and more than 99% of them were being cleanly deleted. I need something better.
Some might say that am too obsessed with using Event Tracing for Windows to try to solve all problems, but as long as it keeps working I’m going to keep using it. I suspected that ETW might be able to track handle creation and destruction and find leaks, but I had no idea how.
A bit of Google searching found this article. It got me started, but it has a few flaws:
- The article shows dozens of flags being passed to xperf.exe, but no explanation of what they are
- The flags specified led to a very high data rate, which made long-running tracing impractical
- The post promised to explain how to analyze the data in “the next post”, but there was no “next post” and a call for help led to nothing
Flaws or not, this let me crack the case. I started by using it mostly as-is. I wrote a test program that leaked 10,000 handles and I recorded a trace to see if I could see the intentional leak. It worked.
However it was recording data at many GB per hour. Because the handle leak was fairly slow – it took weeks to leak 20,000 handles – I needed a many-hour trace to be sure I could catch the leak. I focused on two strategies:
Reduce the amount of data being recorded
The original trace recorded data from these providers:
You can find more about what some of these mean by looking at the output of xperf -providers but the short version is that the recommended command line doesn’t just record information about handles, it also records information about every context switch and all disk I/O (see Latency), enables the sampling profiler (also in Latency), and so much more. My first step was to trim it down to just this:
PROC_THREAD+LOADER is required to make sense of any ETW trace, and OB_HANDLE+OB_OBJECT seemed to be the critical things for handle tracing. A Microsoft contact then told me that OB_OBJECT was not needed so I removed that. The recommended command also had six different parameters to the -stackwalk flag, when all I really need is the stacks for when handles are allocated. I deleted five of those and then added HandleDuplicate just in case.
With that done the bare-minimum options for tracing handle leaks is this:
xperf.exe -start -on PROC_THREAD+LOADER+OB_HANDLE -stackwalk HandleCreate+HandleDuplicate
You can see a more full-featured command line in the batch file on github but this shows the minimal idea. It starts tracing, records enough information to attribute events to the right thread and module (PROC_THREAD+LOADER), records information on handle operations (OB_HANDLE) and records call stacks on handle creation and handle duplication. Perfect!
Reduce the amount of data being generated
I then loaded the (still-too-large) traces into WPA and looked to see where all of the data was coming from. The processes that were generating the most events were Task Manager and Chrome Remote Desktop. The high amount of traffic meant that these two processes were creating and destroying a lot of handles. This was not normally a problem (they weren’t leaking handles) but it was making my trace larger, so the solution was clear – shut them down while tracing. Closing Task Manager was easy enough, but since I was doing this investigation from home on my workstation in the office it was less convenient to close Chrome Remote Desktop. But, no worries. I made sure everything was running correctly and then disconnected (to reduce the amount of data being generated) and then reconnected twelve hours later. This sort of tactic – shutting down programs that are generating excessive data – is often useful with ETW because it generally records information about the entire system, including all running processes.
Now I had some large-but-manageable traces (~970 MB, compressed) and I could start the analysis. I loaded the trace into Windows Performance Analyzer and eventually found the handle tracking graph. It’s in Graph Explorer, under Memory (???), Handles, Outstanding Count by Process. Perfect!
Opening this gives a graph showing handle counts over time for every process. There’s a lot of activity over twelve hours. Zooming in to a ten-minute time span shows a flurry of activity. Processes are created, allocate some handles, and then die, creating a view like a digital Monument Valley. Other processes allocate and free handles with no clear pattern, and others seem to increase their handle count indefinitely:
The graph was pretty, but I decided I wanted raw numbers so I looked at the table. The default view looked pretty damning for Chrome’s browser process with almost 1.5 million handles allocated. But wait…
The handle leak wasn’t in the browser process, and it was nowhere near 1.5 million handles. Well, it turns out that the default view is misleading. While the graph/table are called “Outstanding Count by Process” and that is indeed what the graph shows, the table actually shows cumulative count by process. That is, the browser process allocated 1.5 million handles over twelve hours, but it freed pretty much all of them, so who cares?
What I wanted to see was outstanding handles – allocated but not freed. The table can show that data, but it sure doesn’t make it easy.
In the heap tracing tables there is a column labeled Type. Now that is a uniquely poor choice for a column name, but it is a very important column. Crucially, in the context of the Type column “Inside” means inside the time range recorded by the trace and “Outside” means outside the time range recorded by the trace. The available types are:
- AIFI (Allocated Inside Freed Inside – not leaked)
- AOFO (Allocated Outside Freed Outside – who knows!)
- AOFI (Allocated Outside Freed Inside – not leaked)
- AIFO (Allocated Inside Freed Outside – potentially leaked!)
AIFO are the interesting events because they were allocated while the trace was being recorded and then not freed before trace recording stopped. They might have been freed a few seconds later, but if you see enough of them on the same call stack you get suspicious…
That’s how heap tracing works. Handle tracing took that concept and changed it just enough to be confusing.
- They renamed the column from Type to Lifetime. Lifetime is definitely a better name, but there is inevitably some confusion that happens from having two names for the same concept
- They made the column off by default. This is consistent with heap tracing, which has the Type column off by default, but it is inexplicable. This is the single most important column in the table so I don’t understand why it isn’t front and center.
- In addition to having the column off by default they actually hide it! Normally you can right-click on any column header, select More Columns…, and see a list of columns that you can enable. Many users (me for a long time) probably assume that the list of columns in the menu is complete. It is not. For some columns – including the vitally important Lifetime column – you have to invoke the View Editor (Ctrl+E or click the gear box to the right of “Outstanding Count by Process”) and then drag the Lifetime column from the list of Available Columns on the left to the list on the right. Sigh…
I reported these problems to the WPA lead and they agreed that improvements were needed. Now that WPA (Preview) is available in the Microsoft store we should expect that fixes like this will ship faster than ever before.
A final issue to be aware of with ETW’s handle tracing is that it seems to have some accounting errors. It will report leaks that are impossible given the “handle -s” output, so be sure to cross-check your results before spending too much time.
Finding the bug
With all of that learning and discovery behind me I arrived at this view. It showed 527 event handles leaked by the gmail process in twelve hours, with 510 of them all leaks on the same call stack, ending in the WaitableEvent constructor:
A text version of the key call stack is here:
The event object was allocated by and stored in Chrome’s WaitableEvent class, managed by its ScopedHandle member variable. That meant that if we were leaking these handles we were also leaking the WaitableEvent objects. If we were leaking those then what else were we leaking?
I used heap snapshot tracing to monitor the gmail process. This was familiar territory for me so it went smoothly. I then looked for potential leaks with call stacks similar to the handle leak and found some. I then looked around for other potential leaks with counts that were similar and that looked plausibly related and I found a set that seemed related to leaks of IDBFactory and OperationsController objects.
I used Chrome’s developer tools to see if gmail was leaking IDB related objects. It was not, so I wasn’t able to blame this on the gmail team.
Further investigation supported this theory – the increases in leaked object counts were correlated with service worker threads exiting. I then realized that the service workers were associated with Google Drive’s offline mode. If I disabled offline mode then the leaks went away.
It took months to understand the bug, but then a fix was created fairly promptly (not by me – I still don’t understand that part of the code) and landed in the Chromium repo a few days later.
The fix landed in May in M92 which started rolling out to regular users around July 21st 2021. If you don’t have the fix already then you will soon.
Once the bug was understood I could do stress tests. Using a version of Chrome with the bug I left gmail, sheets, docs, and drive open simultaneously. All of these use offline mode so all of them leaked handles. My Chrome window looked like this:
After leaving buggy Chrome running like this for a few days Task Manager looked like this when sorted by handles:
Five of the top eleven processes are chrome.exe (four renderers plus the browser process). Once I updated my browser the four renderers no longer show up in the top list. But what about the other processes?
- WPA.exe is Windows Performance Analyzer – the tool I use to analyze the handle-leak traces. It leaves about 1,700 .symcache files open, and seems to have an Event handle leak, just like Chrome. I’ve reported these issues to the developer and they are investigating.
- System has a mixture of Event handles, File handles, IoCompletion handles, Process handles, WaitCompletionPacket handles, and others. I don’t know if these represent leaks. I’m guessing they do not.
- This copy of dllhost.exe hosts thumbcache.dll. A bit of experimentation showed that if I create new .mp4 files then this process leaks 12+ WaitCompletionPacket handles per file when explorer creates the thumbnails. When I’m cleaning up vacation videos this can easily leaks hundreds or thousands of handles. I mentioned this on twitter, twice. I also reported it on Feedback Hub (only visible to non-corporate users on Windows 10). An ETW trace showing me creating 20 .mp4 files and leaking hundreds of handles can be found here. This should be fixed.
- The last one is IntelTechnologyAccessService.exe, which leaks Event handles. Without symbols or any idea of what it does I can only tell that the leaks are from their core.dll. I tweeted about the leak and got a swift response from an Intel developer who wants to fix the leak.
I did the same stress test both with offline mode disabled and with fixed versions of Chrome (with offline mode re-enabled). This made it clear that disabling offline mode stopped the bug from appearing, and it proved that the fix was working. The handle leaks were gone, as were the associated memory leaks.
When running the fixed version of Chrome the Windows Task Manager looked more like this. Chrome’s browser and GPU processes (not shown) still have more than 1,000 handles open, due to how they work, but the renderer processes are using few enough handles that none of them make it onto the first page of results anymore – problem solved. dllhost.exe has moved up the rankings because I stress-tested it by creating 512 new .mp4 files, and explorer showed up because apparently my .mp4 testing revealed an event handle leak in it. Sigh…
Handle leaks are not the worst thing ever. Process handle leaks are quite expensive, but event handle leaks are not. Except that in some cases – like Chrome’s – the handle leak may correlate with other leaks, so they are usually worth investigating.
Pay attention to Task Manager, especially for your own processes, so that you can avoid this problem.
I still don’t know how to investigate GDI handle leaks – if somebody knows then please let me know.
You can read the multi-month investigation – missteps and all – in the Chromium bug.