Finding Windows HANDLE leaks, in Chromium and others

Three years ago I found a 32 GB memory leak caused by CcmExec.exe failing to close process handles. That bug is fixed, but ever since then I have had the handles column in  Windows Task Manager enabled, just in case I hit another handle leak.

Because of this routine checking I noticed, in February of 2021, that one of Chrome’s processes had more than 20,000 handles open!

This Chrome bug is fixed now but I wanted to share how to investigate handle leaks because there are other leaky programs out there. I also wanted to share my process of learning.

A bit of investigation with Chrome’s Task manager showed that the process in question was the renderer process for gmail, and a bit more looking around showed that most Chrome renderer processes had fewer than 1,000 kernel handles.

20,000 seemed anomalous and after monitoring for a few hours I could see that the handle count was going up indefinitely. Mercy sakes alive, looks like we’ve got us a handle leak.

Slow steps to a solution

The first thing I wanted to know was what type of handles these were. Windows handles can refer to files, processes, threads, events, semaphores, and many other kernel objects. I turned to the sysinternals’ handle tool to see which type it was, but it said there were only a few hundred handles open in that process. It turns out that handle.exe defaults to only displaying information about file handles. To get full information you can pass -a to dump information about all handles, or -s to summarize by type. The -s option is most helpful for this investigation. Typical output looked for gmail like this (running it from an admin command prompt would have resolved the <Unknown type> handles):

>handle -s -p 13360

Nthandle v4.22 – Handle viewer
Copyright (C) 1997-2019 Mark Russinovich
Sysinternals – http://www.sysinternals.com

Handle type summary:
  <Unknown type>  : 4
  <Unknown type>  : 77
  <Unknown type>  : 48
  ALPC Port       : 1
  Directory       : 2
  Event           : 20858
  File            : 147
  IoCompletion    : 4
  IRTimer         : 6
  Key             : 8
  Semaphore       : 8
  Thread          : 31
  TpWorkerFactory : 3
  WaitCompletionPacket: 10
Total handles: 21207

So. Event handles it is. It could be worse. Leaked process handles keep around expensive kernel structures which seem to add up to about 64 KB per handle. Event handles, in comparison, are pretty cheap – perhaps sixteen bytes or so (it’s hard to measure). It was possible that the impact of this leak was minuscule. But, Chrome uses RAII objects to manage all of its handles so we shouldn’t have leaks – I wanted to understand what was going wrong.

After a bit of asking around I found that none of my coworkers knew how to investigate handle leaks, so I had to figure it out. I attached a debugger to the gmail process and tried setting a breakpoint on the functions that create events but I got too much noise. Hundreds of events were being created and more than 99% of them were being cleanly deleted. I need something better.

ETW!

Some might say that am too obsessed with using Event Tracing for Windows to try to solve all problems, but as long as it keeps working I’m going to keep using it. I suspected that ETW might be able to track handle creation and destruction and find leaks, but I had no idea how.

A bit of Google searching found this article. It got me started, but it has a few flaws:

  • The article shows dozens of flags being passed to xperf.exe, but no explanation of what they are
  • The flags specified led to a very high data rate, which made long-running tracing impractical
  • The post promised to explain how to analyze the data in “the next post”, but there was no “next post” and a call for help led to nothing

Flaws or not, this let me crack the case. I started by using it mostly as-is. I wrote a test program that leaked 10,000 handles and I recorded a trace to see if I could see the intentional leak. It worked.

However it was recording data at many GB per hour. Because the handle leak was fairly slow – it took weeks to leak 20,000 handles – I needed a many-hour trace to be sure I could catch the leak. I focused on two strategies:

Reduce the amount of data being recorded

The original trace recorded data from these providers:

Proc_Thread+Loader+Latency+DISPATCHER+ob_handle+ob_object

You can find more about what some of these mean by looking at the output of xperf -providers but the short version is that the recommended command line doesn’t just record information about handles, it also records information about every context switch and all disk I/O (see Latency), enables the sampling profiler (also in Latency), and so much more. My first step was to trim it down to just this:

PROC_THREAD+LOADER+OB_HANDLE+OB_OBJECT

PROC_THREAD+LOADER is required to make sense of any ETW trace, and OB_HANDLE+OB_OBJECT seemed to be the critical things for handle tracing. A Microsoft contact then told me that OB_OBJECT was not needed so I removed that. The recommended command also had six different parameters to the -stackwalk flag, when all I really need is the stacks for when handles are allocated. I deleted five of those and then added HandleDuplicate just in case.

With that done the bare-minimum options for tracing handle leaks is this:

xperf.exe -start -on PROC_THREAD+LOADER+OB_HANDLE -stackwalk HandleCreate+HandleDuplicate

You can see a more full-featured command line in the batch file on github but this shows the minimal idea. It starts tracing, records enough information to attribute events to the right thread and module (PROC_THREAD+LOADER), records information on handle operations (OB_HANDLE) and records call stacks on handle creation and handle duplication. Perfect!

Reduce the amount of data being generated

I then loaded the (still-too-large) traces into WPA and looked to see where all of the data was coming from. The processes that were generating the most events were Task Manager and Chrome Remote Desktop. The high amount of traffic meant that these two processes were creating and destroying a lot of handles. This was not normally a problem (they weren’t leaking handles) but it was making my trace larger, so the solution was clear – shut them down while tracing. Closing Task Manager was easy enough, but since I was doing this investigation from home on my workstation in the office it was less convenient to close Chrome Remote Desktop. But, no worries. I made sure everything was running correctly and then disconnected (to reduce the amount of data being generated) and then reconnected twelve hours later. This sort of tactic – shutting down programs that are generating excessive data – is often useful with ETW because it generally records information about the entire system, including all running processes.

Analysis

Now I had some large-but-manageable traces (~970 MB, compressed) and I could start the analysis. I loaded the trace into Windows Performance Analyzer and eventually found the handle tracking graph. It’s in Graph Explorer, under Memory (???), Handles, Outstanding Count by Process. Perfect!

Opening this gives a graph showing handle counts over time for every process. There’s a lot of activity over twelve hours. Zooming in to a ten-minute time span shows a flurry of activity. Processes are created, allocate some handles, and then die, creating a view like a digital Monument Valley. Other processes allocate and free handles with no clear pattern, and others seem to increase their handle count indefinitely:

Handles-per-process over time, lots of chaotic lines

The graph was pretty, but I decided I wanted raw numbers so I looked at the table. The default view looked pretty damning for Chrome’s browser process with almost 1.5 million handles allocated. But wait…

WPA table that shows 1.5 million handles allocated by one Chrome process

The handle leak wasn’t in the browser process, and it was nowhere near 1.5 million handles. Well, it turns out that the default view is misleading. While the graph/table are called “Outstanding Count by Process” and that is indeed what the graph shows, the table actually shows cumulative count by process. That is, the browser process allocated 1.5 million handles over twelve hours, but it freed pretty much all of them, so who cares?

What I wanted to see was outstanding handles – allocated but not freed. The table can show that data, but it sure doesn’t make it easy.

In the heap tracing tables there is a column labeled Type. Now that is a uniquely poor choice for a column name, but it is a very important column. Crucially, in the context of the Type column “Inside” means inside the time range recorded by the trace and “Outside” means outside the time range recorded by the trace. The available types are:

  • AIFI (Allocated Inside Freed Inside – not leaked)
  • AOFO (Allocated Outside Freed Outside – who knows!)
  • AOFI (Allocated Outside Freed Inside – not leaked)
  • AIFO (Allocated Inside Freed Outside – potentially leaked!)

AIFO are the interesting events because they were allocated while the trace was being recorded and then not freed before trace recording stopped. They might have been freed a few seconds later, but if you see enough of them on the same call stack you get suspicious…

That’s how heap tracing works. Handle tracing took that concept and changed it just enough to be confusing.

  1. They renamed the column from Type to Lifetime. Lifetime is definitely a better name, but there is inevitably some confusion that happens from having two names for the same concept
  2. They made the column off by default. This is consistent with heap tracing, which has the Type column off by default, but it is inexplicable. This is the single most important column in the table so I don’t understand why it isn’t front and center.
  3. In addition to having the column off by default they actually hide it! Normally you can right-click on any column header, select More Columns…, and see a list of columns that you can enable. Many users (me for a long time) probably assume that the list of columns in the menu is complete. It is not. For some columns – including the vitally important Lifetime column – you have to invoke the View Editor (Ctrl+E or click the gear box to the right of “Outstanding Count by Process”) and then drag the Lifetime column from the list of Available Columns on the left to the list on the right. Sigh…

I reported these problems to the WPA lead and they agreed that improvements were needed. Now that WPA (Preview) is available in the Microsoft store we should expect that fixes like this will ship faster than ever before.

A final issue to be aware of with ETW’s handle tracing is that it seems to have some accounting errors. It will report leaks that are impossible given the “handle -s” output, so be sure to cross-check your results before spending too much time.

Finding the bug

With all of that learning and discovery behind me I arrived at this view. It showed 527 event handles leaked by the gmail process in twelve hours, with 510 of them all leaks on the same call stack, ending in the WaitableEvent constructor:

Call stack for handle leak, going through WaitableEvent

A text version of the key call stack is here:

base::`anonymous namespace’::ThreadFunc
blink::scheduler::WorkerThread::SimpleThreadImpl::Run
base::RunLoop::Run
base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::Run
base::MessagePumpDefault::Run
base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWork
base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWorkImpl
base::TaskAnnotator::RunTask
blink::WorkerThread::InitializeSchedulerOnWorkerThread
blink::scheduler::WorkerScheduler::WorkerScheduler
blink::scheduler::NonMainThreadSchedulerImpl::CreateTaskQueue
blink::scheduler::NonMainThreadSchedulerHelper::NewTaskQueue
base::sequence_manager::internal::SequenceManagerImpl::CreateTaskQueueImpl
base::sequence_manager::internal::TaskQueueImpl::TaskQueueImpl
base::internal::OperationsController::OperationsController
base::WaitableEvent::WaitableEvent
KernelBase.dll!CreateEventW

The event object was allocated by and stored in Chrome’s WaitableEvent class, managed by its ScopedHandle member variable. That meant that if we were leaking these handles we were also leaking the WaitableEvent objects. If we were leaking those then what else were we leaking?

I used heap snapshot tracing to monitor the gmail process. This was familiar territory for me so it went smoothly. I then looked for potential leaks with call stacks similar to the handle leak and found some. I then looked around for other potential leaks with counts that were similar and that looked plausibly related and I found a set that seemed related to leaks of IDBFactory and OperationsController objects.

I used Chrome’s developer tools to see if gmail was leaking IDB related objects. It was not, so I wasn’t able to blame this on the gmail team.

Collaboration

Several people ended up contributing to the analysis. I understood how to interpret the traces and I could reproduce the leaks on custom-instrumented versions of Chrome, but I am terrible at JavaScript and I don’t understand our IDB architecture. One of my coworkers understood the architecture of the leaking objects. This coworker realized that the leaked allocations were happening on service worker threads and were being leaked whenever service worker threads went away while an IndexedDB connection was open. It turns out this happens quite regularly on some websites, and Chrome was handling this poorly. That’s the bug!

Further investigation supported this theory – the increases in leaked object counts were correlated with service worker threads exiting. I then realized that the service workers were associated with Google Drive’s offline mode. If I disabled offline mode then the leaks went away.

It took months to understand the bug, but then a fix was created fairly promptly (not by me – I still don’t understand that part of the code) and landed in the Chromium repo a few days later.

The fix landed in May in M92 which started rolling out to regular users around July 21st 2021. If you don’t have the fix already then you will soon.

Stress testing

Once the bug was understood I could do stress tests. Using a version of Chrome with the bug I left gmail, sheets, docs, and drive open simultaneously. All of these use offline mode so all of them leaked handles. My Chrome window looked like this:

Chrome tab bar showing gmail, sheets, docs, and drive

After leaving buggy Chrome running like this for a few days Task Manager looked like this when sorted by handles:

Task Manager details tab sorted by handles, with WPA and chrome dominating

Five of the top eleven processes are chrome.exe (four renderers plus the browser process). Once I updated my browser the four renderers no longer show up in the top list. But what about the other processes?

  • WPA.exe is Windows Performance Analyzer – the tool I use to analyze the handle-leak traces. It leaves about 1,700 .symcache files open, and seems to have an Event handle leak, just like Chrome. I’ve reported these issues to the developer and they are investigating.
  • System has a mixture of Event handles, File handles, IoCompletion handles, Process handles, WaitCompletionPacket handles, and others. I don’t know if these represent leaks. I’m guessing they do not.
  • This copy of dllhost.exe hosts thumbcache.dll. A bit of experimentation showed that if I create new .mp4 files then this process leaks 12+ WaitCompletionPacket handles per file when explorer creates the thumbnails. When I’m cleaning up vacation videos this can easily leaks hundreds or thousands of handles. I mentioned this on twitter, twice. I also reported it on Feedback Hub (only visible to non-corporate users on Windows 10). An ETW trace showing me creating 20 .mp4 files and leaking hundreds of handles can be found here. This should be fixed.
  • The last one is IntelTechnologyAccessService.exe, which leaks Event handles. Without symbols or any idea of what it does I can only tell that the leaks are from their core.dll. I  tweeted about the leak and got a swift response from an Intel developer who wants to fix the leak.

I did the same stress test both with offline mode disabled and with fixed versions of Chrome (with offline mode re-enabled). This made it clear that disabling offline mode stopped the bug from appearing, and it proved that the fix was working. The handle leaks were gone, as were the associated memory leaks.

Task Manager details tab sorted by handles, with chrome gone due to the bug fixWhen running the fixed version of Chrome the Windows Task Manager looked more like this. Chrome’s browser and GPU processes (not shown) still have more than 1,000 handles open, due to how they work, but the renderer processes are using few enough handles that none of them make it onto the first page of results anymore – problem solved. dllhost.exe has moved up the rankings because I stress-tested it by creating 512 new .mp4 files, and explorer showed up because apparently my .mp4 testing revealed an event handle leak in it. Sigh…

Lessons

Handle leaks are not the worst thing ever. Process handle leaks are quite expensive, but event handle leaks are not. Except that in some cases – like Chrome’s – the handle leak may correlate with other leaks, so they are usually worth investigating.

Pay attention to Task Manager, especially for your own processes, so that you can avoid this problem.

I still don’t know how to investigate GDI handle leaks – if somebody knows then please let me know.

You can read the multi-month investigation – missteps and all – in the Chromium bug.

Here is the twitter announcement, hacker news discussion, and reddit discussion.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Bugs, Code Reliability, Investigative Reporting, uiforetw, xperf and tagged , . Bookmark the permalink.

14 Responses to Finding Windows HANDLE leaks, in Chromium and others

  1. Jeff Stokes says:

    Nice post sir, was worth the wait, as they say 🙂

    I suspect Yong had dispatcher+latency due to it being required as a bare minimum for wait analysis, back in the day, using the internal WPA and xperfview. Probably habit more than anything I’d guess.

    Anyway thanks for the post. Very helpful.

  2. David Millington says:

    Re GDI handle leaks, this won’t answer “how” to solve them, but an engineer on our team recently pointed me to GDIView: https://www.nirsoft.net/utils/gdi_handles.html One problem with GDI leaks is that you just get a count of GDI handles, but you don’t know what they’re handles to (fonts? bitmaps? etc.) This tool tells you what kind of handles you have.

    For us, we have the source to our apps and our source control history, so spotting (for example) a bitmap leak is relatively straightforward to solve because we can look at code that uses bitmaps that was changed recently. For your kind of (very impressive) debugging, it may not be as much use. I don’t believe it lets you get stack traces, for example: it’s a monitoring, not debugging/tracing, tool.

  3. Alastair says:

    “and eventually found the handle tracking graph. It’s in Graph Explorer, under Memory (???), Handles, Outstanding Count by Process. Perfect!”

    Any more info on this as I must be missing something as I can’t understand what I need to do! I can see memory in the Graph Explorer but then what is the action?

    Thanks!

    • brucedawson says:

      You can double-click on entries in Graph Explorer to add them to the analysis view, or drag them over.

      I wrote a bit about how to use WPA here:

      WPA–Xperf Trace Analysis Reimagined

      I try not to get too much into the details in these investigative reporting pieces because otherwise they get way to long. This page has a lot of tutorials for learning trace analysis:

      ETW Central

      • Alastair says:

        Thank you so much for this article. I tracked down a leak in my code (in some Direct2D stuff) using the techniques you mention in the article. It had been bugging us for ages and now all sorted. I owe you a beer.

  4. akraus1 says:

    Are you aware of Performance HUD? It can record GDI/User object leaks which was added to Windows 10 1909. See https://aloiskraus.wordpress.com/2020/07/20/ms-performance-hud-analyze-eventlog-reading-performance-in-realtime/. This one comes from the Office Team which is great and allows you to record Handle/GDI/User object leaks of the outstanding allocations for days or weeks because it filters the outstanding allocations. This one is really neat and is also able to process a many GB trace which would otherwise be too large for WPA to load into this one for a given process. I have tried with an 80 GB Trace and it did work. It has also an extensibility interface where you can add your own tabs, but the parsing of ETW events needs to be done all by yourself then. For leaks this is the goto tool which solves practically every leak problem with the last amount of hassle.

  5. Zhentar says:

    > I still don’t know how to investigate GDI handle leaks – if somebody knows then please let me know.

    Hmmm, last year I noticed a new table was added to the WPA SDK previews, and now you tell me there’s a working version on the store. So if I try enabling the events it was looking at…

    -on Microsoft-Windows-Win32k:0x30000000000::’stack’

    And….. by the gods! https://i.imgur.com/RAURrA5.png

    I’m pretty sure these events are a recent addition. I refuse to believe these events were there all along and I just missed them.

  6. Alexander Bessonov says:

    I’m a bit late, however there is another technique that can be used to find basically any kind of resource leak, including GDI leaks:

    It uses the Microsoft Time Travel Debugger and might not be feasible for long-running sessions (due to large trace files generated), but might still be useful.

    As a bonus, you not only get a stack trace at allocation, but can also start debugging right away, including the ability to “go backwards”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.