Zombies probably won’t consume 32 GB of your memory like they did to me, but zombie processes do exist, and I can help you find them and make sure that developers fix them. Tool source link is at the bottom.
Is it just me, or do Windows machines that have been up for a while seem to lose memory? After a few weeks of use (or a long weekend of building Chrome over 300 times) I kept noticing that Task Manager showed me running low on memory, but it didn’t show the memory being used by anything. In the example below task manager shows 49.8 GB of RAM in use, plus 4.4 GB of compressed memory, and yet only 5.8 GB of page/non-paged pool, few processes running, and no process using anywhere near enough to explain where the memory had gone:
My machine has 96 GB of RAM – lucky me – and when I don’t have any programs running I think it’s reasonable to hope that I’d have at least half of it available.
Sometimes I have dealt with this by rebooting but that should never be necessary. The Windows kernel is robust and well implemented so this memory disappearing shouldn’t happen, and yet…
The first clue came when I remembered that a coworker of mine had complained of zombie processes being left behind – processes that had shut down but not been cleaned up by the kernel. He’d even written a tool that would dump a list of zombie processes – their names and counts. His original complaint was of hundreds of zombies. I ran his tool and it showed 506,000 zombie processes!
Update, November 2020: the original FindZombieHandles.exe tool only tracked process handles that had not been closed. After hitting a thread handle leak in an internal tool it was updated to detect and report on thread handle leaks as well.
It occurred to me that one cause of zombie processes could be one process failing to close the handles to other processes. And the great thing about having a huge number of zombies is that they are harder to hide. So, I went to Task Manager’s Details tab, added the Handles column, and sorted by it. Voila. I immediately saw that CcmExec.exe (part of Microsoft’s System Management Server) had 508,000 handles open, which is both a lot and also amazingly close to my zombie count.
I held my breath and killed CcmExec.exe, unsure of what would happen:
The results were as dramatic as I could imagine. As I said earlier, the Windows kernel is well written and when a process is killed then all of its resources are freed. So, those 508,000 handles that were owned by CcmExec.exe were abruptly closed and my available memory went up by 32 GB! Mystery solved!
What is a zombie process?
Until this point we weren’t entirely sure what was causing these processes to hang around. In hindsight it’s obvious that these zombies were caused by a trivial user-space coding bug. The rule is that when you create a process you need to call CloseHandle on its process handle and its thread handle. If you don’t care about the process then you should close the handles immediately. If you do care – if you want to wait for the process to quit – WaitForSingleObject(hProcess, INFINITE); – or query its exit code – GetExitCodeProcess(hProcess, &exitCode); – then you need to remember to close the handles after that. Similarly, if you open an existing process with OpenProcess you need to close that handle when you are done.
If the process that holds on to the handles is a system process then it will even continue holding those handles after you log out and log back in – another source of confusion during our investigation last year.
So, a zombie process is a process that has shut down but is kept around because some other still-running process holds a handle to it. It’s okay for a process to do this briefly, but it is bad form to leave a handle unclosed for long.
Where is that memory going?
Another thing I’d done during the investigation was to run RamMap. This tool attempts to account for every page of memory in use. Its Process Memory tab had shown hundreds of thousands of processes that were each using 32 KB of RAM and presumably those were the zombies. But ~500,000 times 32 KB only equals ~16 GB – where did the rest of the freed up memory come from? Comparing the before and after Use Counts pages in RamMap explained it:
We can plainly see the ~16 GB drop in Process Private memory. We can also see a 16 GB drop in Page Table memory. Apparently a zombie process consumes ~32 KB of page tables, in addition to its ~32 KB of process private memory, for a total cost of ~64 KB. I don’t know why zombie processes consume that much RAM, but it’s probably because there should never be enough of them for that to matter.
A few types of memory actually increased after killing CcmExec.exe, mostly Mapped File and Metafile. I don’t know what that means but my guess would be that that indicates more data being cached, which would be a good thing. I don’t necessarily want memory to be unused, but I do want it to be available.
Trivia: rammap opens all processes, including zombies, so it needs to be closed before zombies will go away
I tweeted about my discovery and the investigation was picked up by another software developer and they reproed the bug using my ProcessCreateTests tool. They also passed the information to a developer at Microsoft who said it was a known issue that “happens when many processes are opened and closed very quickly”.
Windows has a reputation for not handling process creation as well as Linux and this investigation, and one of my previous ones, suggest that that reputation is well earned. I hope that Microsoft fixes this bug – it’s sloppy.
Why do I hit so many crazy problems?
I work on the Windows version of Chrome, and one of my tasks is optimizing its build system, which requires doing a lot of test builds. Building chrome involves creating between 28,000 and 37,000 processes, depending on build settings. When using our distributed build system (goma) these processes are created and destroyed very quickly – my fastest full build ever took about 200 seconds. This aggressive process creation has revealed a number of interesting bugs, mostly in Windows or its components:
- Fast process destruction led to system input hangs
- Synaptics driver leaked memory whenever a process was created
- O(n^2) log-file creation in App Verifier – my next blog post?
- A Windows kernel file buffer bug from Server 2008 R2 to Windows 10
- Windows Defender delaying each goma compiler launch by 250 ms
If you aren’t on a corporate managed machine then you probably don’t run CcmExec.exe and you will avoid this particular bug. And if you don’t build Chrome or something equivalent then you will probably avoid this bug. But!
CcmExec is not the only program that leaks process handles. I have found many others leaking modest numbers of handles and there are certainly more.
The bitter reality, as all experienced programmers know, is that any mistake that is not explicitly prevented will be made. Simply writing “This handle must be closed” in the documentation is insufficient. So, here is my contribution towards making this something detectable, and therefore preventable. FindZombieHandles is a tool, based on NtApiDotNet and sample code from @tiraniddo, that prints a list of zombies and who is keeping them alive. Here is sample output from my home laptop:
274 total zombie processes.
249 zombies held by IntelCpHeciSvc.exe(9428)
249 zombies of Video.UI.exe
14 zombies held by RuntimeBroker.exe(10784)
11 zombies of MicrosoftEdgeCP.exe
3 zombies of MicrosoftEdge.exe
8 zombies held by svchost.exe(8012)
4 zombies of ServiceHub.IdentityHost.exe
2 zombies of cmd.exe
2 zombies of vs_installerservice.exe
3 zombies held by explorer.exe(7908)
3 zombies of MicrosoftEdge.exe
1 zombie held by devenv.exe(24284)
1 zombie of MSBuild.exe
1 zombie held by SynTPEnh.exe(10220)
1 zombie of SynTPEnh.exe
1 zombie held by tphkload.exe(5068)
1 zombie of tpnumlkd.exe
1 zombie held by svchost.exe(1872)
1 zombie of userinit.exe
274 zombies isn’t too bad, but it represents some bugs that should be fixed. The IntelCpHeciSvc.exe one is the worst, as it seems to leak a process handle every time I play a video from Windows Explorer.
Visual Studio leaks handles to at least two processes and one of these is easy to reproduce. Just fire off a build and wait ~15 minutes for MSBuild.exe to go away. Or, if you “set MSBUILDDISABLENODEREUSE=1” then MSBuild.exe goes away immediately and every build leaks a process handle. Unfortunately some jerk at Microsoft fixed this bug the moment I reported it, and the fix may ship in VS 15.6, so you’ll have to act quickly to see this (and no, I don’t really think he’s a jerk).
You can also see leaked processes using Process Explorer, by configuring the lower pane to show handles, as shown here (note that both the process and thread handles are leaked in this case):
Just a few of the bugs found, not all reported
- CcmExec.exe leak, over 500,000 zombies leaked (fixes in progress)
- Program Compatibility Assistant Service leaks random processes (being investigated)
- devenv.exe leak of MSBuild.exe (fixed)
- devenv.exe leak of ServiceHub.Host.Node.x86.exe (bug filed)
- IntelCpHeciSvc.exe leaks Video.UI.exe for each video played (Intel passes the buck to Lenovo)
- RuntimeBroker.exe leak of MicrosoftEdge and Video.UI.exe (perhaps related to other bugs in RuntimeBroker.exe)
- AudioSrv service leak of Video.UI.exe
- Handle leak in Google internal tool due to old version of psutil
- Lenovo’s tphkload.exe leaks one handle, their SUService.exe leaks three handles
- Synaptic’s SynTPEnh.exe leaks one handle
- googledrivesync.exe leaks one handle (reported internally)
Process handles aren’t the only kind that can be leaked. For instance, the “Intel(R) Online Connect Access service” (IntelTechnologyAccessService.exe) only uses 4 MB of RAM, but after 30 days of uptime had created 27,504 (!!!) handles. I diagnosed this leak using just Task Manager and reported it here. I also used the awesome !htrace command in windbg to get stacks for the CreateEventW calls from Intel’s code. Think they’ll fix this?
Using Processs Explorer I could see that NVDisplay.Container.exe from NVIDIA has ~5,000 handles to \BaseNamedObjects\NvXDSyncStop-61F8EBFF-D414-46A7-90AE-98DD58E4BC99 event, creating a new one about every two minutes? I guess they want to be really sure that they can stop NvXDSync? Reported, and a fix has been checked in.
Apparently ETDCtrl.exe (11.x), some app associated with ELANTech/Synaptics trackpads, leaks handles to shared memory. The process accumulated about 16,000 handles and when the process was killed about 3 GB of missing RAM was returned to the system – quite noticeable on an 8 GB laptop with no swap.
Apparently the netprofm leaks process handles, leading to over 1,110 zombie processes on my work machine – issue filed here.
Apparently nobody has been paying attention to this for a while – hey Microsoft, maybe start watching for handle leaks so that Windows runs better? And Intel and NVIDIA? Take a look at your code. I’ll be watching you.
So, grab FindZombieHandles, run it on your machine, and report or fix what you find, and use Task Manager and Process Explorer as well.
Updates: Microsoft recommended disabling the feature that leaks handles and doing so has resolved the issue for me (and they are fixing the leaks). It’s an expensive feature and it turns out we were ignoring the data anyway! Also, all Windows 10 PIDs are multiples of four which explains why ~500,000 zombies led to PIDs in the 2,000,000+ range.