TL;DR – be wary of buying a Lenovo laptop or any other laptop that uses a Synaptics touch pad until Synaptics ships a fixed driver. Their driver has a memory leak and they have a battery-life bug that causes Windows to repeat the same system-scan once a second. So far Synaptics has failed to respond so the wait for a fix must be assumed to be infinite.
Update: October 15, 2017: an updated driver for my laptop was released September 11, 2017, six days after I published this post. However I have yet to see the driver on Windows Update, Lenovo’s System Update, or in any communications from Lenovo or Synaptics. So, I don’t know if or how ‘normal’ customers are supposed to get this fix. I only found out through the kindness of a stranger.
Now for more details to back up my claims…
My new laptop has great battery life and I want to keep it that way so if I notice something consuming CPU time for no good reason I investigate. If it’s a web page with battery-wasting ads I close it, and if it’s a process that keeps doing background work I’ll kill it.
So, when I noticed that Task Manager said that WmiPrvSE.exe was consuming ~0.7% of my CPU time (~5.6% of a core) continuously I investigated.
I shut down all visible processes and a few invisible ones but the CPU usage continued, so I grabbed an ETW trace. It showed WmiPrvSE.exe was doing about 200 ms of work, then sleeping for one second, and then doing the work again. This means that WmiPrvSE.exe was actually using more CPU time than Task Manager had indicated – a full 16% of a core (200 ms every 1200 ms). The image below shows five seconds of activity on my system which is blissfully idle except for WmiPrvSE.exe:
The next thing I did was look at the CPU Usage (Sampled) data to see where the time was going. A flame graph gave me the best visualization. I know practically nothing about WMI but it seems to me that if Process::LoadCheapPropertiesNT is consuming most of your CPU time then maybe ‘cheap’ is the wrong adjective to use. And if SetDateTime is a major cost then it might be that you are repeatedly loading libraries and strings and, I dunno, maybe this code should be ten times less costly:
Poking around at the trace a bit more I also noticed that tzrez.dll is opened 676 times whenever WmiPrvSE.exe springs to life and tzrez.dll.mui is opened 338 times, which would explain the LoadLibrary and LoadString costs which make SetDateTime so expensive. It looks like there are some easy optimization opportunities here for Microsoft. But, why is this code springing to life every 1.2 seconds?
As somebody with no WMI expertise I started sharing information on twitter and it worked brilliantly. Koby Kahane gave me several crucial hints including his interpretation of the flame graph above:
Who is doing this query?
Normally I use context switch data (CPU Usage (Precise)) to to track interactions between processes, as one process readies the next, but this time that just told me that WmiPrvSE.exe was being readied (woken up) by svchost.exe (3228) running Windows Management Instrumentation. That merely let’s me restate the question as “who is asking the Windows Management Instrumentation svchost to request this query?” The next suggestion from twitter was this:
Unfortunately that didn’t help. The type of query that had been set up was a fire-and-forget query where the WMI service automatically repeats it indefinitely. This seems like a dangerous option to give developers, especially since there are better alternatives:
Then I got two crucial hints – one is that there was probably yet another proxy process (unsecapp.exe) and the other was that results were probably only being reported when new processes were created:
That explained why I could see no signs of results being sent to the client process, because I was recording my traces on an idle machine with no processes being created.
So, I recorded another trace, this time creating new processes during the trace. The details were messy but they showed that:
- svchost.exe (3228), deep inside CCoreQueue::Execute, sends a command to WmiPrvSE.exe which (presumably) triggers its work
- When WmiPrvSE.exe is finished its work it lets svchost.exe (3228) know
- svchost.exe (3228), deep inside CQueueingEventSink::DeliverAll, then wakes up unsecapp.exe
- unsecapp.exe readies four different processes: itself, svchost.exe (3228), System, and SynTPEnh.exe. One of this must be the process that created the polling WMI query, and the only one it can be is SynTPEnh.exe, signed by Synaptics Incorporated.
I killed the SynTPEnh.exe process but it came back to life like a digital zombie. So I killed its friends (SynTPEnhService.exe and SynTPHelper.exe) and then killed it and this time it stayed dead, and WmiPrvSE.exe finally stopped its work!
SynTPEnh.exe is the Synaptics TouchPad 64-bit Enhancements, and this is not the first time I’d noticed it misbehaving. Just one month earlier I had complained that it was using 662 MiB of private working set, and when investigating its new bad behavior it was consuming 942 MiB:
The search for an easier way
The whole investigation process for this CPU consumption seemed unreasonably messy. It bothers me that a process can create a polling query like this without having to leave a clearer calling card. It was pointed out that you can open up Event Viewer, go to Applications and Services Logs, Microsoft, Windows, WMI-Activity, Operational, then look for likely queries around the time that you rebooted. Here’s what I found:
The ClientProcessID identifies which process made the query. I’m not sure I could have found the right query if I didn’t already know what process ID I was looking for. If this is the easy way of tracking down who made these battery-depleting queries then maybe it needs to be made easier.
How long, has this been going on
I’d had my laptop for almost three months when I noticed this so naturally I wondered why I hadn’t noticed it before. I’m enough of a pack rat that I had a trace from when I first got the laptop (from investigating resume-from-standby hangs caused by a buggy Intel driver, sigh…) and I could see that this bad behavior was not showing up in that trace. I then looked at the version number of SynTPEnh.exe in that trace (ETW traces contain a ton of useful information) and it was 220.127.116.11. The badly behaved version was 18.104.22.168, so clearly a driver upgrade had happened and that caused this problem. I then found the driver install entry that showed exactly when the bad driver arrived – apparently it took me more than six weeks to notice:
What about that memory leak?
When I’d first noticed that SynTPEnh.exe was leaking memory I couldn’t reproduce it, but this investigation gave me a new idea: maybe the new-process notifications were related to the memory leak. I happened to have a process creation stress test lying around from investigating a Windows 10 performance problem so I ran it. To nobody’s surprise running it caused the SynTPEnh.exe private working set to increase by about 20 MiB – about 20 KiB per process (technically ~20.5 KiB per process because base-2 is so awesome).
Since the memory leak is proportional to how many processes are created I can safely assume that most users will not see as much memory leaked as I have. You really have to build Chrome or run ProcessCreateTests.exe if you want to leak a huge amount of memory. On the other hand, most consumers don’t have 32 GiB laptops and can’t afford to leak large amounts of memory. Here are some more leak statistics:
- Restarting Chrome (26 tabs opened): ~1 MiB leaked
- Building Chrome ‘base’ target (a tiny portion of Chrome): ~12 MiB leaked
- Running ProcessCreateTests.exe: ~20 MiB leaked
In other words, the memory leak was exacerbated by being a Chrome developer and by investigating the Windows 10 performance problem. It’s safe to say that I was hit worse than anyone. But it still needs fixing.
I also contacted Lenovo since they are the maker of my laptop. They responded but I’m not sure how seriously they are taking the issue and I don’t feel like experimenting with driver versions that “might” fix the issue.
Microsoft suggested that they could use shimming to fix the issue. And they encouraged me to file a feedback report, which I did. I’m not thrilled with shimming as a solution, but unresponsive developers is the perfect use-case for it.
I’m not the first person to notice this. The bug was reported to Lenovo on May 30th and May 31st. Various suggestions were made but the issue can’t be considered fixed until an automatic update goes out to all customers.
For my own sanity I used the Roll Back Driver functionality to go back to a better version:
So, my laptop battery life is now back to normal, until the next update comes along. But a few million unwary customers are still wasting power and memory until this issue is fixed.
Apparently Lenovo has fixed this for at least some Yoga machines.
One reader suggested that uninstalling the Synaptics driver and using the Microsoft Precision Touch Pad drivers might work better.
An alternative WMI investigative technique was suggested here.
Hacker news discussion is here.
Reddit discussion is here.