Hey Synaptics, Can You Please Stop Polling!

TL;DR – be wary of buying a Lenovo laptop or any other laptop that uses a Synaptics touch pad until Synaptics ships a fixed driver. Their driver has a memory leak and they have a battery-life bug that causes Windows to repeat the same system-scan once a second. So far Synaptics has failed to respond so the wait for a fix must be assumed to be infinite.

Now for more details to back up my claims…

My new laptop has great battery life and I want to keep it that way so if I notice something consuming CPU time for no good reason I investigate. If it’s a web page with battery-wasting ads I close it, and if it’s a process that keeps doing background work I’ll kill it.

So, when I noticed that Task Manager said that WmiPrvSE.exe was consuming ~0.7% of my CPU time (~5.6% of a core) continuously I investigated.

I shut down all visible processes and a few invisible ones but the CPU usage continued, so I grabbed an ETW trace. It showed WmiPrvSE.exe was doing about 200 ms of work, then sleeping for one second, and then doing the work again. This means that WmiPrvSE.exe was actually using more CPU time than Task Manager had indicated – a full 16% of a core (200 ms every 1200 ms). The image below shows five seconds of activity on my system which is blissfully idle except for WmiPrvSE.exe:

CPU Usage (Precise) data is derived from context switch data and is therefore very, umm, precise

The next thing I did was look at the CPU Usage (Sampled) data to see where the time was going. A flame graph gave me the best visualization. I know practically nothing about WMI but it seems to me that if Process::LoadCheapPropertiesNT is consuming most of your CPU time then maybe ‘cheap’ is the wrong adjective to use. And if SetDateTime is a major cost then it might be that you are repeatedly loading libraries and strings and, I dunno, maybe this code should be ten times less costly:

Bar widths represent number of samples, stacked bars represent call stacks

Poking around at the trace a bit more I also noticed that tzrez.dll is opened 676 times whenever WmiPrvSE.exe springs to life and tzrez.dll.mui is opened 338 times, which would explain the LoadLibrary and LoadString costs which make SetDateTime so expensive. It looks like there are some easy optimization opportunities here for Microsoft. But, why is this code springing to life every 1.2 seconds?

As somebody with no WMI expertise I started sharing information on twitter and it worked brilliantly. Koby Kahane gave me several crucial hints including his interpretation of the flame graph above:

This is just the WMI service enumerating Win32_Process instances. You need to find the client process doing the queries to find the culprit.

Who is doing this query?

Normally I use context switch data (CPU Usage (Precise)) to to track interactions between processes, as one process readies the next, but this time that just told me that WmiPrvSE.exe was being readied (woken up) by svchost.exe (3228) running Windows Management Instrumentation. That merely let’s me restate the question as “who is asking the Windows Management Instrumentation svchost to request this query?” The next suggestion from twitter was this:

I suggest recording Microsoft-Windows-WMI-Activity, it should have events for the client issuing the queries.

Unfortunately that didn’t help. The type of query that had been set up was a fire-and-forget query where the WMI service automatically repeats it indefinitely. This seems like a dangerous option to give developers, especially since there are better alternatives:

A polling query is particularly tragic in the case of Win32_Process, since Win32_ProcessStartTrace does the same thing without polling.

Then I got two crucial hints – one is that there was probably yet another proxy process (unsecapp.exe) and the other was that results were probably only being reported when new processes were created:

But I think unsecapp.exe would be readying the client, if you rapidly create processes to make the notification query have results.

That explained why I could see no signs of results being sent to the client process, because I was recording my traces on an idle machine with no processes being created.

So, I recorded another trace, this time creating new processes during the trace. The details were messy but they showed that:

  1. svchost.exe (3228), deep inside CCoreQueue::Execute, sends a command to WmiPrvSE.exe which (presumably) triggers its work
  2. When WmiPrvSE.exe is finished its work it lets svchost.exe (3228) know
  3. svchost.exe (3228), deep inside CQueueingEventSink::DeliverAll, then wakes up unsecapp.exe
  4. unsecapp.exe readies four different processes: itself, svchost.exe (3228), System, and SynTPEnh.exe. One of this must be the process that created the polling WMI query, and the only one it can be is SynTPEnh.exe, signed by Synaptics Incorporated.

I killed the SynTPEnh.exe process but it came back to life like a digital zombie. So I killed its friends (SynTPEnhService.exe and SynTPHelper.exe) and then killed it and this time it stayed dead, and WmiPrvSE.exe finally stopped its work!

SynTPEnh.exe is the Synaptics TouchPad 64-bit Enhancements, and this is not the first time I’d noticed it misbehaving. Just one month earlier I had complained that it was using 662 MiB of private working set, and when investigating its new bad behavior it was consuming 942 MiB:

964,104 KiB / 1024 = 941.507 MiB

The search for an easier way

The whole investigation process for this CPU consumption seemed unreasonably messy. It bothers me that a process can create a polling query like this without having to leave a clearer calling card. It was pointed out that you can open up Event Viewer, go to Applications and Services Logs, Microsoft, Windows, WMI-Activity, Operational, then look for likely queries around the time that you rebooted. Here’s what I found:

Namespace = ROOT\CIMV2; NotificationQuery = SELECT * FROM __InstanceCreationEvent WITHIN 1 WHERE TargetInstance ISA 'Win32_Process'; UserName = LAPTOP-EAODEGEE\Bruced; ClientProcessID = 2764, ClientMachine = LAPTOP-EAODEGEE; PossibleCause = Temporary

The ClientProcessID identifies which process made the query. I’m not sure I could have found the right query if I didn’t already know what process ID I was looking for. If this is the easy way of tracking down who made these battery-depleting queries then maybe it needs to be made easier.

How long, has this been going on

I’d had my laptop for almost three months when I noticed this so naturally I wondered why I hadn’t noticed it before. I’m enough of a pack rat that I had a trace from when I first got the laptop (from investigating resume-from-standby hangs caused by a buggy Intel driver, sigh…) and I could see that this bad behavior was not showing up in that trace. I then looked at the version number of SynTPEnh.exe in that trace (ETW traces contain a ton of useful information) and it was 19.3.4.75. The badly behaved version was 19.3.4.83, so clearly a driver upgrade had happened and that caused this problem. I then found the driver install entry that showed exactly when the bad driver arrived – apparently it took me more than six weeks to notice:

image

What about that memory leak?

When I’d first noticed that SynTPEnh.exe was leaking memory I couldn’t reproduce it, but this investigation gave me a new idea: maybe the new-process notifications were related to the memory leak. I happened to have a process creation stress test lying around from investigating a Windows 10 performance problem so I ran it. To nobody’s surprise running it caused the SynTPEnh.exe private working set to increase by about 20 MiB – about 20 KiB per process (technically ~20.5 KiB per process because base-2 is so awesome).

Since the memory leak is proportional to how many processes are created I can safely assume that most users will not see as much memory leaked as I have. You really have to build Chrome or run ProcessCreateTests.exe if you want to leak a huge amount of memory. On the other hand, most consumers don’t have 32 GiB laptops and can’t afford to leak large amounts of memory. Here are some more leak statistics:

  1. Restarting Chrome (26 tabs opened): ~1 MiB leaked
  2. Building Chrome ‘base’ target (a tiny portion of Chrome): ~12 MiB leaked
  3. Running ProcessCreateTests.exe: ~20 MiB leaked

In other words, the memory leak was exacerbated by being a Chrome developer and by investigating the Windows 10 performance problem. It’s safe to say that I was hit worse than anyone. But it still needs fixing.

Next steps

I used Synaptics’ “Contact Us” form to report these bugs and I tweeted information to them but I have heard no response.

I also contacted Lenovo since they are the maker of my laptop. They responded but I’m not sure how seriously they are taking the issue and I don’t feel like experimenting with driver versions that “might” fix the issue.

Microsoft suggested that they could use shimming to fix the issue. And they encouraged me to file a feedback report, which I did. I’m not thrilled with shimming as a solution, but unresponsive developers is the perfect use-case for it.

I’m not the first person to notice this. The bug was reported to Lenovo on May 30th and May 31st. Various suggestions were made but the issue can’t be considered fixed until an automatic update goes out to all customers.

For my own sanity I used the Roll Back Driver functionality to go back to a better version:

image

So, my laptop battery life is now back to normal, until the next update comes along. But a few million unwary customers are still wasting power and memory until this issue is fixed.

Apparently Lenovo has fixed this for at least some Yoga machines.

One reader suggested that uninstalling the Synaptics driver and using the Microsoft Precision Touch Pad drivers might work better.

An alternative WMI investigative technique was suggested here.

Hacker news discussion is here.

Reddit discussion is here.

Advertisements

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Investigative Reporting, Uncategorized. Bookmark the permalink.

27 Responses to Hey Synaptics, Can You Please Stop Polling!

  1. mohammed imran says:

    Are all users using the synaptic pad with this driver affected?

  2. ericlaw says:

    Curious: Does ‘powercfg /energy’ give any useful output here, or is the wasted CPU hidden from its analysis?

  3. Mohammed Imran says:

    19.2.17.55/ 2/7/2017/Synaptics

  4. Is it still behaving that way if mouse is plugged in and touchpad option is set to disable when external pointing device is used?

  5. Koby Kahane says:

    I think you mean tzres.dll when you say trez.dll.

    What would you suggest, event wise, be added to make tracking down the initiators of WMI notification queries easier? I guess adding the PID of the notification query initiator to the event for the “polling query” done every interval would go a long way.

    • brucedawson says:

      D’oh – fixed. Writing is hard.

      I’d like to see a WMI event issued every time that the query happens. That would make finding the culprit in an ETW trace or in event viewer much simpler. Only emitting these events when the query is initiated seems perverse, at least in this context.

  6. Pingback: New top story on Hacker News: Hey Synaptics, Can You Please Stop Polling – The Internet Yard

  7. Very good analysis and use of Windows Performance Toolkit to troubleshoot performance issues.

  8. poiru says:

    Have you tried searching e.g. “synaptics engineer touchpad” on LinkedIn? This is what I usually end up doing when I’m unable to reach the relevant people through normal support channels. I messaged someone (Perry) on your behalf, I’ll let you know if they respond!

    • poiru says:

      Reply from Perry: “I got a report back from the driver team that the issue was fixed a while ago in version 19.3.4.99 of the driver. I’m not sure if it is currently available on Windows Update or not, but users should be able to get an updated driver by contacting Lenovo.”

      • brucedawson says:

        Thanks for the update. But…

        The issue isn’t resolved until the fixed driver is on Windows Update for everyone. Contacting Lenovo/Dell/whomever is a non-starter because the vast majority of users have no way to realize that they have a problem that requires contacting Lenovo.

  9. Carlos says:

    Thanks for this blog post- I knew I wasn’t going crazy when my laptop suddenly started having awful battery life. If I had a dollar for every time I ran powercfg /batteryreport to try and figure out what the heck was going on, I’d be rich.

  10. Very interesting, I had a look and found that Dell’s quickset is hammering my laptop with WMI queries too

    • brucedawson says:

      What model laptop do you have? Dell wants to know:

      • brucedawson says:

        Dell also says: “Hi Bruce, Can you please ask the user to reach out to us and DM the service tag of the system. ^IR”

        So, yeah, can you do that? That will help Dell help you, and perhaps help them help lots of other people.

  11. Haofei Wu says:

    Thank you for the information you posted! I have forwarded this information to our BIOS team to check.

    • brucedawson says:

      It’s still not fixed for my Lenovo P51, but the fixes for other systems are an encouraging sign (as long as they go out through an auto-update system). It is also frustrating that Lenovo couldn’t tell me the state of this bug when I asked, and Synaptics never even responded.

  12. Mohammed Imran says:

    Lenovo brand and its products should be avoided at all cost.

  13. jdrch says:

    The badly behaved version was 19.3.4.83, so clearly a driver upgrade had happened and that caused this problem.

    This is exactly what I was afraid of when Microsoft announced their plans to have Windows 10 update drivers automatically. Microsoft Update has a long history of delivering (obviously) bad drivers; I once had Windows 8 try to update my Bluetooth driver with one that didn’t even have “Atheros” spelled correctly.

    I also had a similar problem earlier this year with Sony Vaio laptop running Windows 10 Insider Slow Ring. It would run out of memory when left idle. The problem turned out to be a bad Realtek driver: https://jdrch.wordpress.com/2016/12/28/how-to-fix-the-your-computer-is-low-on-memory-windows-10-error-message/

  14. My HP Envy 17 laptop (3 years old now…) has a Synaptics trackpad and the same rubbish 64-bit extension thing. I too found over the years that occasionally it’d go bad for a while and then not. I just got used to killing it in Task Manager, and if it was particularly bad, finding the scheduled task that started it on logon and disabling that for a while.
    Rubbish 😦

    My new Thinkpad has a Synaptics trackpad but it’s a precision one so it doesn’t seem to need the same “helpers”.

    From what I could tell, the “helper” thing was enabling two finger scroll, pinch-zoom, etc. That seems like something pretty essential to the driver. I can only imagine it wanted to know about all processes so it could hook sending them windows messages, etc. Surely there’s a more efficient way though 😦

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s