UIforETW is No Longer a CPU Hog

A few months ago I wrote about how many processes on my system were waking up and wasting CPU time for no good reason, thus wasting battery power, electricity, and CPU power. I was surprised that nobody called me out on the hypocrisy of my complaints because UIforETW, my open-source ETW trace recording and management tool, was not well behaved when idle.

This has now been fixed.

UIforETW creates numerous threads that monitor performance related aspects of the system, such as battery drain, CPU power consumption, timer status, user input, etc. These data are emitted as ETW events that show up in the recorded traces and which can then be used to help understand and analyze the ETW traces.

This is all good, and the overhead of these monitoring threads was designed to be small enough that it would not perturb the processes being profiled.

However “small enough overhead” on a busy system is not the same as “small enough overhead” on an idle system. Originally when UIforETW was open but not currently tracing these threads were still running – waking up many times per second, calculating data, and emitting events that were effectively directed to /dev/null.

As of change 72c71dd I’ve rewritten the monitors so that they start and stop with tracing. This dropped the number of context switches and the CPU usage (monitored with sysinternals’ procexp) of an idle UIforETW significantly. However I was still seeing about ten context switches per second. Hmmm…

I used UIforETW\bin\metatrace.bat to record a trace of UIforETW sitting idle. metatrace.bat is useful because it uses a different kernel provider and can therefore be used to profile UIforETW even during trace startup and shutdown. The trace showed that the context switches were on a windows message call stack. I tried using Spy++ to see what type of message was coming in but couldn’t get that to work. I then noticed that the bursts of activity were happening exactly once a second, as can be seen on the WPA screenshot below which graphs context switch counts for UIforETW against elapsed time in seconds:

A search for SetTimer in the UIforETW source showed a call that I had added months ago. Each time a timer message is received UIforETW checks to see if tracing to a file has been running too long. This is a good feature but doesn’t need to be running when tracing is not, so I changed the code so that now the timer only runs during tracing. I also changed it to wake up every thirty seconds instead of every second – precise timing is not relevant for this feature.

The net result of these changes is that now if you have UIforETW running and you aren’t recording a trace then the CPU overhead is zero. UIforETW will go for long periods of time without a single context switch. This means that UIforETW has gone from being a poster-child for a poorly behaved background application to proof that staying perfectly idle is actually quite easy. This blog post was easy to write but I still spent longer writing it than I did making the changes to UIforETW.

It’s worth noting that UIforETW didn’t lose any functionality from these changes. And it’s also worth noting that I did not try to reduce the number of threads, just the number of threads that were waking up for no reason. So, for instance, the DirectoryMonitorThread – which waits on file notification events so that UIforETW can update its trace list automatically – is still running all the time. It wakes up when it needs to do something and is otherwise just sitting on the kernel’s list of threads that don’t need to run.

So, I repeat my entreaty: write your software so that it doesn’t wake up periodically “just in case”. Wake up only when necessary, and be particularly careful about this when your application is not active. If you are wasteful then I will uninstall your software, and encourage others to do likewise.

And you should grab the latest version of UIforETW, for this improvement and a few others.

Aside: What is UIforETW? It’s an open source tool for recording and managing ETW traces to allow investigation of performance problems on Windows in incredible detail. You can find more details here (or at this more memorable url: https://tinyurl.com/etwcentral).

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

6 Responses to UIforETW is No Longer a CPU Hog

WindowsSucks says:

May 15, 2016 at 7:44 am

Great work! If only MS could do the same, every Windows version since at least 2000 will do registry reads and whatnot every now and then and it is pretty pathetic that after 15+ years you can still see registry activity on a idle system in Process Monitor and some of it is simple stuff like polling to see if the computer name has changed!

JJ says:

May 16, 2016 at 4:42 pm

I keep seeing people avoid xperf as a profiling tool because it takes so long to load symbols. Any way to speed that up? Visual Studio loads symbols quicker so it must be doing something silly?

- brucedawson says:
  
  May 16, 2016 at 7:18 pm
  
  To clarify: xperf doesn’t actually load symbols, that doesn’t happen until trace analysis time (generally). So WPA is what is loading symbols. But, anyway…
  
  There are a few issues that make WPA seem slow, especially on first use. One is that WPA is a whole-system profiler so WPA tries to load symbols for every binary loaded on your system during the trace. On my machine that typically includes Outlook, Visual Studio, Chrome, Excel, and huge chunks of Windows.
  
  The next problem is that when WPA sees a binary for the first time it downloads its PDB file (if it isn’t in the local symbol cache) and then translates it to its own proprietary symcache format, which is much smaller and faster to load. You can see these files in c:\symcache. So the *second* time you load a trace into WPA will be much faster.
  
  Additionally, WPA/xperf have a performance bug in their handling of some symbols. This hits Chrome symbols quite hard and makes these (admittedly enormous) symbols take about twenty five minutes to load (for chrome.dll and chrome_child.dll). This was such a serious problem that I wrote https://github.com/google/UIforETW/blob/master/bin/StripChromeSymbols.py in order to strip off the private symbol data before feeding the PDBs to xperf to generate the .symcache files. On a fast machine I can now download/strip/convert the symbols for Chrome in 44 s, and it only has to do that when there is a new version. Maybe you are hitting some version of this bug, in which case you just need to modify StripChromeSymbols.py.
  
  So, I don’t find WPA to be too slow at loading symbols. As a test I just recorded a trace on my quite busy machine. The symbols took ~115 seconds to load, which is a long time, but WPA was responsive the entire time so I didn’t have to wait, so it didn’t matter. Having symbols set to auto load (Trace-> Configure Symbol Paths-> Load Settings) also helps. Also note that during that time WPA loaded symbols for 5,000!!! binaries so if symbol loading seems slow, just realize how much it is actually doing.
  
  But, it is true that WPA is not the type of profiler that shows you results in just a few seconds. It’s a tradeoff.
  
  I hope that helps.
  
JJ says:

May 20, 2016 at 11:13 am

Thanks for the info. You’re right that it’s much faster to open a trace after caching the symbols. It still took ~5 minutes to open the same trace I opened on the 16th though, which is off-puttingly slow when you want a quick analysis.

It would be great if you could tell WPA which symbols to load. Scoping it to the filtered process for example.

- brucedawson says:
  
  May 20, 2016 at 6:34 pm
  
  You can filter symbol loading to a particular process. It’s in the symbol load settings mentioned above. I’ve never used it, but it should work.
  
  If your traces are taking five minutes to load then you probably either need a faster computer or smaller traces. Especially if that five minutes is just *opening* the trace – that should be time to load a trace, load the symbols, and then make a tasty cocktail.
  
Ofek Shilon says:

June 15, 2016 at 11:29 am

What I do is *after the initial download* disable the MS symbol server, thereby forcing WPA to use just SymCache. This initial loading indeed takes a few seconds. I can then re-enable the MS symbol server – it typically grinds for many minutes in the background, mostly on symbols that are missing on the server anyway. (as is shown in the diagnostic console)