TL;DR – I can finally record CPU performance counters for processes on Windows.
I’m mostly a Windows developer but I’ll occasionally fire up my Linux box to use the perf tool to examine CPU performance counters. Sometimes you really need to see how many cache misses or branch mispredicts your code is causing, and Windows has been curiously hostile to this endeavor.
Some time ago Windows gained the ability to record CPU performance counters from within ETW events, but (so the story goes) there was no way to enable it. Then the ability to enable this feature was added, but there was virtually no documentation.
So when a comment on my blog gave instructions on how to use this feature I knew that I had to give it a try. It took some clarification from another commenter, and some experimentation, but I’ve figured it out. It’s clumsy, and it will require some work if you want to use it differently, but it seems to work. Finally I can understand why, at a CPU efficiency level, some processes end up running slowly.
The functionality provided allows recording an ETW trace with CPU performance counters attached to particular events. When the trace is dumped to text (there is no WPA support) there will be special “Pmc” (aka PCM) records that are implicitly associated with the following event. These events record the state of the CPU performance counters at that point on that CPU. My experimentation has strictly been with context switch events, since that makes it easy to associate performance counters with a particular process.
The CSwitch event indicates which process and thread was being switched in (which turns out not to matter), what process and thread was being switched out, and what CPU this was happening on. The performance counters are per-CPU so the script that parses this data has to keep track of the previously seen value for each CPU so that it can calculate the delta. Then it ignores the thread that is being switched in and instead attributes the delta to the thread being switched out, since it is the one that is responsible for that counter delta.
I’ve created a demo to show the basic idea. It consists of a batch file to drive the process, a test program that can either cause many branch mispredicts or very few, and a Python script that dumps the ETW file to text and extracts the counters in a usable manner.
Intentionally slow code
When investigating a feature like this it is important to have a way to sanity check the results. What I needed was a program that would cause some predictable change in performance counters so that I could tell if my results made sense. There’s a classic performance puzzle that asks why conditional summing of an array goes much faster if the array is sorted. The answer is that if the data is random then the branches that decide which values to sum are random, and frequently mispredicted. With sorted data the branches are predictable, they are correctly predicted by the CPU, and the adding goes much faster (more than four times faster in my tests).
So, I wrote a program that generates random data and then either sorts it or not, and then conditionally adds it multiple times. My batch file starts ETW tracing with CPU performance counters that measure the number of branches and number of branch mispredicts. Then it runs my test program twice, saves the ETW trace to disk, converts it to text, and parses the data. The summary prints the performance counters, and their ratios, along with other snippets of data.
The results are dramatic. When processing sorted data the percentage of mispredicted branches is around 0.06%. When processing random data the percentage of mispredicted branches is around 18%. That’s a lot higher!
I actually would have expected a 25% mispredict rate. The inner loop contains two branches – one to control the loop, and one to decide whether to sum the data. The loop branch should be correctly predicted roughly 100% of the time, and the other branch should be correctly predicted roughly 50% of the time. So, a 75% prediction rate – aka a 25% misprediction rate – would seem normal. I don’t know what the discrepancy indicates but I’m going to call 18% close enough.
Note that IRCs and DPCs run in the context of a process, which means that they dirty the performance counters. Adjusting for this or estimating its effect is left as an exercise for the reader.
For more information you should look at the sample code and the batch file that runs it. Modifying the batch file to suit your needs is left as an exercise for the reader.
One obvious modification to the batch file would be changing what performance counters it records – maybe you are more concerned with last-level cache misses. If you run:
tracelog.exe -profilesources Help
then you will get a list of the available performance counters. There aren’t many, and I discuss them briefly in my batch file.
I’ve tested this on Windows 10 and it sounds like it works back to Windows 8. The sample code can be found here. Let me know if you find anything interesting, extend this to Profile events, or otherwise improve the process. A more flexible version of the batch file (that simply waits while you run your scenario and then summarizes all processes) can be found here. It defaults to measuring instructions, cycles, and cache misses, and it is interesting to see how good (or bad) IPC (Instructions Per Cycle) can be. A handy tool that can run at 1.0, 2.0, or ~2.85 IPC can be found here – useful for more testing.
For more general information on ETW profiling see my collection of blog posts at https://tinyurl.com/etwcentral.
Reddit discussion goes here.
For a different perspective on this topic see this blog post.