CPU Performance Counters on Windows

TL;DR – I can finally record CPU performance counters for processes on Windows.

I’m mostly a Windows developer but I’ll occasionally fire up my Linux box to use the perf tool to examine CPU performance counters. Sometimes you really need to see how many cache misses or branch mispredicts your code is causing, and Windows has been curiously hostile to this endeavor.

Some time ago Windows gained the ability to record CPU performance counters from within ETW events, but (so the story goes) there was no way to enable it. Then the ability to enable this feature was added, but there was virtually no documentation.

So when a comment on my blog gave instructions on how to use this feature I knew that I had to give it a try. It took some clarification from another commenter, and some experimentation, but I’ve figured it out. It’s clumsy, and it will require some work if you want to use it differently, but it seems to work. Finally I can understand why, at a CPU efficiency level, some processes end up running slowly.

The functionality provided allows recording an ETW trace with CPU performance counters attached to particular events. When the trace is dumped to text (there is no WPA support) there will be special “Pmc” (aka PCM) records that are implicitly associated with the following event. These events record the state of the CPU performance counters at that point on that CPU. My experimentation has strictly been with context switch events, since that makes it easy to associate performance counters with a particular process.

The CSwitch event indicates which process and thread was being switched in (which turns out not to matter), what process and thread was being switched out, and what CPU this was happening on. The performance counters are per-CPU so the script that parses this data has to keep track of the previously seen value for each CPU so that it can calculate the delta. Then it ignores the thread that is being switched in and instead attributes the delta to the thread being switched out, since it is the one that is responsible for that counter delta.

I’ve created a demo to show the basic idea. It consists of a batch file to drive the process, a test program that can either cause many branch mispredicts or very few, and a Python script that dumps the ETW file to text and extracts the counters in a usable manner.

Intentionally slow code

When investigating a feature like this it is important to have a way to sanity check the results. What I needed was a program that would cause some predictable change in performance counters so that I could tell if my results made sense. There’s a classic performance puzzle that asks why conditional summing of an array goes much faster if the array is sorted. The answer is that if the data is random then the branches that decide which values to sum are random, and frequently mispredicted. With sorted data the branches are predictable, they are correctly predicted by the CPU, and the adding goes much faster (more than four times faster in my tests).

So, I wrote a program that generates random data and then either sorts it or not, and then conditionally adds it multiple times. My batch file starts ETW tracing with CPU performance counters that measure the number of branches and number of branch mispredicts. Then it runs my test program twice, saves the ETW trace to disk, converts it to text, and parses the data. The summary prints the performance counters, and their ratios, along with other snippets of data.

The results are dramatic. When processing sorted data the percentage of mispredicted branches is around 0.06%. When processing random data the percentage of mispredicted branches is around 18%. That’s a lot higher!

image

I actually would have expected a 25% mispredict rate. The inner loop contains two branches – one to control the loop, and one to decide whether to sum the data. The loop branch should be correctly predicted roughly 100% of the time, and the other branch should be correctly predicted roughly 50% of the time. So, a 75% prediction rate – aka a 25% misprediction rate – would seem normal. I don’t know what the discrepancy indicates but I’m going to call 18% close enough.

Note that IRCs and DPCs run in the context of a process, which means that they dirty the performance counters. Adjusting for this or estimating its effect is left as an exercise for the reader.

For more information you should look at the sample code and the batch file that runs it. Modifying the batch file to suit your needs is left as an exercise for the reader.

One obvious modification to the batch file would be changing what performance counters it records – maybe you are more concerned with last-level cache misses. If you run:

tracelog.exe -profilesources Help

then you will get a list of the available performance counters. There aren’t many, and I discuss them briefly in my batch file.

I’ve tested this on Windows 10 and it sounds like it works back to Windows 8. The sample code can be found here. Let me know if you find anything interesting, extend this to Profile events, or otherwise improve the process. A more flexible version of the batch file (that simply waits while you run your scenario and then summarizes all processes) can be found here. It defaults to measuring instructions, cycles, and cache misses, and it is interesting to see how good (or bad) IPC (Instructions Per Cycle) can be. A handy tool that can run at 1.0, 2.0, or ~2.85 IPC can be found here – useful for more testing.

For more general information on ETW profiling see my collection of blog posts at  https://tinyurl.com/etwcentral.

Reddit discussion goes here.

For a different perspective on this topic see this blog post.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in xperf and tagged , , . Bookmark the permalink.

7 Responses to CPU Performance Counters on Windows

  1. lvps1000vm says:

    If the samples are drawn from a memoryless distribution (or if they are randomly sorted in general), a smart compiler can predict the direction of n[x+2] ≷ n[x+1] with probability 2/3 if it already knows the direction of x[n+1] ≷ x[n]. Just predict that the new comparison will turn the other way around. Along with a ~0% misprediction of the loop counter that gives a theoretical best of 16.6% misprediction rate, close to 18%. (This heuristic fails horribly for sorted data, so the compiler needs to identify in which context to use it)

    If instead of only having memory of the previous comparison you are allowed to use the information of all previous samples you can even predict the new comparison with probability 3/4. If a sample lies in the p-th percentile in relation to all previous samples, predict that the following sample will be greater if p < 50% and lower otherwise.

    The case where you can't do better than 50% is when the array contains the sampling of a random walk. For this case the information from previous samples is useless.

  2. Microsoft does indeed seem to have kept a lot of the relevant documentation to itself. This gets my attention, for aside from my own long-time interest in ETW, there’s that even though I would likely not use third-party diagnostics tools I very much want that they should exist. Running TRACELOG, etc, to get your counters isn’t quite like calling the CRT system function to run the Command Prompt’s DIR command for a directory listing that you then parse, but it’s closer to that than having your own program control the trace session and consume the events for presentation as your results.

    Yet I can’t see how anyone can get these counters programmatically except by using several undocumented APIs or at least under-documented cases of documented API functions. With one information class of TraceSetInformation you specify which counters you want. With another you specify the events for which you want the kernel to collect the counters for inclusion as extended data. The event that traces each thread switch (as opposed to the more efficient one that batches them) is the most use to you since it lets you keep track of which changes to the counters were caused while your thread ran on that processor. Fortunately, this event can be enabled via the EnableFlags in the EVENT_TRACE_PROPERTIES that you give when starting an NT Kernel Logger session, else you might need yet another information class (to enable the right group mask).

    The event data with its counters can be interpreted programmatically too, given that you know the structures. So, I expect that a programmer armed with sufficient (or sufficiently good) documentation could write the one program that sets up the ETW stuff to collect counters, runs an algorithm encoding that it wants to test, and then extracts the counters for that thread all without having the events go to a file, let alone getting them formatted to text by some separate tool and parsed however.

    Whether that’s a thing that would be widely desired, I don’t know, but it dissatisfies me that under-documentation means the thing goes not just begging but unrealised. I can do my part: documentation of TraceSetInformation is in preparation at http://www.geoffchappell.com/studies/windows/win32/advapi32/api/etw/logapi/set.htm, with links to what I’ve already written about the kernel-mode stuff that supports it all, such as http://www.geoffchappell/com/studies/windows/km/ntoskrnl/api/etw/callouts/wmi_contextswap.htm for the layout of the event data. Inevitably, though, there’s a long way to go.

    As for your results, well, of course you know that what you write for your thread is not all that runs for the thread. It may be all that runs in user mode while you run the algorithm with no calls outside, But especially if you run the algorithm repeatedly enough to get smoother results, quite a lot of other people’s code will run in kernel mode, first because of hardware interrupts (even that occur during user-mode execution) and then because of deferred procedure calls, and as a special case of these, the code that switches your thread in and out or to other processors – and which traces these thread switches as the events you want to see! That the observed mis-prediction is so much lower than you expect from your own code may be a significant credit to Profile-Guided Optimisation in the kernel.

    • brucedawson says:

      I’m glad that you’re contributing to the documenting of ETW, from a completely different angle from my documentation. We’ll get everything covered eventually.

      Although, the task would be a lot quicker if Microsoft would offer a bit more information…

  3. Adam Sitnik says:

    Using the ETW to collect the PMC counters is easy when you know how.

    Recently I added support for this to BenchmarkDotNet. I have shared my results of some reverse engineering here:
    http://adamsitnik.com/Hardware-Counters-ETW/

    and also added it to BenchmarkDotNet:
    http://adamsitnik.com/Hardware-Counters-Diagnoser/

    Maybe somebody will find this description useful and build another great tool 😉

  4. Denis says:

    Hi Bruce,
    Can you please share step-by-step process how I can run equivalent of “perf stat” on Windows? Or you maybe have some good resources?
    I just want to collect PMCs for my app.

    • brucedawson says:

      The only instructions for recording PMCs for a process are in this blog post. I used the instructions recently to record PMCs for Chrome and they worked. There are links to a batch file and demo programs. Try those and then adjust as needed for your app.

  5. akraus1 says:

    I have added to ETWAnalyzer PMC/LBR tracing support. See https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpPMCCommand.md and https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpLBRCommand.md which should make working with these low level CPU features much easier. The only thing I miss is that the OS HyperV Features needs to be uninstalled which is for todays Kubernetes and related development a no go. The documentation is hopefully correct, but I could have missed some parts because until a few weeks ago that was all new to me.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.