The Lost Xperf Documentation–CPU Usage (Precise)

As I’ve mentioned previously, the documentation for xperf (Windows Performance Toolkit, also known as ETW) is pretty weak. In this post I’m going to attempt to explain the meaning of the extremely subtle and non-obvious columns in the CPU Usage (Precise) Tables, which display every context switch recorded in the trace. Understanding these columns is important when analyzing idle time, which is one of xperf’s most valuable features.

This post was updated September 2015 to include information about UIforETW, WPA, and new columns.

imageThe CPU Usage (Precise) table is one of the worst offenders for having confusingly named columns. Quick – what’s the difference between Switch-In Time and Last Switch-Out Time? And is New Prev Out Pri a real column name, or is it a made-up compound word? And is ProcessOutOfMemory really a column in the context switch table? Stick around and find out. WPA 10 adds some spaces and hyphens to the names, and tooltips, but documentation is still needed.

Here are the other articles in this series:

image_thumb8This post assumes that you are recording traces with UIforETW and that the WPA (Windows Performance Analyzer) startup profiles which ship with UIforETW have been installed. You can do this from the Settings dialog by clicking on Copy startup profiles. This post also assumes that you are using WPA 10 to view the traces – this is the default when you double-click on a trace in UIforETW‘s Traces list.

CPU Usage (Precise)

This table is labeled CPU Usage (Precise) to distinguish it from the CPU Usage (Sampled) data. The sampled data comes from interrupting the CPU frequently – sampling – and seeing what it is doing. The precise data that is discussed here comes from instrumenting the context switch code in the kernel. Every time a thread starts or stops running the precise time of this activity is recorded, with optional call stacks. This means that the CPU Usage (Precise) data can tell exactly how much CPU time each thread consumes, and what the call stack looked like when a thread lost/gained the CPU. However the CPU Usage (Precise) data can’t tell you what a thread is doing between context switches – that is the job of the CPU Usage (Sampled) data. Understanding this difference is crucial.

UIforETW’s default startup profile shows the CPU Usage (Precise) data in graph form, because this is the most accurate graph for showing how much CPU time is used by each process. If you want to see details of the context switches you can click on the Display graph and table button and the table will appear below the graph.

image

Default columns for CPU Usage (Precise)

The available View Presets for the CPU Usage (Precise) data are good for getting different graphs (Timeline by Process, Thread and Timeline by CPU can be interesting)…

image

…but I don’t find any of the built-in presets particularly useful for a detailed investigation of why my program is idle. So, the default View Preset with the UIforETW startup profile is Randomascii Wait Analysis and it is optimized for figuring out when, where, and why your threads go idle. The columns used in this preset are:

  • New Process
  • New Thread Id
  • New Thread Stack
  • Readying Process
  • Readying Thread Id
  • Ready Thread Stack
  • Orange bar (columns to the left are used for grouping)
  • Count
  • Sum: Time Since Last (μs)
  • Max: Time Since Last (μs)
  • Count: Waits
  • Switch-In Time (s)

With these columns available you can easily drill down to where your threads go idle, see how often they went idle and for how long, see what call stack they went idle on, see what thread from what process woke them (and that call stack), and more. It’s pretty amazing.

CPU scheduling shows switches

All of the columns in the CPU Scheduling summary table show data that is related to a particular context switch (or multiple context switches grouped together), and the grouping (by the columns to the left of the orange bar) affects how many context switches are displayed on each row. With the recommended default view there will initially be one row for each process, and as you drill in more rows will be revealed. The grouping by stack is particularly important since it lets you see the big picture more easily.

Any threads or processes that did not have context switches during the visible time period will not show up.

Before diving into the columns (there are over 60 of them!) some concepts and prefixes need to be covered.

Context switch concepts

The CPU Scheduling summary tables are summarizing context switches. They record the act of one logical or physical CPU changing from executing one thread to executing another thread. In order to understand context switches it is important to understand the different states in which a thread can be – if you don’t understand Windows thread scheduling and its nomenclature then you can’t fully understand the CPU scheduling summary table. A Windows thread can be in several states:

  • Waiting – not ready to run because the thread is waiting on something through a call to WaitForMultipleObjects, EnterCriticalSection, ReadFile, etc.
  • Ready – the thread is ready to run but has not yet been scheduled to run. Threads move from waiting to ready when their wait is satisfied or times out. If a thread is in the Ready state for any non-trivial amount of time then it indicates some level of CPU starvation
  • Running – the thread is running on one of the CPUs, executing instructions and getting work done. Threads move from ready to running when a CPU is available. They can move back to ready (if they lose the CPU) or back to waiting (if they wait on something).
  • There are other intermediate states used by the scheduler to improve scalability such as DeferredReady and Standby, and states such as Initialized and Terminated for new and dead threads, but those shouldn’t normally be relevant for performance purposes.

There are up to three processes involved in a context switch:

  • New Process/Thread: this is the process/thread that we are switching to – the one that is starting execution at this point. This may be the idle thread – when a CPU is idle it is considered to be ‘executing’ the idle thread. The New Thread is moving from Ready to Running.
  • Old Process/Thread: this is the process/thread that we are switching away from – the one that has lost (or relinquished) the CPU. This may be the idle thread. The Old Thread is moving from Running to Waiting or to Ready.
  • Readying Process/Thread: this is the process/thread that ‘readied’ the new process – that satisfied whatever it was waiting on. This could be a thread that released a critical section, or set an event, or something else. This readying could represent keyboard input, completion of file I/O, receipt of a network packet, or many other possibilities. Sometimes there is no readying thread – this happens when a thread wasn’t waiting for anything except more CPU time. Sometimes the readying thread is a DPC, which means it comes from a random thread that was hijacked to run the DPC code – watch for it. The Readying Thread’s state doesn’t change during this context switch, and at the time of the context switch it can be in any one of the thread states.

In short, a context switch, as recorded by ETW, is the act of moving the New Thread from Ready to Running, and moving the Old Thread from running to some other state, on a particular CPU.

Some documents suggest that Windows always ensures that the highest priority threads are running, but this is a half truth at best, and usually irrelevant. This priority guarantee is a half truth because current versions of Windows actually only guarantee that a particular CPU is running the highest priority thread in its per-CPU queue. It is quite possible for a high-priority thread to stay in the Ready state for a few quanta while a low-priority thread continues to run on another CPU. In addition, the priority guarantee is usually not relevant because threads are constantly having their priorities adjusted, so you can’t usefully know what priority a thread will have at the time (although you can see it in the ETW trace).

See this presentation for more details on Windows thread scheduling.

Compound words

Many of the column names are compound words and if you understand the meaning of the constituent words then you can more easily remember the meaning of the columns. Memorizing these words is far more valuable than memorizing individual column names. Some of the most important words used are:

  • New – this refers to the thread/process that is being switched to – the thread that will be running after the context switch
  • Old – this refers to the thread/process that is being switched away from – the thread that was running before the context switch
  • Ready/Readying – this refers to the thread/process, if any, that readied the new thread
  • New Prev – this appears to refer to the previous context switch involving the new thread – the time that it stopped running, which may have been arbitrarily many context switches earlier
  • In – this apparently refers to a thread’s state at the moment when it starts running, when it switched “in” to the Running state – its state at the beginning of a period of running
  • Out – this apparently refers to a thread’s state when it stops running, when it is switched “out” of the Running state
  • Max – when rows are collapsed together a Max column is the maximum of some value across all of the collapsed rows, which is useful when searching for a single context switch with a very high value
  • Sum – when rows are collapsed together a Sum column is the sum of some value across all of the collapsed rows, which is useful when finding what collection of context switches are responsible for a high total. There are many columns with no Sum or Max prefix which behave as Sum columns, such as Count, and the time duration columns
  • Time has two different meanings in this summary table. Time (s) refers to a point in time in the trace (with units of seconds). Time (us) refers to a duration – a length of time (with units of microseconds). Thus, the units of time are crucial clues to the meaning of a column.

That is how we can end up with column names such as New Prev Out Pri which is the Priority of the New thread at the end (Out) of the Previous time that it ran. Phew.

Documentation of important columns

  • New/Readying/Old Process Name – this is the name of the process (i.e.; devenv.exe) without the process ID. This column is useful when you want to group all processes from one executable together.
  • New/Readying/Old Process – this is the name of the process plus its process ID. This column is useful when you want each process to be grouped separately
  • New/Readying/Old ThreadId – the ID of the thread, or –1 if there is no thread
  • New Thread Stack – the stack of the new thread, which is both where it resumes execution after the context switch, and where it was when its execution was suspended on an earlier context switch. This represents where the thread was waiting
  • Ready Thread Stack – the stack of the thread, if any, which readied (made eligible to run) the new thread
  • Old Thread Stack – there is no column that records the call stack of the old thread (the thread being switched out). The New Thread Stack for the next context switch involving the Old Thread contains that data
  • Ready Time (s) – this is the time at which a thread was readied – the time at which it became eligible to run
  • Count – this is the number of context switches summarized by the current row. When the first column is NewProcess and all of the data is collapsed this represents the number of context switches per process over the selected time. As you drill down into the data you will eventually get down to individual context switches and the count will be one.
  • Sum/Max:Time Since Last (μs) – this is the time since the thread last ran – i.e.; the time the thread was in the Waiting and Ready states
  • Sum/Max:Waits (μs) – this is the length of time that the thread was in the Waiting state before it was moved to Ready
  • Sum/Max:Ready (μs) – this is the length of time that the thread was in the Ready state before it was moved to Running. This is often just a few dozen microseconds but can be arbitrarily long on a CPU starved system. The sum of Sum:Waits and Sum:Ready should be Sum:Time Since Last, and a little Excel spot checking shows that it is
  • Count: Waits – this counts how many of the context switches have a non-zero Wait time – i.e.; how many of the context switches were after the new thread waited. If a thread is preempted then it can go from the Running state to the Ready state and back to the Running state with no waiting. If you drill down to an individual context switch then Count:Waits count will be zero or one. If it is zero then Sum/Max:Waits will also be zero, and ReadyingProcess, ReadyingThreadId, and ReadyThreadStack will all be empty
  • New/Old InSwitchTime (μs) – these represent how long the new and old threads will/did run – these are the elapsed time between the current context switch and the next/prev context switches – see the Learning through exploring section below for more details – these columns no longer exist in WPA 10. CPU Usage (ms) is the closest substitute for NewInSwitchTime (μs).
  • Switch-In Time (s) – this is the time of the context switch.
  • Annotation – this, column, which is available on most or all WPA tables, can be filled in during trace analysis. See below for details.

My editor tells me that I need more diagrams, so here is a lovely Visio diagram showing some of the times and durations in a context switch – pairs of parentheses indicate time durations (us), and arrows point to particular times (s). The data is all for the Context Switch pointed to by the text with the purple background:

image

For those of you who prefer math with your diagrams:

  • Switch-In Time (s) + CPU Usage (ms) * 1e-3 = Next Switch-Out Time (s)
  • Ready Time (s) + Ready (μs) * 1e-6 = Switch-In Time (s)
  • Waiting (μs) + Ready (μs) = Time Since Last (μs)

Annotation

The annotation column can be useful during analysis as a way to add an arbitrary tag to a set of data. When the Annotation column is enabled it should typically be one of the left-most columns. Then, when a set of data is understood the trace analyst should select that data, right-click and select Annotate Selection…, and enter an appropriate description. Then the trace analysis can continue on the data that is under the <Not Annotated> node in the Annotation column, undistracted by the data that is already understood. In the screen show below the Annotation column was used to tag context switches as “Self-readied” or “Readied by other process” in order to better visualize flame graph data:

image

Unimportant columns

Some of these might be more important if Microsoft explained them enough to be useful, but most of them are really only of interest to kernel developers:

  • Last Switch-Out Time (s) – this is the time when the old thread last stopped running, which could be arbitrarily many context switches ago. However this value appears to be buggy in regards to the idle thread where it is always the same as SwitchInTime (s) 
  • Next Switch-Out Time (s) – this is the time of the next context switch. Or, as the name wants you to think about it, this is when the new thread will be switched out.
  • New Prev Out Pri – I believe that this is the priority of the new thread at the end of the last time that it ran
  • New In Pri – I believe that this is the priority of the new thread when it moves into the Running state because of this context switch. This is often elevated by one or two levels because the Windows scheduler boosts the priority of threads that have just been readied
  • New Out Pri – I believe that this is what the priority of the new thread will be when it ends its current running state – at the next context switch. This often decays back to the same level of New Prev Out Pri
  • Old Out Pri – I believe this is the priority of the old thread at the time that it stopped running. Note that the idle thread runs at priority zero
  • Cpu – this is the CPU on which the context switch is happening. A thread can move across CPUs (although Windows tries to minimize that) but for any given context switch, a particular CPU goes from running one thread to running another thread
  • New/Readying/Old ThreadStartModule – the module where the thread originally started
  • New/Readying/Old ThreadStartFunction – the function where the thread originally started
  • Ideal Cpu – every thread has an ideal CPU that Windows tries to schedule it on. This field presumably represents this data for the New Thread. Grouping by Cpu and then Ideal Cpu shows that some CPUs have far more context switches than others, and the majority of context switches have the new thread running on its ideal CPU. But I can’t find any time where this matters
  • New/Old Qnt – these are always -1, even on Windows 10. This appears to be an old and unfixed bug.
  • New Prev Wait Mode, New/Old Wait Mode – the available wait modes seem to be Unknown, NonSwap, and Swapable. None of these seem relevant to non-kernel developers
  • New Pri Decr – this number varies from –112 to +112, with zero and two being the most common values. But I don’t know what it means
  • New Prev State, New/Old State – these refer to the Waiting, Ready, Running, etc. states that a thread can be in, but I can’t make any sense of the data
  • New Prev Wait Reason, New/Old Wait Reason – these presumably refer to why the thread was waiting or otherwise not running, but the number of reasons and the diversity of call stacks makes understanding them seem hopeless
  • Prev CState – Cstates represent how active your CPU is, with zero being fully active and larger numbers representing various degrees of reduced power. The largest number I’ve seen on a trace is ‘2’
  • AdjustIncrement – unknown
  • AdjustReason – unknown
  • DirectSwitchAttempt – unknown
  • KernelStackNotResident – unknown
  • NewThreadPrevRemainingQuantum – unknown
  • OldThreadRank – unknown
  • OldThreadRemainingQuantum – presumably this records how much of the old (switched out) threads’ quantum was remaining
  • ProcessOutOfMemory – unknown, and doesn’t this seem out-of-place in context-switch data?
  • ReadyThreadInDPC – unknown

Learning through exploring

I have no secret source of knowledge for what most of these columns are – I didn’t call up someone at Microsoft and ask, and my web searches for documentation were generally fruitless. Instead, I figured out the meanings of the columns by rearranging the summary table to figure out patterns. For instance, in order to figure out OldInSwitchTime (μs) (now deleted) and NewInSwitchTime (μs) (now deleted, use CPU Usage (ms) instead) I rearranged the columns in the order:

  • Cpu – the CPU the context switch occurs on
  • Orange bar – columns to the left are used for grouping
  • New Process – the process we are switching to
  • NewThreadId – the thread we are switching to
  • SwitchInTime (s) – the time of the context switch
  • OldInSwitchTime (μs)
  • NewInSwitchTime (μs)

Once I did this and sorted by Switch In Time (s) the patterns became obvious:

image

InSwitchTime (μs) is the length of time that a thread is running, from the context switch where it starts running to the context switch where another thread starts running. This can be verified by taking SwitchInTime (s) on the selected row, adding NewInSwitchTime (μs) (remembering to divide by 1,000,000 to convert to seconds), and noting that the result is the SwitchInTime (s) for the next row – the next context switch. By selecting a few thousand rows, copying to a spreadsheet and doing some Excel-fu you can quickly verify this.

The OldInSwitchTime (μs) is how long the old thread ran, and the NewInSwitchTime (μs) is how long the new thread will run. The NewInSwitchTime (μs) is not known at the time that the context switch happens, but xperf helpfully calculates it at analysis time. You can see that the OldInSwitchTime (μs) for a given row (when sorting by SwitchInTime (s) and when there is one context switch per row) is just the NewInSwitchTime (μs) from the previous row (the previous context switch).

Aside: I don’t know what Outlook is doing in the screenshot above but it appears to be suffering from Attention Deficit Hyperactivity Disorder. In total it flips back and forth between it and the idle thread seven times (not all shown) and it never runs for more than 0.02 ms. That can’t be good.

Grouping variations

It can be interesting to see all of the context switches on a single CPU – or depressing if you think about the lost efficiency that comes from thousands of context switches per second per CPU. To do this just arrange the columns like this:

  • Cpu
  • Switch-In Time (s)
  • Orange bar (columns to the left are used for grouping)
  • New Process
  • New Thread Id
  • Sum: Time Since Last (μs)
  • Sum: Ready (μs)
  • Sum: Waits (μs)
  • Whatever else you want summarized…

Timer expiration

When tracing through chains of readying threads (thread A waiting on thread B which was waiting on thread C…) the identify of the readying thread is crucial, but sometimes it is irrelevant. If a thread was not actually waiting (if it was in the Standby state due to CPU starvation) then there will be no readying thread. If it was waiting on a timer (calling Sleep or a call to WaitForMultipleObjects that timed out) then the readying thread appears to be whatever thread happened to be running on that CPU – watch for KiProcessExpiredTimerList in the ReadyThreadStack for random threads. That’s because DPCs effectively ‘hijack’ other threads and run in their context, and WPA doesn’t expose that at a high level.

Go forth and rearrange summary tables

If you understand how summary tables work and you understand the data that they are presenting you with then there are many rich explorations of CPU Scheduling that you can do. If you figure out a cool technique, let me know. If you find any errors, or new or better column explanations, please share.

P.S.

I didn’t figure out the significance of Time (s) versus Time (μs) (point in time versus duration) until I was researching for this post. In fact, I didn’t figure it out until about half an hour before publishing this post. Knowing that distinction is going to make this summary table much easier to use, and justifies the time I spent writing this post.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Documentation, Performance, Programming, xperf and tagged , , , . Bookmark the permalink.

31 Responses to The Lost Xperf Documentation–CPU Usage (Precise)

  1. Pingback: The Lost Xperf Documentation–CPU sampling | Random ASCII

  2. Pingback: Xperf Wait Analysis–Finding Idle Time | Random ASCII

  3. Pingback: WPA–Xperf Trace Analysis Reimagined | Random ASCII

  4. Neeraj Singh (windows kernel dev) says:

    Great post on my most oft-used summary table. Two minor obscurities:
    1) IdealCpu factors into where you’ll see cases of high priority threads ready while lower pri guys are running on alternate cpus… the scheduler chooses the ideal cpu if there is no better idle cpu to run the thread and then uses priority to determine whether to preempt or queue. The ideal cpu column actually refers to the ‘old’ ideal cpu, which even trips up scheduler experts from time to time.

    2) previous c-state refers to the last sleep state of the cpu when the oldthread was the idle thread. It has no useful meaning in other cases. It may be instructive to look at if you have patterns where you distribute lots of tiny pieces of work to threads and join on the results. It’s possible to get in a regime where more time is spent waking up from deep c-states than doing work. This would show up in 100us – 800us ready times where the target core was actually idle.

    • brucedawson says:

      Thanks Neeraj. I’ve recently been noticing a lot of traces where threads spend a lot of time in the Ready state even though there are CPUs available. The causes that I am aware of are:

      1) Scheduler doesn’t want to move the thread to another CPU
      2) Waking up CPU from deep c-states (thanks, I didn’t know about that one)
      3) Thread is blocked by an interrupt or DPC
      4) Scheduler is unwilling to unlock more logical CPUs (typically second hyperthreads but could be another module)

      Am I missing anything? I’m not sure that accounts for all that I am seeing, or else perhaps I just need to look more.

      I could never get c-state to display what looked like meaningful data, but I’ll look more closely.

      BTW, you should ask the xperf guys to document this. It’s tragic that it’s so hidden.

  5. Martin Weißhaupt says:

    Thank you for your posts about xperf, they have been a great help for me.

    There is one thing I didn’t get to work yet. Do you know how to get the name of the thread to be shown in xperfview or wpa? Or is this not possible?

  6. Pingback: The Lost Xperf Documentation–Disk Usage | Random ASCII

  7. Bobi says:

    Hello,

    It still not clear to me what the “Count” represent.
    o.k. so it saves the number of context switch,is it the number of context switch of the new process?
    and if it is, in what duration?
    or maybe the context switch of other processes while the “new process” is running.
    It is not very clear what this field represent.

    • brucedawson says:

      Count is the number of context switches over the time being analyzed (in WPA this is the currently visible time range) for that row. If you drill down all the way then Count will always go to 1, because at that point each row represents a single context switch.

      If your leftmost column is Process and you haven’t drilled in at all then Count is how many context switches for each process over the time being analyzed. If you have drilled down a bit then the meaning will change, depending on what rows you have to the left of the orange bar and how far you have drilled in.

      Note that in xperfview/WPA a context switch is counted for the New Process/Thread, not the one that is being switched out. I hope that makes sense. If not, let me know.

      • Bobi says:

        Hi Brucedownson,

        Thanks for the quick answer,
        O.k. so from what I understand now, the count field is the sum of all the context switch of new process in the duration time specified in the NewInSwitchTime, is that correct?
        lets say the NewThreadStack is a nieghbor to Count field, and I drill into the call stack (NewThreadStack field).
        from looking at the call stack, what the Count field represent?
        Is it the number of context switch while this function/thread was starting to execute until it finish its duty(what I am actually saying is that while this function/thread was executing it was interrupted by #sum of context switches)?

        Thanks,
        Bobi

        • brucedawson says:

          Are you using xperfview or wpa to view the traces?

          In xperfview you select a time range and then view a summary table. Within that summary table the count columns tells you how many context switches there are per-row, and your grouping tells you what that row actually contains. NewInSwitchTime is not relevant.

          In WPA you are always viewing a summary table and the relevant time range is the entire visible time range — zooming in and out affects what is being displayed, and hence the count field.

          If you are looking at a particular thread and drilling down in the stack then at a particular point in the call stack the count field tells you how many times the OS switched to that thread on that sub-callstack.

          Ultimately the count field tells you how many times a thread or set of threads were given a chance to run. Your grouping affects what thread or threads are being summarized, and what call stacks are being summarized, and you are in charge of what time period is covered (see above).

          Play around with it. Zoom in to a small time range in WPA and experiment. Try some experiments. That’s how I figured out most of the columns. You’ll figure it out.

  8. Pingback: Sleep Variation Investigated | Random ASCII

  9. Pingback: Use Windows Performance Toolkit to Trackdown Slow Service Start Problem | Radec

  10. Finn Wong says:

    Hi Bruce,
    I begin to use WPA to analyze power issue of applications recently and your great posts help me a lot. Thanks you!
    Here is a good document from Microsoft about WPA:
    http://msdn.microsoft.com/en-us/library/windows/desktop/jj679884.aspx
    I think it may help in identifying some of the unknowns you described.

    BR,
    Finn

  11. Pingback: Defective Heat Sinks Causing Garbage Gaming | Random ASCII

  12. Pingback: New Version of Xperf–Upgrade Now | Random ASCII

  13. Pingback: Exporting Arbitrary Data from xperf ETL files | Random ASCII

  14. Pingback: Bugs I Got Other Companies to Fix in 2013 | Random ASCII

  15. Pingback: You Got Your Web Browser in my Compiler! | Random ASCII

  16. Pingback: Self Inflicted Denial of Service in Visual Studio Search | Random ASCII

  17. Pingback: UIforETW – Windows Performance Made Easier | Random ASCII

  18. Pingback: ETW Central | Random ASCII

  19. Pingback: Xperf Wait Analysis–Finding Idle Time | Random ASCII

  20. Pingback: The Lost Xperf Documentation–CPU sampling | Random ASCII

  21. Pingback: The Lost Xperf Documentation–Disk Usage | Random ASCII

  22. Pingback: ETW Flame Graphs Made Easy | Random ASCII

  23. Pingback: CPU Performance Counters on Windows | Random ASCII

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s