As I’ve mentioned previously, the documentation for xperf (Windows Performance Toolkit, also known as ETW) is pretty weak. In this post I’m going to attempt to explain the meaning of the extremely subtle and non-obvious columns in the CPU Usage (Precise) Tables, which display every context switch recorded in the trace. Understanding these columns is important when analyzing idle time, which is one of xperf’s most valuable features.
This post was updated September 2015 to include information about UIforETW, WPA, and new columns.
The CPU Usage (Precise) table is one of the worst offenders for having confusingly named columns. Quick – what’s the difference between Switch-In Time and Last Switch-Out Time? And is New Prev Out Pri a real column name, or is it a made-up compound word? And is ProcessOutOfMemory really a column in the context switch table? Stick around and find out. WPA 10 adds some spaces and hyphens to the names, and tooltips, but documentation is still needed.
Here are the other articles in this series:
This post assumes that you are recording traces with UIforETW and that the WPA (Windows Performance Analyzer) startup profiles which ship with UIforETW have been installed. You can do this from the Settings dialog by clicking on Copy startup profiles. This post also assumes that you are using WPA 10 to view the traces – this is the default when you double-click on a trace in UIforETW‘s Traces list.
CPU Usage (Precise)
This table is labeled CPU Usage (Precise) to distinguish it from the CPU Usage (Sampled) data. The sampled data comes from interrupting the CPU frequently – sampling – and seeing what it is doing. The precise data that is discussed here comes from instrumenting the context switch code in the kernel. Every time a thread starts or stops running the precise time of this activity is recorded, with optional call stacks. This means that the CPU Usage (Precise) data can tell exactly how much CPU time each thread consumes, and what the call stack looked like when a thread lost/gained the CPU. However the CPU Usage (Precise) data can’t tell you what a thread is doing between context switches – that is the job of the CPU Usage (Sampled) data. Understanding this difference is crucial.
UIforETW’s default startup profile shows the CPU Usage (Precise) data in graph form, because this is the most accurate graph for showing how much CPU time is used by each process. If you want to see details of the context switches you can click on the Display graph and table button and the table will appear below the graph.
Default columns for CPU Usage (Precise)
The available View Presets for the CPU Usage (Precise) data are good for getting different graphs (Timeline by Process, Thread and Timeline by CPU can be interesting)…
…but I don’t find any of the built-in presets particularly useful for a detailed investigation of why my program is idle. So, the default View Preset with the UIforETW startup profile is Randomascii Wait Analysis and it is optimized for figuring out when, where, and why your threads go idle. The columns used in this preset are:
- New Process
- New Thread Id
- New Thread Stack
- Readying Process
- Readying Thread Id
- Ready Thread Stack
- Orange bar (columns to the left are used for grouping)
- Sum: Time Since Last (μs)
- Max: Time Since Last (μs)
- Count: Waits
- Switch-In Time (s)
With these columns available you can easily drill down to where your threads go idle, see how often they went idle and for how long, see what call stack they went idle on, see what thread from what process woke them (and that call stack), and more. It’s pretty amazing.
CPU scheduling shows switches
All of the columns in the CPU Scheduling summary table show data that is related to a particular context switch (or multiple context switches grouped together), and the grouping (by the columns to the left of the orange bar) affects how many context switches are displayed on each row. With the recommended default view there will initially be one row for each process, and as you drill in more rows will be revealed. The grouping by stack is particularly important since it lets you see the big picture more easily.
Any threads or processes that did not have context switches during the visible time period will not show up.
Before diving into the columns (there are over 60 of them!) some concepts and prefixes need to be covered.
Context switch concepts
The CPU Scheduling summary tables are summarizing context switches. They record the act of one logical or physical CPU changing from executing one thread to executing another thread. In order to understand context switches it is important to understand the different states in which a thread can be – if you don’t understand Windows thread scheduling and its nomenclature then you can’t fully understand the CPU scheduling summary table. A Windows thread can be in several states:
- Waiting – not ready to run because the thread is waiting on something through a call to WaitForMultipleObjects, EnterCriticalSection, ReadFile, etc.
- Ready – the thread is ready to run but has not yet been scheduled to run. Threads move from waiting to ready when their wait is satisfied or times out. If a thread is in the Ready state for any non-trivial amount of time then it indicates some level of CPU starvation
- Running – the thread is running on one of the CPUs, executing instructions and getting work done. Threads move from ready to running when a CPU is available. They can move back to ready (if they lose the CPU) or back to waiting (if they wait on something).
- There are other intermediate states used by the scheduler to improve scalability such as DeferredReady and Standby, and states such as Initialized and Terminated for new and dead threads, but those shouldn’t normally be relevant for performance purposes.
There are up to three processes involved in a context switch:
- New Process/Thread: this is the process/thread that we are switching to – the one that is starting execution at this point. This may be the idle thread – when a CPU is idle it is considered to be ‘executing’ the idle thread. The New Thread is moving from Ready to Running.
- Old Process/Thread: this is the process/thread that we are switching away from – the one that has lost (or relinquished) the CPU. This may be the idle thread. The Old Thread is moving from Running to Waiting or to Ready.
- Readying Process/Thread: this is the process/thread that ‘readied’ the new process – that satisfied whatever it was waiting on. This could be a thread that released a critical section, or set an event, or something else. This readying could represent keyboard input, completion of file I/O, receipt of a network packet, or many other possibilities. Sometimes there is no readying thread – this happens when a thread wasn’t waiting for anything except more CPU time. Sometimes the readying thread is a DPC, which means it comes from a random thread that was hijacked to run the DPC code – watch for it. The Readying Thread’s state doesn’t change during this context switch, and at the time of the context switch it can be in any one of the thread states.
In short, a context switch, as recorded by ETW, is the act of moving the New Thread from Ready to Running, and moving the Old Thread from running to some other state, on a particular CPU.
Some documents suggest that Windows always ensures that the highest priority threads are running, but this is a half truth at best, and usually irrelevant. This priority guarantee is a half truth because current versions of Windows actually only guarantee that a particular CPU is running the highest priority thread in its per-CPU queue. It is quite possible for a high-priority thread to stay in the Ready state for a few quanta while a low-priority thread continues to run on another CPU. In addition, the priority guarantee is usually not relevant because threads are constantly having their priorities adjusted, so you can’t usefully know what priority a thread will have at the time (although you can see it in the ETW trace).
Many of the column names are compound words and if you understand the meaning of the constituent words then you can more easily remember the meaning of the columns. Memorizing these words is far more valuable than memorizing individual column names. Some of the most important words used are:
- New – this refers to the thread/process that is being switched to – the thread that will be running after the context switch
- Old – this refers to the thread/process that is being switched away from – the thread that was running before the context switch
- Ready/Readying – this refers to the thread/process, if any, that readied the new thread
- New Prev – this appears to refer to the previous context switch involving the new thread – the time that it stopped running, which may have been arbitrarily many context switches earlier
- In – this apparently refers to a thread’s state at the moment when it starts running, when it switched “in” to the Running state – its state at the beginning of a period of running
- Out – this apparently refers to a thread’s state when it stops running, when it is switched “out” of the Running state
- Max – when rows are collapsed together a Max column is the maximum of some value across all of the collapsed rows, which is useful when searching for a single context switch with a very high value
- Sum – when rows are collapsed together a Sum column is the sum of some value across all of the collapsed rows, which is useful when finding what collection of context switches are responsible for a high total. There are many columns with no Sum or Max prefix which behave as Sum columns, such as Count, and the time duration columns
- Time has two different meanings in this summary table. Time (s) refers to a point in time in the trace (with units of seconds). Time (us) refers to a duration – a length of time (with units of microseconds). Thus, the units of time are crucial clues to the meaning of a column.
That is how we can end up with column names such as New Prev Out Pri which is the Priority of the New thread at the end (Out) of the Previous time that it ran. Phew.
Documentation of important columns
- New/Readying/Old Process Name – this is the name of the process (i.e.; devenv.exe) without the process ID. This column is useful when you want to group all processes from one executable together.
- New/Readying/Old Process – this is the name of the process plus its process ID. This column is useful when you want each process to be grouped separately
- New/Readying/Old ThreadId – the ID of the thread, or –1 if there is no thread
- New Thread Stack – the stack of the new thread, which is both where it resumes execution after the context switch, and where it was when its execution was suspended on an earlier context switch. This represents where the thread was waiting
- Ready Thread in DPC – when the value in this column is True it means that the thread was not readied by a process. It was readied by a deferred procedure call (doing work on behalf of an interrupt) that has hijacked a process temporarily. When this happens it means that the Readying Process and Readying Thread Id are meaningless – they are just random victims. The Ready Thread Stack is also mostly meaningless but it will usually have ntoskrnl.exe!KiProcessExpiredTimerList on it because Ready Thread in DPC almost means that a thread was readied due to a timer expiring, such as Sleep(1) or a timeout being reached on WaitForSingleObject.
- Ready Thread Stack – the stack of the thread, if any, which readied (made eligible to run) the new thread
- Old Thread Stack – there is no column that records the call stack of the old thread (the thread being switched out). The New Thread Stack for the next context switch involving the Old Thread contains that data
- Ready Time (s) – this is the time at which a thread was readied – the time at which it became eligible to run
- Count – this is the number of context switches summarized by the current row. When the first column is NewProcess and all of the data is collapsed this represents the number of context switches per process over the selected time. As you drill down into the data you will eventually get down to individual context switches and the count will be one.
- Sum/Max:Time Since Last (μs) – this is the time since the thread last ran – i.e.; the time the thread was in the Waiting and Ready states
- Sum/Max:Waits (μs) – this is the length of time that the thread was in the Waiting state before it was moved to Ready
- Sum/Max:Ready (μs) – this is the length of time that the thread was in the Ready state before it was moved to Running. This is often just a few dozen microseconds but can be arbitrarily long on a CPU starved system. The sum of Sum:Waits and Sum:Ready should be Sum:Time Since Last, and a little Excel spot checking shows that it is
- Count: Waits – this counts how many of the context switches have a non-zero Wait time – i.e.; how many of the context switches were after the new thread waited. If a thread is preempted then it can go from the Running state to the Ready state and back to the Running state with no waiting. If you drill down to an individual context switch then Count:Waits count will be zero or one. If it is zero then Sum/Max:Waits will also be zero, and ReadyingProcess, ReadyingThreadId, and ReadyThreadStack will all be empty
- New/Old InSwitchTime (μs) – these represent how long the new and old threads will/did run – these are the elapsed time between the current context switch and the next/prev context switches – see the Learning through exploring section below for more details – these columns no longer exist in WPA 10. CPU Usage (ms) is the closest substitute for NewInSwitchTime (μs).
- Switch-In Time (s) – this is the time of the context switch.
- Annotation – this, column, which is available on most or all WPA tables, can be filled in during trace analysis. See below for details.
My editor tells me that I need more diagrams, so here is a lovely Visio diagram showing some of the times and durations in a context switch – pairs of parentheses indicate time durations (us), and arrows point to particular times (s). The data is all for the Context Switch pointed to by the text with the purple background:
For those of you who prefer math with your diagrams:
- Switch-In Time (s) + CPU Usage (ms) * 1e-3 = Next Switch-Out Time (s)
- Ready Time (s) + Ready (μs) * 1e-6 = Switch-In Time (s)
- Waiting (μs) + Ready (μs) = Time Since Last (μs)
The annotation column can be useful during analysis as a way to add an arbitrary tag to a set of data. When the Annotation column is enabled it should typically be one of the left-most columns. Then, when a set of data is understood the trace analyst should select that data, right-click and select Annotate Selection…, and enter an appropriate description. Then the trace analysis can continue on the data that is under the <Not Annotated> node in the Annotation column, undistracted by the data that is already understood. In the screen show below the Annotation column was used to tag context switches as “Self-readied” or “Readied by other process” in order to better visualize flame graph data:
Some of these might be more important if Microsoft explained them enough to be useful, but most of them are really only of interest to kernel developers (note that some have been renamed slightly with spaces added):
- Last Switch-Out Time (s) – this is the time when the old thread last stopped running, which could be arbitrarily many context switches ago. However this value appears to be buggy in regards to the idle thread where it is always the same as SwitchInTime (s)
- Next Switch-Out Time (s) – this is the time of the next context switch. Or, as the name wants you to think about it, this is when the new thread will be switched out.
- New Prev Out Pri – I believe that this is the priority of the new thread at the end of the last time that it ran
- New In Pri – this is the priority of the new thread when it moves into the Running state because of this context switch. This is often elevated by one or two levels because the Windows scheduler boosts the priority of threads that have just been readied
- New Out Pri – I believe that this is what the priority of the new thread will be when it ends its current running state – at the next context switch. This often decays back to the same level of New Prev Out Pri
- Old Out Pri – I believe this is the priority of the old thread at the time that it stopped running. Note that the idle thread runs at priority zero
- Cpu – this is the CPU on which the context switch is happening. A thread can move across CPUs (although Windows tries to minimize that) but for any given context switch, a particular CPU goes from running one thread to running another thread
- New/Readying/Old ThreadStartModule – the module where the thread originally started
- New/Readying/Old ThreadStartFunction – the function where the thread originally started
- Ideal Cpu – every thread has an ideal CPU that Windows tries to schedule it on. This field presumably represents this data for the New Thread. Grouping by Cpu and then Ideal Cpu shows that some CPUs have far more context switches than others, and the majority of context switches have the new thread running on its ideal CPU. But I can’t find any time where this matters
- New/Old Qnt – these are always -1, even on Windows 10. This appears to be an old and unfixed bug.
- New Prev Wait Mode, New/Old Wait Mode – the available wait modes seem to be Unknown, NonSwap, and Swapable. None of these seem relevant to non-kernel developers
- New Pri Decr – this number varies from –112 to +112, with zero and two being the most common values. But I don’t know what it means
- New Prev State, New/Old State – these refer to the Waiting, Ready, Running, etc. states that a thread can be in, but I can’t make any sense of the data
- New Prev Wait Reason, New/Old Wait Reason – these presumably refer to why the thread was waiting or otherwise not running, but the number of reasons and the diversity of call stacks makes understanding them seem hopeless
- Prev CState – Cstates represent how active your CPU is, with zero being fully active and larger numbers representing various degrees of reduced power. The largest number I’ve seen on a trace is ‘2’
- AdjustIncrement – unknown
- AdjustReason – unknown
- DirectSwitchAttempt – unknown
- KernelStackNotResident – unknown
- NewThreadPrevRemainingQuantum – unknown
- OldThreadRank – unknown
- OldThreadRemainingQuantum – presumably this records how much of the old (switched out) threads’ quantum was remaining
- ProcessOutOfMemory – unknown, and doesn’t this seem out-of-place in context-switch data?
Learning through exploring
I have no secret source of knowledge for what most of these columns are – I didn’t call up someone at Microsoft and ask, and my web searches for documentation were generally fruitless. Instead, I figured out the meanings of the columns by rearranging the summary table to figure out patterns. For instance, in order to figure out OldInSwitchTime (μs) (now deleted) and NewInSwitchTime (μs) (now deleted, use CPU Usage (ms) instead) I rearranged the columns in the order:
- Cpu – the CPU the context switch occurs on
- Orange bar – columns to the left are used for grouping
- New Process – the process we are switching to
- NewThreadId – the thread we are switching to
- SwitchInTime (s) – the time of the context switch
- OldInSwitchTime (μs)
- NewInSwitchTime (μs)
Once I did this and sorted by Switch In Time (s) the patterns became obvious:
InSwitchTime (μs) is the length of time that a thread is running, from the context switch where it starts running to the context switch where another thread starts running. This can be verified by taking SwitchInTime (s) on the selected row, adding NewInSwitchTime (μs) (remembering to divide by 1,000,000 to convert to seconds), and noting that the result is the SwitchInTime (s) for the next row – the next context switch. By selecting a few thousand rows, copying to a spreadsheet and doing some Excel-fu you can quickly verify this.
The OldInSwitchTime (μs) is how long the old thread ran, and the NewInSwitchTime (μs) is how long the new thread will run. The NewInSwitchTime (μs) is not known at the time that the context switch happens, but xperf helpfully calculates it at analysis time. You can see that the OldInSwitchTime (μs) for a given row (when sorting by SwitchInTime (s) and when there is one context switch per row) is just the NewInSwitchTime (μs) from the previous row (the previous context switch).
Aside: I don’t know what Outlook is doing in the screenshot above but it appears to be suffering from Attention Deficit Hyperactivity Disorder. In total it flips back and forth between it and the idle thread seven times (not all shown) and it never runs for more than 0.02 ms. That can’t be good.
It can be interesting to see all of the context switches on a single CPU – or depressing if you think about the lost efficiency that comes from thousands of context switches per second per CPU. To do this just arrange the columns like this:
- Switch-In Time (s)
- Orange bar (columns to the left are used for grouping)
- New Process
- New Thread Id
- Sum: Time Since Last (μs)
- Sum: Ready (μs)
- Sum: Waits (μs)
- Whatever else you want summarized…
When tracing through chains of readying threads (thread A waiting on thread B which was waiting on thread C…) the identify of the readying thread is crucial, but sometimes it is irrelevant. If a thread was not actually waiting (if it was in the Standby state due to CPU starvation) then there will be no readying thread. If it was waiting on a timer (calling Sleep or a call to WaitForMultipleObjects that timed out) then the readying thread appears to be whatever thread happened to be running on that CPU – watch for KiProcessExpiredTimerList in the ReadyThreadStack for random threads. That’s because DPCs effectively ‘hijack’ other threads and run in their context, and WPA doesn’t expose that at a high level.
Go forth and rearrange summary tables
If you understand how summary tables work and you understand the data that they are presenting you with then there are many rich explorations of CPU Scheduling that you can do. If you figure out a cool technique, let me know. If you find any errors, or new or better column explanations, please share.
I didn’t figure out the significance of Time (s) versus Time (μs) (point in time versus duration) until I was researching for this post. In fact, I didn’t figure it out until about half an hour before publishing this post. Knowing that distinction is going to make this summary table much easier to use, and justifies the time I spent writing this post.
Microsoft’s documentation has improved a lot. See this page for a different perspective.