As I’ve mentioned previously, the documentation for xperf (Windows Performance Toolkit) is pretty weak. In this post I’m going to attempt to explain the meaning of the extremely subtle and non-obvious columns in the CPU Scheduling Summary Tables. Understanding these columns is important when analyzing idle time, which is one of xperf’s most valuable features.
The CPU Scheduling summary table is one of the worst offenders for having confusingly named columns. Quick – what’s the difference between SwitchInTime and InSwitchTime? And is NewPrevOutPri a real column name, or is it a made-up compound word? Stick around and find out.
Here are the other articles in this series:
It is assumed that you have already read the posts explaining how to install xperf and record traces using some variation of my recommended settings and that you have read how to use summary tables, including grouping and sorting.
CPU Scheduling graph
Note: the explanations below discuss xperfview, but WPA is now the recommended viewer for xperf traces. Getting to a summary table and adding and removing columns is now different, but the grouping, sorting, and the actual column contents are unchanged.
The graph is mostly not very useful, but there are a couple of patterns that are worth pointing out. The first pattern is that it is fairly obvious that four of the eight lines in my screenshot have many context switches – so many that the diamonds overlap into solid thick lines – whereas the others have fairly few. This is because Windows normally tries to use only one hyperthread per core, so the second hyperthread per core sits mostly idle. The other hyperthread is used only when CPU demand justifies it, as I discussed in this classic post.
The other thing to notice is that there are a lot of context switches. Thousands per second. I guess that’s normal on a system with many processes running, but it seems inefficient. When it happens on an ‘idle’ system it means that the battery is being drained a lot faster than is ideal because the processor never has a chance to go deep into its power-saving idle states. Maybe I have too many processes running on my laptop.
If you right-click on the CPU Scheduling graph you will see menu options for four different summary tables. They are:
- Summary Table
- Summary Table with Ready Thread
- Wait Classification
- Wait Classification with Ready Thread
The Ready Thread (discussed here) is a big part of the value of the CPU scheduling data, and xperf in general, so I’m going to ignore the two summary tables that lack Ready Thread data. I’m also going to ignore the Wait Classification summary table because I haven’t used it enough to say anything useful.
So, Summary Table with Ready Thread it is.
Default columns for Summary Table with Ready Thread
The default set of columns for the Summary Table with Ready Thread is annoyingly far from what I want. I find that I have to enable/disable about a dozen columns in order to get this summary table into a usable state. I don’t know why the default columns include Cpu, IdealCPU, and OldOutPri, but don’t include NewThreadStack or ReadyThreadStack, but I can only assume that the settings are relevant for NT kernel developers. Anyway, I’ll tell you which columns I have enabled, and I recommend that when you bring up this summary table you immediately switch to this set:
- NewProcess (or New Process Name, if you want more grouping)
- Orange bar (columns to the left are used for grouping)
- Sum:TimeSinceLast (us)
- Max:TimeSinceLast (us) (often optional)
- SwitchInTime (s) (often optional)
It should look something like this:
Advanced users can go beyond this, but new users should start with these. Luckily xperf will remember your column preferences, and you can even use the View menu to save your favorite column layouts with an appropriate name.
CPU scheduling shows switches
All of the columns in the CPU Scheduling summary table show data that is related to a particular context switch (or multiple context switches grouped together), and the grouping (by the columns to the left of the orange bar) affects how many context switches are displayed on each row. With the recommended default state there will initially be one row for each process for which context switches were recorded during the selected time period, and as you drill in more rows will be revealed. The grouping by stack is particularly important since it lets you see the big picture more easily.
Any threads or processes that did not have context switches during the selected time period will not show up.
Before diving into the columns (there are 50 of them!) some concepts and prefixes need to be covered.
Context switch concepts
The CPU Scheduling summary tables are summarizing context switches. They record the act of one logical CPU changing from executing one thread to executing another thread. In order to understand context switches it is important to understand the different states in which a thread can be – if you don’t understand Windows thread scheduling and its nomenclature then you can’t fully understand the CPU scheduling summary table. A Windows thread can be in several states:
- Waiting – not ready to run because the thread is waiting on something through a call to WaitForMultipleObjects, EnterCriticalSection, ReadFile, etc.
- Ready – the thread is ready to run but has not yet been scheduled to run. Threads move from waiting to ready when their wait is satisfied or times out. If a thread is in the Ready state for any non-trivial amount of time then it indicates some level of CPU starvation
- Running – the thread is running on one of the logical CPUs, executing instructions and getting work done. Threads move from ready to running when a CPU is available. They can move back to ready (if they lose the CPU) or back to waiting (if they wait on something).
- There are other intermediate states used by the scheduler to improve scalability such as DeferredReady and Standby, and states such as Initialized and Terminated for new and dead threads, but those shouldn’t normally be relevant for performance purposes.
There are up to three processes involved in a context switch:
- New Process/Thread: this is the process/thread that we are switching to – the one that is starting execution at this point. This may be the idle thread – when a CPU is idle it is considered to be ‘executing’ the idle thread. The New thread is moving from Ready to Running.
- Old Process/Thread: this is the process/thread that we are switching away from – the one that has lost (or relinquished) the CPU. This may be the idle thread. The Old thread is moving from Running to Waiting or to Ready.
- Readying Process/Thread: this is the process/thread that ‘readied’ the new process – that satisfied whatever it was waiting on. This could be a thread that released a critical section, or set an event, or something else. This readying could represent keyboard input, completion of file I/O, receipt of a network packet, or many other possibilities. Sometimes there is no readying thread – this happens when a thread wasn’t waiting for anything except more CPU time. The Readying thread’s state doesn’t change during this context switch, and at the time of the context switch it can be in any one of the thread states.
A context switch, as recorded by ETW, is the act of moving the New thread from Ready to Running, and moving the Old thread from running to some other state, on a particular CPU.
Some documents suggest that Windows always ensures that the highest priority threads are running, but this is a half truth at best, and usually irrelevant. This priority guarantee is a half truth because current versions of Windows actually only guarantee that a particular CPU is running the highest priority thread in its per-CPU queue. It is quite possible for a high-priority thread to stay in the Ready state for a few quanta while a low-priority thread continues to run on another CPU. In addition, the priority guarantee is usually not relevant because threads are constantly having their priorities adjusted, so you can’t usefully know what priority a thread will have.
See this presentation for more details on Windows thread scheduling.
Many of the column names are compound words and if you understand the meaning of the constituent words then you can more easily remember the meaning of the columns. Memorizing these words is far more valuable than memorizing individual column names. Some of the most important words used are:
- New – this refers to the thread/process that is being switched to – the thread that will be running after the context switch
- Old – this refers to the thread/process that is being switched away from – the thread that was running before the context switch
- Ready/Readying – this refers to the thread/process, if any, that readied the new thread
- NewPrev – this appears to refer to the previous context switch involving the new thread – the time that it stopped running, which may have been arbitrarily many context switches earlier
- In – this apparently refers to a thread’s state at the moment when it starts running, when it switched “in” to the Running state – its state at the beginning of a period of running
- Out – this apparently refers to a thread’s state when it stops running, when it is switched “out” of the Running state
- Max – when rows are collapsed together a Max column is the maximum of some value across all of the collapsed rows, which is useful when searching for a single context switch with a very high value
- Time has two different meanings in this summary table. Time (s) refers to a point in time in the trace (with units of seconds). Time (us) refers to a duration – a length of time (with units of microseconds). Thus, the units of time are crucial clues to the meaning of a column.Sum – when rows are collapsed together a Sum column is the sum of some value across all of the collapsed rows, which is useful when finding what collection of context switches are responsible for a high total. There are many columns with no Sum or Max prefix which behave as Sum columns, such as Count, and the time duration columns
That is how we can end up with column names such as NewPrevOutPri which is the Priority of the New thread at the end (Out) of the Previous time that it ran. Phew.
Documentation of important columns
- New/Readying/Old Process Name – this is the name of the process (i.e.; devenv.exe) without the process ID. This column is useful when you want to group all processes from one executable together.
- New/Readying/Old Process – this is the name of the process plus its process ID. This column is useful when you want each process to be grouped separately
- New/Readying/Old ThreadStartModule – the module where the thread originally started
- New/Readying/Old ThreadStartFunction – the function where the thread originally started
- New/Readying/Old ThreadId – the ID of the thread, or –1 if there is no thread
- NewThreadStack – the stack of the new thread, which is both where it resumes execution after the context switch, and where it was when its execution was suspended on an earlier context switch. This represents where the thread was waiting
- ReadyThreadStack – the stack of the thread, if any, which readied (made eligible to run) the new thread
- OldThreadStack – there is no old thread stack column. The NewThreadStack for the next context switch involving the OldThread contains that data
- ReadyTime (s) – this is the time at which a thread was readied – the time at which it became eligible to run
- Count – this is the number of context switches summarized by the current row. When the first column is NewProcess and all of the data is collapsed this represents the number of context switches per process over the selected time. As you drill down into the data you will eventually get down to individual context switches and the count will be one.
- Sum/Max:TimeSinceLast (us) – TimeSinceLast is the time since the thread last ran – i.e.; the time the thread was in the Waiting and Ready states
- Sum/Max:Waits (us) – this is the length of time that the thread was in the Waiting state before it was moved to Ready
- Sum/Max:Ready (us) – this is the length of time that the thread was in the Ready state before it was moved to Running. This is often just a few dozen microseconds but can be arbitrarily long on a CPU starved system. The sum of Sum:Waits and Sum:Ready should be Sum:TimeSinceLast, and a little Excel spot checking shows that it is
- Count:Waits – this counts how many of the context switches have a non-zero Wait time – i.e.; how many of the context switches were after the new thread waited. If a thread is preempted then it can go from the Running state to the Ready state and back to the Running state with no waiting. If you drill down to an individual context switch then Count:Waits count will be zero or one. If it is zero then Sum/Max:Waits will also be zero, and ReadyingProcess, ReadyingThreadId, and ReadyThreadStack will all be empty
- New/Old InSwitchTime (us) – these represent how long the new and old threads will/did run – these are the elapsed time between the current context switch and the next/prev context switches – see the Learning through exploring section below for more details
- SwitchInTime (s) – this is the time of the context switch. Don’t confuse SwitchInTime (s) with InSwitchTime (us). SwitchInTime (s) is a point in time, whereas InSwitchTime (us) is actually a duration
- LastSwitchOutTime (s) – this is the time when the old thread last stopped running, which could be arbitrarily many context switches ago. However this value appears to be buggy in regards to the idle thread where it is always the same as SwitchInTime (s)
- NextSwitchOutTime (s) – this is the time of the next context switch. Or, as the name wants you to think about it, this is when the new thread will be switched out. It is SwitchInTime (s) plus NewInSwitchTime (us)
- NewPrevOutPri – I believe that this is the priority of the new thread at the end of the last time that it ran
- NewInPri – I believe that this is the priority of the new thread when it moves into the Running state because of this context switch. This is often elevated by one or two levels because the Windows scheduler boosts the priority of threads that have just been readied
- NewOutPri – I believe that this is what the priority of the new thread will be when it ends its current running state – at the next context switch. This often decays back to the same level of NewPrevOutPri
- OldOutPri – I believe this is the priority of the old thread at the time that it stopped running. Note that the idle thread runs at priority zero
- Cpu – this is the CPU on which the context switch is happening. A thread can move across CPUs (although Windows tries to minimize that) but for any given context switch, a particular CPU goes from running one thread to running another thread
My editor tells me that I need more diagrams, so here is a lovely Visio diagram showing some of the times and durations in a context switch – pairs of parentheses indicate time durations (us), and arrows point to particular times (s). The data is all for the context switch in purple:
For those of you who prefer math with your diagrams:
- SwitchInTime (s) + NewInSwitchTime (us) * 1e-6 = NextSwitchOutTime (s)
- ReadyTime (s) + Ready (us) * 1e-6 = SwitchInTime (s)
- Waiting (us) + Ready (us) = TimeSinceLast (us)
Some of these might be more important if Microsoft explained them enough to be useful, but most of them are really only of interest to kernel developers:
- IdealCpu – every thread has an ideal CPU that Windows tries to schedule it on. This field presumably represents this data for the NewThread. Grouping by Cpu and then IdealCpu shows that some CPUs have far more context switches than others, and the majority of context switches have the new thread running on its ideal CPU. But I can’t find any time where this matters
- New/Old Qnt – these are always -1. This appears to be an old and unfixed bug.
- NewPrevWaitMode, New/Old WaitMode – the available wait modes seem to be Unknown, NonSwap, and Swapable. None of these seem relevant to non-kernel developers
- NewPriDecr – this number varies from –112 to +112, with zero and two being the most common values. But I don’t know what it means
- NewPrevState, New/Old State – these refer to the Waiting, Ready, Running, etc. states that a thread can be in, but I can’t make any sense of the data
- NewPrevWaitReason, New/Old WaitReason – these presumably refer to why the thread was waiting or otherwise not running, but the number of reasons and the diversity of call stacks makes understanding them seem hopeless
- PrevCState – Cstates represent how active your CPU is, with zero being fully active and larger numbers representing various degrees of reduced power. The largest number I’ve seen on a trace is ‘2’
Learning through exploring
I have no secret source of knowledge for what most of these columns are – I didn’t call up someone at Microsoft and ask, and my web searches for documentation were generally fruitless. Instead, I figured out the meanings of the columns by rearranging the summary table to figure out patterns. For instance, in order to figure out OldInSwitchTime (us) and NewInSwitchTime (us) I rearranged the columns in the order:
- Cpu – the CPU the context switch occurs on
- Orange bar – columns to the left are used for grouping
- New Process – the process we are switching to
- NewThreadId – the thread we are switching to
- SwitchInTime (s) – the time of the context switch
- OldInSwitchTime (us)
- NewInSwitchTime (us)
Once I did this and sorted by SwitchInTime (s) the patterns became obvious:
InSwitchTime (us) is the length of time that a thread is running, from the context switch where it starts running to the context switch where another thread starts running. This can be verified by taking SwitchInTime (s) on the selected row, adding NewInSwitchTime (us) (remembering to divide by 1,000,000 to convert to seconds), and noting that the result is the SwitchInTime (s) for the next row – the next context switch. By selecting a few thousand rows, copying to a spreadsheet and doing some Excel-fu you can quickly verify this.
The OldInSwitchTime (us) is how long the old thread ran, and the NewInSwitchTime (us) is how long the new thread will run. The NewInSwitchTime (us) is not known at the time that the context switch happens, but xperf helpfully calculates it at analysis time. You can see that the OldInSwitchTime (us) for a given row (when sorting by SwitchInTime (s) and when there is one context switch per row) is just the NewInSwitchTime (us) from the previous row (the previous context switch).
Aside: I don’t know what Outlook is doing in the screenshot above but it appears to be suffering from Attention Deficit Hyperactivity Disorder. In total it flips back and forth between it and the idle thread seven times (not all shown) and it never runs for more than 0.02 ms. That can’t be good.
It can be interesting to see all of the context switches on a single CPU – or depressing if you think about the lost efficiency that comes from thousands of context switches per second per CPU. To do this just arrange the columns like this:
- SwitchInTime (s)
- Orange bar (columns to the left are used for grouping)
- Sum:TimeSinceLast (us)
- Sum:Ready (us)
- Sum:Waits (us)
- Whatever you want summarized…
When tracing through chains of readying threads (thread A waiting on thread B which was waiting on thread C…) the identify of the readying thread is crucial, but sometimes it is irrelevant. If a thread was not actually waiting (if it was in the Standby state due to CPU starvation) then there will be no readying thread. If it was waiting on a timer (calling Sleep or a call to WaitForMultipleObjects that timed out) then the readying thread appears to be whatever thread happened to be running on that CPU – watch for KiProcessExpiredTimerList in the ReadyThreadStack for random threads. Keep in mind, however, that I am documenting observed behavior and I am probably missing some of the details, so pay attention and think about what you are seeing. Don’t expect that this documentation will be complete, or completely accurate.
Go forth and rearrange summary tables
If you understand how summary tables work and you understand the data that they are presenting you with then there are many rich explorations of CPU Scheduling that you can do. If you figure out a cool technique, let me know. If you find any errors, or new or better column explanations, please share.
If you’re from Microsoft – tooltips would be nice.
I didn’t figure out the significance of Time (s) versus Time (us) until I was researching for this post. In fact, I didn’t figure it out until about half an hour before publishing this post. Knowing that distinction is going to make this summary table much easier to use, and justifies the time I spent writing this post.