The behavior of the Windows scheduler changed significantly in Windows 10 2004 (aka, the April 2020 version of Windows), in a way that will break a few applications, and there appears to have been no announcement, and the documentation has not been updated. This isn’t the first time this has happened, but this change seems bigger than last time. So far I have found three programs that hit problems because of this silent change.
The short version is that calls to timeBeginPeriod from one process now affect other processes less than they used to. There is still an effect, and thread delays from Sleep and other functions may be less consistent than they used to be (see [updated] section below), but in general processes are no longer affected by other processes calling timeBeginPeriod.
I think the new behavior is an improvement, but it’s weird, and it deserves to be documented. Fair warning – all I have are the results of experiments I have run, so I can only speculate about the quirks and goals of this change.
Update, 2021: this change is acknowledged now in the timeBeginPeriod documentation.
If any of my conclusions are wrong then please let me know and I will update this.
Timer interrupts and their raison d’être
First, a bit of operating-system design context. It is desirable for a program to be able to go to sleep and then wake up a little while later. This actually shouldn’t be done very often – threads should normally be waiting on events rather than timers – but it is sometimes necessary. And so we have the Windows Sleep function – pass it the desired length of your nap in milliseconds and it wakes you up later, like this:
It’s worth pausing for a moment to think about how this is implemented. Ideally the CPU goes to sleep when Sleep(1) is called, in order to save power, so how does the operating system (OS) wake your thread if the CPU is sleeping? The answer is hardware interrupts. The OS programs a timer chip that then triggers an interrupt that wakes up the CPU and the OS can then schedule your thread.
The WaitForSingleObject and WaitForMultipleObjects functions also have timeout values and those timeouts are implemented using the same mechanism.
If there are many threads all waiting on timers then the OS could program the timer chip with individual wakeup times for each thread, but this tends to result in threads waking up at random times and the CPU never getting to have a long nap. CPU power efficiency is strongly tied to how long the CPU can stay asleep (8+ ms is apparently a good number), and random wakeups work against that. If multiple threads can synchronize or coalesce their timer waits then the system becomes more power efficient.
There are lots of ways to coalesce wakeups but the main mechanism used by Windows is to have a global timer interrupt that ticks at a steady rate. When a thread calls Sleep(n) then the OS will schedule the thread to run when the first timer interrupt fires after the time has elapsed. This means that the thread may end up waking up a bit late, but Windows is not a real-time OS and it actually cannot guarantee a specific wakeup time (there may not be a CPU core available at that time anyway) so waking up a bit late should be fine.
The interval between timer interrupts depends on the Windows version and on your hardware but on every machine I have used recently the default interval has been 15.625 ms (1,000 ms divided by 64). That means that if you call Sleep(1) at some random time then you will probably be woken sometime between 1.0 ms and 16.625 ms in the future, whenever the next interrupt fires (or the one after that if the next interrupt is too soon).
In short, it is the nature of timer delays that (unless a busy wait is used, and please don’t busy wait) the OS can only wake up threads at a specific time by using timer interrupts, and a regular timer interrupt is what Windows uses.
Some programs (WPF, SQL Server, Quartz, PowerDirector, Chrome, the Go Runtime, many games, etc.) find this much variance in wait delays hard to deal with but luckily there is a function that lets them control this. timeBeginPeriod lets a program request a smaller timer interrupt interval by passing in a requested timer interrupt interval. There is also NtSetTimerResolution which allows setting the interval with sub-millisecond precision but that is rarely used and never needed so I won’t mention it again.
Decades of madness
Here’s the crazy thing: timeBeginPeriod can be called by any program and it changes the timer interrupt interval, and the timer interrupt is a global resource.
Let’s imagine that Process A is sitting in a loop calling Sleep(1). It shouldn’t be doing this, but it is, and by default it is waking up every 15.625 ms, or 64 times a second. Then Process B comes along and calls timeBeginPeriod(2). This makes the timer interrupt fire more frequently and suddenly Process A is waking up 500 times a second instead of 64 times a second. That’s crazy! But that’s how Windows has always worked.
At this point if Process C came along and called timeBeginPeriod(4) this wouldn’t change anything – Process A would continue to wake up 500 times a second. It’s not last-call-sets-the-rules, it’s lowest-request-sets-the-rules.
To be more specific, whatever still running program has specified the smallest timer interrupt duration in an outstanding call to timeBeginPeriod gets to set the global timer interrupt interval. If that program exits or calls timeEndPeriod then the new minimum takes over. If a single program called timeBeginPeriod(1) then that is the timer interrupt interval for the entire system. If one program called timeBeginPeriod(1) and another program then called timeBeginPeriod(4) then the one ms timer interrupt interval would be the law of the land.
This matters because a high timer interrupt frequency – and the associated high-frequency of thread scheduling – can waste significant power, as discussed here.
timeGetTime (not to be confused with GetTickCount) is a function that returns the current time, as updated by the timer interrupt. CPUs have historically not been good at keeping accurate time (their clocks intentionally fluctuate to avoid being FM transmitters, and for other reasons) so they often rely on separate clock chips to keep accurate time. Reading from these clock chips is expensive so Windows maintains a 64-bit counter of the time, in milliseconds, as updated by the timer interrupt. This timer is stored in shared memory so any process can cheaply read the current time from there, without having to talk to the timer chip. timeGetTime calls ReadInterruptTick which at its core just reads this 64-bit counter. Simple!
Since this counter is updated by the timer interrupt we can monitor it and find the timer interrupt frequency.
The new undocumented reality
With the Windows 10 2004 (April 2020 release) some of this quietly changed, but in a very confusing way. I first heard about this through reports that timeBeginPeriod didn’t work anymore. The reality was more complicated than this.
A bit of experimentation gave confusing results. When I ran a program that called timeBeginPeriod(2) then clockres showed that the timer interval was 2.0 ms, but a separate test program with a Sleep(1) loop was only waking up about 64 times a second instead of the 500 times a second that it would have woken up under previous versions of Windows.
It’s time to do science
I then wrote a pair of programs which revealed what was going on. One program (change_interval.cpp) just sits in a loop calling timeBeginPeriod with intervals ranging from 1 to 15 ms. It holds each timer interval request for four seconds, and then goes to the next one, wrapping around when it is done. It’s fifteen lines of code. Easy.
The other program (measure_interval.cpp) runs some tests to see how much its behavior is altered by the behavior of change_interval.cpp. It does this by gathering three pieces of information.
- It asks the OS what the current global timer resolution is, using NtQueryTimerResolution.
- It measures the precision of timeGetTime by calling it in a loop until its return value changes. When it changes then the amount it changed by is its precision.
- It measures the delay of Sleep(1) by calling it in a loop for a second and counting how many calls it can make. The average delay is just the reciprocal of the number of iterations.
@FelixPetriconi ran the tests for me on Windows 10 1909 and I ran the tests on Windows 10 2004. The results (cleaned up to remove randomness) are shown here:
What this means is that timeBeginPeriod still sets the global timer interrupt interval, on all versions of Window. We can tell from the results of timeGetTime() that the interrupt fires on at least one CPU core at that rate, and the time is updated. Note also that the 2.0 on row one for 1909 was 2.0 on Windows XP, then 1.0 on Windows 7/8, and is apparently back to 2.0? I guess?
However the scheduler behavior changes dramatically in Windows 10 2004. Previously the delay for Sleep(1) in any process was simply the same as the timer interrupt interval (with an exception for timeBeginPeriod(1)), giving a graph like this:
In Windows 10 2004 the mapping between timeBeginPeriod and the sleep delay in another process (one that didn’t call timeBeginPeriod) is peculiar:
[Updated] The section below was added after publishing and then updated several times.
As was pointed out in the reddit discussion, the left half of the graph seems to be an attempt to simulate the “normal” 15.625 ms delay as closely as possible given the available precision of the global timer interrupt. That is, with a 6 millisecond interrupt interval they delay for ~12 ms (two cycles) and with a 7 millisecond interrupt interval they delay for ~14 ms (two cycles) – that matches the data fairly well. However what about with an 8 millisecond interrupt interval? They could sleep for two cycles but that would give an average delay of 16 ms, and the measured value is more like 14.5 ms.
Closer analysis shows that Sleep(1) when another process has called timeBeginPeriod(8) returns after one interval about 20% of the time and after two intervals the rest. Therefore three calls to Sleep(1) resulting in a average delay of 14.5 ms. This variation in the handling of Sleep(1) happens sometimes at other timer interrupt intervals but is most consistent when it is set to 8 ms.
This is all very weird, and I don’t understand the rationale, or the implementation. The intentional inconsistency in the Sleep(1) delays is particularly worrisome. Maybe it is a bug, but I doubt it. I think that there is complex backwards compatibility logic behind this. But, the most powerful way to avoid compatibility problems is to document your changes, preferably in advance, and this seems to have been slipped in without anyone being notified.
This behavior also seems to apply to CreateWaitableTimerEx and its so-far-undocumented CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag, based on the quick-and-dirty waitable timer tests that you can find here (requires Windows 10 1803 or higher).
Most programs will be unaffected. If a process wants a faster timer interrupt then it should be calling timeBeginPeriod itself. That said, here are the problems that this could cause:
- A program might accidentally assume that Sleep(1) and timeGetTime have similar resolutions, and that assumption is broken now. But, such an assumption seems unlikely.
- A program might depend on a fast timer resolution and fail to request it. There have been multiple claims that some games have this problem and there is a tool called Windows System Timer Tool and another called TimerResolution 1.2 that “fix” these games by raising the timer interrupt frequency. Those fixes presumably won’t work anymore, or at least not as well. Maybe this will force those games to do a proper fix, but until then this change is a backwards compatibility problem.
- A multi-process program might have its master control program raise the timer interrupt frequency and then expect that this would affect the scheduling of its child processes. This used to be a reasonable design choice, and now it doesn’t work. This is how I was alerted to this problem. The product in question now calls timeBeginPeriod in all of their processes so they are fine, thanks for asking, but their software was misbehaving for several months with no explanation. Since I wrote this article I have received reports of three other multi-process programs (two within Google, two outside) with the same problem (one mentioned here). It’s an easy fix, but a painful investigation to understand what has gone wrong.
The change_interval.cpp test program only works if nothing has requested a higher timer interrupt frequency. Since both Chrome and Visual Studio have a habit of doing this I had to do most of my experimentation with no access to the web while writing code in notepad. Somebody suggested Emacs but wading into that debate is more than I’m willing to do.
I’d love to hear more about this from Microsoft, including any corrections to my analysis. Discussions: