“I want to say, in all seriousness, that a great deal of harm is being done in the modern world by belief in the virtuousness of WORK, and that the road to happiness and prosperity lies in an organised diminution of work.”1
In the past few years I have encountered many examples of code that spins in a busy loop, doing nothing useful for dozens of milliseconds. I’ve looked at dozens of xperf traces from customer machines that show both busy waiting and CPU starvation. There are many misconceptions about the costs and benefits of busy waiting, and I want to explain emphatically why busy waiting is rarely a good idea, is a worse idea now than in the past, and will become an even worse idea in the future.
The temptation to busy wait
A simple way of providing mutual exclusion is to have flag which a thread can atomically test-and-set, using something like InterlockedCompareExchange. If the flag is zero and you are able to atomically set it to one then you ‘own’ that critical region and you can safely execute your code. If you write your own lightweight mutex you have to be careful to avoid reordering problems but the basics of guaranteeing mutual exclusion are actually quite straightforward.
The problem is, what happens when somebody else owns the lock? You need to wait for them to release the lock – but how?
One option is to wait on a semaphore. That puts you to sleep until the lock is released. That is good. That is appropriate. That is the whole freakin’ point of this article. So do that. But first I want to explain why developers sometimes don’t like waiting on a semaphore.
Waiting on a semaphore means calling out to the kernel (expensive), context switching away from your thread (expensive) and then context switching back to your thread. If the lock is released quickly then it can be more efficient to busy wait for “a while” in order to avoid the costs of the kernel call and a pair of context switches.
So, busy waiting on a mutual exclusion flag can be more efficient in some cases. But can we gain that efficiency without paying an unreasonable price?
The simple case
Imagine that you are running on a N-core system and there are N threads running. Each thread runs on a different core. In this scenario it seems reasonable for a thread that wants a lock to busy wait until it acquires the lock, because (by definition) there are enough CPU resources for all, so we might as well keep spinning, waiting for the lock to be freed.
That all sounds very reasonable, but even in this extremely simple case there may be reasons why busy waiting is a bad idea.
No CPU is an island
- In these days of TurboBoost, the total power draw of a chip is capped. If one core draws more power – because it is spinning in a busy loop – then there is less power available for the other cores, and the whole chip may have to run at a lower frequency.
- In these days of integrated graphics (and the future is increasingly integrated), the total power draw of a chip is capped. If the CPU draws more power – because it is spinning in a busy loop – then there is less power available for the GPU, and it may have to run at a lower frequency.
- Busy waiting wastes energy which can reduce battery life or just waste electricity, increase fan noise, and heat my office or my lap.
- Hyperthreading means that two logical cores residing on one physical core may be sharing resources, such as execution units or L1 caches. The exact details of the sharing vary dramatically (Bulldozer modules versus Intel hyperthreading) but the fundamental problem remains the same – code executed on one logical core can slow down the other. Calling Sleep(0) repeatedly is particularly bad because it uses a lot of cache and execution resources. However any sort of spinning is likely to have some negative effect on the other logical core.
- InterlockedCompareExchange is, by its very nature, a global operation. In order to do an atomic read-modify-write it needs to use bus locking or cache line reservation which requires communicating with – and potentially blocking – every other core in the system.
- Even just a normal read (it turns out that a good optimization when doing a spin lock is to do a normal read before attempting the expensive InterlockedCompareExchange) has to bring in the associated cache line. If it is owned exclusively by another thread (if another thread is writing to the same cache line) then there is some cost. This should be minor, and if this is a problem then the false sharing which it implies should really be fixed, but this is another potential risk.
Because of all of these issues it is likely that busy waiting will reduce overall performance on at least some machines, and waste energy on many others. Even if busy waiting improves performance by a few percent on your machine, there are undoubtedly some customer machines (now, or in the future) where it will cause a much more significant slowdown.
Our idealized example of N-cores and N-threads trying to run is very tidy. On Xbox 360 or other single-purpose machines it may actually be realistic (some of the “No CPU is an island” warnings still apply), but on a PC it is completely unrealistic. Even on a PC where your game is the “only application running” there is, I’m afraid to say, a huge amount of background CPU activity. This could be anti-virus, services, the system process, various application update mechanisms, or any one of dozens of other processes. This load varies, but can quite easily add up to several simultaneous threads trying to run, at least part of the time.
In addition, on a PC you don’t really know what core your code is running on. It changes constantly and, while you can set thread affinity, on a PC you usually shouldn’t because it just constrains the operating system and means that your code will be affected even more by other processes. As usual, Xbox 360 is an exception here because it requires thread affinities to be set – this works because there are no other processes running.
It turns out that on Windows you can’t even tell how many logical cores are currently available. Sure, you can ask the OS for the core count, and you can ask about logical versus physical cores (although Bulldozer blurs the distinction), but the actual number of logical cores that the OS considers eligible for thread scheduling varies over time. When demand is low Windows tries to keep the secondary logical cores on each physical core idle. When demand is high the OS makes those secondary logical cores available, and this can change multiple times per second.
You also don’t know what priority your threads are running at. You can set your thread’s priority, but Windows dynamically adjusts threads’ priorities in order to avoid live-lock and to improve responsiveness. These algorithms are complicated and change over time, so don’t even think about trying to predict them. Other operating systems will have different algorithms, so any attempt at ensuring that you are not interfering with useful work is doomed to failure.
The net result of this is that sometimes when you busy wait you are causing CPU starvation, and on a PC this is fundamentally unavoidable and unpredictable.
A lot of people decide that they are going to sleep in a polite way. They are going to repeatedly call Sleep(0) in some vague attempt to let other threads run. When you call Sleep(0) on Windows it will either return immediately or it will switch to another thread and I am of the opinion that both of these results are bad.
If Sleep(0) returns immediately then what we have is a very complicated and expensive busy wait that is touching lots of cache lines and is probably hitting the OS scheduler lock, for no good reason.
If Sleep(0) switches to another thread then we have lost the CPU, paid all the cost of context switches, but we have failed to tell the operating system what we are waiting for. Thus, there is no way for the operating system to wake us up promptly when the lock is released. This means that we will likely sleep for an entire quantum, which could be 10-20 ms, and then be given a chance to try again. Meanwhile the lock we wanted will probably have been released and acquired hundreds of times and we might have to repeat the process again!
The one theoretical case when Sleep(0) might help is if the thread that owns the lock is waiting to run, in which case Sleep(0) might give it CPU time so that it can finish its task and release the lock. However Sleep(0) will only relinquish the CPU to a thread of equal priority, and due to dynamic thread priority adjustments there is no way to ensure that your worker thread has a matching thread priority. And, even if thread priorities match, Windows tries to avoid moving threads between cores, so if the IdealCPU for this other thread is not the core you are running on then your attempt at generosity may be rejected.
It’s worth pointing out that the problems with Sleep(0) are not implementation defects – it is just the wrong function for the job. Sleep(0) is a poor match for waiting on a lock because what you really want to do is to give CPU time to the owner of the lock, if it needs it, but Sleep(0) has no way of knowing which thread you care about! On the other hand, if you wait on a critical section then Windows knows exactly what thread you are waiting on, and it can (potentially, I don’t know if it does) give that thread a boost in order to avoid priority inversions or otherwise shorten the delay.
Sleep(0) is, in my opinion, more about faith than science. It is simply not a reliable or efficient way to do anything on Windows. It will occasionally do something helpful due to sheer luck, but you can never count on that and the odds are against you.
Sleep(1) puts your thread to sleep for about a millisecond. This avoids the problems of busy waiting, but in video games a millisecond is often too long to be napping. Also, the behavior of Sleep(1) depends on global system settings, and on what version of Windows you are running. Therefore events and semaphores should usually be preferred over Sleep(1) in a loop – they will wake you immediately upon being signaled.
The YieldProcessor macro (which generally maps to _mm_pause) can help prevent a busy wait from completely overwhelming the system, by inserting pauses in the instruction stream that prevent the busy loop from overwhelming the processor. This is particularly important on hyperthreaded systems since it gives the other logical core time to run. If you must busy wait then be sure to use YieldProcessor(). But I’d rather you didn’t busy wait. YieldProcessor will reduce your load on the system, but it doesn’t change the fact that your idle loop is occupying a logical core that might be needed for something more important.
Complicated enough for you?
That’s a lot of complexity. And there is probably more that I have forgotten or I’m not aware of. Writing your own lock requires you to be an expert in all of the issues mentioned above.
Real world example – waiting for thread termination
One method to detect thread termination is to spin in a loop waiting on a volatile variable that the thread sets as it exits. This is a terrible idea for two reasons. One reason is that it is a busy wait, and therefore bad. The other reason is that it doesn’t actually tell you when the thread has exited. It tells you when the thread is about to exit. This makes it a recipe for race conditions.
Another method to detect thread termination is to spin in a loop and repeatedly call GetExitCodeThread until it no longer gives an exit code of STILL_ACTIVE. This is a terrible idea because this is a busy wait, and therefore bad. In some cases the child thread will be starved for CPU time – and unable to exit – because the parent thread is consuming all available CPU time in its wait-for-exit loop. And heaven help you if your thread’s exit code happens to be STILL_ACTIVE…
The one-true-way to detect thread termination is to call WaitForSingleObject on the thread’s handle. When it returns you can be sure that the thread is terminated. Then, and only then, you should close the thread handle. Simple. And no busy waiting required.
Other platforms should offer equivalent ways of waiting for a thread to terminate, such as pthread_join for pthreads. Or in C++ 11 you can wait for a std::thread to terminate by calling std::thread::join.
Real word example – thread pool
I worked with a thread pool a few years ago that contained several busy waits. These busy waits may have made sense when the code was written, but as core-counts and hyperthreading increased, some architectural problems were revealed.
This system was based around using an event to signal when a new job was available. The thread that added the job (we’ll call it the master) would call SetEvent(). The nature of events is that this would wake up all of the worker threads, and the nature of using events for a thread pool is that the thread pool lock had to be held by the master when calling SetEvent().
It turns out that when a thread is woken up after waiting Windows typically gives it a small priority boost. This means that the worker threads would probably be running at a higher priority than the master thread. Therefore, if there were enough threads in the thread pool (and on hyperthreaded systems ‘enough’ is equal to the core count or less) then the master thread would be switched out so that all of the worker threads could run. However the master thread still holds the lock, so the worker threads would all try to acquire the lock and fail. In the best case scenario this means there are a huge number of context switches before any work can get done.
In one version of this system the lock was a pure spinlock. So, the worker threads would keep trying to acquire the lock, and keep failing. In the purest form of this flaw we had a six-core twelve-thread CPU with eleven worker threads. Six of the worker threads would get scheduled (Windows tries to leave the hyperthreads idle) and would try to acquire the lock, thus starving out the master thread. After a few quanta had expired Windows would either allow more threads to run, or would decay the priority of the worker threads so that the master thread could finally run, and release its lock. It was common to see 20-40 ms sequences where six cores were spinning in a lock-acquire busy wait, and no work was done!
The xperfview capture below shows a call to SetEvent that took 12.434 ms due to this issue, during which time all cores were busy waiting:
Using a semaphore instead of an event ensures that only one worker thread is woken up at a time, and also allows the worker thread to be triggered when the lock isn’t held. This avoids all unnecessary context switches. Switching to a lock that doesn’t spin endlessly also avoids the worst-case scenario.
Real world example – folly::MicroSpinLock
This lock uses a one-byte flag and no semaphore. It spins for 4,000 iterations (using ‘pause’ – excellent) before trying the lock again, and there is no apparent limit to how long it will busy wait. Therefore this lock can lead to live-lock, wasted power, CPU starvation, reduced CPU frequency, and reduced GPU frequency. The only advantage I can see of this lock over a futex is that it is very small (there is also a one-bit version) and it is properly constructed through zero initialization.
The comment block says “Note: these locks are for use when you aren’t likely to contend on the critical section”. Famous last words. Caveat lockor. Sorry Facebook, too risky, not recommended.
Real world example – KeStallExecutionProcessor
KeStallExecutionProcessor is documented as being appropriate for very short busy waits – up to a maximum of 50 microseconds. And yet, when profiling on a SurfaceBook Pro I found that this function was running for ~22 milliseconds at a time. The “good news” is that it was only being told to spin for one millisecond at a time, but was being called ~22 times in a row, but somehow that doesn’t make me feel better. I reported this power bug and it has apparently been fixed. Please don’t use KeStallExecutionProcessor to spin for 440 times longer than recommended.
Busy waiting means that wait analysis doesn’t work, which means that a valuable profiling tool is lost.
Spin a little bit
Spinning is not entirely morally bankrupt. If there is a lock that is typically only held for a few hundred cycles and if the threads in question are usually simultaneously running on different cores, then spinning for a little while is just fine. The key is understanding the odds of the two threads being on different cores, and on a reasonable definition of “for a little while”. You should spin long enough to give yourself a reasonable chance of acquiring the lock, but no longer. And, if you never succeed in acquiring the lock by spinning then you should give up and never spin again.
Windows critical sections provide an option to spin, and they even appear to have an (undocumented) flag that will dynamically adjust the spin count:
Java 6 uses adaptive spinning (in a paper called “Java SE 6 Performance White Paper”) to balance the cost and benefits of spinning.
If you do your own spinning then you should monitor the results to make sure that it is genuinely helping, and you should absolutely cap the spinning at some small time, probably a few thousand cycles. Remember, the only reason to spin is to avoid the cost of a context switch. Unless you measure how long you are spinning and how often the spinning works then you can’t be sure that you are getting any benefit. An unmeasured optimization is no optimization at all.
If you write your own spin loop then you should definitely use YieldProcessor() and you should do normal reads in your loop to see if there is any chance of acquiring the lock before trying the heavyweight InterlockedCompareExchange.
However I am not impressed with the value of spinning. Spinning while trying to acquire a lock only makes sense if the lock is lightly contended for. If that is the case then the benefit is likely to be modest, due to the light contention. However the risks remain high, should you get it wrong.
If you do spin, keep it short. Spinning for dozens of milliseconds just makes me angry. You wouldn’t like me when I’m angry.
Please don’t busy wait. It will often harm performance, while offering modest gain. It will reduce battery life and waste power. It has never been a worse idea than right now, but it will become an even worse idea. There are better ways of improving multi-threaded performance.
If you do busy wait, be sure to cap how long you do it, and do normal reads and YieldProcessor() in the wait loop.
Don’t use Sleep(0).
Don’t use this lock.
For best and simplest results use a CRITICAL_SECTION, futex, Benaphore, or the equivalent on your platform. User-mode mutexes like these are extremely fast in the uncontended case, and if you really want to you can specify a spin count. If you use a CRITICAL_SECTION then it will normally be acquired instantly, at very low cost, and in those rare cases when you do have to wait you will be rewarded on reawakening with a priority boost and a smug sense of satisfaction at not wasting CPU time. If a lock is heavily contended then you should address that problem (separate memory pools anyone?) instead of wasting CPU cycles for “efficiency”.
To paraphrase mothers everywhere, if you don’t have anything useful to do, don’t do anything.
- Spinning in a loop
- Can be simple and tempting
- But mother says no
1. Bertrand Russell, In Praise of Idleness