In Praise of Idleness

Posted on June 5, 2012 by brucedawson

“I want to say, in all seriousness, that a great deal of harm is being done in the modern world by belief in the virtuousness of WORK, and that the road to happiness and prosperity lies in an organised diminution of work.”¹

In the past few years I have encountered many examples of code that spins in a busy loop, doing nothing useful for dozens of milliseconds. I’ve looked at dozens of xperf traces from customer machines that show both busy waiting and CPU starvation. There are many misconceptions about the costs and benefits of busy waiting, and I want to explain emphatically why busy waiting is rarely a good idea, is a worse idea now than in the past, and will become an even worse idea in the future.

The temptation to busy wait

A simple way of providing mutual exclusion is to have flag which a thread can atomically test-and-set, using something like InterlockedCompareExchange. If the flag is zero and you are able to atomically set it to one then you ‘own’ that critical region and you can safely execute your code. If you write your own lightweight mutex you have to be careful to avoid reordering problems but the basics of guaranteeing mutual exclusion are actually quite straightforward.

The problem is, what happens when somebody else owns the lock? You need to wait for them to release the lock – but how?

One option is to wait on a semaphore. That puts you to sleep until the lock is released. That is good. That is appropriate. That is the whole freakin’ point of this article. So do that. But first I want to explain why developers sometimes don’t like waiting on a semaphore.

Waiting on a semaphore means calling out to the kernel (expensive), context switching away from your thread (expensive) and then context switching back to your thread. If the lock is released quickly then it can be more efficient to busy wait for “a while” in order to avoid the costs of the kernel call and a pair of context switches.

So, busy waiting on a mutual exclusion flag can be more efficient in some cases. But can we gain that efficiency without paying an unreasonable price?

The simple case

Imagine that you are running on a N-core system and there are N threads running. Each thread runs on a different core. In this scenario it seems reasonable for a thread that wants a lock to busy wait until it acquires the lock, because (by definition) there are enough CPU resources for all, so we might as well keep spinning, waiting for the lock to be freed.

That all sounds very reasonable, but even in this extremely simple case there may be reasons why busy waiting is a bad idea.

No CPU is an island

In these days of TurboBoost, the total power draw of a chip is capped. If one core draws more power – because it is spinning in a busy loop – then there is less power available for the other cores, and the whole chip may have to run at a lower frequency.
In these days of integrated graphics (and the future is increasingly integrated), the total power draw of a chip is capped. If the CPU draws more power – because it is spinning in a busy loop – then there is less power available for the GPU, and it may have to run at a lower frequency.
Busy waiting wastes energy which can reduce battery life or just waste electricity, increase fan noise, and heat my office or my lap.
Hyperthreading means that two logical cores residing on one physical core may be sharing resources, such as execution units or L1 caches. The exact details of the sharing vary dramatically (Bulldozer modules versus Intel hyperthreading) but the fundamental problem remains the same – code executed on one logical core can slow down the other. Calling Sleep(0) repeatedly is particularly bad because it uses a lot of cache and execution resources. However any sort of spinning is likely to have some negative effect on the other logical core.
InterlockedCompareExchange is, by its very nature, a global operation. In order to do an atomic read-modify-write it needs to use bus locking or cache line reservation which requires communicating with – and potentially blocking – every other core in the system.
Even just a normal read (it turns out that a good optimization when doing a spin lock is to do a normal read before attempting the expensive InterlockedCompareExchange) has to bring in the associated cache line. If it is owned exclusively by another thread (if another thread is writing to the same cache line) then there is some cost. This should be minor, and if this is a problem then the false sharing which it implies should really be fixed, but this is another potential risk.

Because of all of these issues it is likely that busy waiting will reduce overall performance on at least some machines, and waste energy on many others. Even if busy waiting improves performance by a few percent on your machine, there are undoubtedly some customer machines (now, or in the future) where it will cause a much more significant slowdown.

CPU Starvation

200px-The_Hunger_Games Our idealized example of N-cores and N-threads trying to run is very tidy. On Xbox 360 or other single-purpose machines it may actually be realistic (some of the “No CPU is an island” warnings still apply), but on a PC it is completely unrealistic. Even on a PC where your game is the “only application running” there is, I’m afraid to say, a huge amount of background CPU activity. This could be anti-virus, services, the system process, various application update mechanisms, or any one of dozens of other processes. This load varies, but can quite easily add up to several simultaneous threads trying to run, at least part of the time.

In addition, on a PC you don’t really know what core your code is running on. It changes constantly and, while you can set thread affinity, on a PC you usually shouldn’t because it just constrains the operating system and means that your code will be affected even more by other processes. As usual, Xbox 360 is an exception here because it requires thread affinities to be set – this works because there are no other processes running.

It turns out that on Windows you can’t even tell how many logical cores are currently available. Sure, you can ask the OS for the core count, and you can ask about logical versus physical cores (although Bulldozer blurs the distinction), but the actual number of logical cores that the OS considers eligible for thread scheduling varies over time. When demand is low Windows tries to keep the secondary logical cores on each physical core idle. When demand is high the OS makes those secondary logical cores available, and this can change multiple times per second.

You also don’t know what priority your threads are running at. You can set your thread’s priority, but Windows dynamically adjusts threads’ priorities in order to avoid live-lock and to improve responsiveness. These algorithms are complicated and change over time, so don’t even think about trying to predict them. Other operating systems will have different algorithms, so any attempt at ensuring that you are not interfering with useful work is doomed to failure.

The net result of this is that sometimes when you busy wait you are causing CPU starvation, and on a PC this is fundamentally unavoidable and unpredictable.

Sleep(0)

A lot of people decide that they are going to sleep in a polite way. They are going to repeatedly call Sleep(0) in some vague attempt to let other threads run. When you call Sleep(0) on Windows it will either return immediately or it will switch to another thread and I am of the opinion that both of these results are bad.

If Sleep(0) returns immediately then what we have is a very complicated and expensive busy wait that is touching lots of cache lines and is probably hitting the OS scheduler lock, for no good reason.

If Sleep(0) switches to another thread then we have lost the CPU, paid all the cost of context switches, but we have failed to tell the operating system what we are waiting for. Thus, there is no way for the operating system to wake us up promptly when the lock is released. This means that we will likely sleep for an entire quantum, which could be 10-20 ms, and then be given a chance to try again. Meanwhile the lock we wanted will probably have been released and acquired hundreds of times and we might have to repeat the process again!

The one theoretical case when Sleep(0) might help is if the thread that owns the lock is waiting to run, in which case Sleep(0) might give it CPU time so that it can finish its task and release the lock. However Sleep(0) will only relinquish the CPU to a thread of equal priority, and due to dynamic thread priority adjustments there is no way to ensure that your worker thread has a matching thread priority. And, even if thread priorities match, Windows tries to avoid moving threads between cores, so if the IdealCPU for this other thread is not the core you are running on then your attempt at generosity may be rejected.

It’s worth pointing out that the problems with Sleep(0) are not implementation defects – it is just the wrong function for the job. Sleep(0) is a poor match for waiting on a lock because what you really want to do is to give CPU time to the owner of the lock, if it needs it, but Sleep(0) has no way of knowing which thread you care about! On the other hand, if you wait on a critical section then Windows knows exactly what thread you are waiting on, and it can (potentially, I don’t know if it does) give that thread a boost in order to avoid priority inversions or otherwise shorten the delay.

Sleep(0) is, in my opinion, more about faith than science. It is simply not a reliable or efficient way to do anything on Windows. It will occasionally do something helpful due to sheer luck, but you can never count on that and the odds are against you.

Sleep(1)

Sleep(1) puts your thread to sleep for about a millisecond. This avoids the problems of busy waiting, but in video games a millisecond is often too long to be napping. Also, the behavior of Sleep(1) depends on global system settings, and on what version of Windows you are running. Therefore events and semaphores should usually be preferred over Sleep(1) in a loop – they will wake you immediately upon being signaled.

YieldProcessor

The YieldProcessor macro (which generally maps to _mm_pause) can help prevent a busy wait from completely overwhelming the system, by inserting pauses in the instruction stream that prevent the busy loop from overwhelming the processor. This is particularly important on hyperthreaded systems since it gives the other logical core time to run. If you must busy wait then be sure to use YieldProcessor(). But I’d rather you didn’t busy wait. YieldProcessor will reduce your load on the system, but it doesn’t change the fact that your idle loop is occupying a logical core that might be needed for something more important.

Complicated enough for you?

That’s a lot of complexity. And there is probably more that I have forgotten or I’m not aware of. Writing your own lock requires you to be an expert in all of the issues mentioned above.

Real world example – waiting for thread termination

converted PNM file

There are three popular ways of waiting for a Windows thread to terminate. Two of them should never be used.

One method to detect thread termination is to spin in a loop waiting on a volatile variable that the thread sets as it exits. This is a terrible idea for two reasons. One reason is that it is a busy wait, and therefore bad. The other reason is that it doesn’t actually tell you when the thread has exited. It tells you when the thread is about to exit. This makes it a recipe for race conditions.

Another method to detect thread termination is to spin in a loop and repeatedly call GetExitCodeThread until it no longer gives an exit code of STILL_ACTIVE. This is a terrible idea because this is a busy wait, and therefore bad. In some cases the child thread will be starved for CPU time – and unable to exit – because the parent thread is consuming all available CPU time in its wait-for-exit loop. And heaven help you if your thread’s exit code happens to be STILL_ACTIVE…

The one-true-way to detect thread termination is to call WaitForSingleObject on the thread’s handle. When it returns you can be sure that the thread is terminated. Then, and only then, you should close the thread handle. Simple. And no busy waiting required.

Other platforms should offer equivalent ways of waiting for a thread to terminate, such as pthread_join for pthreads. Or in C++ 11 you can wait for a std::thread to terminate by calling std::thread::join.

Real word example – thread pool

KONICA MINOLTA DIGITAL CAMERA

I worked with a thread pool a few years ago that contained several busy waits. These busy waits may have made sense when the code was written, but as core-counts and hyperthreading increased, some architectural problems were revealed.

This system was based around using an event to signal when a new job was available. The thread that added the job (we’ll call it the master) would call SetEvent(). The nature of events is that this would wake up all of the worker threads, and the nature of using events for a thread pool is that the thread pool lock had to be held by the master when calling SetEvent().

It turns out that when a thread is woken up after waiting Windows typically gives it a small priority boost. This means that the worker threads would probably be running at a higher priority than the master thread. Therefore, if there were enough threads in the thread pool (and on hyperthreaded systems ‘enough’ is equal to the core count or less) then the master thread would be switched out so that all of the worker threads could run. However the master thread still holds the lock, so the worker threads would all try to acquire the lock and fail. In the best case scenario this means there are a huge number of context switches before any work can get done.

In one version of this system the lock was a pure spinlock. So, the worker threads would keep trying to acquire the lock, and keep failing. In the purest form of this flaw we had a six-core twelve-thread CPU with eleven worker threads. Six of the worker threads would get scheduled (Windows tries to leave the hyperthreads idle) and would try to acquire the lock, thus starving out the master thread. After a few quanta had expired Windows would either allow more threads to run, or would decay the priority of the worker threads so that the master thread could finally run, and release its lock. It was common to see 20-40 ms sequences where six cores were spinning in a lock-acquire busy wait, and no work was done!

The xperfview capture below shows a call to SetEvent that took 12.434 ms due to this issue, during which time all cores were busy waiting:

Using a semaphore instead of an event ensures that only one worker thread is woken up at a time, and also allows the worker thread to be triggered when the lock isn’t held. This avoids all unnecessary context switches. Switching to a lock that doesn’t spin endlessly also avoids the worst-case scenario.

Real world example – folly::MicroSpinLock

This lock uses a one-byte flag and no semaphore. It spins for 4,000 iterations (using ‘pause’ – excellent) before trying the lock again, and there is no apparent limit to how long it will busy wait. Therefore this lock can lead to live-lock, wasted power, CPU starvation, reduced CPU frequency, and reduced GPU frequency. The only advantage I can see of this lock over a futex is that it is very small (there is also a one-bit version) and it is properly constructed through zero initialization.

The comment block says “Note: these locks are for use when you aren’t likely to contend on the critical section”. Famous last words. Caveat lockor. Sorry Facebook, too risky, not recommended.

Real world example – KeStallExecutionProcessor

KeStallExecutionProcessor is documented as being appropriate for very short busy waits – up to a maximum of 50 microseconds. And yet, when profiling on a SurfaceBook Pro I found that this function was running for ~22 milliseconds at a time. The “good news” is that it was only being told to spin for one millisecond at a time, but was being called ~22 times in a row, but somehow that doesn’t make me feel better. I reported this power bug and it has apparently been fixed. Please don’t use KeStallExecutionProcessor to spin for 440 times longer than recommended.

Profiling challenges

Busy waiting means that wait analysis doesn’t work, which means that a valuable profiling tool is lost.

Spin a little bit

Spinning is not entirely morally bankrupt. If there is a lock that is typically only held for a few hundred cycles and if the threads in question are usually simultaneously running on different cores, then spinning for a little while is just fine. The key is understanding the odds of the two threads being on different cores, and on a reasonable definition of “for a little while”. You should spin long enough to give yourself a reasonable chance of acquiring the lock, but no longer. And, if you never succeed in acquiring the lock by spinning then you should give up and never spin again.

Windows critical sections provide an option to spin, and they even appear to have an (undocumented) flag that will dynamically adjust the spin count:

RTL_CRITICAL_SECTION_FLAG_DYNAMIC_SPIN

Java 6 uses adaptive spinning (in a paper called “Java SE 6 Performance White Paper”) to balance the cost and benefits of spinning.

If you do your own spinning then you should monitor the results to make sure that it is genuinely helping, and you should absolutely cap the spinning at some small time, probably a few thousand cycles. Remember, the only reason to spin is to avoid the cost of a context switch. Unless you measure how long you are spinning and how often the spinning works then you can’t be sure that you are getting any benefit. An unmeasured optimization is no optimization at all.

If you write your own spin loop then you should definitely use YieldProcessor() and you should do normal reads in your loop to see if there is any chance of acquiring the lock before trying the heavyweight InterlockedCompareExchange.

However I am not impressed with the value of spinning. Spinning while trying to acquire a lock only makes sense if the lock is lightly contended for. If that is the case then the benefit is likely to be modest, due to the light contention. However the risks remain high, should you get it wrong.

If you do spin, keep it short. Spinning for dozens of milliseconds just makes me angry. You wouldn’t like me when I’m angry.

In conclusion…

Please don’t busy wait. It will often harm performance, while offering modest gain. It will reduce battery life and waste power. It has never been a worse idea than right now, but it will become an even worse idea. There are better ways of improving multi-threaded performance.

If you do busy wait, be sure to cap how long you do it, and do normal reads and YieldProcessor() in the wait loop.

Don’t use Sleep(0).

Don’t use this lock.

For best and simplest results use a CRITICAL_SECTION, futex, Benaphore, or the equivalent on your platform. User-mode mutexes like these are extremely fast in the uncontended case, and if you really want to you can specify a spin count. If you use a CRITICAL_SECTION then it will normally be acquired instantly, at very low cost, and in those rare cases when you do have to wait you will be rewarded on reawakening with a priority boost and a smug sense of satisfaction at not wasting CPU time. If a lock is heavily contended then you should address that problem (separate memory pools anyone?) instead of wasting CPU cycles for “efficiency”.

To paraphrase mothers everywhere, if you don’t have anything useful to do, don’t do anything.

Spinning in a loop
Can be simple and tempting
But mother says no

1. Bertrand Russell, In Praise of Idleness

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

This entry was posted in AltDevBlogADay, Performance, Programming and tagged busy wait, critical section, hyperthreads, interlockedcompareexchange, sleep, turboboost. Bookmark the permalink.

30 Responses to In Praise of Idleness

Terrence Andrew Davis says:

June 5, 2012 at 10:56 pm

LoseThos can do 3,000,000 single core context switches a second. It doesn’t mess with address maps, just stores regs and restores regs. The funny thing is an IN/OUT instruction takes longer than a context switch.

Reply
Terrence Andrew Davis says:

June 5, 2012 at 11:11 pm

With fine grained locks, spinning is not a problem. (Sorry for sloppy typing in previous comment.)

Reply
- Terrence Andrew Davis says:
  
  June 5, 2012 at 11:14 pm
  
  Say you have a thousand locks. What are the odds of the same lock being used at the same time? It obviously depends, but frequently it never happens.
  
  Reply
Terrence Andrew Davis says:

June 5, 2012 at 11:37 pm

LoseThos does synchronous ATA/ATAPI polled I/O hard drive/CDDVD . It reads a status 1uS and swaps-out 0.5uS for context change. Performance is bad if lots of other tasks get swapped-in before it checks status again, but I’m targeting HOME computers, not servers. On a home system, the focus task is all-important and anything in the background rarely happens.

The reason disk activity cannot take place with two tasks is because requests for 1000 blocks are not split to allow sharing, so other tasks must wait for the big request to complete.

Disk activity pegs CPU load numbers, but the CPU is not sluggish. You have to think about what it means to read a I/O and swap-out in 2 uS, over and over. The CPU is not sluggish.

Reply
- brucedawson says:
  
  June 6, 2012 at 6:30 am
  
  That’s great that LoseThos has such fast context switches (no per-process memory mapping will do that for you), but it’s not really relevant to any production OS.
  
  You mention that disk activity pegs the CPU but then dismiss that, but again, that’s not helpful for any production OS. Pegging the CPU on a production OS will cause all of the problems that I mention above with busy waiting. There is no magic way to make these problems go away.
  
  Reply
Terrence Andrew Davis says:

June 5, 2012 at 11:46 pm

I wrote a friggin compiler and operating system by myself. I can’t believe all the know-it-alls who have never done a multithreaded application on multicores, let alone done an operating system.

By the way, LoseThos is master/slave multuicore not SMP. I would rather have one task execute twice as fast than two.

Reply
Aivars says:

June 6, 2012 at 1:55 am

For those who were wondering – Sleep(1) will sleep for 15-16 ms on a regular Windows PC, which makes ~65fps, that’s why it’s not enough for games. I’m sure the next article will give more awesome details about this.

Reply
- brucedawson says:
  
  June 6, 2012 at 6:24 am
  
  timeBeginPeriod(1) makes Sleep(1) much better behaved. That’s the summary of the next article.
  
  Reply
sundropdev says:

June 6, 2012 at 2:20 am

This is all well and good. Busy waiting is generally bad, the underlying OS issues of windows, well who cares thats Microsoft’s problem.

What is slightly more useful is a dissection of asynchronous coding methods and responding to events. This is largely a solved problem unless your dealing with OS primitives yourself in which case you should stop.

Also, given coarse grained service implementations which are prevalent in real world application stacks, its not always clear what might block when you are writing client code to an API.

In an ideal world every statement execution would be async, except context switching screws all that up, but do you really know how long anything is going to take given the resource constraints of your hardware? You dont even know if your work is schedulable much less how long it will take, thats the real problem. Unfortunately virtualization not only doesn’t solve this but makes it much worse.

Reply
kotek.net says:

June 6, 2012 at 3:29 am

Interesting is that similar stuff apply to Java as well. Spin locks can improve performance by one third or even more.

Reply
- brucedawson says:
  
  June 6, 2012 at 6:22 am
  
  Pure spin locks? A third or more improvement on a pure spin lock seems unlikely, and less likely in the future. And, it leaves you vulnerable to the risk of a 10x slowdown, or more.
  
  Reply
brucedawson says:

June 6, 2012 at 6:39 am

Two things from out-of-band comments:

The post title and the opening quote are from Bertrand Russell’s “In Praise of Idleness”

Java 6 has adaptive spinning on their locks. That is excellent. Spin for a bit, then wait, and adjust the spin (up or down) based on the rate of success. That’s perfect.

http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.3

Reply
Sirmabus says:

June 7, 2012 at 1:46 pm

Realize you are talking about the whole scope including additional application threads and what not, but while on the subject in the Windows world for a lot of applications (not just games) you should be so lucky if the developer even thinks to put a bloody “Sleep(n)” in some place while their app is idle.

As if their application is the only thing that ever could or ever should run on your machine; the center of the universe, to the blazes with anything else.
There is more badly written applications and even worse resident services then I can shake my fist angrily at. The worse using 100% of one or more cores all the time even while minimized.

Admittedly, It makes some sense with games though. At least large beast like typical FPS clients and so on that are not expected to scale and are going to be taking up gigabytes of memory et al.
But even then a few easy things would make them more amiable.

If theses “devs” would only handle these dumb wait conditions and in their main loop (typically the windows message thread, and usually where the “game loop” tick runs) add tracking of WM_SIZE messages to know when the application is minimized or not.
While minimized a game could even pause almost entirely by suspending all other of it’s threads in, and, or, otherwise at least throwing a big “Sleep(50)” or so in the main thread.

Also if the application need not run while it is out of focus this can be easily tracked via WM_ACTIVATEAPP messages. Again if nothing else conditionally throw a big sleep so that the application is taking only a few % CPU while it idle in the background..

Reply
- brucedawson says:
  
  June 7, 2012 at 7:19 pm
  
  Agreed. If an application is inactive or minimized it should slow down updating, and perhaps stop. Pausing for a few hundred ms, if possible, will do wonders for battery life. Sometimes this is tricky, or may not be developer’s top priority, but it is a good thing to do.
  
  One relevant article on this topic is this one:
  
  http://prog21.dadgum.com/61.html
  
  As I said in the conclusion “if you don’t have anything useful to do, don’t do anything.”
  
  Reply
- Billco says:
  
  June 12, 2012 at 1:08 am
  
  I haven’t touched game programming in a very long time, but I remember one of the first things I was taught about Windows game programming was to sleep and yield whenever the game was paused or minimized. Mind you, this was back in the single-core days, where on a Pentium 200, if a background task was wasting CPU time you could FEEL it.
  
  Reply
Franklin says:

July 2, 2012 at 11:57 am

This may not be precisely games-related, but there is one reason to wait for which there is no clear solution: to run at a specific frequency (think of a realtime system with a direct feedback loop that requires low latency and consistent behavior). In this case, you are waiting for time, not any other particular resource. Is there any other mechanism but Sleep() that accomplishes this (aside from the reviled busy wait)?

Reply
- brucedawson says:
  
  July 2, 2012 at 12:51 pm
  
  Probably Sleep(n) (with timeBeginPeriod(1) to make Sleep(n) have sufficient resolution) is the best solution on Windows. Even then there is no guarantee — there may not be CPU time available when the timer expires — but it should be good enough, and it should be better than a busy wait.
  
  Reply
Olivier says:

September 30, 2012 at 9:30 am

The URL of the reference on adaptive spinning in Java6 has changed: it is now http://www.oracle.com/us/technologies/java/6-performance-137236.html

Reply
- brucedawson says:
  
  October 1, 2012 at 8:14 am
  
  Thanks. Updated.
  
  Reply
Alexander Safronov says:

February 21, 2019 at 1:56 pm

Hi Bruce!
In your investigations, did you ever encounter that in high contention scenario, entering critical section became significantly more expensive on Windows Server 2012, comparing to Windows Server 2008 (equivalent of Windows 7 vs Windows 10 difference I guess)? I am profiling with xperf, and seemingly same scenario places method RtlpEnterCriticalSectionContended disproportionally high in Windows 2012 (don’t think Windows 2008 has same exact method, but anything related to critical sections is nowhere near the top for similar scenario on Windows 2008). Both systems are similar hardware wise – single CPU, multicore, hyperthreading disabled.
Some random guy on internet observed something similar, albeit for completely artificial scenario – see https://stackoverflow.com/questions/52170665/why-did-critical-section-performance-become-worse-on-win8. I am considering to switch to SRWLock, but the performance difference is such mind-boggling that I am pretty sure there;s gotta be scientific explanation. Sure, I should just try to reduce rate of contention, but I am somewhat reluctant(and it is very expensive change), since same rate of contention does not seem like a problem on Windows 2008
Thanks for your input!
Alexander Safronov

Reply
- brucedawson says:
  
  February 21, 2019 at 10:06 pm
  
  I have not noticed that. An uncontended critical section should be extremely cheap – just an interlocked operation. A contended critical section means a kernel transition and then waiting on an event so the costs depend heavily on system load, contention level, and many other factors. So, there could well be a difference but I can’t say. Avoiding heavy contention is definitely desirable.
  
  Reply
  - Alexander Safronov says:
    
    February 22, 2019 at 11:30 am
    
    Not applicable in my case (we are seeing issue in Broadwell, which is pre-Skylake), but could be amusing read for you: https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/
    
    Reply
    - brucedawson says:
      
      February 22, 2019 at 10:13 pm
      
      Oh *that* problem. That’s not a CPU problem, that’s a poorly implemented delay loop. “Oh no, our delay loop is running slower” – uh, yeah. You might be doing something wrong.
      
      Reply
      - Alexander Safronov says:
        
        February 23, 2019 at 9:46 am
        
        Easy mistake to make. Note this Microsoft’s comment for InitializeCriticalSectionAndSpinCount function: “You can improve performance significantly by choosing a small spin count for a critical section of short duration. For example, the heap manager uses a spin count of roughly 4,000 for its per-heap critical sections”. Does not say anything about “oh, but if it is Skylake, then make it less than 4000”. Of course, I have no idea if Microsoft’s implementation of spins for critical sections actually translates to those “pause” calls
        
        Reply
        
        brucedawson says:
        
        February 23, 2019 at 10:37 am
        
        The article you most recently linked talked about .Net spin loops, not critical sections. There the problem is that the spin loop invoked ‘pause’ multiple times without checking the condition. Microsoft was warned about (I think they asked for) the change in the ‘pause’ implementation so they should have changed their code in preparation. Really, I’m not sure why they *ever* did multiple spin loops without trying to read the flag they were monitoring.
        
        The critical section issue that you initially linked to is separate and (I think) unrelated. It is triggered by an OS change, not a CPU change. I don’t know what the deal is with it.
        
        Reply
        
        Alexander Safronov says:
        
        February 28, 2019 at 1:37 pm
        
        I did confirm it was OS change. Windows 2012 (along with Windows 8 and up) implement Critical Sections very differently. Was staring at the disassembler for some time, and I can clearly see undocumented behavior (or omission in documentation) when SpinCount is 0 (i.e. calling InitializeCriticalSection(&cs) or InitializeCriticalSectionAndSpinCount(&cs,0)). Documentation makes one believe (or, at least, does not imply any reason not to believe) that if SpinCount is 0 (or not set), then there will be no spin-wait (even if multi-“processor”). But that is not the case. If you look at SpinCount member of “CRITICAL_SECTION cs” after calls to InitializeCriticalSection(&cs) or InitializeCriticalSectionAndSpinCount(&cs,0), you will see that spin count is set to 0x00000000020007d0.
        0x7D0 is 2000 and 0x2000000 is suspiciously the same as RTL_CRITICAL_SECTION_FLAG_DYNAMIC_SPIN. According to disassembler, 0x7D0 is a hard-coded constant in InitializeCriticalSection and in InitializeCriticalSectionAndSpinCount.
        I am sure “Microsoft knows best”, but I think this should be clearly documented, especially since it affects default behavior. Knowing the Skylake change above, I would sure like to know if their “dynamic spin” strategy accounts for the fact that “pause” becomes 14 times slower…
        This dude found the same thing
        https://stackoverflow.com/questions/54765280/windows-critical-section-how-to-disable-spinning-completely
        Anyway, I am going to see if I like the behavior of our system more if I use spin-count 1, and maybe will open MSDN support case to force Microsoft clarify this aspect in documentation
        
        brucedawson says:
        
        March 2, 2019 at 10:43 am
        
        Microsoft should clarify this in the documentation, certainly. I think what you are encountering is adaptive spinning which is supposed to adjust to the usage patterns of the critical section. Theoretically this should result in critical sections spinning the “right” amount (i.e.; zero if they are usually held for a long time, and longer if they are usually held briefly), and Microsoft is indeed responsible for making sure this actually works. Setting a spin count of 1 to disable this behavior sounds sensible, along with reporting cases where it works poorly.
        
        Alexander Safronov says:
        
        March 2, 2019 at 5:08 pm
        
        From my perspective, it means that same exact scenario is more CPU intensive in Windows Server 2012 (or windows 8 and higher), than in windows server 2008 (or windows 7), at least according to xperf.
        What is the cost of setting up the wait (and being awoken) in critical section these days? Feels like spin loop is more expensive…
        
        brucedawson says:
        
        March 2, 2019 at 7:59 pm
        
        The kernel transition for the wait and wake is definitely quite expensive – 500-100 cycles? And, it often means that data gets flushed out of the cache so that the thread runs slower when it resumes.
        
        As for the cost of a spin, well, it depends how long it spins for. If it spins for one iteration and then gets the lock then that is a clear win. For longer spins the tradeoff probably depends on whether you have enough work to keep the CPU busy on other threads.
        
        Anyway, I understand your concern, and it sounds like for your particular workload the performance is worse. All I can recommend is putting a repro of the problem on github and trying to get it noticed and addressed. Or set a spin count of 1.
        
        Alexander Safronov says:
        
        March 2, 2019 at 8:14 pm
        
        Heh, well, I can completely remove critical section in that place by switching to lockless queue. I am just being pissed at Microsoft for forcing our hand in the scenario that was not even remotely a concern, even at peak loads