ARM and Lock-Free Programming

I was inspired by the release of Apple’s M1 ARM processor to tweet about the perils of lock-free programming which led to some robust discussion. The discussion went pretty well given the inanity of trying to discuss something as complicated as CPU memory models in the constraints of tweets, but it still left me wanting to expand slightly on the topic in blog form.

This is intended to be a casual introduction to the perils of lock-free programming (which I last wrote about some fifteen years ago), but also some explanation of why ARM’s weak memory model breaks some code, and why that code was probably broken already. I also want to explain why C++11 made the lock-free situation strictly better (objections to the contrary notwithstanding).

Mandatory disclaimer: if your code always uses locks An actual lock (with link to an eleven second lock-picking video)when sharing data between threads then you don’t need to worry about any of this – properly implemented locks avoid all of these data races (unlike the horrible lock in this video).

The basic problems of lock-free programming are best explained through the example of a lock-free producer/consumer pattern, with the producer thread looking like this (C++ pseudo-code with data/function boundaries omitted):

// Producer thread:
Data_t g_data1, g_data2, g_data3;
bool g_flag

g_data1 = calc1();
g_data2 = calc2();
g_data3 = calc3();
g_flag = true; // Indicate that the data is ready for consumption

And here is the consumer thread which retrieves and uses the data:

// Consumer thread:
if (g_flag) {
  DoSomething(g_data1, g_data1, g_data2, g_data3);

Details? More like duck tails, am I right (thousands of ducks being dumped into the Thames, circa 2006)This omits a ton of details (when does g_flag get cleared? How do the threads avoid spinning?) but it suffices for my purposes. The question is, what is wrong with this code, in particular the producer thread?

The basic problem is that the code relies on the three data variables being written to before the flag, but it does not enforce that. If the compiler rearranges the writes then g_flag may be set to true before all of the data is written and the consumer thread may see incorrect values.

Optimizing compilers can be very aggressive about rearranging operations – it’s one of the ways they make their generated code run fast. They may do this in order to reduce the use of registers, to improve the use of CPU pipelines, or just because of some random heuristic added to make Office XP load slightly faster. It’s not worth thinking too much about why a compiler might rearrange things, it’s just important to realize that they can and do.

Compilers are “allowed” to rearrange the writes because of the “as-if” rule which says that they have done their job as long as the program that they generate behaves “as-if” they hadn’t optimized it. Since the C/C++ abstract machine has long assumed a single-thread of execution – with no external observers – all this rearrangement of writes has been correct and reasonable, and has been done for decades.

The question then, is what must be done to stop the compiler from breaking our beautiful code? Let’s pretend for a moment that we are a circa 2005 programmer trying to make this work. Here are some bad ideas:

  1. Declare g_flag as volatile. That prevents the compiler from omitting the reads/writes of g_flag but, to the surprise of many, it does not prevent the problematic rearrangement. Compilers are not allowed to reorder volatile reads/writes with respect to each other, but they are allowed to rearrange them relative to “normal” reads/writes. Adding volatile does nothing to solve our reordering problem (/volatile:ms on VC++ does, but it is a non-standard extension to the language that may generate slower code).
  2. If declaring g_flag as volatile is insufficient then let’s try declaring all four variables as volatile! Then the compiler can’t rearrange the writes and our code will work… on some computers.

It turns out that compilers aren’t the only things that like to rearrange reads and writes. CPUs like to do this as well. This is separate from out-of-order execution (always invisible to your code), and in fact there are in-order CPUs that reorder reads/writes (Xbox 360 CPU) and there are out-of-order CPUs that mostly do not reorder reads/writes (x86/x64 CPUs).

So, if you declare all four variables as volatile then you have code that will only run correctly on x86/x64. And, this code is potentially inefficient because no reads/writes to those variables can be optimized away, potentially leading to redundant work (as when g_data1 is passed twice to DoSomething).

A big barrier (Hadrian's wall?)If you are satisfied with inefficient non-portable code then feel free to stop here, but I think we can do better. But let’s continue to constrain ourselves to the options available in 2005. We now have to make use of… memory barriers.

On x86/x64 we need a compiler memory barrier to prevent reordering. This does the trick:

g_data1 = calc1();
g_data2 = calc2();
g_data3 = calc3();
_ReadWriteBarrier(); // VC++ only and deprecated, but okay in 2005
g_flag = true; // Indicate that the data is ready for consumption.

This tells the compiler not to rearrange the writes across that barrier which is exactly what we need. Another barrier may be needed after the write to g_flag to ensure that value gets written but the details are too uncertain for me to want to discuss. A similar barrier should be used in the consumer thread, but I’m ignoring that thread for now to keep things simple.

The problem is that this code is still broken on CPUs with a weak memory model. A “weak” memory model indicates CPUs that can reorder reads and writes (for greater efficiency or simplicity of implementation) and this includes ARM, PowerPC, MIPS, and basically every in-use CPU except for x86/x64. The solution to this is also a memory barrier, but this time it needs to be a CPU instruction which tells the CPU not to reorder. Something like this:

g_data1 = calc1();
g_data2 = calc2();
g_data3 = calc3();
MemoryBarrier(); // Windows only, and an expensive full memory barrier.
g_flag = true; // Indicate that the data is ready for consumption.

The actual implementation of MemoryBarrier depends on the CPU. In fact, as the comment suggests, MemoryBarrier is not really the ideal choice here because we just want a write/write barrier instead of a much more expensive full memory barrier (which makes reads wait for writes to fully complete) but this is good enough for our purposes today.

I assume that the MemoryBarrier intrinsic is also a compiler memory barrier, so we only need one or the other, so our awesome/efficient producer thread now becomes:

#ifdef X86_OR_X64
#define GenericBarrier _ReadWriteBarrier
#else
#define GenericBarrier MemoryBarrier
#endif
g_data1 = calc1();
g_data2 = calc2();
g_data3 = calc3();
GenericBarrier(); // Why did I have to define this myself?
g_flag = true; // Indicate that the data is ready for consumption.

More barriers (Cusco, Peru) - these should do the trickIf you have circa-2005 code without these memory barriers then your code is broken, and has always been broken, on all CPUs, because compilers have always been allowed to rearrange writes. With these memory barriers (implemented as needed for different compilers and platforms) your code is beautiful and portable.

It turns out that ARM’s weak memory model really doesn’t make things any more complicated. If you are writing lock-free code and not using any sort of memory barriers then your code is potentially broken everywhere due to compiler reordering. If you are using memory barriers then it should be easy to extend them to include hardware memory barriers.

The code above is error prone (where do the barriers go?), verbose, and inefficient. Luckily when C++11 came along we got better options. Prior to C++11 the language didn’t really have a memory model, there was just the implicit assumption that all code was single threaded and if you touched shared data outside of locks then god have mercy on your soul. C++ 11 added a memory model that acknowledged the existence of threads. This made it more explicit that the no-barrier code above was broken, but also gave new options to fix it, like this:

// Producer thread:
Data_t g_data1, g_data2, g_data3;
std::atomic<bool> g_flag // Look at this!

g_data1 = calc1();
g_data2 = calc2();
g_data3 = calc3();
g_flag = true; // Indicate that the data is ready for consumption.

The change is subtle and easy to miss. All I did was change the type of g_flag from bool to std::atomic<bool>. This tells the compiler not to elide reads and writes of this variable (well, mostly), not to rearrange reads and writes across reads and writes to this variable, and to add appropriate CPU memory barriers as needed.

We can even optimize this code slightly:

// Producer thread:
Data_t g_data1, g_data2, g_data3;
std::atomic<bool> g_flag

g_data1 = calc1();
g_data2 = calc2();
g_data3 = calc3();
g_flag.store(true, std::memory_order_release);

By using memory_order_release we are telling the compiler exactly what we are doing so that it can use the appropriate (cheaper) type of memory barrier instruction, or no memory barrier instruction in the case of x86/x64. Our code is now relatively clean and perfectly efficient.

At this point writing the consumer thread is easy. In fact, with the new declaration of g_flag the original version of the consumer thread is now correct! But, we can optimize it slightly:

// Consumer thread:
if (g_flag.load(std::memory_order_acquire)) {
  DoSomething(g_data1, g_data1, g_data2, g_data3);

The std::memory_order_acquire flag tells the compiler that we don’t need a full memory barrier – a read-acquire barrier just ensures that the data values don’t come from shared storage before g_flag without blocking other reordering.

Finishing the code so that the threads can avoid busy-waits and other problems is left as an exercise for the reader.

If you want to learn these techniques then start by carefully reading Jeff Preshing’s introduction to lock-free programming or This is Why They Call It a Weakly-Ordered CPU, and then consider joining a monastery or nunnery instead. Lock-free programming is the most dangerous hammer in the C++ toolkit and that is saying a lot, and it is rarely appropriate.

Note: writing an x86 emulator on ARM forces you to deal with this in a particularly messy way because you never know when reordering will be a problem so you typically have to insert a lot of memory barriers. Or you can follow Apple’s strategy and add a CPU mode that enforces x86/x64 memory ordering, turning that on when emulating.

References:

Most discussion is on reddit but there is some discussion on twitter and some discussion on hacker news.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to ARM and Lock-Free Programming

  1. 1st_C_Lord says:

    I often find the best use of lock free programming is when writing some piece of hot code that could exhibit expensive synchronisation congestion if written for readability. Instead write a lock free version as a baseline and then add locks and measure in order to both improve performance and readability. It’s easier to debug lock free congestion because it’s more ordered, adding locks increases the entropy of the system and makes reasoning about how it is behaving harder.

    I’ve never had a use in production for a pure lock free algorithm I’ve written myself, It should be incredibly rare both because it’s often unperformant .and unreadable/unmaintainable.

  2. akraus1 says:

    Lock free is nice but many times overdone. Except for graphics rendering/raytracing you seldom have problems which need 30 or more cores with some really hot locks. Most often I see way to much CPU spinning in while a producer consumer queue wakes up 30 threads for one work item which is hugely inefficient. The Intel IPP Library uses by default 200 ms active (https://community.intel.com/t5/Intel-C-Compiler/OMP-WAIT-POLICY-and-KMP-BLOCKTIME/td-p/1066367) CPU spinning just to avoid taking a lock. If you have one work item per 200 ms you just have maxed out 64 cores running full speed waiting form more (but never arriving) work. But it scales as hell …

  3. Anonymous says:

    You forgot one thing here with your C examples:

    https://en.wikipedia.org/wiki/Comma_operator

    • brucedawson says:

      What about the comma operator? The critical underlying feature from the compiler point of view is the “as-if” rule and I don’t think the comma operator makes any difference.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.