Microsoft’s VC++ compiler has an option to generate instructions for new instruction sets such as AVX and AVX2, which can lead to more efficient code when running on compatible CPUs. So, an obvious tactic is to compile critical math-heavy functions twice, once with and once without /arch:AVX (or whatever instruction set you want to optionally support).
It seems like a good idea, and it’s been used in various forms for years, but it’s devilishly difficult to do safely. It usually works, but guaranteeing that is trickier than I had realized.
Let’s say that we have a function called NonAVXMath. This function works great but we know that it would be faster if compiled with /arch:AVX. So we copy our function to another source file (or use the pre-processor to give the same effect), rename the copy to AVXMath, compile the new source file with /arch:AVX, and then at runtime choose the appropriate function to call.
This seems simple enough, but it isn’t. Let’s imagine that NonAVXMath calls some helper functions. Those functions are probably in header files so we don’t need to copy them – they will be pulled in as needed by the preprocessor. They will be compiled once with /arch:AVX and once without, and will be inlined into the functions, giving ideal code. And indeed, this is what happens most of the time.
But what happens if the inline functions aren’t inlined? For each translation unit the compiler will generate a copy of the inline functions that were not inlined. It is then the linker’s job to discard all but one copy of these functions. This is supposed to be safe because the function bodies are supposed to be identical. But they’re not because some use AVX instructions and some don’t.
This cannot end well
If the linker chooses the copy of the inline function that was compiled without AVX then your code will run everywhere but will run more slowly, because it is switching back and forth between AVX and SSE math.
If the linker chooses the copy that was compiled with AVX then your code will crash on machines that don’t support AVX! This includes older CPUs that don’t support the AVX instruction set, older operating systems that don’t support AVX, or computers that have had AVX support disabled (on Windows you can do this with “bcdedit /set xsavedisable 1” and doing this used to be the recommended way of working around an old Windows 7 bug). In short, your program will crash for some customers.
I created a sample project that demonstrates this. While the issue can happen in fully optimized LTCG builds (and indeed it did happen recently to Chrome) it is easier to demonstrate in a debug build. My test project contains two source files which both call floorf, one of which is compiled with /arch:AVX. The build.bat file compiles both and links them twice, once with the AVX file first and once with the AVX file last. Then it disassembles the floorf function in both executables to demonstrate that it varies. Here are the results when the AVX source file is linked last:
cvtss2sd xmm0,dword ptr [ebp+8]
movsd mmword ptr [esp],xmm0
call avx_last!floor (001b3b60)
fstp dword ptr [ebp-4]
fld dword ptr [ebp-4]
And here are the results when the AVX source file is linked first:
vcvtss2sd xmm0,xmm0,dword ptr [ebp+8]
vmovsd qword ptr [esp],xmm0
call avx_first!floor (00bb3b60)
fstp dword ptr [ebp-4]
fld dword ptr [ebp-4]
The difference is subtle but important – instead of cvtss2sd the second version uses vcvtss2sd – the AVX variant of this instruction. In both cases the same floorf function will be called by both the AVX and non-AVX functions.
Now the problem is clear – but what is the solution?
Careful link ordering
If you are careful to link the AVX files last then the compiler should grab the non-AVX versions. This seems like a terrible solution to me. It relies on undefined behavior in the linker, it won’t work reliably with code that is in static link libraries, it is probably flaky in the face of LTCG, and it guarantees that your AVX code will be a mixture of SSE and AVX code that then runs slower than it should.
If you mark all of the relevant functions as __forceinline then the compiler is more likely to inline the functions. Your debug builds will probably still be broken, but maybe that’s okay. However even __forceinline doesn’t guarantee inlining (some functions cannot be inlined) and it feels a bit sketchy to use __forceinline for correctness.
If you include all of the inline function definitions from an anonymous namespace or AVX-specific namespace then the functions are no longer considered the same and the linker will not collapse them. This technique has the advantage of actually guaranteeing correctness. You can either use an anonymous namespace or an AVX specific namespace. Using an AVX specific namespace is probably a better idea because it avoids the risk of ending up with multiple copies of functions that aren’t inlined – one per translation unit. The problem with this solution is that many header files don’t like being added to an unexpected namespace – C/C++ standard headers are particularly unlikely to tolerate this.
Marking all of your inline functions as static works similarly to using an anonymous namespace. This means that it comes with the risk of getting multiple copies of non-inlined inline functions. However most linkers can automatically discard duplicate functions if the code bytes are identical – the /OPT:ICF option in the Visual C++ linker does this. Using static also guarantees correctness, as long as you tag every inline function in this manner.
But what about system header files such as math.h? This is the file that I used in my example and it is the one that has twice caused problems for Google’s Chrome web browser. The current VC++ version of this file includes 49 __inline functions, including floorf which is our culprit today. Well, when there aren’t any elegant solutions you have to go with inelegant. The solution that Chrome went with when we hit this problem was essentially:
#define __inline static __inline
Look, we’re not proud of this solution, but it works. The ideal solution would be for Microsoft to modify math.h – and other header files – to mark inline functions as static. This is what gcc does. Otherwise /arch:AVX cannot be used safely without extraordinary measures. I’ve filed a bug to request this.
A separate DLL
There actually is one way to use /arch:AVX without gross hackery and that is to put all of the AVX code into a separate DLL, compiled entirely with /arch:AVX. Whether this works for you depends on your build system and method of distribution.
Having VC++ tag the inline functions that it ships with static, like gcc/clang do, would avoid the specific problem of floorf and friends. But what about template functions such as std::min, or inline functions written by random developers. A toolchain fix that defuse this landmine once and for all would be much better. A tempting option was suggested on twitter. If all non-inlined inline functions had their name mangling altered to include a /arch: prefix then this problem would be resolved. My test binary would end up with _floorf and _floorf:avx and the linker would trivially resolve the correct functions. The programmer’s intent would be preserved, without the linker inefficiencies of marking every inline function as static (which isn’t even possible for template member functions).
Insert credits here
This problem was previously encountered a while ago by some other developers who use Chromium. They reported their internal bug here, and filed a VC++ bug here. They also contacted me to share their findings, which I appreciate.
Thanks to those on the Chrome team who came up with the (ugly but effective) static __inline solution, thus fixing Chrome’s canary builds for non-AVX capable customers, without having to disable /arch:AVX.