Microsoft’s VC++ compiler has an option to generate instructions for new instruction sets such as AVX and AVX2, which can lead to more efficient code when running on compatible CPUs. So, an obvious tactic is to compile critical math-heavy functions twice, once with and once without /arch:AVX (or whatever instruction set you want to optionally support).
It seems like a good idea, and it’s been used in various forms for years, but it’s devilishly difficult to do safely. It usually works, but guaranteeing that is trickier than I had realized.
Let’s say that we have a function called NonAVXMath. This function works great but we know that it would be faster if compiled with /arch:AVX. So we copy our function to another source file (or use the pre-processor to give the same effect), rename the copy to AVXMath, compile the new source file with /arch:AVX, and then at runtime choose the appropriate function to call.
This seems simple enough, but it isn’t. Let’s imagine that NonAVXMath calls some helper functions. Those functions are probably in header files so we don’t need to copy them – they will be pulled in as needed by the preprocessor. They will be compiled once with /arch:AVX and once without, and will be inlined into the functions, giving ideal code. And indeed, this is what happens most of the time.
But what happens if the inline functions aren’t inlined? For each translation unit the compiler will generate a copy of the inline functions that were not inlined. It is then the linker’s job to discard all but one copy of these functions. This is supposed to be safe because the function bodies are supposed to be identical. But they’re not because some use AVX instructions and some don’t.
It’s an ODR violation, essentially.
This cannot end well
If the linker chooses the copy of the inline function that was compiled without AVX then your code will run everywhere but will run more slowly, because it is switching back and forth between AVX and SSE math.
If the linker chooses the copy that was compiled with AVX then your code will crash on machines that don’t support AVX! This includes older CPUs that don’t support the AVX instruction set, older operating systems that don’t support AVX, or computers that have had AVX support disabled (on Windows you can do this with “bcdedit /set xsavedisable 1” and doing this used to be the recommended way of working around an old Windows 7 bug). In short, your program will crash for some customers.
Oops.
I created a sample project that demonstrates this. While the issue can happen in fully optimized LTCG builds (and indeed it did happen recently to Chrome) it is easier to demonstrate in a debug build. My test project contains two source files which both call floorf, one of which is compiled with /arch:AVX. The build.bat file compiles both and links them twice, once with the AVX file first and once with the AVX file last. Then it disassembles the floorf function in both executables to demonstrate that it varies. Here are the results when the AVX source file is linked last:
avx_last!floorf:
push ebp
mov ebp,esp
push ecx
cvtss2sd xmm0,dword ptr [ebp+8]
sub esp,8
movsd mmword ptr [esp],xmm0
call avx_last!floor (001b3b60)
add esp,8
fstp dword ptr [ebp-4]
fld dword ptr [ebp-4]
mov esp,ebp
pop ebp
ret
And here are the results when the AVX source file is linked first:
avx_first!floorf:
push ebp
mov ebp,esp
push ecx
vcvtss2sd xmm0,xmm0,dword ptr [ebp+8]
sub esp,8
vmovsd qword ptr [esp],xmm0
call avx_first!floor (00bb3b60)
add esp,8
fstp dword ptr [ebp-4]
fld dword ptr [ebp-4]
mov esp,ebp
pop ebp
ret
The difference is subtle but important – instead of cvtss2sd the second version uses vcvtss2sd – the AVX variant of this instruction. In both cases the same floorf function will be called by both the AVX and non-AVX functions.
Now the problem is clear – but what is the solution?
Careful link ordering
If you are careful to link the AVX files last then the compiler should grab the non-AVX versions. This seems like a terrible solution to me. It relies on undefined behavior in the linker, it won’t work reliably with code that is in static link libraries, it is probably flaky in the face of LTCG, and it guarantees that your AVX code will be a mixture of SSE and AVX code that then runs slower than it should.
__forceinline
If you mark all of the relevant functions as __forceinline then the compiler is more likely to inline the functions. Your debug builds will probably still be broken, but maybe that’s okay. However even __forceinline doesn’t guarantee inlining (some functions cannot be inlined) and it feels a bit sketchy to use __forceinline for correctness.
Namespaces
If you include all of the inline function definitions from an anonymous namespace or AVX-specific namespace then the functions are no longer considered the same and the linker will not collapse them. This technique has the advantage of actually guaranteeing correctness. You can either use an anonymous namespace or an AVX specific namespace. Using an AVX specific namespace is probably a better idea because it avoids the risk of ending up with multiple copies of functions that aren’t inlined – one per translation unit. The problem with this solution is that many header files don’t like being added to an unexpected namespace – C/C++ standard headers are particularly unlikely to tolerate this.
static
Marking all of your inline functions as static works similarly to using an anonymous namespace. This means that it comes with the risk of getting multiple copies of non-inlined inline functions. However most linkers can automatically discard duplicate functions if the code bytes are identical – the /OPT:ICF option in the Visual C++ linker does this. Using static also guarantees correctness, as long as you tag every inline function in this manner.
math.h
But what about system header files such as math.h? This is the file that I used in my example and it is the one that has twice caused problems for Google’s Chrome web browser. The current VC++ version of this file includes 49 __inline functions, including floorf which is our culprit today. Well, when there aren’t any elegant solutions you have to go with inelegant. The solution that Chrome went with when we hit this problem was essentially:
#define __inline static __inline
#include <math.h>
#undef __inline
Look, we’re not proud of this solution, but it works. The ideal solution would be for Microsoft to modify math.h – and other header files – to mark inline functions as static. This is what gcc does. Otherwise /arch:AVX cannot be used safely without extraordinary measures. I’ve filed a bug to request this.
A separate DLL
There actually is one way to use /arch:AVX without gross hackery and that is to put all of the AVX code into a separate DLL, compiled entirely with /arch:AVX. Whether this works for you depends on your build system and method of distribution.
Toolchain fixes
Having VC++ tag the inline functions that it ships with static, like gcc/clang do, would avoid the specific problem of floorf and friends. But what about template functions such as std::min, or inline functions written by random developers. A toolchain fix that defuse this landmine once and for all would be much better. A tempting option was suggested on twitter. If all non-inlined inline functions had their name mangling altered to include a /arch: prefix then this problem would be resolved. My test binary would end up with _floorf and _floorf:avx and the linker would trivially resolve the correct functions. The programmer’s intent would be preserved, without the linker inefficiencies of marking every inline function as static (which isn’t even possible for template member functions).
Insert credits here
This problem was previously encountered a while ago by some other developers who use Chromium. They reported their internal bug here, and filed a VC++ bug here. They also contacted me to share their findings, which I appreciate.
Thanks to those on the Chrome team who came up with the (ugly but effective) static __inline solution, thus fixing Chrome’s canary builds for non-AVX capable customers, without having to disable /arch:AVX.
Reddit discussion is here, announcement tweet is here, hacker news discussion is here.
if I remember well Intel compiler does automatic dispatch at runtime, which would make useful enabling ALL the arch options. To me, it would me nice to have a declspec which enables the runtime dispatch (and maybe for which runtimes) only for specific functions, to avoid code-bloat. The cost of this would be only once per function.
“your code will run everywhere but will run more slowly, because it is switching back and forth between AVX and SSE math”
Pictures or it didn’t happen…
Are you saying here, you don’t get all code you wanted as AVX? Or are you saying there is a cost to switching between SSE and AVX. Or to be pendantic, switching between 256 bit wide and 128 bit wide operands – for on some Intel microarchitectures there is a switchover cost. And, further, I believe this cost does not exist on AMD (citation or it didn’t happen, ok ok).
Anyway we are men of science and I hereby demand you make this sentence more accurate.
https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties
I did not measure a switching cost, but there will definitely be switching as the AVX code calls a non-AVX floorf function, and my understanding is that this switching between AVX and SSE has a cost, on some CPUs.
The switching can be seen by stepping through the AVX function into the non-AVX floorf, in one of the variants in my sample.
I am happy with this clarification, it meets your own exacting standards. You may proceed.
SSE/AVX transition penalties are unlikely. When compilers don’t inline a function, they normally use vzeroupper before calling it, if any ymm uppers are potentially dirty. (And before returning). i.e. AVX usage is kept “private” to the scope of a function.
I could *imagine* that problem if the compiler was expecting a call between two AVX-using functions and doing inter-procedural optimization, it might skip vzeroupper. Just like it might make some assumptions about calling convention, or even use a *custom* calling convention.
GCC would make a `foo.clone123` version of a function if it wanted to propagate one constant arg into the function, or do other kinds of IPO, but I don’t know about MSVC.
QbProg yes, but does Intel compiler still only dispatch fully optimized code for GenuineIntel (TM) processors?
>older machines
Ha ha, I wish. All the latest Pentiums, Celerons, and, of course, Atoms, do not support AVX at all.
I’m curious why you feel separate dlls are too complex? That’s the solution I prefer. It’s often nice to have all your important performance critical math bits in a separate project/dll for other reasons as well (namely, it’s easier for the perf-minded people to work on it while making it more difficult for the non-perf-minded people to break it).
Also interested; suspect combinatorial explosion may feature in the answer.
Separate DLLs are a perfect solution in many cases. One extreme version would be recompiling the entire project with /arch:AVX and distributing two or more versions. That is ‘easy’ except for the distribution changes.
Having key functions in separate DLLs is a nice compromise but it can get a bit messy because those separate DLLs cannot share *any* libraries with the rest of the code. How tricky this is really depends on the build system used. If it works then go for it. I know that in Chrome we don’t want to lightly add more DLLs.
Maybe “too much work in some circumstances” would be a better statement. I would be interested to hear reports from developers who have used this technique.
This is slightly offtopic but it made me think of separate images, for example the way Process Explorer packages both 32 and 64 bit images and decides which one to run at launch. Why does Chrome still default to 32 bit?
Chrome defaults to 64 bit now, if you install it freshly on a 64-bit OS. However existing 32-bit installs are not (yet) being upgraded to 64-bit. Sysinternals does indeed do this very well, but downloading both versions of Chrome to all consumer machines (which is effectively what sysinternals does) would be too much of an increase in download size, I think.
Ok, but you could make it a function of the installer. It would download the build best usable for that machine. And put a check in Chrome that can detect if the current CPU does support all the features it expects. (People may switch to new CPUs when changing computers and restore from a backup.)
Isn’t an inline namespace be a better solution?
An anonymous namespace could work, but is less targeted than changing inline functions to static inline. If you try an anonymous namespace (perhaps starting with my test project) then let me know how it goes.
No, I mean inline namespace
namespace math {
#ifdef AVX
inline
#endif
namespace AVX {
float floorf(float);
}
#ifdef SSE2
inline
#endif
namespace SSE2 {
float floorf(float);
}
}
Ah – gotcha. Well, just gotta try it with math.h I guess and see if it works.
Shouldn’t the AVX version of floor be calling vroundps?
Er vroundss as this is scalar in the example.
vroundss only works if the rounding mode is set to a non-default value, and changing the rounding mode is frequently expensive. floorf ends up calling floor and I didn’t bother looking at its implementation – it presumably comes with a whole host of new problems. I used floorf because that was what caused the problem in Chrome and because it made for an easy demo repro. The issue could occur with any inline math function.
I don’t think roundss requires you to change the default rounding mode, if you look at immintrin.h, the rounding mode is passed in as a compile time constant.
#define _mm256_floor_ps(val) _mm256_round_ps((val), _MM_FROUND_FLOOR)
Ya this is more of an aside about MS’s implementation of floor..(and trunc/ceil which have the same problem).
Is this something MS could fix with a linker enhancement? If the linker was /arch:AVX aware, then it could choose from an AVX and a non AVX version of the function, depending on which translation unit it was called from.
On twitter it was suggested – https://twitter.com/dascandy42/status/806175509130477577 – that name mangling should include modifiers for /arch: options, so /arch:AVX would produce a floorf:AVX symbol instead of floorf. If this was done for all non-inlined inline functions then the entire problem would disappear. No linker changes would be needed, just a simple compiler change.
thaaaanx
Thank goodness the C pre-processor still exists, despite the last 20 years of C++ changes attempting to kill it. This is what it’s there for – the kind of meta-language hack that you shouldn’t need to do in a perfect world, but do in reality.
A colleague of mine encountered a similar problem a couple of years ago. At that time we could not figure out why this happened but was able to quickly work around it by moving some math function calls to another source file for which we did not use /arch:AVX. Having read this article, we now understand the problem more clearly. Thanks.
Given that we have potentially the same problem of ODR violation for any inline function or function template such as std::min(), we cannot but conclude that there is no perfect solution other than building a separate DLL for each architecture, though this seems to be a little cumbersome to me because dispatching inside a DLL is just so handy.
Finally, I would like to point out that this problem is definitely not restricted to floating-point arithmetic but also affects some integer arithmetic because, since Visual Studio 2015, the /arch:AVX2 option allows the compiler to emit bit manipulation instructions called BMI1/BMI2 such as SHLX (shift logical left without affecting flags), rendering virtually any code unsafe.
When floorf is NOT inlined why does the linker consider the resulting functions as duplicates? Is it just going by name? The generated assembly isn’t identical after all as is evidenced by that one instruction?
When floorf is not inlined the linker is *required* to treat the resulting functions as duplicates, and those duplicates are required to be identical. A failure of those functions to be identical is an ODR violation. The big question is whether this ODR violation is the fault of the compiler, or of the author of the inline function.
A number of ABIs could benefit from name mangling. On the ARM, for example, it would be helpful to have a naming convention for functions that accept floating-point values in FPU registers and for those which accept them in integer registers; code which defines either form could generate a strong symbol for the form it’s expecting and a weak symbol with a stub that would translate the arguments and chain to the other. Code which expects to call a function with one convention could then define a weak symbol with a stub for the form it’s calling whcih would chain to the other.
Using such a pattern, if caller and callee expect the same convention, both would weakly define stubs which never get called and could thus be omitted; if they expect different conventions, the weakly-defined stubs would bridge between the caller and the called function.
One difficulty with using naming mangling to bridge such issues, however, is that C allows function pointers to be passed between modules, and there’s no nice way that code with a pointer to a function of one style could convert it into a pointer to a function of the other style. Perhaps that could be resolved by having the linker produce tables of the entry points of all function whose addresses are taken, using a separate table for each calling convention, and then have “function pointers” actually be offsets into the table. Such a design would require support from linkers and compilers, but would facilitate interoperation among code which is compiled with different “preferred CPU” options.
I think that it’s okay for function pointers to not be handled. The real problem (as far as I can tell) is when there is cross-talk between two domains, as discussed and demonstrated here. When a developer is using a function pointer they are presumed to be in control and have understanding of what they are doing, so they need to take responsibility for when they cross architecture domains. I think that’s better than having compiler magic to make arbitrarily ‘fat’ function pointers.
But name mangling for non-inlined inline functions seems crucial.
Do you like my suggested approach for using name mangling to allow clean inter-operation between domains? The performance of code using that approach wouldn’t be quite as good as that of code which was all compiled the same way (e.g. if “foo() is compiled to use FPU registers and “bar()” isn’t, having “foo” load values into FPU registers and then call a stub that reads them out would be less efficient than having “foo” put the values where they’re expected, but the code would work *correctly* in any case. Code which uses function pointers will be a problem, but otherwise I don’t see why things shouldn’t be able to interact a lot more smoothly than they do.
The idea of using name mangling and thunks for allowing cleaner inter-operation between domains with different calling conventions could work nicely. Although, in many cases it would be better to generate two versions of the functions, or else use the name mangling to indicate the calling convention such that calling the wrong one becomes a linker error.
I’ve implemented dynamic codepaths between AVX and SSE including the appropriate CPU checks, but I’ve found that AVX actually hurts performance unless you’ve got significant intensive bursts of very heavily vectorised code.
The problem is that when the upper half of the YMM registers and ALU/FPU are not used, the chip powers them down, and in fact the “base clock” and “turbo clock” speeds quoted for your CPU is with these units powered down.
When the chip sees them used, it powers up the circuitry (takes about 60microseconds to do so, during which it actually implements AVX ops by doing 2 128 bits ops internally). Now this extra circuitry consumes electricity, increasing the power consumption and heat produced by the CPU, so the base & turbo clock speeds are actually reduced. If you’re using “low power” AVX instructions this base clock reduction is only about 10%, but for “high power” AVX instructions it’s more like 20%.
Only when the wider registers haven’t been used for 700 microseconds (0.7ms) are these circuits powered down again.
So if you’re going to be intensively crunching numbers for a few milliseconds, all well and good – the 256 bits registers will more than make up for a 10 or 20% reduction in clock speed, but if (as happens with our maths library) the use of wide registers is in short bursts of a few microseconds at a time, but maybe 4 or 5 times a millisecond, then the 90% of the code that is NOT benefitting from the use of wider registers runs 10% slower, and the net overall result is a slowdown in performance.
See https://stackoverflow.com/questions/35663635/why-do-processors-with-only-avx-out-perform-avx2-processors-for-many-simd-algori/44353690#44353690 for a longer explanation including links to Intel’s notes about this base clock reset.
Would /fp:fast run into the same dangers as /arch:AVX?
As in two cpp files with different options, one /fp:precise and another /fp:fast, both share some template function that get’s instantiated twice, but each instantiation is compiled with different floating point optimization options! E.g. one implementation with /fp:precise will treat nan correctly, but the other with /fp:fast will not, right? In general they would just produce different results, and you never know which one the linker will choose…
I think that the same dangers would apply. As you say, the consequences would be different (different results versus crashes) but the unpredictability would still be there.