std::min Causing Three-Times Slowdown on VC++

Using macros to implement Min and Max has numerous well known pitfalls that can be avoided by using template functions. However the definitions of min and max that are mandated by the C++ standard cause VC++ to generate inefficient code. Until the VC++ compiler fixes this it may be best to use custom Min/Max template functions to avoid this problem.

Update, July 2015: VS 2015 fixes this bug such that StdMaxTest now generates perfect code.

I recommend writing template Min/Max functions that return by value instead of by const reference. In most cases these can be used as drop-in replacements for std::min and std::max, and they cause VC++ to generate much better code. Replacing std::min and std::max won’t always cause a large improvement, but I don’t think it ever hurts.

One of the advantages of working at Valve is having really smart coworkers. This particular code-gen problem was found by Michael Abrash. When he said that the compiler was being difficult I figured it was probably worth investigating.

We don’t need no stinkin’ macros

Macro implementations of Min and Max have four main problems. The first is that if they are not carefully written with lots of parentheses then you can hit non-obvious precedence problems. This can be dealt with by careful placement of parentheses in the macro definitions.

The second problem with macros is multiple evaluations of the arguments. If one of the parameters to a min/max macro has side effects then these multiple evaluations will cause incorrect behavior. In other cases the multiple evaluations might cause bloated code or reduced performance. This problem can be avoided by never passing arguments that have side effects to a macro.

The third problem is that macros don’t care whether your types match. You can pass a float and an int to MIN, or a signed and an unsigned, and the compiler will happily do conversions, sometimes leading to unexpected behavior.

The fourth problem with macros is that they pollute all namespaces. Yucch.

It might be possible to ensure that all of your macros are implemented correctly to avoid precedence problems, but it is very difficult to ensure that all uses of all macros avoid side effects and unplanned conversions, and namespace pollution is unavoidable with macros.

Functions it is

Inline functions avoid these problems, and inline template functions offer the promise of perfectly efficient min/max for all types with none of the problems associated with macros. Here’s the general form for a template max function:

template<class T> inline const T& max(const T& a, const T& b)
{
return b < a ? a : b;
}

This function is designed to work with any type that is comparable with operator<, so T can be an int, double, or your class FooBar. Because the function needs to handle arbitrary types the parameters and return type need to be declared as const references. This allows finding the maximum of objects that are expensive to copy or uncopyable:

auto& largest = std::max(NonCopyable1, NonCopyable2);

However that const reference return type gives Visual C++’s optimizer heartburn when std::max is used with built-in types.

In order to see what sort of code template min/max generate we need to call them. Below we have the world’s simplest test of std::max:

int StdMaxTest(int a, int b)
{
return std::max(a, b);
}

By compiling this in a release build with link-time-code-generation (LTCG) disabled and /FAcs enabled we can get a sense of the code-gen in a simple scenario, without even having to call the function. This technique was described in more detail in How to Report a VC++ Code-Gen Bug. Here’s what the assembly language from the .cod file looks like:

?StdMaxTest@@YAHHH@Z
push   ebp
mov    ebp, esp
mov    ecx, DWORD PTR _a$[ebp]
lea    eax, DWORD PTR _b$[ebp]
cmp    ecx, DWORD PTR _b$[ebp]
lea    edx, DWORD PTR _a$[ebp]
cmovge eax, edx
mov    eax, DWORD PTR [eax]
pop    ebp
ret    0

The first two and last two instructions are boilerplate prologue and epilogue. I’ve put the code that does the work in bold. There are six instructions and, roughly speaking, they conditionally select the address of the largest value, then load the winner from the specified address.

Now let’s consider an alternative definition of a template max function. This function is identical to std::max except that its return type is a value instead of a reference. Here’s the function and a test caller:

template <class T>
T FastMax(const T& left, const T& right)
{
return left > right ? left : right;
}

int FastMaxTest(int a, int b)
{
return FastMax(a, b);
}

And here’s the generated code:

?FastMaxTest@@YAHHH@Z
push  ebp
mov   ebp, esp
mov   eax, DWORD PTR _b$[ebp]
cmp   DWORD PTR _a$[ebp], eax
cmovg eax, DWORD PTR _a$[ebp]
pop   ebp
ret   0

The inner section of the function – the part that does the actual work – is three instructions instead of six. Instead of selecting the winning address and then loading the value it just selects the winning value.

All else being equal, smaller and shorter code is better. A shorter dependency chain means higher peak speed, and a smaller footprint means fewer i-cache misses. While lower instruction counts don’t always equal higher speed, if all else is equal (no expensive instructions) they should give equivalent or better speed. The smaller code won’t necessarily be faster, but it will not be slower, and being smaller is a real advantage.

Timing differences

Measuring the performance of three to six instructions is impossible. Given that modern processors can have far more instructions than that in-flight at one time it isn’t even well defined. So, we need a better test.

The simplest timing test I could think of was a loop that scans an array to find the largest value. To compare FastMax and std::max I just need to call them each a bunch of times on a moderately large array and see which one is fastest. In order to avoid distortions from context switches and interrupts I print both the fastest and slowest times but I ignore the slowest times. Here’s one version of the test code:

int MaxManySlow(const int* p, size_t c)
{
int result = p[0];

for (size_t i = 1; i < c; ++i)
result = std::max(result, p[i]);

return result;
}

The results are dramatic. The differences are way more extreme than I had expected. The FastMax code runs three times faster than the std::max code! The code using FastMax took two cycles per iteration, whereas the code using std::max takes six cycles per iteration.

Here is the inner loop generated when using std::max:

SlowLoopTop:
1: cmp      ecx,dword ptr [edx]
2: lea      eax,[result]
3: cmovl    eax,edx
4: add      edx,4
5: mov      ecx,dword ptr [eax]
6: mov      dword ptr [result],ecx
7: dec      esi
8: jne      SlowLoopTop

Here is the inner loop generated when using FastMax:

FastLoopTop:
1: cmp      eax,dword ptr [esi+edx*4]
2: cmovle   eax,dword ptr [esi+edx*4]
3: inc      edx
4: cmp      edx,edi
5: jb       FastLoopTop

Remember that the only difference between the source code of these two functions is the return type of the Max function called. If I change FastMax to return a const reference then it generates code identical to std::max.

The problems handling the std::max return type is apparently an optimizer weakness in VC++. I’ve filed a bug to show the problem and I’m hopeful that the VC++ team will address it. Until then I recommend using FastMax instead of std::max (and FastMin instead of std::min).

All testing was done with VC++ 2013, release builds, with the /O2 optimization setting. Testing was done on an Intel Sandybridge CPU on Windows 7. I tested one other Intel and got similar but not identical results. I saw similar results with VC++ 2010, so this is not a new problem.

One final glitch

In order to protect programs against exploits of overruns of stack based buffers VC++ inserts calls to _security_check_cookie() at the end of vulnerable functions if you compile with /GS. A vulnerable function is one that VC++ thinks could have a buffer overrun. Unfortunately, using std::max in our test function triggers the /GS code so MaxManySlow is further penalized by having code to check for impossible buffer overruns. This doesn’t affect the performance of our test because I passed in a large enough array, but if a small array was passed in then this would be an additional bit of overhead. And, the extra code wastes more space in the instruction cache. MaxManySlow is 35 bytes larger than MaxManyFast – 73 bytes versus 38 bytes.

Other types

I’m not interested in testing this with all of the built-in types, but I did test with float. The minimal test functions showed different code-gen for the two template functions, but it wasn’t obvious which was better. When I ran tests on arrays of floats FastMax was about 4.8x faster than std::max. The loop with std::max took 7.33 cycles per iteration, while the loop with FastMax took an impressive 1.5 cycles per iteration.

Test code is attached to the bug and is also (newer version) available here. Run the release build to see the array results, compile the release build and look at MaxTests.cod to see the code-gen of the micro test functions.

I’m surprised I never noticed this before. I guess I got lucky because when I pushed my coworkers from macro MIN/MAX to template Min/Max I accidentally did return by value.

Update

I’ve tested this lightly with gcc and it seems to handle the references without any problems. I hear clang handles it fine also. However, using a single micro-test to compare compiler quality is clearly meaningless, so don’t over extrapolate.

deniskravtsov suggested using overloads of min/max for built-in types to avoid this problem, which is a fascinating idea. However, if you put the overloads of min/max in the global namespace then they will not always be used and if you put them in the std:: namespace then you are probably breaking the language rules. I like using FastMax better.

While testing the overload suggestion I found that if the overload takes its parameters by value then the performance is slightly lower than if it takes them by reference. I don’t know if that is a general trend or if it is peculiar to this one test. It does seem quite odd that the parameter type and return type of a function that is inlined would cause so much trouble! The performance difference from the parameter types is slight so I wouldn’t read too much into it.

Reddit discussion is here and here.

A few comments to my readers…

A surprising number of people said that the problem would go away if std::max was inlined by the compiler. Uhhh – in both of my examples std::max was inlined by the compiler.

There were also several people saying that this would never happen in real code. Well, it was originally found in real code. That code needed high performance, and it took quite a while to find out what was making the optimizer generate imperfect code. The slowdown in that case wasn’t three times, but it was enough to matter. Also, I think MaxManySlow looks like real code – I’m sure I’m not the first person to write that exact loop.

FastMax may indeed generate slower code for heavyweight types – but maybe not. That test will have to wait for another day.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Performance, Programming, Visual Studio and tagged , . Bookmark the permalink.

44 Responses to std::min Causing Three-Times Slowdown on VC++

  1. Nathan Reed says:

    Good find! This makes me wonder if VS is weak in optimizing away references in general – what other opportunities are being missed?

    • brucedawson says:

      That would be fascinating to investigate. The one factor that probably reduces the frequency of the problems is that references are not frequently used to return built-in-types, and that seems to be what triggers this.

    • Mark says:

      Well as an example: VS2010 fails to inline some embarrassingly inlinable functions marked as inline if we separate the definition from the declaration. We compile with full optimisation and link time code generation, so no excuses.

      It even fails to inline with the MS specific __forceinline keyword! I know it’s only a suggestion still… but seriously, if you’ve gone to the effort to use __forceinline you really are expecting a function to be inlined.

      That being said, it does seem like Microsoft has significantly picked up their game since VS2010 and it’s only getting better each release. However, this is only based on my personal code and not a 2+ million line code base like my VS2010 experience.

      I wouldn’t complain though. Each compiler has its own little idiosyncrasies and if you need total performance then you profile and special case for each target architecture.

      In the end Bjarne sums it up pretty well in that C++ has become too “expert friendly”. For me, this doesn’t just include the core language but the difficulty in writing a decent optimising compiler as well.

      • brucedawson says:

        I strongly recommend creating a repro of the missed optimization opportunities and filing a bug. Either that will help find a workaround, or it will give Microsoft a chance to fix an issue that you care about.

        If nothing else it would let us have a concrete discussion.

  2. Sidney Just says:

    Thanks for the tip, we use std::min and std::max quite extensively instead of macros, for the reason you’ve mentioned at the beginning, and porting our code to Windows is on the list for the near future.
    One thing I would like to point out though, more instructions doesn’t always mean longer runtime! Just looking at the instruction count will rarely give a clue about the runtime of something. I know that you know that, but just in case someone sees this post and is now going through the assembly their compiler generated and tries to get the instruction count down.

    • brucedawson says:

      Yep, there are definitely cases where more instructions is better — loop unwinding being one obvious example, or replacing a multiply or divide with multiple instructions. In this case the extra instructions are not offering those benefits so the best we can hope for is identical performance with more bloat.

  3. denis says:

    Ah, not sure my comment went through. I’m normally using an overload for POD types in std:
    namespace:
    {
    int min(int i1, int i2)
    {
    return i1 > i2 ? i1 : i2;
    }
    }

    This allows to use generic code “specialising” for specific cases.

    BTW /GS option terminates the process in case things go wrong and you don’t get a crash dump, so need to keep guessing unless you turn it off. MS needs to change this default bahaviour or allow to customise it.

  4. Kat Marsen says:

    That’s outrageous. I can confirm this is true for 64-bit code generation under VS2010 as well:

    ?DoubleMaxTest@test@@YANNN@Z PROC ; test::DoubleMaxTest

    ; 42 : return test::Max(a, b);

    comisd xmm0, xmm1
    movsdx QWORD PTR [rsp+16], xmm1
    movsdx QWORD PTR [rsp+8], xmm0
    lea rax, QWORD PTR a$[rsp]
    ja SHORT $LN7@DoubleMaxT@2
    lea rax, QWORD PTR b$[rsp]
    $LN7@DoubleMaxT@2:
    movsdx xmm0, QWORD PTR [rax]

    ?DoubleMaxTestByValue@test@@YANNN@Z PROC ; test::DoubleMaxTestByValue

    ; 58 : return b < a ? a : b;

    comisd xmm0, xmm1
    ja SHORT $LN4@DoubleMaxT
    movapd xmm0, xmm1
    $LN4@DoubleMaxT:

    If you want to see something really silly, take a look at the generated code when one or both arguments are immediates. For example for std::cout << std::max(3,6) I get:

    000000013F802D1C mov dword ptr [rbp+88h],6
    000000013F802D26 mov dword ptr [rbp+0C8h],3
    000000013F802D30 mov dword ptr [rbp+0D8h],6
    000000013F802D3A mov edx,6
    000000013F802D3F mov rcx,qword ptr [__imp_std::cout (13F9D33E0h)]
    000000013F802D46 call qword ptr [__imp_std::basic_ostream<char,std::char_traits >::operator<< (13F9D33C8h)]

    I have no idea why it's storing 6 twice here. Using the by-value version as expected doesn't store anything.

    • brucedawson says:

      Make sure you are testing an optimized (release) build. When I test std::max(3,6) on a VS 2013 optimized 64-bit build I get “mov eax, 6\nret 0”, which is perfect.

      The other tests seem to behave similarly with the 64-bit compiler, so it has the same weakness as the 32-bit compiler.

      • Kat Marsen says:

        Indeed, /Ox. We already always use our own version of min/max (templated though and defined pretty much exactly like the standard versions). Some calls are optimized just fine, many though are not. After switching to overloads (but still using the reference return) I don’t see any extraneous stores. So there’s something about the const& return plus the template that confuses the thing.

        Unfortunately it doesn’t seem like you can have overloads for common types plus a template version to pick up everything else– not only does the type safety go out the window but it picks the (double,double) overload in far more cases than I’d expect, including a nasty one where someone provided an operator bool() (sigh).

        I briefly flirted with using some sort of SFINAE monstrosity to have the template change the return value based on the arguments (I used has_trivial_destructor) but could not get past the need to explicitly specify the type of the argument in order to use the specialization… adding a helper method to do it for me doesn’t help obviously because then that itself has a return value that needs to be defined (and taking a const& to a temporary return value isn’t a great idea). I guess it’s better to not fight this one, and just wait for a fix.

        • brucedawson says:

          I’m not sure why you are getting different results from me. In my tests I find that the only thing that matters is the return type — template versus not makes no difference. I would be surprised if being a template mattered since a template is just a way of stamping out code and shouldn’t affect code generation.

          I’m also unclear if you saying that you used /Ox on the std::max(3,6) test because in my tests that construct produces perfect code in optimized builds. If it is not for you then you should check your build settings.

          Anyway, I recommend changing your min/max return type. The reference return type is rarely needed — you can always use std::max or MaxRef() when you need a reference return type.

          • Kat Marsen says:

            Note I’m using 2010, not 2013, but other than that I used your setup above with the wonderful /FAcs trick to iterate. Since I expect further optimization to occur during linking I just skimmed around a production binary for the silly store pattern in functions I know make heavy use of min/max on UDTs that essentially just wrap a double.

            I think it’s good advice to just flat out change the return type to a value, after all, how often does a “complex” UDT get used in min/max that is also performance-critical?

  5. Seth says:

    Certainly a good find. Hope this will improve the compiler in the future.

    I got a bit worried with your early recommendation of template min/max as “drop-in replacement”.
    I think casual readers may need a bit more explicit warning about the cases where the bahviour changes from standard c++.

    (I’d have preferred std::min(a,b), std::min(std::ref(a), std::ref(b)) and std::min(std::cref(a), std::cref(b)) etc. so users would have had control. Alas, that’s history).

    I’m also a bit sceptic about this being significant in any real code base. It won’t be noticeable unless you’re executing a large number of similar operations in an algorithm. In which case I’d argue that it is always going to be trivial to optimize your higher-level algorithm. (Like in the case of your benchmark “naive max_element”).

    And that’s exactly what the programmer of said algorithm should already be doing, IMO.

    Cheers

    • brucedawson says:

      When would FastMax not be a good replacement? Obviously for noncopyable objects it won’t work, and for expensive to copy objects it may be worse (it depends on what the optimizer does and how you use it) but I’m trusting my readers to realize that.

      This issue definitely was significant in a real code base. Michael Abrash was trying to optimize some very performance sensitive code and his (entirely reasonable) use of std::max was causing enough slowdown to matter. It wasn’t causing a 3x slowdown, but it was noticeable. The trouble is that it wasn’t at all obvious that std::max was what was making the compiler get stupid, so it took a while to find the fix.

      • Seth says:

        Thanks for adding that. I gave an example of a semi-contrived case where
        behaviour changes in response to @denis. I just said that it’s not a _general_
        drop-in replacement (standards conformance wise).

        Anyways, in my experience: first I write _intentional_ code (i.e. highlevel
        code, using standard abstractions as much as possible) and aim for 1.
        correctness 2. expressiveness (in that order; they often jibe really well).
        This would be the phase that _might_ sport a `std::max` call or two.

        In the next (optional) phase, I optimize when the profiler tells me. This phase
        often sees me routinely dissolving “minor” abstractions used (like
        std::min/max) because it sacrifices little in terms of expressivess.

        Of course, in practice the “big wins” come from other types of changes:

        * break the abstraction and use intermediate state or “inside information” to
        make the code do the same, while writing more specific detailed steps, e.g.

        – “Removing The Varnish”: e.g. fill a vector manually, and sort/index
        (lower_/upper_bound ranges) instead of relying on boost’s
        bimap/flat_[multi]map/etc.).
        – “Crusting on Special Cases”: add manual administrative overhead to take
        advantage of special cases

        Very little surprises me anymore in that phase. I admit that I don’t often go
        down to the generated assembly anymore. I think that’s mainly because I can
        interact with profiler data on the source level just fine.

        The other side of it perhaps just means this: I’m not a compiler or
        (proprietary) library developer 🙂

        🙂

        • Seth says:

          heh my irony tag got eaten [/end unsollicited rant]

        • brucedawson says:

          I agree with your workflow, with one possible exception. If I have to remove std::max and replace it with FastMax or with manually going if/else or ?: then I get annoyed. std::max should be a zero-cost abstraction. Until a few months ago I thought that std::max *was* a zero-cost abstraction.

  6. Guillaume says:

    Nice catch and great article Bruce!

  7. I’m surprised by this. It seems like such simple optimization. gcc and clang seem to handle std::max just fine:
    http://bit.ly/IefWtH

    I think I prefer the others suggestion of overloading std::max for builtin types. That way whenever MS gets around to fixing this you can just remove your specializations and not have to change any code.

    • brucedawson says:

      The problems with overloading std::max are you can’t overload it in the std namespace without changing the meaning of some programs, and if you overload it outside of the std namespace it will not always get used. Hence, I prefer FastMax, and it’s easy enough to switch back to std::max in the future.

  8. Adrian says:

    Note that, when the arguments are equal, std::max returns the left argument (25.4.7). Your first example of a max implementation under the “Functions it is” header does not behave exactly like std::max. To make them match, I believe you’d want the meat of the function to be:

    return a < b ? b : a;

    FastMax returns the value of the right argument, but since it is returning by value, this shouldn't matter. Nonetheless, I thought it was worth pointing out to avoid confusion.

    Personally, I would use FastMax only in places where performance it mattered, and only until the compiler vendor fixes the optimizer for std::max. For all the places where performance isn't critical, I'd still use std::max.

    • brucedawson says:

      Thanks for the point about the subtleties of the std::max contract. I agree that FastMax should be discarded as soon as possible. Luckily s/FastMax/std::max/ is an easy change to make.

  9. semmy13 says:

    Thanks for the article. I was reading the article and @denis comment and I thought I’d try to let the compiler decide what type to return.
    @seth comment about the standard is still valid and my version of std::max (AutoMax) misbehave for every scalar type, but that’s, as you said, a temporary solution.
    Also, regarding @adrian comment, it looks from my tests the implementation using operator> instead of operator< generates slower code.
    I hope the code is correct, it's a quick test I wrote at night and I'm not a template expert 🙂

    template
    struct return_type_impl {
    typedef const T& value_type;
    };

    template
    struct return_type_impl {
    typedef T value_type;
    };

    template
    struct return_type : public return_type_impl< std::is_scalar::value, T > {
    };

    template
    inline typename return_type::value_type AutoMax(const T& left, const T& right) {
    return left < right ? right : left;
    }

    This code compiles and seems to behave correctly in VS2012 and 2013. It can be further improved using more type_traits (is_trivial) and eventually considering the size of the type (although this would be platform specific and I don't think useful too often).
    At the end I decided to only use is_scalar because is easy to implement in VS2010/12/13.

    • semmy13 says:

      Ops. Always the same mistake. The code should display the < > part now

      template < bool B, class T >
      struct return_type_impl {
      typedef const T& value_type;
      };

      template < class T >
      struct return_type_impl < true, T > {
      typedef T value_type;
      };

      template < class T >
      struct return_type : public return_type_impl {
      };

      template < class T >
      inline typename return_type < T > ::value_type AutoMax(const T& left, const T& right) {
      return left < right ? right : left;
      }

      • brucedawson says:

        I am reminded of this famous quote:

        Some people, when confronted with a problem, think
        “I know, I’ll use regular expressions.” Now they have two problems.

        For “regular expressions” substitute “template metaprogramming”. I’m intrigued by your solution, but I’m not sure there was ever enough of a problem to justify it. I think the programmer should choose the return type. It’s unfortunate that the default return value for ‘max’ is const-ref and that (for a while) VC++ developers should prefer to return by-value, but I’d rather they learned to call FastMax than AutoMax.

        Reference: http://regex.info/blog/2006-09-15/247

  10. Anteru says:

    For the record, both Clang & GCC generate optimal code here.

    anteru@computer:/tmp/mintest$ cat test.cpp
    #include 
    
    int min (const int a, const int b)
    {
            return std::min (a, b);
    }
    anteru@computer:/tmp/mintest$ clang++ -c -S test.cpp -O3 -std=c++11
    anteru@computer:/tmp/mintest$ cat test.s
            .file   "test.cpp"
            .text
            .globl  _Z3minii
            .align  16, 0x90
            .type   _Z3minii,@function
    _Z3minii:                               # @_Z3minii
            .cfi_startproc
    # BB#0:
            cmpl    %edi, %esi
            cmovlel %esi, %edi
            movl    %edi, %eax
            ret
    .Ltmp0:
            .size   _Z3minii, .Ltmp0-_Z3minii
            .cfi_endproc
    
    
            .section        ".note.GNU-stack","",@progbits
    anteru@computer:/tmp/mintest$ g++ -c -S test.cpp -O3
    anteru@computer:/tmp/mintest$ cat test.s
            .file   "test.cpp"
            .text
            .p2align 4,,15
            .globl  _Z3minii
            .type   _Z3minii, @function
    _Z3minii:
    .LFB414:
            .cfi_startproc
            cmpl    %edi, %esi
            movl    %edi, %eax
            cmovle  %esi, %eax
            ret
            .cfi_endproc
    .LFE414:
            .size   _Z3minii, .-_Z3minii
            .ident  "GCC: (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1"
            .section        .note.GNU-stack,"",@progbits
  11. Hi Bruce,

    I have not read the entire article, so I apologize in advance if I’m making a mistake.

    According to the C++ Standard, max must return the first argument when the arguments are equivalent.
    See N3797 [alg.min.max] p9

    Your following code snippet return the secound argument when the arguments are equivalent, then, it is not standard compliant.

    template inline const T& max(const T& a, const T& b)
    {
    return b < a ? a : b;
    }

    According to Alex Stepanov in his book Elements Of Programming, he acknowledges that it has made ​​a mistake, and that mistake has been propagated to the Standard and now, is difficult to fix it.

    So your version of Max is more accurate than the Standard version, unfortunately, implementations have to be conformant with the Standard 🙂

    Regards,
    Fernando Pelliccioni

  12. Sorry about my last comment.
    I don’t know for what reason I thought you are developer of the C++ Standard Library team at Microsoft.
    Ignore my comment.

  13. Pingback: Self Inflicted Denial of Service in Visual Studio Search | Random ASCII

  14. Hi Bruce,

    Long time reader of your blog here.

    I came across your article because I have a similar, but opposite issue regarding std::max(). I’m using Visual Studio 2013 Update 4. I’m compiling for x64 with full optimizations on (got the same results with and without LTCG).

    Here’s my test program:

    #include

    const float& max_ref(const float& lhs, const float& rhs) {
    return lhs < rhs ? rhs : lhs;
    }

    float max_val(const float lhs, const float rhs) {
    return lhs < rhs ? rhs : lhs;
    }

    int main(int argc, char* argv[]) {
    float x;
    std::scanf("%f", &x);
    std::printf("%f\n", max_ref(x, 0.0f));
    std::printf("%f\n", max_val(x, 0.0f));
    return 0;
    }

    max_ref(), which mimics the implementation of std::max(), compiles to

    movss xmm0, DWORD PTR [edx]
    comiss xmm0, DWORD PTR [ecx]
    cmova ecx, edx
    mov eax, ecx
    ret 0

    while max_val() compiles to

    comiss xmm1, xmm0
    jbe SHORT $LN4@max_val
    movaps xmm0, xmm1
    $LN4@max_val:
    ret 0

    That seems to contradict your conclusions, but I suspect we are using slightly different versions of the compiler. My intuition is that VC++ now recognizes the exact signature of std::max() and generate the proper code for it.

    Any thought?

    Franz

    • brucedawson says:

      The code-gen has changed, and the results also vary depending on whether you test with int or float, and whether you look at std::max or a function that calls (and inlines) it.

      The best test is not the code generated for std::max, because that should always be inlined. Instead the best test is a simple wrapper function that calls either std::max or FastMax. With that test, for integers, I see shorter code with a branch for 64-bit FastMaxTest versus StdMaxTest and longer code with a branch for 32-bit FastMaxTest versus StdMaxTest.

      So, perhaps different, but certainly not optimal yet.

      My one recommendation would be looking at the code-gen for a wrapper function such as FastMaxTest or StdMaxTest, not std::max. The code-gen for std::max may be better, but it isn’t relevant since it will always be inlined.

  15. Any news on VS2015’s handling of std::max?

Leave a reply to denis Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.