Making VirtualAlloc Arbitrarily Slower

As part of my continuing exploration of things that are slower than they should be I recently came across a problem with VirtualAlloc. If you use the wrong flag with this function then, depending on how much memory you allocate, you may find it running 1,000 times slower, or perhaps even far worse.


The gory details

VirtualAlloc is a Windows function used for low-level memory allocation. It lets you reserve regions of virtual address space and commit pages within those regions.

I recently received a report that an application running on one of our servers was starting up significantly more slowly on machines with more memory. By ‘more slowly’ I mean ridiculously slowly. When we upgraded to 288 GB (yep, that’s a metric boatload of memory) the startup time went from 3 minutes to 25 minutes.

Ouch.

This application reserves most of memory for a cache and it does it in 1 MB chunks, so on a 288 GB machine it was doing 250,000 calls to VirtualAlloc. It is these memory allocations that were taking 25 minutes. And honestly, even 3 minutes is far too long.

A bit of work with ETW tracing together with an inspired guess quickly led to the solution:

Don’t use the VirtualAlloc MEM_TOP_DOWN flag.

The MEM_TOP_DOWN flag tells VirtualAlloc to allocate from the top of the address space. This is helpful when looking for bugs in an application that has just been ported to 64-bit because it ensures that the top 32-bits of addresses are necessary.

Nowhere is there any warning that VirtualAlloc with this flag does not scale well. It appears to use an O(n^2) algorithm where ‘n’ is related to the number of allocations you have made with this flag. From profiling it looks like it is slowly and methodically scanning a list of the reserved memory trying to ensure that it finds the hole with the highest address.

When this flag is removed the time went from 25 minutes to instantaneous. Okay, it probably wasn’t instantaneous but it was so fast that it wasn’t worth measuring.

Testing

I decided to run some tests. I don’t have a 288 GB server so I had to run tests on my underpowered 8 GB laptop. After some experimentation I ended up with code that would VirtualAlloc a bunch of 64-KB address ranges, then free them, and then do it again. All allocations were done with the flags MEM_COMMIT | MEM_RESERVE | MEM_TOP_DOWN. Each time it allocated more, going from 1,000 to 64,000. I made my test program a 64-bit app so that it could easily handle the ~4 GB of address space this required.

I dropped the data into Excel and it made the most perfect O(n^2) graph I have ever seen:

image

I then tried it without the MEM_TOP_DOWN flag and the results were strikingly different:

image

Not only has the graph gone from quadratic to linear, but check out the change in the scale. With MEM_TOP_DOWN VirtualAlloc is, for large numbers of allocations, over a thousand times slower, and getting slower all the time!

It’s amusing to put both results on one chart just to see the blue graph (VirtualAlloc without MEM_TOP_DOWN) being indistinguishable from a horizontal line at zero.

image

Just to be thorough I grabbed an ETW trace. ETW is great for this kind of stuff because it can grab full call stacks all the way from the implementation of VirtualAlloc down to my test code. Here’s the relevant excerpt of the CPU Usage (Sampling) Table:

image

Over the 8.23 seconds that I looked at the sampling shows that approximately 8,155 milliseconds (8.155 seconds) was spent trying to find the next address, instead of actually adjusting page tables and allocating memory.

Conclusion

Don’t use MEM_TOP_DOWN, except for testing. That’s pretty obvious.

Here’s the test code. No warranty. Don’t remove the copyright notice. Feel free to otherwise use and modify. Compile as 64-bit.

// Copyright Bruce Dawson

const size_t MaxAllocs = 64000;

__int64 GetQPC()
{
    LARGE_INTEGER time;
    QueryPerformanceCounter( &time );
    return time.QuadPart;
}

double GetQPF()
{
    LARGE_INTEGER frequency;
    QueryPerformanceFrequency( &frequency );
    return (double)frequency.QuadPart;
}

void* pMem[MaxAllocs];

void TimeAllocTopDown(bool topDown, const size_t pageSizeBytes, size_t count)
{
    DWORD flags = MEM_COMMIT | MEM_RESERVE;
    if (topDown)
        flags |= MEM_TOP_DOWN;
    assert(count <= _countof(pMem));
    __int64 start = GetQPC();
    for (size_t i = 0; i < count; ++i)
    {
        pMem[i] = VirtualAlloc( NULL, pageSizeBytes, flags, PAGE_READWRITE );
        assert(pMem[i]);
    }
    __int64 elapsed = GetQPC() – start;
    if (topDown)
        printf(“Topdown”);
    else
        printf(“Normal “);
    printf(” allocs took %7.4f s for %5lld allocs of %lld KB blocks (total is %lld MB).\n”, elapsed / GetQPF(), (long long)count, (long long)pageSizeBytes / 1024, (long long)(count * pageSizeBytes) / (1024 * 1024));

    // Free the memory.
    for (size_t i = 0; i < _countof(pMem); ++i)
    {
        if (pMem[i])
            VirtualFree( pMem[i], 0, MEM_RELEASE );
        pMem[i] = 0;
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    bool topDown = true;
    for (size_t blockSize = 64 * 1024; blockSize <= 64 * 1024; blockSize += blockSize)
    {
        for (int i = 0; i <= 64; ++i)
        {
            TimeAllocTopDown(topDown, blockSize, i * 1000);
        }
    }
    return 0;
}

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Programming, xperf and tagged . Bookmark the permalink.

19 Responses to Making VirtualAlloc Arbitrarily Slower

  1. Karl says:

    I think the purpose of the MEM_TOP_DOWN flag is for when you allocate a region of space that is going to be used for a long time, not for situations where it needs to be reallocated lots of times. It allows you to allocate a region of address space that is located far away from where more frequent allocation and deallocation happen.

    For example: In a program I’m currently writing I have a facility to edit a bunch of text documents. These documents are backed by memory mapped files (when saved) but the edit history of are backed by the pages in a virtual memory region. Now this virtual memory region is supposed to exist for as long as the application is running now it is useful to put this region in the far end of the process user-mode address space, away from all the main action of the application. So I use the MEM_TOP_DOWN flag to put a region (large enough to support an extremely heavy use of my editing facility) at the far end of the address space. I don’t use this flag when committing pages from within this region. So far I use this flag once in my application life time and I can’t see any bad impact on my application.

    So I think in your situation you are probably right in avoiding this flag. But it is a useful flag when creating long lived regions.

    Regards
    Karl

  2. brucedawson says:

    The performance problems with MEM_TOP_DOWN seem to be related to how many outstanding allocations of this type there are. That’s the ‘n’ in O(n^2). So, with just a few allocations you shouldn’t hit any performance problems.

    I’m not sure why a long-lived region should be allocated at the top of memory. I don’t see any association between the two. It’s harmless, but I don’t see what the benefit is.

    The only good use I’ve heard of for this flag is in debugging 64-bit compatibility problems. If you use MEM_TOP_DOWN then you are guaranteed to get addresses that will not fit in 32 bits, and thus expose any pointer truncation bugs.

    • Karl says:

      It is tru that these addresses wont fit in 32. The purpose in my case is to avoid external fragmentation of the address space. This is usually more of an issue on 32bit systems. And although the 64bit virtual address space is absolutely huge, the user-mode partition is limited to about ~8192 GB on x64 Windows. If you use a lot of your address space then it may be a problem with external fragmentation to reserve a long lived region of space in the middle of the address space. Things would have to be allocated/reserved around this region and some things might not fit “under” the region which may waste that address space. It is generally not a big problem on 64bit applications as the address space is so big, but if you can live with 64bit pointers using this flag is not a bad thing in my opinion.

      Note on 64bit windows you are not guaranteed to get ’32bit sized pointers’ by not using this flag, it entirely depends on how much of the your address space is already in use and the size of the region that you are reserving.

      • Karl says:

        I think your point is still very valid for the situation that you describe and for 64bit windows this flag is not really necessary unless you are using extreme amounts of address space.

  3. The performance degradation is duly noted on MSDN stating this about the use of MEM_TOP_DOWN in VirtualAlloc(Ex) calls… “Allocates memory at the highest possible address. This can be slower than regular allocations, especially when there are many allocations.”

    • brucedawson says:

      That’s encouraging progress. In the VS 2010 documentation it just says “Allocates memory at the highest possible address” so it’s good to see that they have improved it. The acknowledgement that allocations slow down when there are more allocations is particularly good as that hints at the precise problem.

      So, good news.

  4. L.C. says:

    Thanks very much for bringing this to your readers’ attention!

    FYI dlmalloc (which is a commonly used memory allocator replacement) calls VirtualAlloc with MEM_TOP_DOWN for any large allocations (by default anything >= 256k). Having such a large threshold puts a very reasonable upper bounds on the maximum number of these allocations you can have in 32-bit and that thankfully limits the penalty of doing N+1 large allocs. In my testing with 1GB worth of 256k top-down allocations, the penalty for doing another was less than 0.2ms.

    However, if you’re in 64-bit you may want to increase dlmalloc’s DEFAULT_MMAP_THRESHOLD to compensate for the likely increase in the number of simultaneous large allocations due to the increased address space.

  5. dithermaster says:

    Two things:
    1. I think you mean “does it in 1 MB chunks” instead of “does it in 1 GB chunks” because the latter would only make 288 calls instead of 250,000.
    2. For 64-bit pointer truncation testing we use the “AllocationPreference” registry setting (described here http://msdn.microsoft.com/en-us/library/windows/desktop/bb613473(v=vs.85).aspx) instead of the MEM_TOP_DOWN flag. I haven’t tested it to see if it gets the same bad performance, but now I’m curious.

    • brucedawson says:

      Oops — you’re right. I fixed that. I also fixed your comment🙂

      It would help if Microsoft would actually document the meaning of the allocation preference flag. Is it a bit flag? Is it an address (with probably 4 KB page units) for where allocations should start? My guess is that it is the latter, but forcing us to guess at both the interpretation and the units seems cruel and unhelpful.

      My guess would be that performance will be good and that allocations will start at the value of AllocationPreference times 4 KB. Please let me know what you find. In my brief testing I can’t make this flag do anything. I created a REG_DWORD of the requested name and value and it made no difference to allocation patterns of my 64-bit Windows 7 application.

  6. dithermaster says:

    Worked for me; it’s what we used for 64-bit development to detect pointer truncation.
    Note: Changing the AllocationPreference requires a restart to take effect.

    • brucedawson says:

      Ah — that explains why it didn’t work in my tests.

      That requirement also makes the lack of proper documentation for this flag even more frustrating. Experimenting to figure out how that setting works when it requires a reboot each time is a bit much. But I may have to do it anyway because the setting would be handy. It’s too bad it’s global instead of per-process.

      • brucedawson says:

        Sorry, no good. AllocatePreference is not a starting address for allocations. Instead it appears to be a bit flag that sets MEM_TOP_DOWN for all allocations. Thus, it hits the O(n^2) behavior of MEM_TOP_DOWN.

        I confirmed the top-down allocation behavior by calling VirtualAlloc in a loop and noting that the addresses returned started with 000007FFFFDF0000 and headed down from there.

        I confirmed the slowdown by running my tests code with the topDown flag set to false and verifying that the allocation times showed a perfect O(n^2) behavior. Here are some typical results:

        Normal allocs took 0.0000 s for 0 allocs
        Normal allocs took 0.0039 s for 1000 allocs
        Normal allocs took 0.0163 s for 2000 allocs
        Normal allocs took 0.0412 s for 3000 allocs
        Normal allocs took 0.0816 s for 4000 allocs
        Normal allocs took 0.1313 s for 5000 allocs
        Normal allocs took 0.1920 s for 6000 allocs
        Normal allocs took 0.2775 s for 7000 allocs
        Normal allocs took 0.3707 s for 8000 allocs
        Normal allocs took 0.4648 s for 9000 allocs
        Normal allocs took 0.5757 s for 10000 allocs
        Normal allocs took 0.6803 s for 11000 allocs
        Normal allocs took 0.8080 s for 12000 allocs
        Normal allocs took 0.9536 s for 13000 allocs
        Normal allocs took 1.0996 s for 14000 allocs
        Normal allocs took 1.2588 s for 15000 allocs
        Normal allocs took 1.4367 s for 16000 allocs
        Normal allocs took 1.6244 s for 17000 allocs

        It is frustrating that Windows doesn’t bother to offer a per-process flag for doing allocations that start at the 4 GB line. That would be incredibly valuable to anybody doing 64-bit development. Until then my advice from last year is the best option:

        https://randomascii.wordpress.com/2012/02/14/64-bit-made-easy/

        • dithermaster says:

          Thanks for testing this. I’m also sad to see that it is also O(N^2)! Your technique in the linked post ensures high-bits-set pointers but as you’ve said, it sure would be nice if there was a non-performance-robbing simple flag that did it. Cheers.

  7. keithinhanoi says:

    What happens if MEM_TOP_DOWN is used, say, instead of MEM_RESERVE in a 32-bit LAA application? Would the same kinds of slowdowns be expected?

    More specifically, what would the effect be of injecting code in a graphically intensive (ie., real-time 1st/3rd-person perspective 3D) game to replace MEM_RESERVE with MEM_TOP_DOWN, with the conditions of flProtect must be set to PAGE_READWRITE, lpAddress is NULL, and dwSize is at least 1MB?

    • brucedawson says:

      You don’t use MEM_TOP_DOWN *instead* of another flag, you use it *in addition to” another flag.

      We hit serious problems because we were allocating 250,000 blocks of address space. A 32-bit application can, at most, allocate 65536 blocks of address space (address space is given out in minimum 64 KB blocks) so the worst of the O(n^2) will be avoided. So, MEM_TOP_DOWN can probably be used in a 32-bit app without severe repercussions.

      Then again, even a ms or two of delay could be noticeable in a game, so I wouldn’t ship it that way.

      It really depends on how many blocks are allocated that way. A few hundred? Probably not a problem.

      • keithinhanoi says:

        Whoops – so MEM_TOP_DOWN is an optional flag. Well that shows my naivete on this topic. I really appreciate the explanation, though.

        The game code example I gave actually comes from some source code that an author of a set of shader enhancements / memory optimization has made public. The claim associated with the use of the MEM_TOP_DOWN flag is that memory allocation fragmentation is reduced and some users claim they see fewer exception error crashes with it in use.

        It sounds as though the slowdowns would likely not be as much of an issue in this situation since, the MEM_TOP_DOWN flag is added only to certain allocations over 1MB. Even with the potential ms slowdowns I guess I’d still be worried about possible race conditions cropping up – but I’m too new to this to know for sure.

        Still – what is actually gained by making some allocations at the top of the memory space of a 32-bit LAA (game) app?

        • brucedawson says:

          I can believe that using MEM_TOP_DOWN might reduce fragmentation, but a more careful analysis would be wise. There are tools such as sysinternals vmmap (http://technet.microsoft.com/en-us/sysinternals/dd535533.aspx) that will help with this. In general if you can avoid intermingling allocations of different sizes in the same range then you will avoid fragmentation. If this technique puts all the 1 MB allocations at the top, away from the rest, then that might be helpful.

          But at some point you have to consider either reducing memory consumption or going to 64-bit. The main obstacle to that is stubborn 32-bit Windows XP users.

          Race conditions should not be an issue, I don’t think.

  8. Pingback: Hidden Costs of Memory Allocation | Random ASCII

  9. Pingback: ETW Central | Random ASCII

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s