As part of my continuing exploration of things that are slower than they should be I recently came across a problem with VirtualAlloc. If you use the wrong flag with this function then, depending on how much memory you allocate, you may find it running 1,000 times slower, or perhaps even far worse.
The gory details
VirtualAlloc is a Windows function used for low-level memory allocation. It lets you reserve regions of virtual address space and commit pages within those regions.
I recently received a report that an application running on one of our servers was starting up significantly more slowly on machines with more memory. By ‘more slowly’ I mean ridiculously slowly. When we upgraded to 288 GB (yep, that’s a metric boatload of memory) the startup time went from 3 minutes to 25 minutes.
This application reserves most of memory for a cache and it does it in 1 MB chunks, so on a 288 GB machine it was doing 250,000 calls to VirtualAlloc. It is these memory allocations that were taking 25 minutes. And honestly, even 3 minutes is far too long.
A bit of work with ETW tracing together with an inspired guess quickly led to the solution:
Don’t use the VirtualAlloc MEM_TOP_DOWN flag.
The MEM_TOP_DOWN flag tells VirtualAlloc to allocate from the top of the address space. This is helpful when looking for bugs in an application that has just been ported to 64-bit because it ensures that the top 32-bits of addresses are necessary.
Nowhere is there any warning that VirtualAlloc with this flag does not scale well. It appears to use an O(n^2) algorithm where ‘n’ is related to the number of allocations you have made with this flag. From profiling it looks like it is slowly and methodically scanning a list of the reserved memory trying to ensure that it finds the hole with the highest address.
When this flag is removed the time went from 25 minutes to instantaneous. Okay, it probably wasn’t instantaneous but it was so fast that it wasn’t worth measuring.
I decided to run some tests. I don’t have a 288 GB server so I had to run tests on my underpowered 8 GB laptop. After some experimentation I ended up with code that would VirtualAlloc a bunch of 64-KB address ranges, then free them, and then do it again. All allocations were done with the flags MEM_COMMIT | MEM_RESERVE | MEM_TOP_DOWN. Each time it allocated more, going from 1,000 to 64,000. I made my test program a 64-bit app so that it could easily handle the ~4 GB of address space this required.
I dropped the data into Excel and it made the most perfect O(n^2) graph I have ever seen:
I then tried it without the MEM_TOP_DOWN flag and the results were strikingly different:
Not only has the graph gone from quadratic to linear, but check out the change in the scale. With MEM_TOP_DOWN VirtualAlloc is, for large numbers of allocations, over a thousand times slower, and getting slower all the time!
It’s amusing to put both results on one chart just to see the blue graph (VirtualAlloc without MEM_TOP_DOWN) being indistinguishable from a horizontal line at zero.
Just to be thorough I grabbed an ETW trace. ETW is great for this kind of stuff because it can grab full call stacks all the way from the implementation of VirtualAlloc down to my test code. Here’s the relevant excerpt of the CPU Usage (Sampling) Table:
Over the 8.23 seconds that I looked at the sampling shows that approximately 8,155 milliseconds (8.155 seconds) was spent trying to find the next address, instead of actually adjusting page tables and allocating memory.
Don’t use MEM_TOP_DOWN, except for testing. That’s pretty obvious.
Here’s the test code. No warranty. Don’t remove the copyright notice. Feel free to otherwise use and modify. Compile as 64-bit.
// Copyright Bruce Dawson
const size_t MaxAllocs = 64000;
QueryPerformanceCounter( &time );
QueryPerformanceFrequency( &frequency );
void TimeAllocTopDown(bool topDown, const size_t pageSizeBytes, size_t count)
DWORD flags = MEM_COMMIT | MEM_RESERVE;
flags |= MEM_TOP_DOWN;
assert(count <= _countof(pMem));
__int64 start = GetQPC();
for (size_t i = 0; i < count; ++i)
pMem[i] = VirtualAlloc( NULL, pageSizeBytes, flags, PAGE_READWRITE );
__int64 elapsed = GetQPC() – start;
printf(” allocs took %7.4f s for %5lld allocs of %lld KB blocks (total is %lld MB).\n”, elapsed / GetQPF(), (long long)count, (long long)pageSizeBytes / 1024, (long long)(count * pageSizeBytes) / (1024 * 1024));
// Free the memory.
for (size_t i = 0; i < _countof(pMem); ++i)
VirtualFree( pMem[i], 0, MEM_RELEASE );
pMem[i] = 0;
int _tmain(int argc, _TCHAR* argv)
bool topDown = true;
for (size_t blockSize = 64 * 1024; blockSize <= 64 * 1024; blockSize += blockSize)
for (int i = 0; i <= 64; ++i)
TimeAllocTopDown(topDown, blockSize, i * 1000);