It’s important to understand the cost of memory allocations, but this cost can be surprisingly tricky to measure. It seems reasonable to measure this cost by wrapping calls to new and delete with timers. However, for large buffers these timers may miss over 99% of the true cost of these operations, and these hidden costs are larger than I had expected.
Further complicating these measurements, it turns out that some of the cost may be charged to another process and will therefore not show up in any timings that you might plausibly make.
In this (Windows specific) post I’m going to explain what these hidden costs are, how they hide, how to measure them, and what you should do about it.
Which of these do you need to measure to fully record the cost of freeing and reallocating memory rather than reusing it?
- Time spent allocating and freeing the memory
- Time spent using the memory
- CPU time consumed in the system process
- All of the above
Spoiler alert, the answer is #4. For sufficiently large blocks of memory the time to allocate and free memory is only part of the CPU cost associated with the allocation. Details to follow.
I like big blocks and I cannot lie
In this post I am exclusively focusing on allocating large blocks of memory – those that are large enough that the heap delegates the allocation to the operating system. I am ignoring the low-fragmentation heap, heap contention, and heap fragmentation in order to focus on the operating system issues of allocating large blocks of memory. This is a real issue that I have encountered multiple times, and fixing it can give an easy performance win, but I don’t want to try to account for all memory allocation issues in one post.
Measuring allocating and freeing
Measuring the cost of allocating and freeing memory seems easy. Just call new/delete in a loop for various sized blocks of memory and measure how long they take. In my tests, for sizes ranging from 8 MB to 32 MB, the cost for a new/delete pair averaged about 7.5 μs (microseconds), split into ~5.0 μs for the allocation and ~2.5 μs for the free. For the large allocations I was testing the size of the allocation did not seem to significantly affect the results.
Freeing is about to get slower
Allocating memory without using it is a bit artificial, so the next thing to do is to write to the entire block of memory before freeing it. Pause for a moment to think about what effect this should have on the cost of allocating and freeing memory. I’ll wait.
The cost of allocating the memory goes up slightly in this scenario, presumably because writing to the memory flushes all sorts of useful data from the CPU’s caches, making subsequent allocations slightly slower. This effect adds about 8.0 μs to the allocation cost.
Freeing the memory gets a lot more expensive. When the memory is written before freeing the cost to free the memory is about 75 μs per MB – a huge jump from the ~2.5 μs to free unused memory. This means that it takes about 2,400 μs (2.4 ms) to free 32 MB of memory that has been used!
Memory, on demand
The reason for this seemingly peculiar behavior has to do with the laziness of the Windows operating system. For large allocations (a MB or larger I believe) the Windows heap (new/new, malloc/HeapAlloc) will use the system VirtualAlloc function (with MEM_COMMIT | MEM_RESERVE) to request memory. This will reserve address space and allocate commit charges, but doesn’t actually commit any pages. The pointer returned by VirtualAlloc is basically just a promissory note – the operating system promises that when pages are touched there will be memory there but, as the documentation says “Actual physical pages are not allocated unless/until the virtual addresses are actually accessed.” This is often a good thing because applications that allocate more memory than they need may end up being less wasteful than intended, and Linux behaves similarly.
So, when memory is allocated and freed without being touched the free operation has to do very little. However if the memory has been touched then the pages have been faulted in and when the memory is freed they actually have to be removed. Removing those pages from the process’ address space is what takes 75 μs per MB.
If removing the pages from the process’ address space takes 75 μs per MB then presumably it also takes some time to bring the pages in to the address space. This is more difficult to measure because it happens on demand, whenever a page is first accessed. The easiest way to measure this is to measure how long it takes to write to a freshly allocated block of memory, and compare it to how long it takes to write to a previously written block of memory.
This is a perilous measurement. There are many different effects that you can end up measuring. If the memory blocks are too small then you may mostly measure cache effects. Beyond that you may be measuring TLB effects. If you don’t have enough memory then pages may be removed from your working set and then need to be paged back in, and other processes can easily interfere with the results. CPU frequency changes can distort the results.
But, I did my best. My CPU’s L3 cache is 6 MB so I tested with buffers from 8 MB to 32 MB to avoid cache effects. I have lots of memory, and I shut down as many processes as possible to avoid interference. I ran the tests multiple times. I used the high-performance power profile and ran a background busy thread to keep the CPU frequencies consistently high. I also recorded ETW (xperf) profiles and examined them to identify sources of error.
The result was pretty clear. Faulting in the pages for first-time use costs a minimum of ~175 μs per MB. In some situations the cost is a lot more, for reasons that I don’t understand, but for allocations of 8+ MB you should assume a minimum cost of ~175 μs per MB when memory is first used.
Page faults, soft versus hard
There is a difference between a soft/minor page fault and a hard/major page fault. A soft page fault is when a page is faulted in from memory. This happens when a freshly allocated page is first referenced and it also happens when a page is faulted in from the standby list – perhaps a page that was trimmed from the working set, or perhaps a memory mapped file that was in the system disk cache. Soft page faults, while expensive in this context, are still quite cheap.
A hard page fault is a page fault that requires going to disk, perhaps to page in a memory-mapped file or to retrieve data from the swap file. These are more expensive because, you know, disks are slow. Hard faults can easily cost over a second per MB. But those are hard faults and this post is entirely about soft faults, which are much cheaper.
(for more details see http://en.wikipedia.org/wiki/Page_fault)
But wait, there is yet another cost of allocating memory that we have not accounted for.
For security reasons it is critical that freshly mapped pages be zeroed, since otherwise information would leak between processes – all modern operating systems do this. Zeroing pages isn’t super expensive, but it isn’t free.
On Windows the OS tries to keep a pool of zeroed pages available. These pages can then be quickly faulted in to the processes that need them. But this pool must be replenished by taking free pages and zeroing them. What this means is the System process has a low-priority thread that zeroes pages when they are freed. If it is able to keep up then all of the zeroing of memory after freeing will be done outside of your process, and the true cost is thoroughly hidden.
(see http://blogs.msdn.com/b/tims/archive/2010/10/29/pdc10-mysteries-of-windows-memory-management-revealed-part-two.aspx for more information on the life-cycle of a memory page)
However nothing can hide from the all-seeing eye of ETW (xperf) traces. The CPU Usage (Precise) graph shows my test process in red and the system process in green. At the end of each test loop when my test process frees memory you can see the system process spring to life:
This means that the only way to measure the true cost of your inexpensive memory allocations is to record a trace with WPT and look both in your process (for KiPageFault) and in the system process (for time spent in the zero page thread in KeZeroMemory – KeZeroPages on Windows 8.1).
If the zero-page thread can’t keep up then the pages may be zeroed on demand in the context of your process, in which case KeZeroMemory/KeZeroPages may show up. This is more likely to happen on machines with fewer CPUs.
The CPU Usage (Sampled) data tells you what the zero page thread is doing. Then, once you know that thread 8 is always the zero-page thread, you can use the CPU Usage (Precise) data to measure exactly how long that thread wakes up each time it finds work. This works out to ~150 μs per MB of memory that is touched and then freed.
A naive measurement of the cost of allocating and freeing large blocks of memory would conclude that it costs about 7.5 μs for each alloc/free pair. However there are three separate per-MB costs for large allocations. As the chart shows these add up to approximately 400 μs per MB. So, allocating an 8 MB buffer every frame (perhaps to hold a 1080p RGBA image) can easily waste 3.2 ms (3,200 μs) of CPU time per frame. Some of this will be hidden in the system process where it won’t hurt performance on many-core developer machines, but will harm frame-rate on dual-core customer machines.
If you don’t like paying the cost of page faults when you use memory then you can force it to happen at allocation time by using VirtualLock. However, as discussed by Raymond Chen you should be sure that doing so is actually appropriate – it often isn’t. PrefetchVirtualMemory is the equivalent for memory-mapped files, and similar cautions apply.
It’s worth mentioning that most of these numbers are minimum costs. In particular the time to fault in pages is sometimes much higher, for reasons that I don’t understand.
Real-time zero-page monitoring
Recording an ETW trace is the most accurate and reliable way to find out what is going on, performance-wise, in your process. I record a lot of traces and I learn a lot. Lately I’ve made a habit of looking for KiPageFault in my processes, and looking for heavy activity in the zero-page thread.
But this is tedious and time-consuming. It can be very helpful to monitor the zero-page thread’s activity in real-time. Activity in this thread is a proxy for heavy usage of freshly allocated memory and makes it easy to correlate this with the actions I am taking.
It turns out that this monitoring is easy.
The information I am about to supply makes use of undocumented Windows details that may change in future, present, or past versions of Windows. This information should only be used for diagnostic purposes. It works for me, for now, but if you ship a product that uses this information then I will ask Raymond Chen to form a vigilante group with me to hunt you down and corrupt random bytes in your compiler.
Note that Windows 10 does, indeed, change the assumption below. You were warned.
I have observed (see disclaimer above) that in my tests the zero-page thread always has thread ID 8. Therefore we can easily monitor the activity level of the zero-page thread by:
- Using OpenThread to get a handle to thread ID 8
- Using QueryThreadCycleTime to get the cycles used by this thread
- Sitting in a loop calling Sleep(1000) and printing the zero-page cycles consumed in the last second.
That’s it. The program has to run as administrator and it is clearly a horrible hack, but it has helped me identify code that is reallocating memory far too frequently. If the zero-page thread gets significantly busier whenever your code is running then it might be worth recording a trace and looking for KiPageFault.
All tests were done on my four-core eight-thread Intel(R) Core(TM) i7-2720QM CPU @ 2.20 GHz (Turboboost up to 3.3 GHz) with 8 GB RAM, running Windows 7 SP1. Spot tests were run on Windows 8.1 and the results were similar.
Crudely designed test code available here.
I did some spot tests on Linux but I was unable to cleanly measure the cost of mapping pages in and out of memory. I did see the kernel zeroing memory as it was mapped in after first being written – watch for clear_page_c_e or similar functions when you run perf top. A bit of searching shows that developers have noticed the cost of clear_page_c_e before. Linux seems to clear pages on-demand instead of in a dedicated thread, which has pros and cons.
If you want to know more about how to use ETW to do this type of deep investigation of Windows performance then check out ETW Central.
Arseny Kapoulkine wrote an awesome blog post that explains other issues with soft page faults, such as apparent serialization in the kernel which can make them even more expensive!