It’s important to understand the cost of memory allocations, but this cost can be surprisingly tricky to measure. It seems reasonable to measure this cost by wrapping calls to new and delete with timers. However, for large buffers these timers may miss over 99% of the true cost of these operations, and these hidden costs are larger than I had expected.
Further complicating these measurements, it turns out that some of the cost may be charged to another process and will therefore not show up in any timings that you might plausibly make.
In this (Windows specific) post I’m going to explain what these hidden costs are, how they hide, how to measure them, and what you should do about it.
Which of these do you need to measure to fully record the cost of freeing and reallocating memory rather than reusing it?
- Time spent allocating and freeing the memory
- Time spent using the memory
- CPU time consumed in the system process
- All of the above
Spoiler alert, the answer is #4. For sufficiently large blocks of memory the time to allocate and free memory is only part of the CPU cost associated with the allocation. Details to follow.
I like big blocks and I cannot lie
In this post I am exclusively focusing on allocating large blocks of memory – those that are large enough that the heap delegates the allocation to the operating system. I am ignoring the low-fragmentation heap, heap contention, and heap fragmentation in order to focus on the operating system issues of allocating large blocks of memory. This is a real issue that I have encountered multiple times, and fixing it can give an easy performance win, but I don’t want to try to account for all memory allocation issues in one post.
Measuring allocating and freeing
Measuring the cost of allocating and freeing memory seems easy. Just call new/delete in a loop for various sized blocks of memory and measure how long they take. In my tests, for sizes ranging from 8 MB to 32 MB, the cost for a new/delete pair averaged about 7.5 μs (microseconds), split into ~5.0 μs for the allocation and ~2.5 μs for the free. For the large allocations I was testing the size of the allocation did not seem to significantly affect the results.
Freeing is about to get slower
Allocating memory without using it is a bit artificial, so the next thing to do is to write to the entire block of memory before freeing it. Pause for a moment to think about what effect this should have on the cost of allocating and freeing memory. I’ll wait.
The cost of allocating the memory goes up slightly in this scenario, presumably because writing to the memory flushes all sorts of useful data from the CPU’s caches, making subsequent allocations slightly slower. This effect adds about 8.0 μs to the allocation cost.
Freeing the memory gets a lot more expensive. When the memory is written before freeing the cost to free the memory is about 75 μs per MB – a huge jump from the ~2.5 μs to free unused memory. This means that it takes about 2,400 μs (2.4 ms) to free 32 MB of memory that has been used!
Memory, on demand
The reason for this seemingly peculiar behavior has to do with the laziness of the Windows operating system. For large allocations (a MB or larger I believe) the Windows heap (new/new, malloc/HeapAlloc) will use the system VirtualAlloc function (with MEM_COMMIT | MEM_RESERVE) to request memory. This will reserve address space and allocate commit charges, but doesn’t actually commit any pages. The pointer returned by VirtualAlloc is basically just a promissory note – the operating system promises that when pages are touched there will be memory there but, as the documentation says “Actual physical pages are not allocated unless/until the virtual addresses are actually accessed.” This is often a good thing because applications that allocate more memory than they need may end up being less wasteful than intended, and Linux behaves similarly.
So, when memory is allocated and freed without being touched the free operation has to do very little. However if the memory has been touched then the pages have been faulted in and when the memory is freed they actually have to be removed. Removing those pages from the process’ address space is what takes 75 μs per MB.
If removing the pages from the process’ address space takes 75 μs per MB then presumably it also takes some time to bring the pages in to the address space. This is more difficult to measure because it happens on demand, whenever a page is first accessed. The easiest way to measure this is to measure how long it takes to write to a freshly allocated block of memory, and compare it to how long it takes to write to a previously written block of memory.
This is a perilous measurement. There are many different effects that you can end up measuring. If the memory blocks are too small then you may mostly measure cache effects. Beyond that you may be measuring TLB effects. If you don’t have enough memory then pages may be removed from your working set and then need to be paged back in, and other processes can easily interfere with the results. CPU frequency changes can distort the results.
But, I did my best. My CPU’s L3 cache is 6 MB so I tested with buffers from 8 MB to 32 MB to avoid cache effects. I have lots of memory, and I shut down as many processes as possible to avoid interference. I ran the tests multiple times. I used the high-performance power profile and ran a background busy thread to keep the CPU frequencies consistently high. I also recorded ETW (xperf) profiles and examined them to identify sources of error.
The result was pretty clear. Faulting in the pages for first-time use costs a minimum of ~175 μs per MB. In some situations the cost is a lot more, for reasons that I don’t understand, but for allocations of 8+ MB you should assume a minimum cost of ~175 μs per MB when memory is first used.
Page faults, soft versus hard
There is a difference between a soft/minor page fault and a hard/major page fault. A soft page fault is when a page is faulted in from memory. This happens when a freshly allocated page is first referenced and it also happens when a page is faulted in from the standby list – perhaps a page that was trimmed from the working set, or perhaps a memory mapped file that was in the system disk cache. Soft page faults, while expensive in this context, are still quite cheap.
A hard page fault is a page fault that requires going to disk, perhaps to page in a memory-mapped file or to retrieve data from the swap file. These are more expensive because, you know, disks are slow. Hard faults can easily cost over a second per MB. But those are hard faults and this post is entirely about soft faults, which are much cheaper.
(for more details see http://en.wikipedia.org/wiki/Page_fault)
But wait, there is yet another cost of allocating memory that we have not accounted for.
For security reasons it is critical that freshly mapped pages be zeroed, since otherwise information would leak between processes – all modern operating systems do this. Zeroing pages isn’t super expensive, but it isn’t free.
On Windows the OS tries to keep a pool of zeroed pages available. These pages can then be quickly faulted in to the processes that need them. But this pool must be replenished by taking free pages and zeroing them. What this means is the System process has a low-priority thread that zeroes pages when they are freed. If it is able to keep up then all of the zeroing of memory after freeing will be done outside of your process, and the true cost is thoroughly hidden.
(see http://blogs.msdn.com/b/tims/archive/2010/10/29/pdc10-mysteries-of-windows-memory-management-revealed-part-two.aspx for more information on the life-cycle of a memory page)
However nothing can hide from the all-seeing eye of ETW (xperf) traces. The CPU Usage (Precise) graph shows my test process in red and the system process in green. At the end of each test loop when my test process frees memory you can see the system process spring to life:
This means that the only way to measure the true cost of your inexpensive memory allocations is to record a trace with WPT and look both in your process (for KiPageFault) and in the system process (for time spent in the zero page thread in KeZeroMemory – KeZeroPages on Windows 8.1).
If the zero-page thread can’t keep up then the pages may be zeroed on demand in the context of your process, in which case KeZeroMemory/KeZeroPages may show up. This is more likely to happen on machines with fewer CPUs.
The CPU Usage (Sampled) data tells you what the zero page thread is doing. Then, once you know that thread 8 is always the zero-page thread, you can use the CPU Usage (Precise) data to measure exactly how long that thread wakes up each time it finds work. This works out to ~150 μs per MB of memory that is touched and then freed.
A naive measurement of the cost of allocating and freeing large blocks of memory would conclude that it costs about 7.5 μs for each alloc/free pair. However there are three separate per-MB costs for large allocations. As the chart shows these add up to approximately 400 μs per MB. So, allocating an 8 MB buffer every frame (perhaps to hold a 1080p RGBA image) can easily waste 3.2 ms (3,200 μs) of CPU time per frame. Some of this will be hidden in the system process where it won’t hurt performance on many-core developer machines, but will harm frame-rate on dual-core customer machines.
If you don’t like paying the cost of page faults when you use memory then you can force it to happen at allocation time by using VirtualLock. However, as discussed by Raymond Chen you should be sure that doing so is actually appropriate – it often isn’t. PrefetchVirtualMemory is the equivalent for memory-mapped files, and similar cautions apply.
It’s worth mentioning that most of these numbers are minimum costs. In particular the time to fault in pages is sometimes much higher, for reasons that I don’t understand.
Real-time zero-page monitoring
Recording an ETW trace is the most accurate and reliable way to find out what is going on, performance-wise, in your process. I record a lot of traces and I learn a lot. Lately I’ve made a habit of looking for KiPageFault in my processes, and looking for heavy activity in the zero-page thread.
But this is tedious and time-consuming. It can be very helpful to monitor the zero-page thread’s activity in real-time. Activity in this thread is a proxy for heavy usage of freshly allocated memory and makes it easy to correlate this with the actions I am taking.
It turns out that this monitoring is easy.
The information I am about to supply makes use of undocumented Windows details that may change in future, present, or past versions of Windows. This information should only be used for diagnostic purposes. It works for me, for now, but if you ship a product that uses this information then I will ask Raymond Chen to form a vigilante group with me to hunt you down and corrupt random bytes in your compiler.
Note that Windows 10 does, indeed, change the assumption below. You were warned.
I have observed (see disclaimer above) that in my tests the zero-page thread always has thread ID 8. Therefore we can easily monitor the activity level of the zero-page thread by:
- Using OpenThread to get a handle to thread ID 8
- Using QueryThreadCycleTime to get the cycles used by this thread
- Sitting in a loop calling Sleep(1000) and printing the zero-page cycles consumed in the last second.
That’s it. The program has to run as administrator and it is clearly a horrible hack, but it has helped me identify code that is reallocating memory far too frequently. If the zero-page thread gets significantly busier whenever your code is running then it might be worth recording a trace and looking for KiPageFault.
All tests were done on my four-core eight-thread Intel(R) Core(TM) i7-2720QM CPU @ 2.20 GHz (Turboboost up to 3.3 GHz) with 8 GB RAM, running Windows 7 SP1. Spot tests were run on Windows 8.1 and the results were similar.
Crudely designed test code available here.
I did some spot tests on Linux but I was unable to cleanly measure the cost of mapping pages in and out of memory. I did see the kernel zeroing memory as it was mapped in after first being written – watch for clear_page_c_e or similar functions when you run perf top. A bit of searching shows that developers have noticed the cost of clear_page_c_e before. Linux seems to clear pages on-demand instead of in a dedicated thread, which has pros and cons.
Making VirtualAlloc Arbitrarily Slower
If you want to know more about how to use ETW to do this type of deep investigation of Windows performance then check out ETW Central.
Arseny Kapoulkine wrote an awesome blog post that explains other issues with soft page faults, such as apparent serialization in the kernel which can make them even more expensive!
Nice article. Time unit rant: My eyes hurt when I see us as unit of time, the right prefix is µ – Greek small letter Mu and time unit is µs. We live in 21st century and computers are able to understand and render UNICODE quite well.
Fair enough — I do get lazy sometimes and I appreciate when readers point out mistakes (including spelling, grammar, time units). Fixed.
but in the 21st century we still can’t type µ – well you can if you remember (on Windows) to hold ALT and press 0181 on the number pad (if you have one). Well this comes up in searching. Easier to search for us than µs
Hmmm, hurting eyes, you are either a nitpicking whiner, or you need your eyes examined by a doctor. This is not an academic paper or something, we live in the 21st century, where people should be able to get the obvious.
Reblogged this on Bits and Pieces of Code and commented:
Interesting post on memory allocation. Useful and insightful.
You may like to read this paper “why nothing matters, the impact of zeroing” http://users.cecs.anu.edu.au/~steveb/downloads/pdf/zero-oopsla-2011.pdf
Fascinating — thanks for sharing.
I didn’t know Windows zeroed pages in a separate thread.
I Love the disclaimer, I would so like to see that vigilante combo 🙂
Very informative, thanks!
Great article Bruce!
Thank you for the writeup. I have a relevant anecdote to share 🙂
I have implemented and am maintaining qgrep (https://github.com/zeux/qgrep) – a fast full-text regular expression search designed to quickly grep through source code. The source is stored in one file as large (~0.5 Mb) LZ4-compressed chunks. Whenever you do a search you have to read all chunks from the file, for any chunk that has a potential match (there’s a Bloom filter for every chunk that can force you to stop processing it… I really should blog a bit about this tool, there are lots of interesting pieces there) you have to LZ4 decompress it, and scan the result using RE2 (+ fast SSE2 matchers that can find a potential match quickly).
Turns out that on Windows the impact of allocating and deallocating every chunk was very significant – not because of the allocation cost itself, but because of the soft page faults during the actual commit. This was a while ago but https://github.com/zeux/qgrep/commit/4885f8a88dec0d8329b11d12692bd018f7051232 cites 10-50% performance gains… IIRC one of the problems was that the grepping process is parallel but page faults are serialized in the kernel (or at least profiling results seemed to suggest that? not quite sure) so the impact in terms of wall time is significant.
Thanks for sharing. Did you find that the overhead of mapping pages in/out was lower on other operating systems? Measuring that is my next task and I’m looking for pointers.
If you still have a version that does the allocations for each chunk then I’d be interested in looking at an ETW trace. I could probably decode what’s going on and prove/disprove the serialization theory. That would be a shame.
Would it be possible to turn of zeroing of pages that was previously owned by the same process? For most programs that would not really pose a security risk as far as I see it (say for games). That is, a pool of pages for each app that aren’t zero but has been zeroed at an earlier point in time? Just a thought!
It seems that would make bugs which allow attackers to read past the end of buffers (heartbleed?) more dangerous.
If programs are willing to reuse their old unzeroed pages, and they’ll actually receive a performance benefit from doing so, they can always manage some of their own memory.
Correct me if I’m wrong, but I was under the impression that malloc() and friends did exactly that: they reuse the pages of the process without zeroing them, unless done explicitly (commonly by using cmalloc(), which is the only one mentioning zeroing pages in my man page here). It’s only new pages that are zeroed unconditionally, they come from outside.
You are correct that malloc and new often reuse mapped memory and this mapped memory is not guaranteed to be zeroed. However when malloc/new are asked for large blocks of memory they typically go to the operating system and request new pages. The OS functions that malloc/new use guarantee that they will return zeroed pages, and many applications depend on this. So…
To make your idea work you’d have to have a new OS function (or flag) that returns pages that may or may not be zeroed. Then new/malloc would have to use this new function.
This still wouldn’t help much in Windows becomes the memory is usually zeroed long before the OS knows what process will ask for it.
But, your idea is quite possible. The MEM_RESET/MEM_RESET_UNDO flags in VirtualAlloc supports a variant of what you want where it potentially gives back exactly the same pages that you previously freed.
As I’ve written as an answer to kominek before: I’m pretty sure that already happens.
One thing keeps puzzling me, your findings say that freeing 32MB of memory costs about 2.4ms if that memory was used. That seems to me like A LOT. If we pretend that the cost is the same for bigger sizes, in the context of gamedev, that means that when you load 7 textures of 32MB, freeing them right away would cost around 16ms which is the time of one entire frame!
The question would be: what exactly happens, or should I better say, what exactly needs to happen when the memory is freed? Where is all that time being spent? It seems to me like removing memory pages from process’s address space should be something simple/fast? What are your comments on this?
I agree that the cost seems very high. I had to check it many times before believing it.
Not all of the cost is incurred when freeing it — that is only 75 μs per MB. The rest is in faulting in pages, and zeroing pages that have been freed.
As to “where is all that time being spent?” In addition to zeroing memory it is being spent adjusting the process’ page mappings. That isn’t free. Maybe it should be faster, but it could never be free.
“If we pretend that the cost is the same for bigger sizes” — it is.
I encourage you to do your own measurements (I provided my source code) on Windows and other platforms.
When I was still spoiled by Java and wrote my first C++ code, I wondered why it was so ridiculously slow. The Java VM grabs a large chunk of memory and holds on to it. All the garbage collection etc. is an internal affair.
So I just commented out all delete/delete calls and it improved performance by orders of magnitude. The lessons I learned:
– Reuse memory (placement new is your friend)
– Make buffers as long living as you can, e.g. reserve the worst case amount of memory and run with it, the real cost comes with use, so allocating more memory than needed is cheap
– If you have dynamic data structures like trees, you may want to overload new/new and delete/delete to hold on to your memory and reuse freed chunks, since the OS already (sometimes) does that transparently you might want to do some profiling before investing the time
– Keep things on the stack whenever possible
Memory allocation costs in C/C++ code do tend to be much higher than in Java/C#, and the hidden costs that I document are only part of the reason.
In C# a memory allocation is little more than incrementing the free pointer and then using the memory, although there is overhead for running the garbage collector. In C/C++ the allocator has to scan the heap structures looking for free blocks which can be time consuming and may require acquiring locks.
Modern allocators (the Windows 7+ heap) are more efficient but avoiding ‘excessive’ allocation is still important.
As always, profiling is important. It doesn’t make sense to corrupt your design to avoid memory allocation unless memory allocation has been shown to be a problem.
In modern C++ we tend to use higher approaches like memory pools (like the one proposed by Boost). The std::allocator approach was a failure, although some members at the committee are still working hard to fix them.
GC are indeed very useful / impressive, but have limitations has well. More specifically, the GC thread can become a bottleneck on computers having a high number of CPUs. And to be fair, you need to take into account the hidden cost of moving your objects in memory (while your app is paused), which can become a serious issue with large applications.
Something I would like to try one day is to replace the (global) delete operators in order to defer the request in a separate (dedicated) thread so it speeds up memory deallocation on other threads (while still benefiting of cache locality to run the destructors).
Pingback: ETW Central | Random ASCII
Pingback: Xperf for Excess CPU Consumption: WPA edition | Random ASCII
Very informative article thanks for publishing it.
After reading I have reproduced this on my Windows Vista x64 machine with 8GB memory and quad core processor. If I allocate a block of memory (I’ve tried between 8MB and 32MB sized blocks), memset the entire block, and then memset again, I find that consistently the first memset takes 10-15% longer than the second.
So learning from this from now on when allocating dynamic memory I shall try to do so outside of critical loops, and zero initialise the memory to begin with to avoid this extra overhead.
Here is a (unverified) report of these costs (unclear which ones) increasing faster on Windows 10 than on Windows 7 when multiple threads are running:
If I find time I’ll modify the sample code to measure which part of the process is scaling poorly on Windows 10.
Reputable sources have reported that this scaling bug is *fixed* in Fall Creators Update:
This may be addressed by one of the words you used that I didn’t understand. Isn’t testing in your dev environment on a machine that doesn’t also have spotify, facebook, and sixteen HP tasklets competing for resources also make the measurement faulty?
I’m not sure I understand the question. Testing on a machine with minimal other processes running makes it easier to see the overhead. The overhead would still be there if I was running lots of other processes but it would be harder to see and measure.
I assume users exhaust available RAM. Doesn’t this affect the jobs of allocation and freeing?
Some users do indeed exhaust available RAM, which may add additional costs, in some circumstances. Those costs often come at different times, such as when memory is referenced, rather than when it is allocated. Smart users buy computers with enough memory to avoid significant paging.
Are those data still accurate in 2022? Does some soft & hard ware improve allocation speed?
I would hope that on a modern CPU all of these costs would be lower. On the other hand, on a many-core CPU the page-zeroing costs may actually be significantly _higher_ now due to a Windows performance bug that I blogged about this summer:
I’m not inclined to redo my measurements because the answer, ultimately, is “it depends”. I think they are close enough for estimation purposes.