Some code optimizations requires complex data structures and thousands lines of code. But, in a surprising number of cases, significant improvements can be made by simple changes – sometimes as simple as typing a single zero. It’s like the old story of the boilermaker who knows the right place to tap with his hammer – he sends an itemized bill for $0.50 for tapping the valve, and $999.50 for knowing where to tap.
I’ve been personally involved in several performance bugs that were fixed with the typing of a single zero and in this post I want to share two of them.
The importance of measurement
Back in the days of the original Xbox I helped optimize a lot of games. While working on one of these games the profiler led me to a matrix transformation function that was consuming 7% of CPU time – the biggest spike on the graph. So, I dutifully went to work to optimize this function.
I was not the first person who had been down this path. The function had already been rewritten in assembly language. I found a few potential improvements to the assembly language and tried to measure the improvements. This is a crucial step – otherwise I could easily have ended up checking in an ‘optimization’ that made no difference, or perhaps even made things worse.
However measuring the improvements was difficult. I’d launch the game, play for a bit while profiling, and then examine the profile to see whether the code was faster. It looked like there might be some modest progress, but there was so much randomness that it was hard to be sure.
So I got all scientific. I wrote a test harness that could drive the old and new versions of the code so that I could precisely measure the performance differences. This didn’t take long and it let me see that, as predicted, the new code was about 10% faster than the old.
But it turns out that that 10% speedup was not interesting.
What was more interesting was that the code inside of the test harness was running about 10x faster (900%!!!) than the same code inside of the game. That was an exciting discovery.
After checking my results, and starting into space for a while I realized what must be happening.
In order to give game developers full control and maximum performance, video game consoles let game developers allocate memory with different attributes. In particular, the original Xbox would let game developers allocate non-cacheable memory. This type of memory (actually, this type of tag in the page tables) is useful when writing data that will be used by the GPU. Because the memory is non-cacheable the writes will go almost straight to RAM, avoiding the delays and cache-pollution that would happen with ‘normal’ memory mappings.
So non-cacheable memory is an important optimization, but it must be used carefully. In particular, it is crucial that games never try to read from non-cacheable memory, or their performance will be severely compromised. Even the relatively slow 733 MHz CPU of the original Xbox needed its caches to give adequate performance when reading data.
With this knowledge in hand I realized what must be happening. The data used by this function must have been allocated in non-cacheable memory, and that was why the performance was poor. A bit of investigation confirmed this hypothesis and the stage was set for the fix. I located the line of code that allocated the memory, double-clicked on the flag value that was erroneously requesting non-cacheable memory, and typed zero.
The cost of this function went from ~7% of CPU time down to about 0.7% of CPU time, and was no longer of interest.
My status report at the end of that week was something like “39.999 hours of investigation, 0.001 hours of coding – huge success!”
Most developers don’t need to worry about accidentally allocating non-cacheable memory – that option isn’t easily available in user space in most operating systems. But, if you want to see how much non-cacheable memory can slow down your code, trying using the PAGE_NOCACHE or PAGE_WRITECOMBINE flags with VirtualAlloc.
Zero GiB is better than four GiB
The other tale I want to share is of a bug that I found, but which somebody else fixed. A couple of years ago I noticed that the disk cache on my laptop was getting purged quite frequently. I tracked this down to a transient 4 GiB allocation, and I eventually discovered that the device driver for my new backup drive was setting SectorSize to 0xFFFFFFFF (or –1) to indicate an unknown sector size. The Windows kernel interpreted this value as as 4 GiB, allocated that big a block of memory, and that was the cause of the problem.
I don’t have any contacts at Western Digital but it is pretty safe to assume that they fixed this bug by selecting the 0xFFFFFFFF (or -1) constant and then typing zero. A single character typed, and a significant performance regression fixed.
(full details of this investigation can be found at Windows Slowdown, Investigated and Identified)
- In both cases the problem was related to caching
- Using a profiler to accurately identify the problem is crucial
- A fix that is not verified through measurements is not necessarily a fix
- I could write about many instances of this but the other examples are either too secret or too boring
- The right fix needn’t be complicated. Sometimes a huge improvement can be made by a tiny change, but you’ve got to know where to tap
I’ve also optimized code by commenting out a #define, and other trivial changes. Share your similar stories in the comments.
Like this article? I wrote another Xbox (360) cache-related post here.
Very motivational and educational post. Thanks a lot for sharing these stories with us.
Also note that running your code in an artificial situation, such as a test harness or massive loop, may remove it from the actual problem.
Yep, exactly. It is very important to watch for performance changes when you move to a test harness, since understanding those can (as in this case) be crucial.
If “read from non-cacheable memory” is a common type of bug, it might be worth modifying Address Sanitizer to catch it.
I don’t think it’s a common pattern because I don’t even know if/how user-land processes can allocate non-cacheable memory. It could be a problem for device drivers, so maybe asan could check for it.
Writing to write-combined memory is also a bit finicky — the Xbox 360 tools would check those rules as well.
At least on Windows, PAGE_NOCACHE is a valid parameter for the page protection parameter of VirtualAlloc.
Non-cacheable memory isn’t something I’m actually familiar with (I’ve just had to look up that page often to remember that it exists), so I’m uncertain if it’s similar enough, but I was able to reproduce terrible performance on Win7 when passing in PAGE_NOCACHE.
I didn’t realize those flags were exposed. Well, be sure not to use them or you will probably regret it. They could be handy for various synthetic tests though.
The “read from non-cacheable memory” pattern sounds a bit like “read from non-aligned pointers”, which has caused some performance bugs in Firefox that were intermittent and hard to track down. Luckily, alignment bugs can be caught by running regression tests under -fsanitize=alignment, which can be easier than profiling.
What platform were these unaligned-pointer-read bugs on? The penalties for unaligned reads on x86/x64 are so small (often zero) that misaligned memory is often worthwhile. On ARM the penalties can be severe if the OS silently handles misalignment (which I think is a poor idea):
In your first story, at first I thought it was going to be a denormal 🙂
That would have been a good guess, yes. Also tricky to track down.
Tons of stories similar to this one though not many that would add much to your examples. Avoiding useless copies, unique data initialization to be taken out of the loop, minimising cache faults…
Classic issues which I often find annoying as hell to debug are those involving data access in multi-threaded processes. Those ones could be worth a couple of articles, there are tons of different issues to be aware of, and some can be pretty tricky to identify and isolate (heck, I’m still occasionally learning some by making the mistake and having to track it myself :p).
In the end, I find many deelopers could do with a bit more knowledge of how hardware works, even being aware of some aspects can lead to at least asking the right questions if not knowing directly the right answer.
I’ve had huge speed-ups in toolchain processes just by adding a call to ‘resize’ to pre-allocate space in a std::vector where people were making a large number of calls to ‘push_back’. In one case reducing level export times from ~45 minutes down to ~5 minutes by introducing a single line of code.
That’s a good one. I’ve also made huge improvements by *removing* a call to std::vector::resize() — calling it inside a loop can defeat exponential growth and make growth costs go quadratic.
I had similar experiences with concurrent collections in .NET: http://geekswithblogs.net/akraus1/archive/2015/01/26/161276.aspx They are fine when much thread contention is going on. But one major drawback is that they allocate memory proportional to the number of cores leading to dramatic higher allocations on high end machines. A simple dictionary for example needs 45MB for 300K allocs. On a 8 core machine with a ConcurrentDictionary it needs 600MB. On a 24 core machine 1,4GB ….
Needless to say things got much slower the bigger (and more expensive) the machine was were the code was running on. By going back to a simple dictionary with a lock (contention was not an issue at all) things got much better.
About 10 Years ago I was optimizing a part of a DirectX game. We had some code that was writing vertex data into video memory. The code run *much* slower than expected, so I started to investigate.
Turned out the writes have not been done sequentially. It wasn’t fully random. Instead the writes have always been nearby. That’s fine for CPU memory because the cache will take care of that, but for uncached GPU memory that was a killer. Fortunately the CPUs had a write queue which accelerates such write requests, but they only do so if the writes are strictly ascending.
The fix: Do the writes to some temporary chunk of CPU memory first, then copy all that stuff over to video memory using memcpy. Performance increased by 10 or so.
Yep! One of the tools I wrote for Xbox 360 would examine write patterns and look for badly ordered writes to non-cacheable memory, to make it easier to find issues like this.
The usual fix we used was to declare the write pointer as volatile so that the compiler wouldn’t reorder the writes (and also be careful).