Some code optimizations requires complex data structures and thousands lines of code. But, in a surprising number of cases, significant improvements can be made by simple changes – sometimes as simple as typing a single zero. It’s like the old story of the boilermaker who knows the right place to tap with his hammer – he sends an itemized bill for $0.50 for tapping the valve, and $999.50 for knowing where to tap.
I’ve been personally involved in several performance bugs that were fixed with the typing of a single zero and in this post I want to share two of them.
The importance of measurement
Back in the days of the original Xbox I helped optimize a lot of games. While working on one of these games the profiler led me to a matrix transformation function that was consuming 7% of CPU time – the biggest spike on the graph. So, I dutifully went to work to optimize this function.
I was not the first person who had been down this path. The function had already been rewritten in assembly language. I found a few potential improvements to the assembly language and tried to measure the improvements. This is a crucial step – otherwise I could easily have ended up checking in an ‘optimization’ that made no difference, or perhaps even made things worse.
However measuring the improvements was difficult. I’d launch the game, play for a bit while profiling, and then examine the profile to see whether the code was faster. It looked like there might be some modest progress, but there was so much randomness that it was hard to be sure.
So I got all scientific. I wrote a test harness that could drive the old and new versions of the code so that I could precisely measure the performance differences. This didn’t take long and it let me see that, as predicted, the new code was about 10% faster than the old.
What was more interesting was that the code inside of the test harness was running about 10x faster (900%!!!) than the same code inside of the game. That was an exciting discovery.
After checking my results, and starting into space for a while I realized what must be happening.
In order to give game developers full control and maximum performance, video game consoles let game developers allocate memory with different attributes. In particular, the original Xbox would let game developers allocate non-cacheable memory. This type of memory (actually, this type of tag in the page tables) is useful when writing data that will be used by the GPU. Because the memory is non-cacheable the writes will go almost straight to RAM, avoiding the delays and cache-pollution that would happen with ‘normal’ memory mappings.
So non-cacheable memory is an important optimization, but it must be used carefully. In particular, it is crucial that games never try to read from non-cacheable memory, or their performance will be severely compromised. Even the relatively slow 733 MHz CPU of the original Xbox needed its caches to give adequate performance when reading data.
With this knowledge in hand I realized what must be happening. The data used by this function must have been allocated in non-cacheable memory, and that was why the performance was poor. A bit of investigation confirmed this hypothesis and the stage was set for the fix. I located the line of code that allocated the memory, double-clicked on the flag value that was erroneously requesting non-cacheable memory, and typed zero.
The cost of this function went from ~7% of CPU time down to about 0.7% of CPU time, and was no longer of interest.
My status report at the end of that week was something like “39.999 hours of investigation, 0.001 hours of coding – huge success!”
Most developers don’t need to worry about accidentally allocating non-cacheable memory – that option isn’t easily available in user space in most operating systems. But, if you want to see how much non-cacheable memory can slow down your code, trying using the PAGE_NOCACHE or PAGE_WRITECOMBINE flags with VirtualAlloc.
Zero GiB is better than four GiB
The other tale I want to share is of a bug that I found, but which somebody else fixed. A couple of years ago I noticed that the disk cache on my laptop was getting purged quite frequently. I tracked this down to a transient 4 GiB allocation, and I eventually discovered that the device driver for my new backup drive was setting SectorSize to 0xFFFFFFFF (or –1) to indicate an unknown sector size. The Windows kernel interpreted this value as as 4 GiB, allocated that big a block of memory, and that was the cause of the problem.
I don’t have any contacts at Western Digital but it is pretty safe to assume that they fixed this bug by selecting the 0xFFFFFFFF (or -1) constant and then typing zero. A single character typed, and a significant performance regression fixed.
(full details of this investigation can be found at Windows Slowdown, Investigated and Identified)
- In both cases the problem was related to caching
- Using a profiler to accurately identify the problem is crucial
- A fix that is not verified through measurements is not necessarily a fix
- I could write about many instances of this but the other examples are either too secret or too boring
- The right fix needn’t be complicated. Sometimes a huge improvement can be made by a tiny change, but you’ve got to know where to tap
I’ve also optimized code by commenting out a #define, and other trivial changes. Share your similar stories in the comments.