Knowing Where to Type ‘Zero’

Some code optimizations requires complex data structures and thousands lines of code. But, in a surprising number of cases, significant improvements can be made by simple changes – sometimes as simple as typing a single zero. It’s like the old story of the boilermaker who knows the right place to tap with his hammer – he sends an itemized bill for $0.50 for tapping the valve, and $999.50 for knowing where to tap.

I’ve been personally involved in several performance bugs that were fixed with the typing of a single zero and in this post I want to share two of them.

The importance of measurement

imageBack in the days of the original Xbox I helped optimize a lot of games. While working on one of these games the profiler led me to a matrix transformation function that was consuming 7% of CPU time – the biggest spike on the graph. So, I dutifully went to work to optimize this function.

imageI was not the first person who had been down this path. The function had already been rewritten in assembly language. I found a few potential improvements to the assembly language and tried to measure the improvements. This is a crucial step – otherwise I could easily have ended up checking in an ‘optimization’ that made no difference, or perhaps even made things worse.

However measuring the improvements was difficult. I’d launch the game, play for a bit while profiling, and then examine the profile to see whether the code was faster. It looked like there might be some modest progress, but there was so much randomness that it was hard to be sure.

So I got all scientific. I wrote a test harness that could drive the old and new versions of the code so that I could precisely measure the performance differences. This didn’t take long and it let me see that, as predicted, the new code was about 10% faster than the old.

imageBut it turns out that that 10% speedup was not interesting.

What was more interesting was that the code inside of the test harness was running about 10x faster (900%!!!) than the same code inside of the game. That was an exciting discovery.

After checking my results, and starting into space for a while I realized what must be happening.

Caching counts

In order to give game developers full control and maximum performance, video game consoles let game developers allocate memory with different attributes. In particular, the original Xbox would let game developers allocate non-cacheable memory. This type of memory (actually, this type of tag in the page tables) is useful when writing data that will be used by the GPU. Because the memory is non-cacheable the writes will go almost straight to RAM, avoiding the delays and cache-pollution that would happen with ‘normal’ memory mappings.

So non-cacheable memory is an important optimization, but it must be used carefully. In particular, it is crucial that games never try to read from non-cacheable memory, or their performance will be severely compromised. Even the relatively slow 733 MHz CPU of the original Xbox needed its caches to give adequate performance when reading data.

With this knowledge in hand I realized what must be happening. The data used by this function must have been allocated in non-cacheable memory, and that was why the performance was poor. A bit of investigation confirmed this hypothesis and the stage was set for the fix. I located the line of code that allocated the memory, double-clicked on the flag value that was erroneously requesting non-cacheable memory, and typed zero.

The cost of this function went from ~7% of CPU time down to about 0.7% of CPU time, and was no longer of interest.

My status report at the end of that week was something like “39.999 hours of investigation, 0.001 hours of coding – huge success!”

Most developers don’t need to worry about accidentally allocating non-cacheable memory – that option isn’t easily available in user space in most operating systems. But, if you want to see how much non-cacheable memory can slow down your code, trying using the PAGE_NOCACHE or PAGE_WRITECOMBINE flags with VirtualAlloc.

Zero GiB is better than four GiB

imageThe other tale I want to share is of a bug that I found, but which somebody else fixed. A couple of years ago I noticed that the disk cache on my laptop was getting purged quite frequently. I tracked this down to a transient 4 GiB allocation, and I eventually discovered that the device driver for my new backup drive was setting SectorSize to 0xFFFFFFFF (or –1) to indicate an unknown sector size. The Windows kernel interpreted this value as as 4 GiB, allocated that big a block of memory, and that was the cause of the problem.

I don’t have any contacts at Western Digital but it is pretty safe to assume that they fixed this bug by selecting the 0xFFFFFFFF (or -1) constant and then typing zero. A single character typed, and a significant performance regression fixed.

(full details of this investigation can be found at Windows Slowdown, Investigated and Identified)

Observations

  • In both cases the problem was related to caching
  • Using a profiler to accurately identify the problem is crucial
  • A fix that is not verified through measurements is not necessarily a fix
  • I could write about many instances of this but the other examples are either too secret or too boring
  • The right fix needn’t be complicated. Sometimes a huge improvement can be made by a tiny change, but you’ve got to know where to tap

I’ve also optimized code by commenting out a #define, and other trivial changes. Share your similar stories in the comments.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Performance and tagged , , , . Bookmark the permalink.

18 Responses to Knowing Where to Type ‘Zero’

  1. Pingback: 1p – Knowing Where to Type ‘Zero’ – Exploding Ads

  2. Pingback: 1 – Knowing Where to Type ‘Zero’ – blog.offeryour.com

  3. Very motivational and educational post. Thanks a lot for sharing these stories with us.

  4. CdrJameson says:

    Also note that running your code in an artificial situation, such as a test harness or massive loop, may remove it from the actual problem.

  5. If “read from non-cacheable memory” is a common type of bug, it might be worth modifying Address Sanitizer to catch it.

    • brucedawson says:

      I don’t think it’s a common pattern because I don’t even know if/how user-land processes can allocate non-cacheable memory. It could be a problem for device drivers, so maybe asan could check for it.

      Writing to write-combined memory is also a bit finicky — the Xbox 360 tools would check those rules as well.

  6. The “read from non-cacheable memory” pattern sounds a bit like “read from non-aligned pointers”, which has caused some performance bugs in Firefox that were intermittent and hard to track down. Luckily, alignment bugs can be caught by running regression tests under -fsanitize=alignment, which can be easier than profiling.

  7. panino says:

    In your first story, at first I thought it was going to be a denormal🙂

  8. Godji says:

    Tons of stories similar to this one though not many that would add much to your examples. Avoiding useless copies, unique data initialization to be taken out of the loop, minimising cache faults…

    Classic issues which I often find annoying as hell to debug are those involving data access in multi-threaded processes. Those ones could be worth a couple of articles, there are tons of different issues to be aware of, and some can be pretty tricky to identify and isolate (heck, I’m still occasionally learning some by making the mistake and having to track it myself :p).

    In the end, I find many deelopers could do with a bit more knowledge of how hardware works, even being aware of some aspects can lead to at least asking the right questions if not knowing directly the right answer.

  9. CW says:

    I’ve had huge speed-ups in toolchain processes just by adding a call to ‘resize’ to pre-allocate space in a std::vector where people were making a large number of calls to ‘push_back’. In one case reducing level export times from ~45 minutes down to ~5 minutes by introducing a single line of code.

    • brucedawson says:

      That’s a good one. I’ve also made huge improvements by *removing* a call to std::vector::resize() — calling it inside a loop can defeat exponential growth and make growth costs go quadratic.

  10. Alois Kraus says:

    I had similar experiences with concurrent collections in .NET: http://geekswithblogs.net/akraus1/archive/2015/01/26/161276.aspx They are fine when much thread contention is going on. But one major drawback is that they allocate memory proportional to the number of cores leading to dramatic higher allocations on high end machines. A simple dictionary for example needs 45MB for 300K allocs. On a 8 core machine with a ConcurrentDictionary it needs 600MB. On a 24 core machine 1,4GB ….
    Needless to say things got much slower the bigger (and more expensive) the machine was were the code was running on. By going back to a simple dictionary with a lock (contention was not an issue at all) things got much better.

  11. Pingback: Zeroing Memory is Hard (VC++ 2015 arrays) | Random ASCII

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s