I’m planning a post on how to maximize build parallelism in VC++, but first I needed to create a simple project that was slow to compile. Creating such a program was an interesting exercise in its own right and I decided that it deserved a separate blog post.

I started with a recursive algorithm that ran in exponential time. When I made it a compile-time algorithm I found that it compiled in linear time – too fast to be useful for my peculiar purposes. In order to get a slow-to-compile file I had to understand and then prevent the optimization that was allowing my result to be calculated so efficiently at compile time.

Is it wrong that I get so much satisfaction from defeating compilers’ optimizations?

## Fibonacci, calculated slowly

Calculating a Fibonacci number using recursion is a great way to waste some CPU time. Here’s what a simple recursive Fibonacci calculation of F(n) looks like:

int fib(int n) { if (n <= 2) return 1; return fib(n - 1) + fib(n - 2); }

Most calls to *fib(n)* result in two calls to *fib(n)* so one might guess that the expected run-time cost is *O(2^n)*. In fact *fib(n)* is *O(fib(n))* – the runtime cost is proportional to the result. More usefully, this works out to about O(1.618^n). This is exponential which means that relatively small values of *n* can cause *fib(n)* to take a long time to run.

Yes, I know that this is a foolish way to implement fib(n). It is trivial to structure the code so that it is O(n). I’m trying to make something slow.

Executing *fib(45)* with all optimizations on takes about 4 – 5 s on my laptop for the 2.3 billion function calls.

Pop quiz: if we assume 32-bit ints then what, according to the C++ standard, happens if we calculate fib(46)?

## Better living through templates

My goal, however, was to calculate F(n) at compile time, in order to get exponential compile times for my parallel compilation tests. So I translated the recursive function calls into recursive template metaprogramming:

template <int N> struct Fib_t { enum { value = Fib_t<N-1>::value + Fib_t<N-2>::value }; }; // Explicitly specialized for N==2 template <> struct Fib_t<2> { enum { value = 1 }; }; // Explicitly specialized for N==1 template <> struct Fib_t<1> { enum { value = 1 }; }; ... printf("Fib_t<45> is %d.\n", Fib_t<45>::value);

This is a simple, albeit not particularly useful, example of template metaprogramming. Types are used for recursion, and explicit template instantiations halt the recursion. On the face of it this should give the same exponential performance as *fib(int n)*, only at compile time instead of at run-time. I expected it to be slower than the run-time solution because creating a new type is more expensive than a simple function call.

But it wasn’t.

*Fibonacci<45>::Val* compiled with no perceptible delay. After turning off warnings about overflows in constant integer arithmetic I compiled *Fibonacci<500>::Val* in a fraction of a second. I could not measure any cost for this template madness. Clearly the exponential cost was being completely avoided.

## Recursion in linear time

Before analyzing how the compiler is able to handle this case so efficiently let’s try optimizing our recursive *fib(n)* function, while still keeping it recursive.

The key observation is that *fib(n)* gets called exponentially many times, but only actually calculates *n* different values. If we cache the results of *fib(n)* for each value of *n* then we can avoid the replicated calculations and the exponential explosion. Here’s some simple code that caches the results:

int fibfast(int n) { // Non-zero entries in this vector contain the // value of fib(n). Not thread safe! static std::map<int, int> s_fibValues; // Check the cache. int& result = s_fibValues[n]; if (result == 0) { // Calculate F(n) and store it in the cache. result = 1; if (n > 2) result = fibfast(n - 1) + fibfast(n - 2); } return result; }

The code simply maintains a map of results and it consults this map before doing any calculations. This technique is known as memoization.

Below I have a diagram showing the process of calculating *fib(7)* using the original *fib(n)* function. Each arrow represents a function call, radiating out from the bottom left corner. We can see that *fib(2)* is called eight times to calculate *fib(7)*, and this gets exponentially worse as *n* increases.

Without caching we visit each diamond. If we cache values, as shown in *fibfast(n)*, then we only need to visit each column once. The cells with the blue backgrounds are the only ones we have to visit, and the cost drops from exponential to linear. It almost feels magical.

The exact set of cells visited actually depends on expression evaluation order and is likely to be more than just the blue cells but we still only do O(n) function calls and n calculations.

Again, I’m not recommending this way of calculating Fibonacci numbers. Starting from F(1) and building up is much simpler. Recursion simply doesn’t make sense unless you are trying to be slow.

## Compile-time caching

When you use a class template the first time then the compiler instantiates it – it stamps out a class of the appropriate type. However if you use that template a second time with the same template parameters then the compiler typically *doesn’t* instantiate it. That would be wasteful. Instead the compiler typically just reuses the previous instantiation.

This means that the compiler must be keeping a record of what types it has instantiated – a cache if you will. This cache behaves exactly like the cache in *fibfast(n)* and gives exactly the same performance benefit – it makes our exponential algorithm run in linear time.

Cool!

## Defeating optimizations for fun and profit

Now that we know why the compiler is working so quickly we can figure out a way to defeat its optimizations. The basic idea is to avoid reuse of types. We can do this by adding another template parameter whose only purpose is to make the types unique. At each level of recursion we can simply pass this value along when going on the top branch, and set a bit (corresponding to our distance from the leaf nodes) when going down the bottom branch. Simple!

Here’s the code, with the explicit instantiations omitted for simplicity:

template <int TreePos, int N> struct FibSlow_t { enum { value = FibSlow_t<TreePos, N - 1>::value + FibSlow_t<TreePos + (1 << N), N - 2>::value, }; };

*TreePos* is the unique number, and when going down the top branch we just pass it along – it will be zero all along the top of our diagram. When going down the bottom branch (the N – 2 branch) we set a bit. The combination of *TreePos* and *N* ensures that we never instantiate the same type twice. Optimization defeated!

## How slow can we go?

With *Fib_t* I could measure no slowdown, but with *FibSlow_t* I finally got the exponential compile times that I so desperately wanted. Compiling *FibSlow_t<0,18>* takes about 1.7 seconds and every increase in *n* increases this by about 2.8 times. I’m not sure why the slowdown is greater than expected, but I’m just happy to have slain the efficient compilation dragon. It takes *FibSlow_t<0,21>* about 40 seconds to compile which is more than slow enough for my purposes, and the memory usage (about 356 MB for *FibSlow_t<0,21>*) is modest enough to allow lots of compiler parallelism.

Look for this code being put to ‘practical’ use in a future post that will cover how to get maximum parallelism from the VC++ compiler.

Thanks to STL expert Stephan T. Lavavej for reviewing this and suggesting many improvements.

When doing parallell builds it seems as if PDB-generation takes a lot of time and effort, especially on non-SSD-drives. I’m not sure if this is actually a bottleneck but I hope that this slow program of yours generates a lot of PDB-data so I can get a number on this gut feeling of mine from the next article! Otherwise it might be a good thing to tweak it in order to make it realistic!

Sorry to disappoint but my next article will strictly be covering compilation speed. However I’ll probably cover some aspects of linking and PDBs in a subsequent article.

Until then remember that incremental linking is faster than regular linking which is faster than LTCG linking. Choose wisely.

No worry, now I’ve got two post to look forward to 🙂 Thanks for sharing!

Did you try with a constexpr function ? I am curious to know if a compiler is able to do the same caching, as it would be the modern way to write that kind of compile time computation with a C++11 compliant compiler. It had at least the good bonus to not blow in type instantiation.

That sounds like a great test to run — you should totally do that, and let me know what you find. My guess is that it would natively have the exponential slowdown.

The result of the constexpr try on clang is exponential, i had first to push the limits that prevent long constexpr evaluation “-fconstexpr-depth=2147483647 -fconstexpr-steps=2147483647” 🙂

Pairs are value and time on my computer :

{32, 4s}, {33,6.4s}, {34,10.2s}, {35,16.4s}, {36,26.4s}, {37,42.7s}.

It shows a steady 1.6 scale to reach the next step. It then scales linearly on the number of invocation, so there is no caching. The memory is stable at 14MB.

By looking at the code, it looks like it evaluates the AST instead of compiling it, not sure if it would be possible while keeping track of the restrictions and behaviors of such functions anyway.

Thanks for the suggestion. The crazy number of types created by FibSlow_t had unintended side-effects but the constexpr technique (with the VS 2013 CTP) works like a charm. On my computer I get about 3.8s for 32, and memory consumption goes nuts with anything much larger. But no settings changes were needed.

BTW, code is:

constexpr int const_fib(int n)

{

return n <= 2 ? 1 : const_fib(n – 1) + const_fib(n – 2);

}

constexpr int x = const_fib(30);

Get the VS 2013 CTP to compile it.

And somebody suggested using lambdas for slow compilation:

The caching technique is good old memoization : http://en.wikipedia.org/wiki/Memoization

Useful in many contexts, though obviously not for computing Fibs 🙂

Great read, and im looking forward to the next article Bruce. Especially on the subject of link times, if you have a section on that (:

Very nice article! Waiting for the next one.

Cheers.

Usually what I do is just include boost::signals2 and I’m done :p

My hypothesis is that the slow compilation comes from either

a) preprocessor (front end ?)

b) code generation (back end?)

As for a) in these times of ssd’s and precompiled header I don’t think (but I need to test) that this is an issue anymore. It;s interesting to use the Preprocess to file option in VS to see how large is the preprocessed file … but this is slower then usual.

For b VS has an undocumented /Bt option (have I read this on your blog?) that you can pass to the compiler and it shows how much it took to compile but nothing more …

I usually see a correlation between the compile time and the resulting obj file size …

A nice tool to inspect obj files is this http://timtabor.com/dumpbinGUI/index.htm

Also some interesting stuff here http://gameangst.com/?tag=code-bloat

About linking (because for sure this will be very requested) I remeber this very old article

http://nedbatchelder.com/blog/200401/speeding_c_links.html

As everybody …really Really REALLY … looking forward to read the article …

The constexpr Fibonacci is better than boost::signals2 because it produces very small object files and PDBs and it allows fine-grained control over how long you want each source file to take — perfect my needs.

I forgot about /Bt I should have mentioned that.

I think large amounts of memory are more important than SSDs for avoiding disk I/O costs — more on that later.

Thanks for the other links.

Pingback: Make VC++ Compiles Fast Through Parallel Compilation | Random ASCII

What a terribly inefficient method of calculating a Fibonacci series! I am sorry, but you have lost all credibility.

I hope you are joking, the horrible inefficiency is the whole point of the article! Part of knowing how to make compile times faster is knowing what makes them slower. If you are serious you need to re-read because you’ve missed the point!

If Fardon is not joking, I would bet he just read some select words from the entire article. Which, BTW, makes us lose credibility of his literacy.

Another solution would be to change the function being computed, to something inherently more costly, even with memoization, such as the Ackermann funciton.

For example: http://coliru.stacked-crooked.com/a/d8624892fb7ebf97