Making Compiles Slow Through Abuse of Templates

I’m planning a post on how to maximize build parallelism in VC++, but first I needed to create a simple project that was slow to compile. Creating such a program was an interesting exercise in its own right and I decided that it deserved a separate blog post.

I started with a recursive algorithm that ran in exponential time. When I made it a compile-time algorithm I found that it compiled in linear time – too fast to be useful for my peculiar purposes. In order to get a slow-to-compile file I had to understand and then prevent the optimization that was allowing my result to be calculated so efficiently at compile time.

Is it wrong that I get so much satisfaction from defeating compilers’ optimizations?

Fibonacci, calculated slowly

Calculating a Fibonacci number using recursion is a great way to waste some CPU time. Here’s what a simple recursive Fibonacci calculation of F(n) looks like:

int fib(int n)
{
	if (n <= 2)
		return 1;
	return fib(n - 1) + fib(n - 2);
}

Most calls to fib(n) result in two calls to fib(n) so one might guess that the expected run-time cost is O(2^n). In fact fib(n) is O(fib(n)) – the runtime cost is proportional to the result. More usefully, this works out to about O(1.618^n). This is exponential which means that relatively small values of n can cause fib(n) to take a long time to run.

Yes, I know that this is a foolish way to implement fib(n). It is trivial to structure the code so that it is O(n). I’m trying to make something slow.

Executing fib(45) with all optimizations on takes about 4 – 5 s on my laptop for the 2.3 billion function calls.

Pop quiz: if we assume 32-bit ints then what, according to the C++ standard, happens if we calculate fib(46)?

Better living through templates

My goal, however, was to calculate F(n) at compile time, in order to get exponential compile times for my parallel compilation tests. So I translated the recursive function calls into recursive template metaprogramming:

template <int N> struct Fib_t {
	enum { value = Fib_t<N-1>::value + Fib_t<N-2>::value };
};

// Explicitly specialized for N==2
template <> struct Fib_t<2> {
	enum { value = 1 };
};

// Explicitly specialized for N==1
template <> struct Fib_t<1> {
	enum { value = 1 };
};
...
printf("Fib_t<45> is %d.\n", Fib_t<45>::value);

This is a simple, albeit not particularly useful, example of template metaprogramming. Types are used for recursion, and explicit template instantiations halt the recursion. On the face of it this should give the same exponential performance as fib(int n), only at compile time instead of at run-time. I expected it to be slower than the run-time solution because creating a new type is more expensive than a simple function call.

But it wasn’t.

Fibonacci<45>::Val compiled with no perceptible delay. After turning off warnings about overflows in constant integer arithmetic I compiled Fibonacci<500>::Val in a fraction of a second. I could not measure any cost for this template madness. Clearly the exponential cost was being completely avoided.

Recursion in linear time

Before analyzing how the compiler is able to handle this case so efficiently let’s try optimizing our recursive fib(n) function, while still keeping it recursive.

The key observation is that fib(n) gets called exponentially many times, but only actually calculates n different values. If we cache the results of fib(n) for each value of n then we can avoid the replicated calculations and the exponential explosion. Here’s some simple code that caches the results:

int fibfast(int n)
{
	// Non-zero entries in this vector contain the
	// value of fib(n). Not thread safe!
	static std::map<int, int> s_fibValues;

	// Check the cache.
	int& result = s_fibValues[n];
	if (result == 0)
	{
		// Calculate F(n) and store it in the cache.
		result = 1;
		if (n > 2)
			result = fibfast(n - 1) + fibfast(n - 2);
	}
	return result;
}

The code simply maintains a map of results and it consults this map before doing any calculations. This technique is known as memoization.

Below I have a diagram showing the process of calculating fib(7) using the original fib(n) function. Each arrow represents a function call, radiating out from the bottom left corner. We can see that fib(2) is called eight times to calculate fib(7), and this gets exponentially worse as n increases.

image

Without caching we visit each diamond. If we cache values, as shown in fibfast(n), then we only need to visit each column once. The cells with the blue backgrounds are the only ones we have to visit, and the cost drops from exponential to linear. It almost feels magical.

The exact set of cells visited actually depends on expression evaluation order and is likely to be more than just the blue cells but we still only do O(n) function calls and n calculations.

Again, I’m not recommending this way of calculating Fibonacci numbers. Starting from F(1) and building up is much simpler. Recursion simply doesn’t make sense unless you are trying to be slow.

Compile-time caching

When you use a class template the first time then the compiler instantiates it – it stamps out a class of the appropriate type. However if you use that template a second time with the same template parameters then the compiler typically doesn’t instantiate it. That would be wasteful. Instead the compiler typically just reuses the previous instantiation.

This means that the compiler must be keeping a record of what types it has instantiated – a cache if you will. This cache behaves exactly like the cache in fibfast(n) and gives exactly the same performance benefit – it makes our exponential algorithm run in linear time.

Cool!

Defeating optimizations for fun and profit

Now that we know why the compiler is working so quickly we can figure out a way to defeat its optimizations. The basic idea is to avoid reuse of types. We can do this by adding another template parameter whose only purpose is to make the types unique. At each level of recursion we can simply pass this value along when going on the top branch, and set a bit (corresponding to our distance from the leaf nodes) when going down the bottom branch. Simple!

Here’s the code, with the explicit instantiations omitted for simplicity:

template <int TreePos, int N>
struct FibSlow_t {
	enum { value = FibSlow_t<TreePos, N - 1>::value + 
			FibSlow_t<TreePos + (1 << N), N - 2>::value, };
};

TreePos is the unique number, and when going down the top branch we just pass it along – it will be zero all along the top of our diagram. When going down the bottom branch (the N – 2 branch) we set a bit. The combination of TreePos and N ensures that we never instantiate the same type twice. Optimization defeated!

How slow can we go?

With Fib_t I could measure no slowdown, but with FibSlow_t I finally got the exponential compile times that I so desperately wanted. Compiling FibSlow_t<0,18> takes about 1.7 seconds and every increase in n increases this by about 2.8 times. I’m not sure why the slowdown is greater than expected, but I’m just happy to have slain the efficient compilation dragon. It takes FibSlow_t<0,21> about 40 seconds to compile which is more than slow enough for my purposes, and the memory usage (about 356 MB for FibSlow_t<0,21>) is modest enough to allow lots of compiler parallelism.

Look for this code being put to ‘practical’ use in a future post that will cover how to get maximum parallelism from the VC++ compiler.

Thanks to STL expert Stephan T. Lavavej for reviewing this and suggesting many improvements.

Reddit discussion is here and here.

About these ads

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Performance, Programming, Visual Studio and tagged , , . Bookmark the permalink.

17 Responses to Making Compiles Slow Through Abuse of Templates

  1. When doing parallell builds it seems as if PDB-generation takes a lot of time and effort, especially on non-SSD-drives. I’m not sure if this is actually a bottleneck but I hope that this slow program of yours generates a lot of PDB-data so I can get a number on this gut feeling of mine from the next article! Otherwise it might be a good thing to tweak it in order to make it realistic!

    • brucedawson says:

      Sorry to disappoint but my next article will strictly be covering compilation speed. However I’ll probably cover some aspects of linking and PDBs in a subsequent article.

      Until then remember that incremental linking is faster than regular linking which is faster than LTCG linking. Choose wisely.

  2. Nicolas Silvagni says:

    Did you try with a constexpr function ? I am curious to know if a compiler is able to do the same caching, as it would be the modern way to write that kind of compile time computation with a C++11 compliant compiler. It had at least the good bonus to not blow in type instantiation.

    • brucedawson says:

      That sounds like a great test to run — you should totally do that, and let me know what you find. My guess is that it would natively have the exponential slowdown.

      • Nicolas Silvagni says:

        The result of the constexpr try on clang is exponential, i had first to push the limits that prevent long constexpr evaluation “-fconstexpr-depth=2147483647 -fconstexpr-steps=2147483647″ :)

        Pairs are value and time on my computer :
        {32, 4s}, {33,6.4s}, {34,10.2s}, {35,16.4s}, {36,26.4s}, {37,42.7s}.

        It shows a steady 1.6 scale to reach the next step. It then scales linearly on the number of invocation, so there is no caching. The memory is stable at 14MB.

        By looking at the code, it looks like it evaluates the AST instead of compiling it, not sure if it would be possible while keeping track of the restrictions and behaviors of such functions anyway.

        • brucedawson says:

          Thanks for the suggestion. The crazy number of types created by FibSlow_t had unintended side-effects but the constexpr technique (with the VS 2013 CTP) works like a charm. On my computer I get about 3.8s for 32, and memory consumption goes nuts with anything much larger. But no settings changes were needed.

          • brucedawson says:

            BTW, code is:

            constexpr int const_fib(int n)
            {
            return n <= 2 ? 1 : const_fib(n – 1) + const_fib(n – 2);
            }
            constexpr int x = const_fib(30);

            Get the VS 2013 CTP to compile it.

            And somebody suggested using lambdas for slow compilation:

  3. AL says:

    The caching technique is good old memoization : http://en.wikipedia.org/wiki/Memoization
    Useful in many contexts, though obviously not for computing Fibs :-)

  4. Alan Wolfe says:

    Great read, and im looking forward to the next article Bruce. Especially on the subject of link times, if you have a section on that (:

  5. mxmadera says:

    Very nice article! Waiting for the next one.

    Cheers.

  6. Mihai Sebea says:

    Usually what I do is just include boost::signals2 and I’m done :p
    My hypothesis is that the slow compilation comes from either
    a) preprocessor (front end ?)
    b) code generation (back end?)

    As for a) in these times of ssd’s and precompiled header I don’t think (but I need to test) that this is an issue anymore. It;s interesting to use the Preprocess to file option in VS to see how large is the preprocessed file … but this is slower then usual.

    For b VS has an undocumented /Bt option (have I read this on your blog?) that you can pass to the compiler and it shows how much it took to compile but nothing more …

    I usually see a correlation between the compile time and the resulting obj file size …
    A nice tool to inspect obj files is this http://timtabor.com/dumpbinGUI/index.htm
    Also some interesting stuff here http://gameangst.com/?tag=code-bloat

    About linking (because for sure this will be very requested) I remeber this very old article
    http://nedbatchelder.com/blog/200401/speeding_c_links.html

    As everybody …really Really REALLY … looking forward to read the article …

    • brucedawson says:

      The constexpr Fibonacci is better than boost::signals2 because it produces very small object files and PDBs and it allows fine-grained control over how long you want each source file to take — perfect my needs.

      I forgot about /Bt I should have mentioned that.

      I think large amounts of memory are more important than SSDs for avoiding disk I/O costs — more on that later.

      Thanks for the other links.

  7. Pingback: Make VC++ Compiles Fast Through Parallel Compilation | Random ASCII

  8. Fardon Heli says:

    What a terribly inefficient method of calculating a Fibonacci series! I am sorry, but you have lost all credibility.

    • Alan Wolfe says:

      I hope you are joking, the horrible inefficiency is the whole point of the article! Part of knowing how to make compile times faster is knowing what makes them slower. If you are serious you need to re-read because you’ve missed the point!

  9. mxmadera says:

    If Fardon is not joking, I would bet he just read some select words from the entire article. Which, BTW, makes us lose credibility of his literacy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s