Loop unrolling for speed

I recently saw a posting where someone was considering a great deal of loop unrolling.  I can imagine some exotic case where this is a good idea however in general it's more likely to be bad that good.

For those that aren't familiar with the term, loop unrolling is where you change code that looks like this:

for (i=0;i<100;) { a[i] = i++; }

into something like this

for (i=0;i<100;) { a[i] = i++; a[i] = i++;a[i] = i++;a[i] = i++; }

(only usually the loop body is a little more complicated, and the unrolling is smarter)

Remember the purpose of loop unrolling is to reduce the cost of testing the loop variables on each iteration and the associated control flow.  That's all the overhead there is.

Generally when you unroll a loop you take several iterations and put it directly into the body.  Hopefully you can do this easily because you know (for instance) that the number of iterations is always a multiple of 10, or something like that  (there's of course other ways too).

OK, so far all is goodness and joy.  Here comes the but.

In the world of modern processors, bigger code is often slower code.  At some point the savings that you gained by having less control flow are not sufficient to overcome the costs of extra cache misses due to having to load in more code for the bigger (unrolled) algorithm.

So, the sweet spot is in between, you need to unroll enough to reduce the cost (remember if you only unroll 10 iterations that's still at 90%
reduction in control flow cost, you'll have to unroll 100 iterations to get to 99% and then 1000 to get to 99.9% -- diminishing returns if ever I  saw them) however you mustn't bloat the code to the point where you're taking an undue number of cache misses in the course of running the program.  I find it unlikely that more than 10 unrolls would ever really be worth it, but of course you'd have to measure to be sure for your case.