Considering the performance implications of converting recursion to an explicit stack

Suppose you have a recursive algorithm, such as in-order tree walking.

void add_to_each_node_recursive(Node* root, int extra)
 if (!root) return;
 root->value += extra;
 add_to_each_node_recursive(root->left, extra);
 add_to_each_node_recursive(root->right, extra);

Suppose further that the stack is large enough and the data set small enough that stack overflow is not a problem. Will converting the algorithm to a non-recursive algorithm with an explicit stack provide any performance improvement?

For expository convenience, I will call this second version "iterative".

void add_to_each_node_iterative(Node* root, int extra)
 std::stack<Node*> stack;
 for (;;) {
  while (root) {
   root->value += extra;
   if (root->right) stack.push(root->right);
   root = root->left;
  if (stack.empty()) return;
  root = stack.pop();

Determining which version will have better performance is complicated.

The recursive version will take advantage of a hardware stack, assuming the processor has one. So pushing and popping activation frames may be faster than manually manipulating a stack data structure.

On the other hand, you have function call overhead and ABI constraints. The compiler has to build a proper ABI-compliant stack frame, and it will almost certainly save and restore additional registers as part of that stack frame. Parameters that are constant through the entire operation such as the extra parameter will still need to be passed to each recursive call. The ABI may require things like home space and stack pointer alignment, which will add to the size of the frame.

The iterative version reuses the same extra variable, so it doesn't have to push it onto the explicit stack. All that goes onto the explicit stack is a single pointer, namely the point at which to resume the tree walk after exploring the left branch.

Lower memory requirements point toward the iterative version because that increases the amount of useful data that will fit inside the L1 cache.

The recursive version uses the hardware procedure call mechanism, which means that it gets to take advantage of any return address predictor. However, those predictors tend to max out at around 16 to 32 calls, so if your structure is deeper than that, you are going to overflow the return address predictor.

The iterative version doesn't use the hardware procedure call mechanism at all. Instead, it relies on the branch predictor. So now it depends on how you structured your code: Did you set things up so that the branches usually go the same way? For example, maybe the branch is always taken, except for the final iteration.

The recursive version may consume a lot of stack, although we're assuming that the stack is large enough to accommodate the worst case. As the function runs, the stack grows to accommodate the recursion, and when the function returns, that stack growth is not rolled back. The extra stack has been allocated, and it won't go away until the thread exits. Sure, it'll get paged out eventually, and will even be reused if you call another deeply recursive function. But that reuse is going to have to page the stack back in.

The iterative version creates a temporary stack, uses it for the purpose of the algorithm, and then destroys it. So there is no lingering effect of the iterative version.

Okay, well, you did have to allocate the memory from the heap, and the heap may have had to expand, but when you destroy the stack, that memory becomes available to the rest of the program, so it's not lost forever.

But that introduces a sticking point: There's a lot of code hiding inside the stack.push and stack.pop operations. You have the potential for heap allocations and deallocations, which means incursions into your precious branch prediction and cache (both code and data). It also reduces optimization opportunities for the compiler because a call into the heap is going to destroy all the volatile registers.

So which way do the scales tip?

As with most issues of performance, the only way to know is to measure. There's no single answer. You'll have to run tests on your specific workload to determine which works better for you.

Bonus chatter: Don't forget the engineering cost. The recursive version is usually easier to write, understand, and debug.

Comments (18)
  1. ZLB says:

    There is another trick that works well if you frequently perform operations on all nodes of your tree:

    Also store a list of pointers to your tree nodes. When you need to perform an operation on all nodes, you can just iterate the node list rather than waking the tree. (This also assumes the order of node visits doesn’t matter and you can cope with double bookkeeping cost)

    1. Joshua says:

      Well I suppose you could do it old school like the old school.

      One function. Explicit stack operations. Only the stack in question is the runtime stack. alloca() is your friend. If you’re coding in assembly you can do it even better yet. It’s one of the last places left where the compiler can’t optimize anywhere near as good as a human. In order to make exception unwind work you will need a frame pointer, and therefore sacrifice a register to it. Totally worth it if you’re not a leaf function.

    2. Dave Bacher says:

      If you’re talking performance – separate the index out into an array. Your in order traversal becomes a for loop, your binary search isn’t swapping in entire class instances in order to access a single variable, and the binary search does not need a stack at all and needs only three index variables.
      As a rule, an object “belongs to” an index, or the index “contains” an object – however you want to think about it, and so having the object “have” an index only makes conceptual test if the object is a container and the index is children. Separating off the index creates a reusable brick you can pull in to containers that need it, that brick can be easily unit tested and depending on how clever you are with the design, ends up paging in only the data that you need for the search, for the search case, and only the list of objects for the in order traversal case.
      Usually there’s less storage requirements, higher performance by separating off the indexes. Which is, of course, why we’re teaching kids to keep 20 separate indexes inline in classes – same reason we can’t spend ten minutes on teaching useful debugging in school

  2. Antonio Rodríguez says:

    The “Bonus chatter” gives us the key. Unless you are optimizing a time-critical section of an inner loop or dealing with a task that takes more than a few seconds in current hardware, gaining a 10% of performance doesn’t justify making the code more complicated. It not only makes it harder to read and understand. It also makes it easier to slip bugs and harder to debug them. It doesn’t matter if a program is fast when it’s buggy.

  3. Nathan says:

    This reminds me of a change I made to an opensource project which used treewalking to generate an SQL query.
    The problem was that it included a long list of binary operators (1000’s of arguments) and was unbalanced (basically only the right tree was filled), resulting in a stack overflow.

    Rewriting the AST constructor to generate a balanced tree solved the issue (log_2(n) depth vs n-depth))

  4. Timothy Byrd says:

    Nitpick, but isn’t your recursive example pre-order, not in-order?
    I thought in-order would look like this:

    void add_to_each_node_recursive(Node* root, int extra)
    if (!root) return;
    add_to_each_node_recursive(root->left, extra);
    root->value += extra;
    add_to_each_node_recursive(root->right, extra);

    1. xtal256 says:

      No, that’s doing the left node first.

      1. Timothy Byrd says:

        Isn’t that what an in-order traversal is? Processing the left sub tree, the current node and then the right sub tree?

        For example:

  5. David says:

    Personally, I find recursive code is usually harder to reason about, beyond simple examples such as tree traversal. Especially the condition under which the recursion stops.

  6. Jon Scharff says:

    Then of course there’s the hybrid version which recurses on one child and iterates on the other. I’m guessing that this is only explicitly implementing a tail-call recursion optimization, so it would likely compile to byte code very similar to add_to_each_node_recursive.

    void add_to_each_node_hybrid(Node* root, int extra)
    while (root) {
    root->value += extra;
    add_to_each_node_hybrid(root->right, extra);
    root = root->left;

  7. Martin Ba. _ says:

    There’s another angle to engineering cost: The recursive version is easier to write and understand (and yes, to debug) but there is the problem that there lurks a potential UB stack overflow depending on the runtime data set. So there is another tradeoff to consider that has nothing to do with performance: Easy to write and understand with potential UB or waterproof (at least wrt. UB) but harder to write and maintain.

    Personally, I’d go with option #1 for low risk code, and if I want waterproof, I’d factor the ‘iterative’ tree walking itself out into a generic function that I can test in isolation.

  8. alegr1 says:

    Recently I wrote a C++ template to implement AVL trees with node insertion and deletion support, and with iterator access semantics. To support iterating through the nodes, I just added a back pointer to the parent node. Thus, no concerns about recursion depth.

  9. osxpert says:

    “recursive version is usually easier to write, understand, and debug”
    I disagree on the last. Recursive call with deep stack is almost impossible to debug.

    1. BC says:

      -“Recursive call with deep stack is almost impossible to debug”
      Sure, if you wait until the stack is blown! If the reason the stack is that deep is because of a bug, the bug probably triggered long, long before the stack was blown. I think I would enable the Guard Page event in windbg, and it would break when the stack needs to grow, a handy “auto-break” feature for when recursion headed to infinity.

  10. Pseudonym says:

    The recursive version is usually easier to write and understand, but the explicit stack version can be much easier to debug. I’ve wasted hours digging through stack traces that are 200 levels deep.

  11. squelart says:

    I’m late to the party, but another solution is for each node to store a pointer to its parent, an iterative algorithm only needs two pointers to walk that tree. Harder to write correctly of course, but with a bit of templating it only needs to be written&tested once. And it’s much more code with a lot of branching, so it could still end up being slower; always measure!

    1. Tree walking was just an example. Let’s assume that the recursion is inherent in the problem. The issue is comparing the pure recursive version with the recursion-converted-to-iteration version.

      1. squelart says:

        I guess I couldn’t see the forest for the … tree! *badum tish*
        In the end, there may be more than one way to convert to non-iterative, with different pros&cons; All to be evaluated along different axes (which your article shows is not that obvious) to decide which one is preferred. Much fun!

Comments are closed.

Skip to main content