Tail call performance on x86

MSIL includes the tail. prefix to be used with call instructions (call, callvirt, calli), and also the specialized form jmp. The tail. prefix is a hint that the stack frame of the current method should be popped from the stack before transferring control to the target of the call. This is important for functional languages like Scheme which rely on recursion. Without tail calls, they would run into StackOverflows on a routine basis.

The CLR implementation of tail calls on x86 is not the most optimal it could be. We have improved the performance of tail calls in Whidbey (V2). Most tail calls are about 50% faster. However, there is still some work to be done. This blog explains why implementing tail calls is not as straight-forward as it might seem.

The CLR has an accurate tracing Garbage collector - as opposed to a conservative one like the Boehm collector. This puts some requirements on the native code compiled from the IL, and also on the tracking information corresponding to it (similar to pdata on RISC platforms). One of the requirements is that the CLR be able to determine exactly where the return address of the method is on the stack. This is required so that the CLR can “hijack” the return address by temporarily overwriting it with the address of one of the hijacking function in the CLR. This is done so the CLR will get control when the method returns, and the CLR can perform a garbage-collection.

This requirement makes it difficult to move the return address around. Moving the return address around is needed while doing a tail call to a method with a different number of stack-based arguments as the return address gets pushed after the stack arguments. The following diagram shows how the stack looks at various points when a method foo is called normally, and it does a tail call to bar which expects a different number of arguments on the stack. Note that the return address needs to be moved by foo before it does the tail call.

 As soon as foo While foo When foo is about to
is called is executing tail call to bar

                      | |
+--------------+
| |
| |
| foo's frame |
| | | |
| | +--------------+
| | | | |return address|
+--------------+ +--------------+ +--------------+
|return address| |return address| | |
+--------------+ +--------------+ | |
| | | | | |
| stack args | | stack args | | stack args |
| of foo | | of foo | | of bar |
| | | | | |
+--------------+ +--------------+ +--------------+
| | | | | |
| | | | | |
| | | | | |
| Frame of | | Frame of | | Frame of |
|foo's caller | |foo's caller | |foo's caller |
| | | | | |
| | | | | |

Furthermore, we tend to be really careful when changing the hijacking code as any bugs introduced in this code are not easily caught, and instead can become GC holes (corrupt GC references). Most programs will seem to work fine, but if a GC is triggered at just the right spot, some poor object reference will get corrupted.

Another (secondary) concern is GC starvation. If two methods recursively tail call to each other, it causes an infinite loop and the hijacking mechanism does not get a chance to kick in. This reduces the opportunity for GC to suspend the thread and do a collection. This could cause other threads to block longer while they wait for GC to free up some heap memory.

We do know how to make this even faster, and would have loved to have fixed this. However, because of schedule pressure and the risk associated, further improvements will have to wait until after Whidbey (V2.0).

FWIW, the jmp IL instruction is faster since the return address does not have to be shuffled in this case. Also, tail calls will be much faster on 64-bit platforms as the arguments for most methods fit in registers (and so the return address does not need to be moved around). In fact, on 64-bit platforms, the CLR tries to do tail calls even if the tail. prefix is not specified. This is because platform-independent IL programs will get the same amount of stack space reserved for them by the OS, even though the 64-bit process will consume stack at a faster rate. Hence, we try to reduce stack usage by doing tail calls.