Enter, Leave, Tailcall Hooks Part 2: Tall tales of tail calls


For most people the idea of entering or returning from a function seems straightforward. Your profiler’s Enter hook is called at the beginning of a function, and its Leave hook is called just before the function returns. But the idea of a tail call and exactly what that means for the Profiling API is less straightforward.


In Part 1 I talked about the basics of the Enter / Leave / Tailcall hooks and generally how they work. You may want to review that post first if you haven’t seen it yet. This post builds on that one by talking exclusively about the Tailcall hook, how it works, and what profilers should do inside their Tailcall hooks.


Tail calling in general


Tail calling is a compiler optimization that saves execution of instructions and saves reads and writes of stack memory. When the last thing a function does is call another function (and other conditions are favorable), the compiler may consider implementing that call as a tail call, instead of a regular call.

    static public void Main()
{
Helper();
}

static public void Helper()
{
One();
Two();
Three();
}

static public void Three()
{

}


In the code above, the compiler may consider implementing the call from Helper() to Three() as a tail call. What does that mean, and why would that optimize anything? Well, imagine this is compiled without a tail call optimization. By the time Three() is called, the stack looks like this (my stacks grow UP):

Three
Helper
Main

Each of those functions causes a separate frame to be allocated on the stack. All the usual contents of a frame, including locals, parameters, the return address, saved registers, etc., get stored in each of those frames. And when each of those functions returns, the return address is read from the frame, and the stack pointer is adjusted to collapse the frame of the returning function. That’s just the usual overhead associated with making a function call.


Now, if the call from Helper() to Three() were implemented as a tail call, we’d avoid that overhead, and Three() would just “reuse” the stack frame that had been set up for Helper(). While Three() is executing, the call stack would look like this:

Three
Main

And when Three() returns, it returns directly to Main() without popping back through Helper() first.


Folks who live in functional programming languages like Scheme use recursion at least as often as C++ or C# folks use while and for loops. Such functional programming languages depend on tail call optimizations (in particular tail recursion) to avoid overflowing the stack. While imperative languages like C++ or C# don’t have such a vital need for tail call optimizations, it’s still pretty handy as it reduces the number of instructions executed and the writes to the stack.  Also, it’s worth noting that the amount of stack space used for a single frame can be more than you’d expect.  For example, in CLR x64, each regular call (without the tail call optimization) uses a minimum of 48 bytes of stack space, even if it takes no arguments, has no locals, and returns nothing.  So for small functions, the tail call optimization can provide a significant overhead reduction in terms of stack space.


The CLR and tail calls


When you’re dealing with languages managed by the CLR, there are two kinds of compilers in play. There’s the compiler that goes from your language’s source code down to IL (C# developers know this as csc.exe), and then there’s the compiler that goes from IL to native code (the JIT 32/64 bit compilers that are invoked at run time or NGEN time). Both the source->IL and IL->native compilers understand the tail call optimization. But the IL->native compiler–which I’ll just refer to as JIT–has the final say on whether the tail call optimization will ultimately be used. The source->IL compiler can help to generate IL that is conducive to making tail calls, including the use of the “tail.” IL prefix (more on that later). In this way, the source->IL compiler can structure the IL it generates to persuade the JIT into making a tail call. But the JIT always has the option to do whatever it wants.


When does the JIT make tail calls?


I asked Fei Chen and Grant Richins, neighbors down the hall from me who happen to work on the JIT, under what conditions the various JITs will employ the tail call optimization. The full answer is rather detailed. The quick summary is that the JITs try to use the tail call optimization whenever they can, but there are lots of reasons why the tail call optimization can’t be used. Some reasons why tail calling is a non-option:



  • Caller doesn’t return immediately after the call (duh :-))

  • Stack arguments between caller and callee are incompatible in a way that would require shifting things around in the caller’s frame before the callee could execute

  • Caller and callee return different types

  • We inline the call instead (inlining is way better than tail calling, and opens the door to many more optimizations)

  • Security gets in the way

  • The debugger / profiler turned off JIT optimizations

Here are their full, detailed answers.


Note that how the JIT decides whether to use the tail calling optimization is an implementation detail that is prone to change at whim.  You must not take dependencies on this behavior. Use this information for your own personal entertainment only.


Your Profiler’s Tailcall hook


I’m assuming you’ve already read through Part 1 and are familiar with how your profiler sets up its Enter/Leave/Tailcall hooks, so I’m not repeating any of those details here. I will focus on what kind of code you will typically want to place inside your Tailcall hook:


typedef void FunctionTailcall2(


                FunctionID funcId,


                UINT_PTR clientData,


                COR_PRF_FRAME_INFO func);


Tip: More than once I’ve seen profiler writers make the following mistake. They will take their naked assembly-language wrapper for their Enter2 and Leave2 hooks, and paste it again to use as the Tailcall2 assembly-language wrapper. The problem is they forget that the Tailcall2 hook takes a different number of parameters than the Enter2 / Leave2 hooks (or, more to the point, a different number of bytes is passed on the stack to invoke the Tailcall2 hook). So, they’ll take the “ret 16” at the end of their Enter2/Leave2 hook wrappers and stick that into their Tailcall2 hook wrapper, forgetting to change it to a “ret 12“. Don’t make the same mistake!

It’s worth noting what these parameters mean. With the Enter and Leave hooks it’s pretty obvious that the parameters your hook is given (e.g., funcId) apply to the function being Entered or Left. But what about the Tailcall hook? Do the Tailcall hook’s parameters describe the caller (function making the tail call) or the callee (function being tail called into)?


Answer: the parameters refer to tail caller.


The way I remember it is that the Tailcall hook is like an “Alternative Leave” hook. A function ends either by returning (in which case the CLR invokes your Leave hook) or a function ends by tail calling out to somewhere else (in which case the CLR invokes your Tailcall hook). In either case (Leave hook or Tailcall hook) the hook’s parameters tell you about the function that’s ending. If a function happens to end by making a tail call, your profiler is not told the target of that tail call. (The astute reader will realize that actually your profiler is told what the target of the tail call is–you need only wait until your Enter hook is called next, and that function will be the tail call target, or “tail callee”. (Well, actually, this is true most of the time, but not all! (More on that later, but consider this confusing, nested series of afterthoughts a hint to a question I pose further down in this post.)))


Did you just count the number of closing parentheses to ensure I got it right? If so, I’d like to make fun of you but I won’t–I’d have counted the parentheses, too. My house is glass.


Ok, enough dilly-dallying. What should your profiler do in its Tailcall hook? Two of the more common reasons profilers use Enter/Leave/Tailcall hooks in the first place is to keep shadow stacks or to maintain call traces (sometimes with timing information).


Shadow stacks


The CLRProfiler is a great example of using Enter/Leave/Tailcall hooks to maintain shadow stacks. A shadow stack is your profiler’s own copy of the current stack of function calls on a given thread at any given time. Upon Enter of a function, you push that FunctionID (and whatever other info interests you, such as arguments) onto your data structure that represents that thread’s stack. Upon Leave of a function, you pop that FunctionID. This gives you a live list of managed calls in play on the thread. The CLRProfiler uses shadow stacks so that whenever the managed app being profiled chooses to allocate a new object, the CLRProfiler can know the managed call stack that led to the allocation. (Note that an alternate way of accomplishing this would be to call DoStackSnapshot at every allocation point instead of maintaining a shadow stack. Since objects are allocated so frequently, however, you’d end up calling DoStackSnapshot extremely frequently and will often see worse performance than if you had been maintaining shadow stacks in the first place.)


 


OK, so when your profiler maintains a shadow stack, it’s clear what your profiler should do on Enter or Leave, but what should it do on Tailcall? There are a couple ways one could imagine answering that question, but only one of them will work! Taking the example from the top of this post, imagine the stack looks like this:

Helper
Main

and Helper is about to make a tail call into Three().  What should your profiler do?


Method 1: On tailcall, pop the last FunctionID.  (In other words, treat Tailcall just like Leave.)


So, in this example, when Helper() calls Three(), we’d pop Helper().  As soon as Three() is called, our profiler would receive an Enter for Three(), and our shadow stack would look like this:

Three
Main

This approach mirrors reality, because this is what the actual physical stack will look like.  Indeed, if one attaches a debugger to a live process, and breaks in while the process is inside a tail call, the debugger will show a call stack just like this, where you see the tail callee, but not the tail caller.  However, it might be a little confusing to a user of your profiler who looks at his source code and sees that Helper() (not Main()) calls Three().  He may have no idea that when Helper() called Three(), the JIT chose to turn that into a tail call. In fact, your user may not even know what a tail call is. You might therefore be tempted to try this instead:


Method 2: On tailcall, “mark” the FunctionID at the top of your stack as needing a “deferred pop” when its callee is popped, but don’t pop yet.


With this strategy, for the duration of the call to Three(), the shadow stack will look like this:

Three
Helper
(marked for deferred pop)
Main

which some might consider more user-friendly. And as soon as Three() returns, your profiler will sneakily do a double-pop leaving just this:

Main

So which should your profiler use: Method 1 or Method 2? Before I answer, take some time to think about this, invoking that hint I cryptically placed above in nested parentheses. And no, the fact that the parentheses were nested is not part of the actual hint.

The answer: Method 1. In principle, either method should be fine. However, the behavior of the CLR under certain circumstances will break Method 2. Those “certain circumstances” are what I alluded to when I mentioned “this is true most of the time, but not all” above.  These mysterious “certain circumstances” involve a managed function tail calling into a native helper function inside the runtime. Here’s an example:
static public void Main()
{
Thread.Sleep(44);
Helper();
}

It just so happens that the implementation of Thread.Sleep makes a call into a native helper function in the bowels of the runtime. And that call happens to be the last thing Thread.Sleep does. So the JIT may helpfully optimize that call into a tail call. Here are the hook calls your profiler will see in this case:

(1) Enter (into Main)
(2) Enter (into Thread.Sleep)
(3) Tailcall (from Thread.Sleep)
(4) Enter (into Helper)
(5) Leave (from Helper)
(6) Leave (from Main)

Note that after you get a Tailcall telling you that Thread.Sleep is done, (in (3)), the very next Enter you get (in (4)) is NOT the Enter for the function being tail called. This is because the CLR only provides Enter/Leave/Tailcall hooks for managed functions, and the very next managed function being entered is Helper().  So, how will Method 1 and Method 2 fare in this example?


Method 1: Shadow stack works


By popping on every Tailcall hook, your shadow stack stays up to date.


Method 2: Shadow stack fails


At stage (4), the shadow stack looks like this:


Helper
Thread.Sleep
(marked for “deferred pop”)
Main


If you think it might be complicated to explain tail calls to your users so they can understand the Method 1 form of shadow stack presentation, just try explaining why it makes sense to present to them that Thread.Sleep() is calling Helper()!


And of course, this can get arbitrarily nasty:

static public void Main()
{
Thread.Sleep(44);
Thread.Sleep(44);
Thread.Sleep(44);
Thread.Sleep(44);
Helper();
}

would yield:


Helper
Thread.Sleep
(marked for “deferred pop”)
Thread.Sleep (marked for “deferred pop”)
Thread.Sleep (marked for “deferred pop”)
Thread.Sleep (marked for “deferred pop”)
Main


And things get more complicated if you start to think about when you actually pop a frame marked for “deferred pop”. In all the above examples, you would do so as soon as the frame above it gets popped. So once Helper() is popped (due to Leave()), you’d cascade-pop all the Thread.Sleeps. But what if there is no frame above the frames marked for “deferred pop”?  To wit:

static public void Main()
{
Helper()
}

static public void Helper()
{
Thread.Sleep(44);
Thread.Sleep(44);
Thread.Sleep(44);
Thread.Sleep(44);
}


would yield:


Thread.Sleep (marked for “deferred pop”)
Thread.Sleep (marked for “deferred pop”)
Thread.Sleep (marked for “deferred pop”)
Thread.Sleep (marked for “deferred pop”)
Helper
Main


until you get a Leave hook for Helper(). At this point, you need to pop Helper() from your shadow stack, but he’s not at the top– he’s buried under all your “deferred pop” frames. So your profiler would need to perform the deferred pops if a frame above OR below them gets popped. Hopefully, the yuckiness of this implementation will scare you straight.  But the confusion of presenting crazy stacks to the user is the real reason to abandon Method 2 and go with Method 1.


Call tracing


The important lesson to learn from the above section is that sometimes a Tailcall hook will match up with the next Enter hook (i.e., the tail call you’re notified of in your Tailcall hook will have as its callee the very function you’re notified of in the next Enter hook), and sometimes the Tailcall hook will NOT match with the next Enter hook (in particular when the Tailcall hook refers to a tail call into a native helper in the runtime). And the sad fact is that the Enter/Leave/Tailcall hook design does not currently allow you to predict whether a Tailcall will match the next Enter.


As an illustration, consider two simple tail call examples:


Matching Example

    static public void Main()
{
One();
…(other code here)…
}

static public void One()
{
Two();
}


Non-matching Example

    static public void Main()
{
Thread.Sleep(44);
Two();
}


In either case, your profiler will see the following hook calls

(1) Enter (into Main)
(2) Enter (into One / Thread.Sleep)
(3) Tailcall (from One / Thread.Sleep)
(4) Enter (into Two)


In the first example, (3) and (4) match (i.e., the tail call really does call into Two()). But in the second example, they do not (the tail call does NOT call into Two()).


Since you don’t know when Tailcall will match the next Enter, your implementation of call tracing, like shadow stack maintenance, must treat a Tailcall hook just like a Leave. If you’re logging when functions begin and end, potentially with the amount of time spent inside the function, then your Tailcall hook should basically do the same thing as your Leave hook. A call to your Tailcall hook indicates that the specified function is over and done with, just like a call to your Leave hook.


As with shadow stacks, this will sometimes lead to call graphs that could be confusing. “Matching Example” had One tail call Two, but your graph will look like this:

Main
|
|– One
|– Two

But at least this effect is explainable to your users, and is self-correcting after the tail call is complete, while yielding graphs that are consistent with your timing measurements. If instead you try to outsmart this situation and assume Tailcalls match the following Enter, the errors can snowball into incomprehensible graphs (see the nasty examples from the shadow stack section above).


How often does this happen?


So when does a managed function in the .NET Framework tail call into a native helper function inside the CLR? In the grand scheme of things, not a lot. But it’s a pretty random and fragile list that depends on which JIT is in use (x86, x64, ia64), and can easily change as parts of the runtime are rev’d, or even as JIT compilation flags are modified by debuggers, profilers, and other environmental factors while a process is active. So you should not try to guess this list and make dependencies on it.


Can’t I just turn tail calling off?!


If all this confusion is getting you down, you might be tempted to just avoid the problem in the first place. And yes, there is a way to do so, but I wouldn’t recommend it in general. If you call SetEventMask, specifying COR_PRF_DISABLE_OPTIMIZATIONS inside your mask, that will tell the JIT to turn off the tail call optimization. But the JIT will also turn off ALL optimizations. Profilers that shouldn’t perturb the behavior of the app should definitely not do this, as the code generation will be very different.


Watching CLR tail calls in action


If you’re writing a profiler with Enter/Leave/Tailcall hooks, you’ll want to make sure you exercise all your hooks so they’re properly tested. It’s easy enough to make sure your Enter/Leave hooks are called–just make sure the test app your profiler runs against has a Main()! But how to make sure your Tailcall hook is called?


The surest way is to have a simple managed app that includes an obvious tail call candidate, and make sure the “tail.” IL prefix is in place. You can use ilasm / ildasm to help build such an assembly. Here’s an example I tried on x86 using C#.


Start with some simple code that makes a call that should easily be optimized into a tail call:

using System;
class Class1
{
static int Main(string[] args)
{
return Helper(4);
}

static int Helper(int i)
{
Random r = new Random();
i = (i / 1000) + r.Next();
i = (i / 1000) + r.Next();
return MakeThisATailcall(i);
}

static int MakeThisATailcall(int i)
{
Random r = new Random();
i = (i / 1000) + r.Next();
i = (i / 1000) + r.Next();
return i;
}
}


You’ll notice there’s some extra gunk, like calls to Random.Next(), to make the functions big enough that the JIT won’t inline them. There are other ways to avoid inlining (including from the profiling API itself), but padding your test functions is one of the easier ways to get started without impacting the code generation of the entire process. Now, compile that C# code into an IL assembly:

    csc /o+ Class1.cs

(If you’re wondering why I specified /o+, I’ve found that if I don’t, then suboptimal IL gets generated, and some extraneous instructions appear inside Helper between the call to MakeThisATailcall and the return from Helper. Those extra instructions would prevent the JIT from making a tail call.)


Run ildasm to get at the generated IL

    ildasm Class1.exe

Inside ildasm, use File.Dump to generate a text file that contains a textual representation of the IL from Class1.exe.  Call it Class1WithTail.il.  Open up that file and add the tail. prefix just before the call you want optimized into a tail call (see highlighted yellow for changes):

  .method private hidebysig static int32
Helper(int32 i) cil managed
{
// Code size 45 (0x2d)
// Code size 46 (0x2e)
.maxstack 2
.locals init (class [mscorlib]System.Random V_0)
IL_0000: newobj instance void [mscorlib]System.Random::.ctor()
IL_0005: stloc.0
IL_0006: ldarg.0
IL_0007: ldc.i4 0x3e8
IL_000c: div
IL_000d: ldloc.0
IL_000e: callvirt instance int32 [mscorlib]System.Random::Next()
IL_0013: add
IL_0014: starg.s i
IL_0016: ldarg.0
IL_0017: ldc.i4 0x3e8
IL_001c: div
IL_001d: ldloc.0
IL_001e: callvirt instance int32 [mscorlib]System.Random::Next()
IL_0023: add
IL_0024: starg.s i
IL_0026: ldarg.0
IL_0027: call int32 Class1::MakeThisATailcall(int32)
IL_002c: ret
IL_0027: tail.
IL_0028: call int32 Class1::MakeThisATailcall(int32)
IL_002d: ret
} // end of method Class1::Helper

Now you can use ilasm to recompile your modified textual IL back into an executable assembly.

    ilasm /debug=opt Class1WithTail.il

You now have Class1WithTail.exe that you can run!  Hook up your profiler and step through your Tailcall hook.


You Can Wake Up Now


If you didn’t learn anything, I hope you at least got some refreshing sleep thanks to this post. Here’s a quick recap of what I wrote while you were napping:



  • If the last thing a function does is call another function, that call may be optimized into a simple jump (i.e., “tail call”).  Tail calling is an optimization to save the time of stack manipulation and the space of generating an extra call frame.

  • In the CLR, the JIT has the final say on when it employs the tail call optimization. The JIT does this whenever it can, except for a huge list of exceptions. Note that the x86, x64, and ia64 JITs are all different, and you’ll see different behavior on when they’ll use the tail call optimizations.

  • Since some managed functions may tail call into native helper functions inside the CLR (for which you won’t get an Enter hook notification), your Tailcall hook should treat the tail call as if it were a Leave, and not depend on the next Enter hook correlating to the target of the last tail call.  With shadow stacks, for example, this means you should simply pop the calling function off your shadow stack in your Tailcall hook.

  • Since tail calls can be elusive to find in practice, it’s well worth your while to use ildasm/ilasm to manufacture explicit tail calls so you can step through your Tailcall hook and test its logic.

David has been a developer at Microsoft for over 70 years (allowing for his upcoming time-displacement correction). He joined Microsoft in 2079, first starting in the experimental time-travel group. His current assignment is to apply his knowledge of the future to eliminate the “Wait for V3” effect customers commonly experience in his source universe. By using Retroactive Hindsight-ellisenseTM his goal is to “get it right the first time, this time” in a variety of product groups.


Comments (6)

  1. (off topic, sorry) Given the huge number of conditions that prevent tail calls, it’s rather absurd that there is *no* supported way of preventing tail calls. When I originally discovered that the x64 JIT tail calls even when a method has MethodImplOptions.NoInlining and filed a bug, the bug was closed as "by design". Well let me be honest the design sucks!

    For some languages, reliable stack walking is a requirement. Tail call optimizations are nice, but there MUST be a way to prevent them.

    Oh, and the "recommended" workaround of doing a volatile memory read before the return is obviously nonsense. I might as well write an interpreter in JavaScript.

  2. davbr says:

    Hi, Jeroen, thanks for the feedback!  Judging by your comment, I’m not sure if you’re coming from the viewpoint of trying to write a profiler that requires reliable call stacks of the code it’s profiling, or if you’re trying to implement language features that intrinsically require the ability to perform stack walks.  Would you clarify your context?

    If you’re writing a profiler, I agree that the CLR profiling API could do a better job of making stack reporting better.  Being able to turn off just those optimizations that affect the stack would indeed be useful (including inlining, tail calling, and any new JIT optimizations we might dream up that could obfuscate the stack).  

    If you’re implementing language features, however, I’d be curious to understand better your underlying need to be able to walk the stack.  I discussed this situation with Grant.  He feels (and I agree) that while having full control over all the JIT optimizations may be one way to solve your problem, it’s probably better to attack that problem by serving the underlying need.  For example, perhaps there’s a way we could allow the language->IL compiler to make explicit in the IL whatever it’s trying to accomplish by walking the stack.  By going that route, the CLR could still continue to do advanced optimizations while also preserving the behavior required by the compiler.  Tail calls and inlining are just 2 well known examples of optimizations that ‘mess up’ the call chain or stack walking.  In the future there might be others.  If the IL is explicit in what it really needs, then those future optimizations will not break proper code.  And the proper code may even benefit from the optimization.

    Finally, to address the workaround you mentioned, I agree that in spirit adding an extra volatile read before the call to prevent the JIT from making it a tail call does feel hacky.  It is a work around, after all :-).  But it is very lightweight.  The only reason a volatile read or write is any more expensive than a regular read or write is that it prevents re-ordering of reads or writes (volatile and non-volatile).  It’s a way of telling the compiler (and sometimes even the CPU) that despite what it (the compiler or CPU) thinks is going on, reordering of a specific read or write is not safe, so it must perform them in the original program order.  I’d doubt this is akin to writing an interpreter in JavaScript.  🙂  I do believe there are more direct ways we could go about implementing solutions to your problems, as I mentioned above.  But as a present-day work around for compiler writers, an extra volatile read seems reasonable.

  3. As I said, my comments are off topic and I again apologize for that. I’ve written a Java VM for .NET and for various reasons reliable stack walking is pretty much essential when you want to fully implement all the JVM semantics. I don’t think that providing any higher level IL semantics would be helpful to me (well, it would in theory be possible to model all the Java requirements, but that wouldn’t be practical).

    BTW, I think that I fully understand the effect of a volatile read, and I disagree that it’s reasonable, but with the "JavaScript interpreter" I was obviously exaggerating a little for effect 😉

    BTW2, here’s a quote from the ECMA CLI specification:

    >>>

    noinlining specifies that the body of this method should not be included into the code of any caller methods, by a  CIL-to-native-code compiler; it shall be kept as a separate routine.  

    [Rationale: specifying that a method not be inlined ensures that it remains ‘visible’ for debugging (e.g., displaying stack traces) and profiling.  It also provides a mechanism for the programmer to override the default

    heuristics a CIL-to-native-code compiler uses for inlining. end rationale]

    <<<

    Given the rationale (in particular the "displaying stack traces" part), I think it is reasonable to expect that NoInlining implies no-tail-call.

    Thanks.

  4. First a little background reading before going into tail call improvements in CLR 4 – David Broman did

  5. News says:

    First a little background reading before going into tail call improvements in CLR 4 – David Broman did