The PowerPC has a concept of a "glue routine". This is a little block of code to assist with control transfer, most of the time to allow a caller in one module to call a function in another module. There are two things that make glue routines tricky: Jumping to the final target and juggling two tables of contents (the caller's and the callee's).
Registers r11 and r12 are available to glue routines as scratch registers. You can use them in your code, but be aware that they may be trashed by a glue routine, which means in practice that they are good only until the next taken jump instruction. (We saw earlier that r12 is used by prologues, but since prologues run at the start of a function, and you must have jumped there, prologues are welcome to use r12 as a scratch register because any valid caller must have assumed that r12 could have been trashed by a glue routine anyway.)
Let's take care of the easy case first: Suppose the routines share the same table of contents. This is usually the case if the caller and callee are in the same module. A glue routine may become necessary if a branch target ends up being too far away to be reached by the original branch, and the linker needs to insert a glue routine near the caller that in turn jumps to the callee. (On the Alpha AXP, this is called a trampoline.)
bl toofar_glue
...
toofar_glue:
lwz r11, n(r2) ; r11 = original jump target (toofar)
mtctr r11 ; ctr = original jump target (toofar)
bctr ; and jump to toofar
Exercise: We had two choices for the register to use for the indirect jump. We could have used ctr or lr. Why did we choose ctr?
Next is the hard part: A glue routine that needs to connect functions that may have different tables of contents. This sort of thing happens if you naïvely import a function.
bl toofar_glue
...
toofar_glue:
lwz r11, n(r2) ; r11 = function pointer
lwz r12, 0(r11) ; r12 = code pointer
stw r2, 4(r1) ; save caller's table of contents
mtctr r12 ; ctr = code for target
lwz r2, 4(r11) ; load callee's table of contents
bctr ; and jump to toofar
The inter-module glue function sets up both the code pointer and the table of contents for the destination function. But there's the question of what to do with the old table of contents. For now, we save it in one of the reserved words on the stack, but we're still in trouble because the callee will return back to the caller with the wrong table of contents. Oh no!
The solution is to have the compiler leave a nop
after every call that might be to a glue routine
that jumps to another module.
If the linker determines that the call target is indeed a glue routine,
then it patches the nop to lwz r2, 4(r1) to reload
the caller's table of contents.
So from the caller's perspective, calling a glue routine looks
like this:
; before
bl toofar ; not sure if this is a glue routine or not
nop ; so let's drop a nop here just in case
; after the linker inserts the glue routine
bl toofar_glue ; turns out this was a glue routine after all
ldw r2, 4(r1) ; reload caller's table of contents
The system also leaves the word at 8(r1) available
for the runtime,
but I don't see any code actually using it.¹
The remaining three reserved words in the stack frame
have not been assigned a purpose yet; they remain reserved.
If the compiler can prove² that the call destination uses the same
table of contents as the caller, then it can omit the nop.
The glue code saves the table of contents at 4(r1),
but the calling function may have already saved its table of contents
on the stack,
in which case saving the table of contents again
is redundant.
On the other hand,
if a function does not call through any function pointers,
then it doesn't explicitly manage its table of contents because
it figures the table of contents will never need to be restored.
So there's a trade-off here:
Do you force every function to save its table of contents on the stack
just in case it calls a glue routine (and teach the linker how to fish
the table of contents back out, so it can replace the nop
with the correct reload instruction)?
Or do you incur an extra store at every call to a glue routine?
Windows chose the latter.
My guess is that glue routines are already a bit expensive,
so making them marginally more expensive is better than penalizing
every non-leaf function with extra work that might end up not needed
after all.³
Exercise: Discuss the impact of glue routines on tail call elimination.
¹ My guess is that intrusive code coverage/profiling tools may use it as a place to save the r11 register, thereby making r11 available to increment the coverage count. But I haven't found any PowerPC code coverage instrumented binaries to know for sure.
² Microsoft compilers in the early 1990's did not support link-time code generation, so the compiler can prove this only if the function being called resides in the same translation unit as the caller.
³
It's possible to eliminate most glue routines
with sufficient diligence:
Explicitly mark your imported functions as
__declspec(dllimport) so that they aren't
naïvely-imported any more.
The only glue routines remaining would be the ones for calls
to functions that are too far away.
…And yet there are people who call IA-32 “outright bizarre compared to contemporary RISC-processors”. Holy fatcats. What is “sane processor architecture and coding conventions” then?
Sane is what you do, insane is what the other guy does :)
Oh IA-32 does a very similar rigmarole for position-independent code (PIC) on platforms that use the System V ABI (Linux, most other Unices). Namely, the GOT passed in ebx is equivalent to the PPC TOC passed in r12. See e.g. https://ewontfix.com/18/.
This doesn’t have much to do with PPC per se, it’s just that without PIC you have frequent fix-ups all over the place that force you to keep distinct copies of fixed-up pages for every process if a DLL gets relocated (the solution Windows chose on 32-bit x86), but with PIC and in absence of a nice way to do PC/RIP-relative addressing you get stuff like the above. Okay, the second half of the post anyway – the “branch too distant” glue functions/trampolines, not so much, because 32- and 64-bit x86 have 32-bit jump distances.
Newer architecture revisions (x86-64 and ARM AArch64) have learned their lesson and provide direct mechanisms for PC-relative addressing to make PIC easier. This is a blind spot for many RISC ISAs because most of them were designed before OSes did dynamic linking, which is when this suddenly started to matter: SunOS 4, which is I believe the first commercial Unix with dynamic linking support, came out in Dec 1988. Most of the major RISCs were already out by then – although nothing POWER-derived; PPC wasn’t a thing yet and the POWER1 shipped early in 1990.
By the way, do you or someone else know why “System V ABI” is called so? AFAIK, System V has never been ported to x64, so why call it “System V”?
I’m not sure!
It goes back to before POSIX and the SUS, so at the time the System V Interface Definition (SVID) was the primary standard for Unices. That might be the reason why the ABI is branded “System V”.
ELF was introduced in System V Release 4. The processor-specific definitions for ELF binaries on new Linux architectures can still be considered as extensions of the System V ABI.
First: ‘lr’ is already holding something we need (the return address). Even disregarding that, using ‘lr’ instead of ‘ctr’ sounds like a great way to completely desynchronize the return address predictor, killing your performance.
Second: Seeing as how the entire point of the glue function appears to be for saving and restoring the caller’s TOC, then because a tail call never returns to the caller, it seems that you can just skip the glue entirely, and simply loading the target’s TOC into r2 and then branching (without link).
Ah, but what if the target turns out to be a glue routine?
I expect that people who complain about the madness of IA-32 (*especially* IA-32; IA-64 has a great deal more sanity) are referring to the fallout from the following: Going from 16-bit to 32-bit, and the increasing sophistication of CPUs.
The x86 instruction set is actually a pretty nice instruction set. It has a lot of specialty instructions, especially for supporting loops, string handling, and function calls.
However, a number of things conspired to turn the elegance of the architecture into ugliness:
1. Going from 16-bit to 32-bit: The instruction set stayed pretty much the same, but all immediate values inside machine-word-sized instructions went from 16 bits to 32 bits. Since a typical instruction with an immediate was 3-4 bytes in 16-bit mode, that meant it became 5-6 bytes in 32-bit mode: which is a 50-67% increase in the size of instructions that required an immediate.
This was especially noticable for instructions that were loading small constants into registers.
2. Because there were so few registers, and because most useful instructions can directly reference memory, compilers (and assembly language programmers) would be forced to emit code that directly referenced memory addresses.
When combined with #1, this means that in 32-bit mode, many instructions that are only 4 bytes on a RISC system are *more* than 4 bytes on IA-32. Strike one.
Furthermore, temporaries that would be stored in a register on a RISC system must be written to main memory, with all the latency and cache effects that implies. Strike two.
3. Out-of-order execution was a death knell for a lot of the more fancy instructions.
OOE is crucial for a modern CPU to achieve anwhere near maximum performance; inversely, anything that prevents OOE must be avoided as much as possible.
A great example of this is PUSH. Because PUSH writes to SP, a series of PUSH instructions would create a dependency chain, preventing OOE and thus forcing the CPU to serialize the parameter setup phase.
The choice became: small but slow code, or fast but big code. Compilers would often choose the latter, and write MOV DWORD PTR [SP+x], y, for a whopping 6-byte instruction (for loading from a register) or a *10-byte* instruction (for a constant).
At this point, we’re in “crazy” territory: a large number of specialty instructions that aren’t used, and a common step (setting up a single argument for a subroutine call) that is a single 32-bit CPU instruction on a RISC machine, is more than twice as large on an x86. And, on x86, *must* hit main RAM eventually, even if all the subroutine does is read the value in, add 8, and return it in EAX.
Dependency chains aren’t that bad and certainly don’t prevent OOE! Specifically, it’s totally fine to have a few long dependency chains as long as you have other stuff to do in between – like, say, the rest of the function.
PUSHes and POPs are an interesting case because they are quite common and have been getting special hardware acceleration for over 15 years now.
Even in a direct implementation of PUSH/POP (i.e. actual SP modified every instruction), that means a sequence of PUSH/POPs each depends on the previous one in the chain, but that only affects exactly those instructions (and others that access RSP explicitly or implicitly), it doesn’t prevent you from reordering anything else. Said direct implementation boils down to a SP decrement followed by a store for each PUSH, with subsequent PUSHes depending on the prior PUSH’s SP decrement. Assuming the SP decrement takes one clock cycle (i.e. same as a regular integer add), that limits you to one PUSH executed per clock cycle; but all currently shipping x86s execute at most one store per clock cycle anyway (because there’s only one, pipelined, store unit) so that is not a throughput hit by itself.
POPs are a bit more annoying in the direct implementation because current x86s can do two loads per cycle, so there the 1/cycle limit if you have to wait for the SP update hurts.
That’s where the aforementioned hardware acceleration helps: starting with the Pentium M, Intel put a “Stack Engine” in the renamer that internally turns sequences of stack-modifying operations into more efficient ones. Conceptually, the CPU frontend rewrites say
PUSH rax
PUSH rcx
PUSH rdx
into something like the internal equivalent of
MOV [rsp-8], rax
MOV [rsp-16], rcx
MOV [rsp-24], rdx
SUB rsp, 24
getting rid of most of the serial stack pointer updates entirely (similar for POPs). For PUSHes, this doesn’t decrease the total latency for the pushes to complete, but it does reduce the number of uops executed (leaving the integer execution units free to do more useful work). For POPs on the more recent CPUs supporting two loads/cycle (for Intel: Sandy Bridge and later), this also has the side effect of allowing two POPs to complete per cycle, but that’s more recent (~7 years ago).
(A bit more detail: what the stack engines do is make instructions such as PUSH, POP, CALL and RET use a virtual SP that has the form “virtual_SP = base_SP + offset”, where offset is a small integer, and base_SP is stored in a register. Most stack-using instructions only update the offset. Only when the offset gets too small, too large, an instruction wants to use RSP directly, or an interrupt/exception occurs, is an actual stack-pointer update inserted that makes the virtual SP agree with the “base SP” stored in a register and zeroes the offset. This occurs fairly rarely.)
>The x86 instruction set is actually a pretty nice instruction set. It has a lot of specialty instructions,
>especially for supporting loops, string handling, and function calls.
You left off the rest of the sentence:
… as implemented by the Pascal programming language.
Its a pitty IBM didn’t go with the Motorola 68000 as the CPU in the PC, having dabbled in 68000 assembler I find it a nice instruction set to work with (in part because it gives you that nice flat address space and doesn’t even have the idea of segments)