x64 ABI vs. x86 ABI (aka Calling Conventions for AMD64 & EM64T)

(This is an older post, with some mild cleanup, and fixed links)

Before I start: ABI = Application Binary Interface – this is the spec that describes how to call functions, pass parameters, unwind the stack, handle exceptions, etc... It’s also sometimes call the ‘Calling Convention’

There is a persistent misconception among people who are implementing x64 compilers/code generators, and folks that write ASM code for x64, who have a functioning solution for x86. People too frequently assume they can just keep most of their code, while changing ESP to RSP, and things should 'just work'. This is fundamentally not true. When initially working on the x64 ABI, it was decided that we wanted to clean up the way that exception handling & general function invocation worked. We had a brand new architecture, we wanted to cut out all of the legacy junk that continually prevents, or at least overcomplicates, achieving great performance on the x86 platform. With this in mind, x64 was given a single calling convention – no __cdecl / __stdcall / __fastcall / __thiscall mess. There was also a dramatic change in the way that x64 unwinds the stack, compared to how x86 does it.

Unwinding the stack is used in a wide variety of places, including handling exceptions, garbage collection, and displaying the call stack from a debugger. On x86, every function that needs some sort of attention due to an exception must add an element to a thread-global linked list upon entry, and remove it upon exit. For non-exception unwinds, the thing doing the unwind must grok through some nasty meta-data that tries to describe what & when the compiler is setting up the stack frame. This meta-data was only implemented after about 1999, and is primarily supported in debuggers. I’ll be ignoring this junk, and instead focus on how exception handling works, since this is the primary way your code will break if you don’t get this right (undebuggable code is still broken, but you won’t notice until you try to get a stack trace).

So, the x86 thread-global linked list contains a list of structures, each element of such contains a function pointer to call in the event of an exception, and then some data that said function will consume. Thus, you’ll see fs:[0] references scattered throughout C++ code:every function that contains a destructor that must be invoked if an exception occurs must have one of these things. When your code creates a new object, there is a small bit of code executed to update the data on this thread-global list element. When that object is destroyed, more code is executed. Because of this, x86 can catch hundreds of thousands of exceptions per second, but if you don’t actually take any exceptions, your CPU executes lots of blobs of code that have no purpose except to be sure that the rare exception is handled properly. Finding the first function that needs to handle an exception, or destroy an object is an O(1) operation: it’s a single lookup of FS:[0]. VERY fast. Unfortunately, that should not be a common scenario. Exceptions should be exceptional, not common!In addition, this linked list really resides on the stack, thus there is a function pointer sitting right below the return address on your stack!Buffer overruns, anyone?There is now an extra blob of information that indicates what functions are valid to be invoked as exception handlers (link /SAFESEH, if you’re curious), but this is only used if every contribution to your .exe or .dll has this information.

With this in mind, on x64 every function has a very strict structure that must be properly described in static data. The prolog is the only place in which you can adjust your stack frame pointer. The prolog can only be up to 255 bytes long. All modifications of your stack frame pointer, as well as all saves of nonvolatile registers (RBP, RSI, RDI, R12-R15, XMM6-XMM15) must be described in this static data, so that they can be restored correctly if an exception occurs. If you function is missing this static data, and an exception is raised, the thread will be terminated by the OS.

In an effort to get stuff into peoples hands, here’s an excerpt that I’ve prepended to the ABI document that has not yet seen the light of day. I’ve updated the links to point you

at the sections on MSDN. Rather than posting the whole thing, I've just put in my description, with links back to the slightly older version on MSDN. The current version has some more detail, but nothing you couldn't figure out through, ahem, trial & error...

<snip>

Overview

The in depth nature of an ABI document doesn’t lend itself to ‘easy reading’. However, it is the case that a detailed knowledge of the entire ABI is rarely necessary to accomplish most programming tasks. This section is simply a quick overview of the ABI, with pointers to the sections that describe the various aspects in more detail. It also tries to point out particular ‘gotchas’ that must be strictly adhered to, in an effort to minimize the problems encountered.

Calling convention

The x64 Windows ABI is a 4 register ‘fast-call’ calling convention, with stack-backing for those registers. There is a strict one-to-one correspondence between arguments in a function, and the registers for those arguments. Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers. The x87 register stack is unused. It may be used, but must be considered volatile across function calls. All floating point operations are done using the 16 XMM registers. The arguments are passed in registers RCX, RDX, R8, and R9. If the arguments are float/double, they are passed in XMM0L, XMM1L, XMM2L, and XMM3L. 16 byte arguments are passed by reference. Parameter passing is described in detail at MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Calling Convention: Parameter Passing. In addition to these registers, RAX, R10, R11, XMM4, and XMM5 are volatile. All other registers are non-volatile. Register usage is documented in detail at MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Register Usage and MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Calling Convention: Caller/Callee Saved Registers.

The caller is responsible for allocating space for parameters to the callee, and must always allocate sufficient space for the 4 register parameters, even if the callee doesn’t have that many parameters. This aids in the simplicity of supporting K&R C’s unprototyped functions, and ‘vararg’ C/C++ functions. For vararg/unprototyped functions any float values must be duplicated in the corresponding general-purpose register. Any parameters above the first 4 must be stored on the stack, above the backing-store for the first 4, prior to the call. Vararg function details can be found at MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Calling Convention: Varargs. Unprototyped function information is detailed at MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Calling Convention: Unprototypesd Functions.

Alignment

Most structures are aligned to their natural alignment. The primary exceptions are the stack pointer and malloc/alloca memory, which are aligned to 16 byte, in order to aid performance. Alignment above 16 bytes must be done manually, but since 16 bytes is a common alignment size for XMM operations, this should suffice for most code. For more information about structure layout and alignment see MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Types and Storage. For information about the stack layout, see MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Stack Usage.

Unwindability

All non-leaf functions [functions that neither call a function, nor allocate any stack space themselves] must be annotated with data [referred to as xdata or ehdata, which is pointed to from pdata] that describes to the operating system how to properly unwind them, to recover non-volatile registers. Prologs & epilogs are highly restricted, so that they can be properly described in xdata. The stack pointer must be aligned to 16 bytes, except for leaf functions, in any region of code that isn’t part of an epilog or prolog. For details about the proper structure of function prolog & epilogs, see MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Prolog and Epilog. For more information about exception handling, and the exception handling/unwinding pdata & xdata see MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Exception Handling (x64).

</snip>

Anyway, I hope that helps. I could not believe how long it took to find this stuff on MSDN. For some (lame) reason, blogs are better indexed than MSDN in both MSN search, and Google. Hopefully this entry will make finding this information easier.

Here’s a link to the official x64 ABI documentation, which goes into excruciating detail about this stuff. Sorry if it’s not very readable – I wrote a few parts, along with a few other people, and we’re primarily engineers, not writers. We have had a great UE write take over on this document, and it should be slightly improved when it sees the light of day as part of the VS 2005 documentation, but until then, this is it:

MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions

Hopefully, that link will continue to work for a while. MSDN likes to occasionally restructure the way it references information, just to keep us all on our toes. I've now updated the links 3 times. If they're broken, please ping me and I'll update them!