The Alpha AXP, part 1: Initial plunge


Since the Itanium series was such a smash hit (two whole people read it!), here's another series for a now-defunct processor architecture which Windows once supported. The next who-knows-how-many days will be devoted to an introduction to the Alpha AXP processor, as employed by Win32.

The Alpha AXP follows in the traditional RISC philosophy of having a relatively small and uniform instruction set. The first Alpha AXP chip was dual-issue, and it eventually reached quad-issue. (There was an eight-issue processor under development when the Alpha AXP project was cancelled.) This series will focus on the original Alpha AXP architecture because that's what Windows NT for Alpha AXP ran on, and it will largely ignore features added later.

The Alpha AXP is a 64-bit processor. It does not have "32-bit mode"; the processor is always running in 64-bit mode. If the destination of a 32-bit operation is a register, the answer is always sign-extended to a 64-bit value. (This is known as the "canonical form" for a 32-bit value in a 64-bit register.) This one weird trick lets you close one eye and sort of pretend that it's a 32-bit processor. An Alpha AXP program running on 32-bit Windows NT still has full access to the 64-bit registers and can use them to perform 64-bit computations. It could even use the full 64-bit address space, if you were willing to jump through some hoops.

Each instruction is a 32-bit word, aligned on a 4-byte boundary. Unlike other RISC processors of its era, the Alpha AXP does not have branch delay slots. If you don't know what branch delay slots are, then consider yourself lucky.

Memory size terms in the Alpha AXP instruction set are byte, word (two bytes), longword (four bytes), and quadword (eight bytes).¹ In casual conversation, longword and quadword are usually shortened long and quad.

The Alpha AXP defines certain groups of instructions which are optional, such as floating point. If you perform an instruction which is not implemented by the processor, the instruction will trap into the kernel, and the kernel is expected to emulate the missing instruction, and then resume execution.

Registers

There are 32 integer registers, all 64 bits wide. Formally, they are known by the names r0 through r31, but Win32 assigns them the following mnemonics which correspond to their use in the Win32 calling convention.

Register Mnemonic Meaning Preserved? Notes
r0 v0 value No On function exit, contains the return value.
r1r8 t0t7 temporary No
r9r14 s0s5 saved Yes
r15 fp frame pointer Yes For functions with variable-sized stacks.
r16r21 a0a5 argument No On function entry, contains function parameters.
r22r25 t8t11 temporary No
r26 ra return address Not normally
r27 t12 temporary No
r28 at assembler temporary Volatile Long jump assist.
r29 gp global pointer Special Not used by 32-bit code.
r30 sp stack pointer Yes
r31 zero reads as zero N/A Writes are ignored.

The zero register reads as zero, and writes to it are ignored. But it goes further than that: If you specify zero as the destination register for an instruction, the entire instruction may be optimized out by the processor! This means that any side effects may or may not occur. There are a few exceptions to this rule:

  • Branch instructions are never optimized out. If a branch instructions specifies zero as the register to receive the return address, the branch is still taken, but the return address is thrown away.

  • Load instructions are always optimized out. If a load instruction specifies zero as the destination register, the processor will never raise an exception. Instead, these "phantom loads" are used as prefetch hints to the processor.

Whereas the behavior of the zero register is architectural, the behavior of the other registers are established by convention.

Win32 requires that the gp, sp, and fp registers be used for their stated purpose throughout the entire function. (If a function does not have a variable-sized stack frame, then it can use fp for any purpose.) Some registers have stated purposes only at entry to a function or exit from a function. When not at the function boundary, those registers may be used for any purpose.

Register marked with "Yes" in the "Preserved" column must be preserved across the call; those marked "No" do not.

The ra register is marked "Not normally" because you don't normally need to preserve it. However, if you are a leaf function that uses no stack space and modifies no preserved registers, then you can skip the generation of unwind codes for the leaf function, but you must keep the return address in ra for the duration of your function so that the operating system can unwind out of the function should an exception occur. (Special rules for lightweight leaf functions also exist for Itanium and x64.)

What does it mean when I say that the at register is volatile?

Direct branch instructions can reach destinations up to 4MB from the current instruction. When the compiler generates a bsr instruction (branch to subroutine), it typically doesn't know how far away the destination is. The compiler just generates a bsr instruction with a fixup and hopes for the best. It is the linker who knows how far away the destination actually is, and if it turns out the destination is too far away, the linker changes

        ....
        BSR     toofaraway
        ....

to

        ....
        BSR     trampoline
        ....

trampoline:
        ... set the "at" register equal to the
        ... address of "toofaraway."
        JMP     (at)            ; register indirect jump

The linker inserts the generated trampoline code between functions, which also has as a consequence that a single function cannot be larger than 8MB.

Anyway, this secret rewriting means that any branch instruction can potentially modify the at register. In between branches, you can use at, but you cannot rely on its value remaining the same once a branch is taken. In practice, the compiler just avoids using the at register altogether.

The gp register is not used by 32-bit code. I don't know for sure, but I'm guessing that in 64-bit code, it serves the same purpose as the Itanium gp register.

Note that some register names, like a0 look like hex digits. The Windows debugger resolves them in favor of hex values, so if you do ? a0 thinking that you're getting the value of the a0 register, you're going to be disappointed. To force a symbol to be interpreted as a register name, put an at-sign in front: ? @a0.

Even more confusing is that the Windows debugger's disassembler does not put the 0x prefix in front of numbers, so when you see an a0, you have to use the context to determine whether it is a number or a register. For example,

    LDA     a0, a0(a0)
            ^^  ^^ ^^
      register  |  register
              number

The first parameter to LDA and the parameter inside the parentheses must be a register, so the outer a0's refer to the register. The thing just outside the parentheses must be a constant, so the middle a0 is the number 160. Yes, it's confusing at first, but the uniform instruction set means that these rules are quickly learned, and you don't really notice it once you get used to it.

Another point of confusion is that the conventional placeholder names for registers in instructions are Ra, Rb and Rc. This should not be confused with the ra register.

There are thirty-two floating point registers. Formally, they are known as f0 through f31, but Win32 assigns the following mnemonics:

Register Mnemonic Preserved? Meaning
f0 No Return value
f1 No Second return value (for complex numbers)
f2f9 Yes
f10f15 No
f16f21 No First six parameters
f22f30 No
f31 fzero N/A Reads as zero. Writes are ignored.

There are four floating point formats supported. Two are the usual IEEE single and double precision formats. Two are special formats for backward compatibility with the DEC VAX. That's about all I'm going to say about floating point.

Finally, there are some special registers.

Register Mnemonic Meaning
pc fir program counter
lock_flag For interlocked memory access
phys_locked For interlocked memory access
fpcr Floating point control register

Why is the program counter called fir? Because that stands for "faulting instruction register".

Clearly named by somebody wearing kernel-colored glasses.

These special registers are not directly accessible. To retrieve the program counter, you can to issue a branch instruction and save the "return address" into the desired destination register. We'll learn more about the lock_flag and phys_locked when we study interlocked memory access.

Note that there is no flags register.

I repeat: There is no flags register.

Here's what a register dump looks like in the Windows debugger:

  v0=00000000 00000016   t0=00000000 00000000   t1=00000000 00000000
  t2=00000000 00000000   t3=00000000 00000009   t4=00000000 00000001
  t5=00000000 0006f9d0   t6=00000000 00000008   t7=00000000 00000000
  s0=00000000 00000001   s1=00000000 00000000   s2=00000000 00081eb0
  s3=00000000 77fc0000   s4=00000000 00081dec   s5=00000000 77fc0000
  fp=00000000 7ffde000   a0=00000000 750900c8   a1=00000000 00000001
  a2=00000000 00000009   a3=00000000 0006f9d0   a4=00000000 00000001
  a5=00000000 00000001   t8=00000000 0000004c   t9=00000000 00000001
 t10=00000000 0000004c  t11=ffffffff c00ea124   ra=00000000 77f4df08
 t12=00000000 00000001   at=00000000 77f548f0   gp=00000000 00000000
  sp=00000000 0006f9e0 zero=00000000 00000000 fpcr=08000000 00000000
softfpcr=00000000 00000000  fir=77f63bf4
 psr=00000003
mode=1 ie=1 irql=0

I never needed to know what softfpcr is. The psr is the processor status register, the mode is 1 for user mode and 0 for kernel mode, ie is the interrupt enable flag, and irql is the interrupt request level.

The calling convention is simple. As noted in the tables above, parameters are passed in registers, with excess parameters spilled onto the stack. There is no home space. The return address is passed in the ra register, and the stack must be kept aligned on a 16-byte boundary. Exception dispatch is done by unwind tables stored in a separate section of the image.

Okay, that's the register set and calling convention. Next time, we'll look at integer operations.

Exercise: The x64 calling convention reserves home space so that the register-based parameters can be spilled onto the stack and remain contiguous with the other stack-based parameters, so that the entire parameter pack can be enumerated with the va_start family of macros. Why doesn't this requirement apply to the Alpha AXP?

¹ The term octaword was introduced later, but we are focusing on the Alpha AXP classic architecture.

Comments (40)
  1. pc says:

    Both of us who enjoyed the Itanium writeup will be enjoying this series as well. Thank you.

    1. Martin Bonner says:

      Ah-ha! I was wondering who the other guy was.

      1. jnm2 says:

        He’s gotta be joking. I read it and I’m sure many others did as well.

        1. cheong00 says:

          Yup. Not replying it does not necessarily means not enjoy reading it.

        2. mikeb says:

          He was certainly joking – look at the comments to the Itanium articles. They were quite active.

        3. Grey Hodge says:

          There are literally dozens of us!

    2. I’d love to see Raymond do one of these for each architecture he’s worked with. The Itanium one was deeply interesting, and I expect this one to be good, too.

      1. mikeb says:

        I’d love to see a series on ARM even if he hasn’t worked on it. I’d bet he would still teach me some interesting stuff.

          1. mikeb says:

            Raymond comes through – courtesy of Marion Cole. Thanks! (I’m surprised I don’t remember that at all)

    3. ZLB says:

      Well, I enjoyed the Itanium series too. What a beast of an architecture!

      On the other hand, batch week. Err…yeah that OK I guess!

  2. DWalker07 says:

    It’s interesting to follow the development of different processor architectures. I’ll bet at least THREE people read the Itanium articles.

  3. Muzer says:

    Hooray! I’ve been waiting for this for a while! I loved your Itanium series; I must be the third out of two!

    As for the exercise, I’m a little stumped, but would it make sense for the first six arguments to be added to registers left-to-right, then the remaining arguments pushed onto the stack right-to-left, such that arguments can be taken out of registers and added to the “correct” end of the stack without issue as if they were local variables?

  4. kantos says:

    Alpha AXP was always the great “What could have been.” I’d have been very curious to see benchmarks between early x64 and AXP using windows.

  5. Not as crazy as the Itanium architecture to be sure, but a fun read regardless. Reminds me a lot of the MIPS V architecture, except I know MIPS did have a flags register along with some other internal registers that are still an enigma to me. MIPS did have a 32-bit operating mode, but all it basically meant was that it ignored the top 32 bits when it came to branch, jump, and load/store operations.

  6. CarlD says:

    “If you don’t know what branch delay slots are, then consider yourself lucky. ” Amen to that, brother.

    1. Joshua says:

      In class the instructor kept placing work instructions in the delay load slot for the hypothetical processor; neglecting the one real processor we dealt with said in the manual you couldn’t do that because interrupt save/restore wouldn’t restore the state to actually run the delay load instruction if an interrupt happened at the right time.

      1. smf says:

        @Josuha I love branch delay slots & load delay slots. Having that level of implementation visible to the programmer appeals to my hacker instinct. Yeah you have to be careful, but you can pull some tricks with it. Putting a branch in a branch delay slot for fun and profit.

        “In class the instructor kept placing work instructions in the delay load slot for the hypothetical processor; neglecting the one real processor we dealt with said in the manual you couldn’t do that because interrupt save/restore wouldn’t restore the state to actually run the delay load instruction if an interrupt happened at the right time.”

        1. Alex Elsayed says:

          The problem arises when “that level of implementation” – which may have made sense for the original silicon – is made obsolete by subsequent advances.

          The branch delay slot in MIPS is a wart these days, hindering microarchitectural optimizations that could result in faster/lower-power cores. There’s a good reason RISC-V left it out :P.

    2. Ted Spence says:

      I was fond of “skip-if-test” instructions. Those were the ones where each test instruction would skip the next instruction if the test was successful, and not skip it if the test failed. So each “If x = y then z” code would compile to a load, a compare, and an unconditional branch that was skipped if the test caused it.

      1. Alex Elsayed says:

        Yeah, predication has some nice uses. It’s fallen out of favor for modern application CPUs in favor if branch prediction, but it’s seeing use in new vector stuff, such as the Hwacha project (and the RISC-V ISA’s vector extension, V).

  7. IanBoyd says:

    Hey, I loved the Itanium series. I’ve gone back to re-read it a couple times, and I’ve shared links to it.

    Its made me wonder what Intel could do today with everything they’ve learned about CPU design. With all the cache, speculative execution, and out of order execution: 95% of the CPU die is dedicated to to the concept of rewriting your program to make it run faster. Which gave rise to the premise of Itanium: let the compiler do the rewriting of the code; rather than dedicating precious transistors on the CPU die. And it eventually worked – although regular Xeon eventually caught up.

    I wonder if we’re at the point where RAM is the fundamental limitation, and any further advancements in compilers or silicon gives only tiny marginal improvements, and there’s no other CPU design that can fundamentally help execution.

    But the Itanium series was great.

    1. Ted Spence says:

      Problem is that even if the compiler understands everything perfectly, the next processor in the series will come out and the optimizations will have to be redone. So putting a limited amount of rewriting logic in the chip is the best way to do it because you can re-optimize for each generation of the architecture.

  8. skSdnW says:

    I must be the fourth out of two. (Did you check the view stats Raymond?) It might not be that relevant and people might not make that many comments but I think most people still enjoy these posts!

    I’ll take a stab at the exercise; I don’t know about the win32 ABI but in C the va_* functions are free to do whatever they want and the underlying platform might not store them in a contiguous array which is why va_copy was added later on.

  9. Zan Lynx' says:

    I am a big fan of the Itanium design and I read your series on it too. I am sad that it didn’t take off.

  10. camhusmj38 says:

    Attempt at the exercise: Is it because the return address is not stored on the stack?
    Therefore you can spill the register parameters and still access the whole thing [as in also the stack parameters] as an array.
    In x64, you would have the return address in the middle.

    1. Yup, that’s the answer I came up with.

      1. Yuri Khan says:

        int printf(const char *format, …);
        printf(“%d %f %d %f”, 42, 3.14, 17, 2.72);

        If integer arguments are passed in a0 and a1, and floating arguments in f16 and f17, how does the callee know beforehand which registers to spill, and in which order?

        1. It spills all of them! I’ll make a note to discuss this in a future installment.

  11. Karellen says:

    Memory size terms in the [64-bit] Alpha AXP instruction set are byte, word (two bytes), longword (four bytes), and quadword (eight bytes).

    That seemed weird. Wikipedia both agrees (“In the Alpha architecture, a byte was defined as an 8-bit datum (octet), a word as a 16-bit datum, a longword as a 32-bit datum, a quadword as a 64-bit datum, and an octaword as a 128-bit datum.”[0]) and disagrees (Under “Word size w” for entry “Alpha”, the value is “64b”[1])

    However, looking back at DEC’s history, the Alpha is the successor to the 32-bit VAX, which is in turn the successor to the 16-bit PDP. By analogy with Windows, where WORD and DWORD have fixed meanings irrespective of the natural word size of the CPU, inherited from Win16 through Win32 to Win64, it would appear that the “word” of Alpha chips has been similarly immutable.

    So much for technical terms with precise meanings, when backwards-compatibility rears its ugly (but oh so pragmatic) head!

    [0] https://en.wikipedia.org/wiki/Alpha_AXP#Data_types
    [1] https://en.wikipedia.org/wiki/Word_(computer_architecture)#Table_of_word_sizes

    1. The term “word” is being used in two different senses, which is why you see an apparent contradiction. One is as an architecture-defined terminology. The other is as a general computer industry term, where it means the natural operand size. For the Alpha AXP, the architecture-defined terminology for “word” is 16-bit, and the natural operand size is 64-bit.

  12. JoeD says:

    Not only did I enjoy reading it, I actually have both an Alpha (DEC Multia UDB) system and an Itanium system in my basement at home. I haven’t fired them up in a while though.

    1. DonH says:

      Of course not, it’s summer. You shouldn’t need them until about November.

      1. ErikF says:

        They also double as counter-surveillance units: nobody can hear what you’re yelling at them outside of 5 feet! ;-)

  13. Matteo Italia says:

    Feels like a remarkably clear, well designed architecture, especially compared to the weird memories I have of the Itanium series. I can’t wait to see if it will live up to these premises.

    1. Thiago Macieira says:

      Famous last words…

      There’s a reason there’s a section of multiprocessor documentation of the Linux kernel that starts with “And then there was Alpha”

  14. Simon Clarkstone says:

    In answer to the exercise, I can make and educated guess:

    The top-of-stack at time of function entry is the parameters N to 6. This means that the callee is free to push parameters 5 to 0 onto the stack and they will be contiguous with the parameters N to 6. This doesn’t work in (e.g.) x64 because on the stack the return address and frame pointer(?) would be sandwiched between parameter 6 and parameter 5, but it’s fine in AXP where those things are kept in registers and only spilled onto the stack if+where the callee wants them to be.

    More advanced: In functions like foo(FILE f, int i, char* fmt, …) you can spill parameters 3-5 onto the stack and needn’t spill parameters 0-2.

    I liked the Itanium series too.

  15. Count me as another that thoroughly enjoyed the arcana of the IA64 series. Looking forward to the rest of the Alpha AXP series!

  16. Simon says:

    Heh… Alpha was the model architecture we studied back in university. Never as popular as x86, but far cleaner, and so much better suited for teaching the concepts. I think I still have my old textbook around somewhere…

  17. anonymous says:

    Exercise: no need for home space because return address is saved in a register rather than pushed to the stack. If parameters are pushed from right to left, the function is free to spill the first 6 by pushing them on the stack.

Comments are closed.

Skip to main content