The Itanium processor, part 1: Warming up


The Itanium may not have been much of a commercial success, but it is interesting as a processor architecture because it is different from anything else commonly seen today. It's like learning a foreign language: It gives you an insight into how others view the world.

The next two weeks will be devoted to an introduction to the Itanium processor architecture, as employed by Win32. (Depending on the reaction to this series, I might also do a series on the Alpha AXP.)

I originally learned this information in order to be able to debug user-mode code as part of the massive port of several million lines of code from 32-bit to 64-bit Windows, so the focus will be on being able to read, understand, and debug user-mode code. I won't cover kernel-mode features since I never had to learn them.

Introduction

The Itanium is a 64-bit EPIC architecture. EPIC stands for Explicitly Parallel Instruction Computing, a design in which work is offloaded from the processor to the compiler. For example, the compiler decides which operations can be safely performed in parallel and which memory fetches can be productively speculated. This relieves the processor from having to make these decisions on the fly, thereby allowing it to focus on the real work of processing.

Registers overview

There are a lot of registers.

  • 128 general-purpose integer registers r0 through r127, each carrying 64 value bits and a trap bit. We'll learn more about the trap bit later.

  • 128 floating point registers f0 through f127.
  • 64 predicate registers p0 through p63.
  • 8 branch registers b0 through b7.
  • An instruction pointer, which the Windows debugging engine for some reason calls iip. (The extra "i" is for "insane"?)

  • 128 special-purpose registers, not all of which have been given meanings. These are called "application registers" (ar) for some reason. I will cover selected register as they arise during the discussion.

  • Other miscellaneous registers we will not cover in this series.

Some of these registers are further subdivided into categories like static, stacked, and rotating.

Note that if you want to retrieve the value of a register with the Windows debugging engine, you need to prefix it with an at-sign. For example ? @r32 will print the contents of the r32 register. If you omit the at-sign, then the debugger will look for a variable called r32.

A notational note: I am using the register names assigned by the Windows debugging engine. The formal names for the registers are gr# for integer registers, fr# for floating point registers, pr# for predicate registers, and br# for branch registers.

Static, stacked, and rotating registers

These terms describe how the registers participate in register renumbering.

Static registers are never renumbered.

Stacked registers are pushed onto a register stack when control transfers into a function, and they pop off the register stack when control transfers out. We'll see more about this when we study the calling convention.

Rotating registers can be cyclically renumbered during the execution of a function. They revert to being stacked when the function ends (and are then popped off the register stack). We'll see more about this when we study register rotation.

Integer registers

Of the 128 integer registers, registers r0 through r31 are static, and r32 through r127 are stacked (but they can be converted to rotating).

Of the static registers, Win32 assigns them the following mnemonics which correspond to their use in the Win32 calling convention.

Register Mnemonic Meaning
r0 Reads as zero (writes will fault)
r1 gp Global pointer
r8r11 ret0ret3 Return values
r12 sp Stack pointer
r13 TEB

Registers r4 through r7 are preserved across function calls. Well, okay, you should also preserve the stack pointer and the TEB if you know what's good for you, and there are special rules for gp which we will discuss later. The other static variables are scratch (may be modified by the function).

Register r0 is a register that always contains the value zero. Writes to r0 trigger a processor exception.

The gp register points to the current function's global variables. The Itanium has no absolute addressing mode. In order to access a global variable, you need to load it indirectly through a register, and the gp register points to the global variables associated with the current function. The gp register is kept up to date when code transfers between DLLs by means we'll discuss later. (This is sort of a throwback to the old days of MAKEPROCINSTANCE.)

Every integer register contains 64 value bits and one trap bit, known as not-a-thing, or NaT. The NaT bit is used by speculative execution to indicate that the register values are not valid. We learned a little about NaT some time ago; we'll discuss it further when we reach the topic of control speculation. The important thing to know about NaT right now is that if you take a register which is tagged as NaT and try to do arithmetic with it, then the NaT bit is set on the output register. Most other operations on registers tagged as NaT will raise an exception.

The NaT bit means that accessing an uninitialized variable can crash.

void bad_idea(int *p)
{
 int uninitialized;
 *p = uninitialized; // can crash here!
}

Since the variable uninitialized is uninitialized, the register assigned to it might happen to have the NaT bit set, left over from previous execution, at which point trying to save it into memory raises an exception.

You may have noticed that there are four return value registers, which means that you can return up to 32 bytes of data in registers.

Floating point registers

Register Meaning
f0 Reads as 0.0 (writes will fault)
f1 Reads as 1.0 (writes will fault)

Registers f0 through f31 are static, and f32 through f127 are rotating.

By convention, registers f0 through f5 and f16 through f31 are preserved across calls. The others are scratch.

That's about all I'm going to say about floating point registers, since they aren't really where the Itanium architecture is exciting.

Predicate registers

Instead of a flags register, the Itanium records the state of previous comparison operations in dedicated registers known as predicates. Each comparison operation indicates which predicates should hold the comparison result, and future instructions can test the predicate.

Register Meaning
p0 Reads as true (writes are ignored)

Predicate registers p0 through p15 are static, and p16 through p63 are rotating.

You can predicate almost any instruction, and the instruction will execute only if the predicate register is true. For example:

(p1) add ret0 = r32, r33

means, "If predicate p1 is true, then set register ret0 equal to the sum of r32 and r33. If not, then do nothing." The thing inside the parentheses is called the qualifying predicate (abbreviated qp).

Instructions which execute unconditionally are internally represented as being conditional upon predicate register p0, since that register is always true.

Actually, I lied when I said that the instruction will execute only if the qualifying predicate is true. There is one class of instructions which execute regardless of the state of the qualifying predicate; we'll learn about that when we get to them.

The Win32 calling convention specifies that predicate registers p0 through p5 are preserved across calls, and p6 through p63 are scratch.

There is a special pseudo-register called preds by the Windows debugging engine which consists of the 64 predicate registers combined into a single 64-bit value. This pseudo-register is used when code needs to save and restore the state of the predicate registers.

Branch registers

The branch registers are used for indirect jump instructions. The only things you can do with branch registers are load them from an integer register, copy them to an integer register, and jump to them. In particular, you cannot load them directly from memory or do arithmetic on them. If you want to do any of those things, you need to do it with an integer register, then transfer it to a branch register.

The Win32 calling convention assigns the following meanings to the branch registers:

Register Mnemonic Meaning
b0 rp Return address

The return address register is sometimes called br, but the disassembler calls it rp, so that's what we'll call it.

The return address register is set automatically by the processor when a br.call instruction is executed.

By convention, registers b1 through b5 are preserved across calls, while b6 and b7 are scratch. (Exercise: Is b0 preserved across calls?)

Application registers

There are a large number of application registers, most of which are not useful to user-mode code. We'll introduce the interesting ones as they arise. I've already mentioned one of them already: bsp is the ia64's second stack pointer.

Break

Okay, this was a whirlwind tour of the Itanium register set. I bet your head hurts already, and we haven't even started coding yet!

In fact, we're not going to be coding for quite some time. Next time, we'll look at the instruction format.

Comments (62)
  1. anonymouscommenter says:

    Thanks for all of the effort that obviously went into writing all of this up.  I hope you don't get too many comments along the lines of "we don't care about the Itanium". Processor architecture is fascinating to many of us!

  2. anonymouscommenter says:

    Is b0 preserved across calls?

    Only recursive calls :-)

  3. anonymouscommenter says:

    So you think that is a lot of registers? I'm working on a project involving a CPU w/ 512 32bit registers.

  4. anonymouscommenter says:

    So you think that is a lot of registers? I'm working on a project involving a CPU w/ 512 32bit registers.

  5. anonymouscommenter says:

    This series is also of interest to me as would a series on the AXP (particularly if there is information about the unfinished 64-bit port.)

  6. kantos says:

    I also vote for the AXP follow on. Alpha was a very interesting architecture killed by sales people that couldn't sell and price point

  7. Charles H. Stevens says:

    Thanks for writing this up. I think it will be interesting.

    I will also vote for the AXP series for the same reason.

    Also I am going to guess that the extra i in iip is for Intel :)  [and yes I remember that this processor is related to the HP PA-RISC]

  8. anonymouscommenter says:

    And this is your processor on drugs…

  9. I love it!  I've always been curious about the Itanium architecture.  I heard/read that it had plenty of registers, but didn't realize exactly how many until you listed them (in excellent detail, might I add) above.  No wonder those processors were so expensive; it's surprising they didn't release a scaled back version with half the number of registers.  That still would have been an improvement over x86.

  10. anonymouscommenter says:

    Yes, my head does hurt already. This will be interesting though, looking forward to the rest of it.

    I almost expected there to be a "register p1 always reads as false". But maybe that would be useless??

  11. anonymouscommenter says:

    I wouldn't be so rude as to suggest that nobody cares about a topic so Raymond shouldn't spend a lot of time on it.  Different strokes for different folks.  But as I used to say to my Wife, "Just wake me when it's over."

  12. anonymouscommenter says:

    The first "i" in "iip" is for "Insanium", never to be confused with the instruction pointer of other architectures, emulated or not.

    Oops, I meant "Itanium"!

    Actually, it stands for something else, I think…

  13. anonymouscommenter says:

    Looking forward to this series, and hoping for the AXP one as well.  It's always interesting to take a look at what might have been.  Not that IA64 and AXP didn't happen, but if things had turned out differently I might be developing for them today instead of x86 or x86_64.

  14. anonymouscommenter says:

    The irony of course being that as transistors shrunk, it wound up being to do all that stuff they wanted to offload onto the compiler in the chip itself. ;)

  15. Evan says:

    I'm also looking forward to this series (and hoping you do one one the Alpha). There was some discussion of Itanium features in a grad compiler class I took several years ago, and a lot of it is pretty neat. I think it was the first time I learned about the idea of generic instruction predicates, for instance, which seem like a neat and useful idea. (OTOH, ARM 32-bit supports predicates but ARM64 doesn't, so maybe they *aren't* actually that useful. Or at least are less useful than having those instruction bits for other stuff.)

  16. anonymouscommenter says:

    You say writing to r0, f0, f1 and p0 is “not allowed”. How and on which level is this enforced? Does the processor ignore writes to them, or does it raise an exception, or is it just a Windows convention “thou shalt not write to r0 lest thou changest the value of integer zero”?

    Re exercise: There is not enough information in this post in order to give an answer.

    The contract as stated (“rp contains the return address”) consists of two clauses: 1. Whenever the Callee is invoked, rp shall initially contain the return address, and the Callee may return by jumping to that address. 2. Before invoking any function, the Caller shall set rp to the address it wishes control returned to (including, but not limited to, the address immediately following the instruction that transfers control to the callee — see tail call optimization).

    The callee may want to copy its initial rp value to a different register, then modify rp (perhaps to call its sub-callees), then restore the return address into a scratch register and jump via that. In this case, rp will be trashed.

    I can see one case where the caller cares for rp to be preserved: When it calls a single function in a tight loop. In this case, relying on rp keeping its value across the call allows its initial assignment to be hoisted out of the loop.

    Preserving the return address also simplifies compiler and debugger implementation.

    [Good points. I'll clarify in the article. -Raymond]
  17. anonymouscommenter says:

    This looks like a very interesting series, looking forward to the next parts!

  18. anonymouscommenter says:

    I can't wait till Microsoft engineers develop unreliable, insecure software for this platform too! How about, instead of securing it, you can just darken the screen and give the user a threatening warning box that if they want to USE their Itanium server, they risk destroying it.

    How about forcing them to reboot every 2-3 days because "updates" ?

    How about leaving the system vulnerable to privilege-escalation bugs so that someone can write a few lines of malformed javascript, root the system, encrypt all your data, and hold it ransom ? That's a great idea, isn't it?

    How about creating all these problems, letting the system get polluted and trashed, like you always do, but then offer them a NEW VERSION with a higher version number (and fewer features and less functionality), to sell them a solution to those problems that you created in the first place ?

    Is that how you become a billion dollar corporation ?

  19. anonymouscommenter says:

    @Mike I'm curious, do trolls read articles and formulate their nonsense based on that, or do you just have a few different ones in a shared document somewhere? If the latter, I'd like to take a look!

  20. anonymouscommenter says:

    There was a famous (at the time) virus for Windows on Itanium.  I can't even understand its code anymore.  Slots and allocs and sptk.few… and that predicate abuse.  Who puts 11 predicated lines in a row?

  21. anonymouscommenter says:

    Nice that you're covering the underdog. I suggest you mention the features that make up for its drawbacks: the security features. It has stack protection, memory key technology like IBM mainframes, and per-page read/write/execute permissions. Secure64's SourceT OS built on these ground up to receive quite a review by Matasano. Other appliance vendors could use it for a combo of security & reliability. Details on Itanium security below:

    http://www.intel.com/…/intel-itanium-secure-white-paper.pdf

    I also second a look at the Alpha. Unlike the rest, I encourage you to explore its coolest feature: PALcode. It was like having abilities of microcode with regular assembler. It was used for performance enhancements, firmware mods, security techniques, and so on. Very applicable to modern stuff is the fact that a PAL instruction block executed atomically. Modern concurrency handling relies on tricks such as compare-and-swap. Alpha could do a whole transaction in the CPU. Further, it was simple enough that the DARPA-funded crash-safe.org converted a simple implementation into a highly secure processor for HLL languages.

  22. mikeb says:

    Great article. I briefly worked on Itanium back when MS was porting Windows to it, but I had forgotten almost all of this stuff.

    This article also reminds me that I wish vendors could come up with a universal, *documented* set of aliases for registers (even if registers might have a couple different names). Occasionally I'll read some MIPS assembly, and there seems to be a few different conventions for naming special registers on that chip (similar to how certain registers on the Itanium might have different names depending on the tool being used).  Usually these register aliases are poorly documented – you just have to "know the lore".

    Also, I'd like to point out that the help file for the Debugging Tools for Windows package has (or maybe "had") really, really good Itanium information in the Debugging Techniques/Processor Architecture section.

    Also: a vote for the AXP series.  Hell, I'm pretty sure you'd write a really good ARM series if you put your mind to it (even if you haven't worked on that target yet).

  23. anonymouscommenter says:

    >> but then offer them a NEW VERSION with a higher version number

    Clearly, this is the crux of the problem.  New versioning scheme: Decrement!

  24. anonymouscommenter says:

    Also, I'd guess that the instruction pointer is named `iip` in the debugger because of this (from the Itanium Architecture Developer's Manual):

    "The processor maintains two instruction pointers for IA-32 instruction set references, EIP (32-bit effective address) and IP (a 64-bit virtual address equivalent to the Itanium instruction set IP)"

  25. anonymouscommenter says:

    Liked the article. Looking forward to part 2. Also another vote the the AXP series.

  26. anonymouscommenter says:

    > [Good points. I'll clarify in the article. -Raymond]

    Thanks.

    After the clarification, it is still unclear how using br0 for the return address is a Windows (as opposed to Itanium) convention, without delving further into Itanium-specific br.call and br.ret semantic. [I looked in a book and now I understand.] But the fact that the return address is automatically stored into the register removes loop hoisting as a valid reason for the callee to preserve this register.

    Also, from the caller’s point of view, the mere act of calling a subroutine trashes the return address register by setting it to a known value. The caller is therefore responsible for saving the original value if it will eventually return to its grand-caller. The caller is also likely to have a better way of knowing addresses in its own code than by analyzing the contents of the return address register after calling a different function.

    Thus, I see no reason to require that the callee preserve rp.

  27. anonymouscommenter says:

    Related reading: blogs.msdn.com/…/58199.aspx

  28. anonymouscommenter says:

    "r0 Reads as zero (writes will fault)"

    An interesting design decision. When I went to school we learned in such an architecture that has a zero register, writing to it is usually how you throw away the result. So "test r1, r2" would be expressed as "and r0, r1, r2". Not in this case though.

  29. anonymouscommenter says:

    The fact that Windows NT's first 64-bit architecture was Itanium and not x86-64 has left its mark on the code.  WOW64, the engine behind running x86-32 programs on 64-bit Windows NT, was originally written for Itanium.  WOW64 was thus originally written to run x86-32 programs on Itanium.

    Older Itanium chips had a built-in x86 emulator assisting feature, but later ones did not, when it was found that simple software emulation could be faster.  WOW64 for Itanium works by having a software emulation layer to run x86-32 programs.  Unlike other operating systems, however, the thunking from 32-bit system calls to 64-bit in Windows NT is done almost entirely in user mode.  Linux and Darwin (Mac OS X) do it in kernel mode, for example.

    As a result of this, WOW64 is written to have a CPU emulation layer.  WOW64 consists of three DLLs: wow64.dll, wow64win.dll, and wow64cpu.dll.  wow64.dll is the portion of WOW64 that is primarily responsible for translating 32-bit system calls into 64-bit system calls.  wow64win.dll is the portion that does the same for windowing and GDI (the NtUser* system calls).  wow64cpu.dll is the CPU emulator, which is tasked with actually running the 32-bit applications.

    Due to the history, wow64cpu.dll still exists in x86-64.  It is still responsible for emulating the x86-32 software, like it was on Itanium.  Obviously, on x86-64, it has quite a hardware assist with running x86-32 code: it runs directly.  So x86-64 wow64cpu.dll's calls to do the x86-32 emulation directly jump to the x86-32 code.  To exit emulation upon a system call, the x86-32 ntdll.dll jumps directly to x86-64 code inside wow64cpu.dll that suspends the "emulator" and returns to wow64.dll for system call dispatch.

    Just wanted to give some WOW64 context. =^-^=

  30. JM says:

    The story of Itanium is a sad one. I can only imagine the terrible disappointment on the part of the engineers who designed it when their 2.0 processor, with a neat, clean architecture free of all the mistakes of the past, was overtaken by a boring, predictable 64-bit extension to the original instruction set — designed by the *competitor*, no less. "Good enough" wins again.

    With the benefit of hindsight, it's easy to say the Itanium was the victim of its own ambition. A completely new architecture that leaves it up to compiler writers to fully exploit it which doesn't come with a great story for leveraging your existing software (since the architecture itself was incompatible and emulation was hopelessly slow). It had to hit it out of the park on the first go, and it didn't. Still, as far as failures go, Itanium was a courageous one. I look forward to learning more about it. Also, another vote for a similar series on the Alpha; Alpha machines have only ever been mythical creatures for me. :-)

  31. anonymouscommenter says:

    I used to work at intel, and specifically on Itanium validation. I swear I used to know what iip meant.

  32. anonymouscommenter says:

    mikeb: IIP stands for Interruption Instruction Pointer. It points to a 16-byte-aligned bundle of 3 intstructions. The value of the IP gets copied to IIP whenever execution is being interrupted by something like an exception, break point, etc.

    I think the debugger wanted to reserve IP as a pseudo-register for the actual instruction being executed, so it uses the IIP.

  33. anonymouscommenter says:

    Thank you for this article. I've done very little asm in the past, and this is quite over my head, but all that means is that I'll have to read it a couple times before it starts to make sense, which I don't see as a problem. I also vote for the AXP series.

  34. anonymouscommenter says:

    I had a small hand in the Itanium back end for GCC, lo these many years ago, working under contract for HP.

    The lesson I took away from it, as a compiler guy, was that for general-purpose integer work, ahead-of-time scheduling <i>cannot</i> hope to compete with out-of-order execution in the CPU, because memory access latencies and branch destinations are just too unpredictable.  All of that lovely explicit parallelism was being blown on NOPs and load stalls.

    (Version 1 of the silicon was also ridiculously power-hungry for the performance.  My little contracting shop had to move all its computers to a beefier data center because the HVAC at the place we started couldn't handle the Itaniums.)

    It's my understanding that modern GPU architectures look suspiciously similar to the Itanium — which works for them, because the programmer understands they have to think about what can execute efficiently on something like that.

  35. anonymouscommenter says:

    Can't wait for your overview of the Alpha.

  36. anonymouscommenter says:

    Just reading this post's title got me excited. Looking forward to more!

  37. http://www.intel.com/…/itanium-architecture-vol-1-2-3-4-reference-set-manual.pdf section 3.3.5.3 says that iip stands for Interruption Instruction Bundle Pointer.

  38. anonymouscommenter says:

    @JM: Doubtless there were some iAPX-432 engineers left on whose shoulders the Itanium engineers could shed their tears. en.wikipedia.org/…/Intel_iAPX_432

    @Raymond: Thank you for this series.

  39. anonymouscommenter says:

    @McBucket: but, frankly, the 432 sounds terrible even on paper — bit-aligned variable length instructions? Ada as your primary programming language, complete with your own OS? There's ambition and then there's going overboard. There's definitely parallels in that the 432 too relied on the compiler to work magic. In the case of the 432 the compiler was perversely inefficient. In the case of Itanium, it just proved too hard to make the compiler do all the optimization. But the Itanium at least sounds like something you could get to work in theory, while the 432 sounds like a lead balloon.

    Thanks for reminding me about this shambling horror, though. I know I read about it once, but I'd completely forgotten about it.

  40. anonymouscommenter says:

    I'm excited for this series! Keep em coming!

  41. IanBoyd says:

    I remember seeing the (hindsight) complaint that the Itanium was difficult to create compilers for. It wasn't until now that i understood why that would be.

    Only after watching Eric Brumer's excellent talks on compilers and modern processors, as well as Herb Sutter's atomic<> weapon's talk, did i appreciate how much magic the x86/x64 CPU does.

  42. anonymouscommenter says:

    +1 on excited for this and excited for a series on the Alpha!

  43. cheong00 says:

    I'm interested to hear whether Windows has built special code to handle what is executed with IA32-EL differently, or is that also transparent to the operating system.

  44. anonymouscommenter says:

    IA64 was a big deal for NT…there was just no software, not even Microsoft.  About the same time, NT4 sp3 NT was dropped for Alpha. With enough RAM, it would smoke a P4. IMHO, looking forward to rest of article. Ty

  45. anonymouscommenter says:

    @ Mr Bucket, JM

    i432 did go overboard but System/38 (later AS/400) did survive aiming for similar capabilities. They should've made it compatible with popular stuff, though, instead of Ada among other things. I think they did a good job with their i960 that trimmed i432:

    en.wikipedia.org/…/Intel_i960

    Take out the weird BiiN stuff, make it C/UNIX compatible, and what you have left is a sophisticated, RISC chip. It's error handling, high availability, and object protection would be useful today. Matter of fact, there were designs then (eg NonStop) and now (eg HISC, CHERI) with some but not all properties. Had they marketed it right, the i960 might have made it and would today be 8-16 cores with blah blah acceleration. I'd be coding on it. :)

    It's still in production minus my favorite features far as I can see. I suggested they open source and license it for a tiny fraction on MIPS or ARM. Independents or academics might do something with it. No action on that. I'd like the MX model, myself.

  46. anonymouscommenter says:

    Thank you for the article – I look forward to the follow-ons!

    You mentioned you don't intend to cover the floating point side of things as they aren't exciting, but that's an area I would find very interesting.  At the time, I remember reading that the Itanium opcodes could shove three parallel instructions into 128 bits with a tag bit of some sort to indicate following sets of 3 instructions could also be executed in parallel.  I think the idea was that as they could afford to stuff more ALUs onto the chip, parallel code could scale transparently (without recompilation on a newer compiler).  For numerical (floating point) work, this seems brilliant.  SSE and AVX can mostly fill that niche now, but you mostly have to do the exact same operations across the vector, and in order to scale, Intel has to create bigger SIMD registers (AVX-512).

    I always thought it was a shame Itanium died in the market, largely from (what I believed was) trash-talking articles drawing attention to how poorly it emulated legacy x86.  Regardless, the public glommed on to that and rejected it like sheep (people still call it the Itanic without even knowing much about it).

    I wonder if Intel marketing had not tried to keep it backwards compatible if they could've dodged all the complaints about running it in that mode.  My guess is that aren't enough people doing scientific programming to justify the architecture.  Integer ops and 3D vectors rule the day in web-serving and gaming markets…

    I also look forward to your Alpha articles.  I used plenty of those in the day, and when everything switched to x86, a lot of our (naive) software ran slower until we tailored it for x86 (simple code ran fast enough on the Alphas).  Price per performance ended up winning in the end, and now we use a desktop architecture for numerical algorithms.

  47. anonymouscommenter says:

    This is fascinating.  And with 20/20 hindsight, I can see why they thought this was a good way forward, and also why x86 beat it.  Keep it coming!

  48. anonymouscommenter says:

    @Zack wrote:

    > I had a small hand in the Itanium back end for GCC

    > The lesson I took away from it, as a compiler guy,

    > was that for general-purpose integer work, ahead-of-time

    > scheduling <i>cannot</i> hope to compete with out-of-order

    > execution in the CPU, because memory access latencies and

    > branch destinations are just too unpredictable.  All of

    > that lovely explicit parallelism was being blown on NOPs

    > and load stalls.

    As I'm sure you know, around the same time, several GCC forks were consolidated into EGCS to try and get good performance out of regular old x86 – eventually replacing the old GCC.  It is easy for me to believe that GCC's original design wasn't very accommodating to anything other than traditional RISC processors, and that EPIC might have required a similar effort from GCC developers to be viable.

    Besides, who's to say that Itanium couldn't have evolved to do speculative/out-of-order execution (and register renaming, etc…) as well?  Just because the compiler has more freedom to be explicit doesn't mean the CPU can't infer additional optimizations.  In some imaginary world (one we don't live in), those no-ops would've been discarded upon translation to the icache.  It's hard for me to see how an instruction set with more registers and the ability to indicate more information to the CPU "<i>cannot</i> hope to compete" with a CPU which has neither.

  49. anonymouscommenter says:

    Wow…. THAT many registers?

    How in the world do you handle context switches for that many registers? I thought the original i860 was eventually rejected for NT ("N-Ten") because it just couldn't switch its many registers fast enough. But this thing has a lot more than the 860 did! How does an Itanium context switch even work?

  50. anonymouscommenter says:

    @Mark: the reason people took Itanium to task for not doing x86 well was also because Intel itself envisioned Itanium as the replacement for x86 everywhere. When it became clear the global domination wasn't happening any time soon Intel backpedaled quickly, positioning the Itanium as purely a server architecture instead with no designs on the desktop market. Even then it could have worked, but Intel failed to make Itanium compelling quickly enough. x86 performance aside, even native Itanium code wasn't outperforming the competition. Intel did improve it, but they just couldn't catch up. When AMD64 came out and gave people what they wanted, rather than what Intel thought they needed, it was basically game over. Itanium had some interesting ideas, but it dramatically failed to capitalize on them.

  51. anonymouscommenter says:

    Please keep going with this series. I vote +1 for the Alpa AXP, too!

  52. Azarien says:

    This is insane…

    And I'll be boring, but I'm more interested in ARM than AXP…

  53. anonymouscommenter says:

    @Andre

    Relative to the predicate registers, my first thought upon seeing "p0 Reads as true" was to ask why wasn't p0 set up to always read as false and p1 always read as true. Further reflection revealed this wouldn't make any sense if there isn't a direct way to "test false and do/branch". But as I know virtually nothing about the Itanium instruction set, this is just guessing on my part.

  54. anonymouscommenter says:

    I'm going to greatly enjoy this series. I often wish I had a job that requires knowing information like this.

    Another +1 for AXP too

  55. anonymouscommenter says:

    @Mark:

    I don't remember which version of GCC first shipped an ia64 back end. I worked on it a few years after the branches were reunified — circa 2002, so that would've been 3.1 or 3.2. Someone else had already done most of the heavy lifting, I just did fine-tuning and debugging. (For clarity, also, I haven't been involved with GCC development since about 2006.)

    It is true that the architecture GCC had at the time was a handicap; in particular, the ia64 back end disabled the machine-independent scheduler and reimplemented it from scratch, because the MI scheduler (at the time) was not capable of bundle-packing correctly. However, that was not the biggest hindrance to performance. Rather, it was that there just isn't enough _statically visible_ ILP in "general purpose integer code" to fill up the bundles efficiently. (Think SPECint rather than SPECfp; alternatively, think "gigantic white-elephant C++ application, in which a supermajority of all instructions executed are virtual method calls".) Out-of-order hardware will always be able to squeeze more ILP out of this kind of code … I think probably 75% because of dynamic branch prediction and 25% because it doesn't have to conservatively estimate memory access latencies. Both involve information that isn't available to an ahead-of-time compiler, no matter how clever it is.

    I can imagine an architecture not unlike what Transmeta (remember Transmeta?) was going for, in which out-of-order execution is implemented in software. I suspect that couldn't beat dedicated OOO hardware on speed or power efficiency, but it might make the hardware simpler enough to be worth the hit. (Disclaimer: I know bupkis about hardware design.) I can also imagine an evolved Itanium with out-of-order execution that's competitive on generic integer code, but I think you could get there easier if you started from something like Alpha or MMIX. Which HP had the opportunity to do, come to think of it, didn't they buy the Alpha design IP when they bought Compaq? Perhaps that is what AArch64 will become — I haven't looked at it in detail.

  56. anonymouscommenter says:

    @JM:

    Absolutely agree: the inclusion of an x86 compatibility mode in Itanium was 100% a mistake, as was attempting to position it from day one as The One Architecture To Rule Them All.  Better to have introduced it as a high-end server CPU with no backward compatibility goop, and moved downmarket after it had demonstrated enough momentum to warrant ongoing R&D resources.

  57. anonymouscommenter says:

    @IanBoyd Not only in hindsight. I remember being at an Itanium session at JavaOne in 1998 and thinking it very ironic that a processor so actively hostile to JIT compilation was being presented at a Java conference.

  58. anonymouscommenter says:

    Very interesting article and please do a series on the Alpha too.

  59. anonymouscommenter says:

    JeroenFrijters: My first thought upon seeing the architecture was also how painful it must be to JIT. All that analysis the compiler must do would surely make any JIT horribly slow to start up.

    But then as I read more about the features of IA64, I thought many of them seem impossible for a compiler to use because they require runtime analysis. And then I realized — those features are custom-made for a JIT compiler!

  60. anonymouscommenter says:

    @JD: (regarding i860)

    Word of mouth is that the problem was interrupt processing.  Interrupt processing on the i860 required significant, complex code to save the state that was interrupted and then resume it when the interrupt was completed.

    It's possible that the memory bandwidth wasn't adequate for context switching but I don't remember ever hearing about that.

    Interestingly in very high performance workloads, interrupt processing is again too slow so many techniques are applied to mitigate high interrupt rates which probably would have made the ghastly amount of decode required for i860 interrupts tolerable.

  61. anonymouscommenter says:

    I love this sort of stuff.  Also interested in the Alpha AXP if you decide to do it.

  62. anonymouscommenter says:

    I would like to see a PA-RISC overview.Commodore were going to put PA-RISC's into the next gen Amiga's bitd. the one's that never came out and were going to be Windows NT based :/

Comments are closed.