The Itanium processor, part 2: Instruction encoding, templates, and stops

Instructions on Itanium are grouped into chunks of three, known as bundles, and each of the three positions in a bundle is known as a slot. A bundle is 128 bits long (16 bytes) and always resides on a 16-byte boundary, so that the last digit of the address is always zero. The Windows debugging engine disassembler shows the three slots as if they were at offsets 0, 4, and 8 in the bundle, but in reality they are all crammed together into one bundle.

You cannot jump into the middle of a bundle.

Now, you can't just put any old instruction into any old slot. There are 32 bundle templates, and each has different rules about what types of instructions they can accept and the dependencies between the the slots. For example, the bundle template MII allows a memory access instruction in slot 0, an integer instruction in slot 1, and another integer instruction in slot 2.

(Math: Each slot is 41 bits wide, so 123 bits are required to encode the slots. Add five bits for encoding the template, and you get 128 bits for the entire bundle.)¹

The slot types are

  • M = memory or move
  • I = complex integer or multimedia
  • A = simple arithmetic, bit logic, or multimedia
  • F = floating point or SIMD
  • B = branch

Some instructions can be used in multiple slot types, and the disassembler will stick a suffix (known as a completer) to disambiguate them. For example, there are five different nop instructions, one for each slot type: nop.m, nop.i, nop.a, nop.f, and nop.b. When reading code, you don't need to worry too much about slotting. You can assume that the compiler did it correctly; otherwise it wouldn't have disassembled properly! (For the remainder of this series, I will tend to omit completers if their sole purpose is to disambiguate a slot type.)

If you are debugging unoptimized code, you may very well see a lot of nops because the compiler didn't bother trying to optimize slot usage.

Another thing that bundles encode is the placement of what are known as stops. A stop is used to indicate that the instructions after the stop depend on instructions before the stop. For example, if you had the following sequence of instructions

    mov r3 = r2
    add r1 = r2, r4 ;;
    add r2 = r1, r3

there is no dependency between the first two instructions; they can execute in parallel. However, the third instruction cannot execute until the first two have completed. The compiler therefore inserts a stop after the second instruction, which is represented by a double-semicolon.

A sequence of instructions without any stops is known as an instruction group. (There are other things that can end an instruction group, but they aren't important here.) As noted above, the instructions in an instruction group may not have any dependencies among them. This allows the processor to execute them in parallel. (This is an example of how the processor relies on the compiler: By making it the compiler's responsibility to ensure that there are no dependencies within an instruction group, the processor can avoid having to do its own dependency analysis.)

There are some exceptions to the rule against having dependencies within an instruction group:

  • A branch instruction is allowed to depend on a predicate register and/or branch register set up earlier in the group.
  • You are allowed to use the result of a successful ld.c without an intervening stop. We'll learn more about ld.c when we discuss explicit speculation.
  • Comparison instructions .and, .andcm, .or, and .orcm are allowed to combine with others of the same type into the same targets. (In other words, you can combine two .ands, but not an .and and an .or.)
  • You are allowed to write to a register after a previous instruction reads it. (With rare exceptions.)
  • Two instructions in the same group cannot write to the same register. (With the exception of combined comparisons noted above.)

There are a lot of fine details in the rules, but I'm ignoring them because they are of interest primarily to compiler-writers. The above rules are to give you a general idea of the sorts of dependencies that can exist within an instruction group. (Answer: Not much.)

It does highlight that writing ia64 assembly by hand is exceedingly difficult because you have to make sure every triplet of instructions you write matches a valid template in terms of slots and stops, and you have to ensure that the instruction groups do not break the rules.

Next time, we'll look at the calling convention.

¹ There are two templates which are special in that they encode only two slots rather than three. The first slot is the normal 41 bits, but the second slot is a double-wide 82 bits. The double-wide slot is used by a few special-purpose instructions we will not get into.

Comments (27)
  1. Brian_EE says:

    >•You are allowed to write to a register after a previous instruction reads it. (With rare exceptions.)

    I read that and found it curious that you would explicitly state that, it seems obvious to me. But then thinking about a pure software person, he/she might wonder how that is possible if the instructions are executed in parallel. (I.e. how does it read and write at the same time?)

    It has to do with the setup/hold times on the flip-flops that make up the register bits. If the hold time is sufficiently shorter than the propagation delay through the flip-flop, then you can read the previous value of the bit at the same clock edge as you are writing a new value to the bit.

  2. Brian_EE says:

    Just to clarify… "If the hold time *of the destination register* is sufficiently shorter than the propagation delay through the *source register* flip-flop…."

  3. anonymouscommenter says:

    Ah, this sounds more like the Intel I know: lots of seemingly arbitrary restrictions on things to minimize orthogonality. :-)

    (I'm being deliberately mean, of course. I don't design hardware.)

    [News flash: Hardware is not orthogonal. You can't do floating point arithmetic in a MMU, or integer multiplication in a branch predictor. Remember, the idea here is to make scheduling decisions at compile time, not run time. -Raymond]
  4. Kevin says:

    So now, 11 years later, we *finally* learn how a processor could possibly have 41 bit instructions.…/229387.aspx

  5. anonymouscommenter says:

    That 41-bit instructions mean it's nearly impossible for a human to read machine code directly (either in hex code or ASCII mojibake). In x86 and x86-64 many instructions are easily recognizable (with training).

  6. anonymouscommenter says:

    I'm sure if anybody had actually bothered to come up with all these rules and think about how to implement an efficient compiler for it before spending billions of dollars in implementing it, we could've avoided this train wreck and the whole industry would be better off for it..

  7. Eric says:

    @Voo – The compilers *are* efficient, given an appropriate workload.  The problem with Itanium is that (outside of scientific computing) there aren't enough instructions per basic block to effectively fill the bundles.    Besides, the history of computing is littered with failures.   Lisp Machines, 5th Generation computers, Connection Machines, Gallium Arsenide – all took billions of dollars of investment and produced little or nothing in return.

  8. anonymouscommenter says:

    > It does highlight that writing ia64 assembly by hand is exceedingly difficult because you have to make sure every triplet of instructions you write matches a valid template in terms of slots and stops, and you have to ensure that the instruction groups do not break the rules.

    How did the Windows team handle this for stuff that is usually written in assembly like the bootloader and JIT compilers?

  9. anonymouscommenter says:

    @Raymond: an orthogonal design would be bundles of five instructions, one per unit (actually, as I understand it the Itanium has four units, but let's say a virtual fifth one is necessary for the ALU/non-ALU distinction), each with their own instruction set, and dependency rules as a separate layer. Instead there are 32 templates that specify some combination of units and dependencies.

    Alright, so "orthogonal" isn't the right word for that. And please don't explain how the templates make sense and how a design with 5 slots would be even less efficient and produce even more nops and stops than the current design, not to mention the effort the chip would have to expend to check validity — I get that. The templates effectively act as a form of compression to cut down on the enormous amount of impossible/unproductive combinations you could otherwise encode, like, I don't know, an instruction set is supposed to do.

    Jeez, it's like you can't even make ignorant, baseless accusations in blog comments anymore! What's the world coming to. :-)

  10. anonymouscommenter says:

    Ah, this is one of those rare occasions where we can see when Raymond actually wrote this blog post. The last revision of that linked Wikipedia article having the anchor target "Sizes" is from 2013:…/index.php :)

  11. anonymouscommenter says:

    Out of curiosity, what would happen if you omitted the stop in the example, like this:

       mov r3 = r2

       add r1 = r2, r4

       add r2 = r1, r3

    That's the sort of thing I'd test out for myself, if I had an IA64 box to play with.

  12. anonymouscommenter says:

    @Eric Last time I checked (admittedly that was when Itanium was still relevant, so about a decade ago), compilers produced code that was a far, far cry from optimal due to all the complexity of the ISA. Sure with the given workloads and highly optimized code Itanium could be tremendously efficient, problem is those two conditions were hardly ever true (and don't get me started on power efficiency).

    And while LISP machines might not have been much of a mainstream success (sad, but not unexpected), they gave the industry lots of new, innovative ideas to play around with – I just don't see what cool features of Itanium we'll be looking back in 20 years and be missing.

    @Sebastiaan Dammann: Speaking from experience, nobody would ever write a JIT compiler – which is in the end nothing but an optimizing compiler with some additional information and tighter time limits for compilation – in assembly. Yes you might have to write some small parts in assembly or at least use assembly fragments, but that's only a tiny amount of code.

    No idea about operating systems (I only ever wrote toy ones), but I wager you can get quite far with using C and inline asm instructions when needing to access special registers.

  13. anonymouscommenter says:

    @mokomull: I'm lead to believe that your code now behaves differently depending on when interrupts fire.

    @Sebastiaan Damman: I don't know what the Windows team did but my solution would be make the assembler validate.

  14. anonymouscommenter says:

    I guess this would be great, if one could trust the compiler to never make a mistake!  And with something this complicated, it would be a long time before that trust was earned, and a long time after that before the generated code was optimal enough to beat good ole x86!

  15. anonymouscommenter says:

    To me, it's plausible that read-before-write is allowed within an instruction group simply because I know that processors are pipelined: Computing the result of an operation takes several clock cycles, so even with parallel or slightly delayed execution, the read operations will easily occur before the write operations.

    The other possibility, write-before-read, is the one that causes problems (stalls or expensive feedback path logic) with pipelined designs.

  16. anonymouscommenter says:

    Just pencilling it out, there are five instruction types and three slots, and in principle each slot _could_ need a stop after it, so there are exactly 1000 possible combinations; if the instruction encoding actually permitted all of them it would've required ten bits, not five, to encode that.  Conveniently, 24 encodings would be left over for those double-wide bundles.  And one extra bit, because 118 is not a multiple of 3, but 117 is.  And now the instructions themselves are only 39 bits wide.  I have no recollection of how much leftover encoding space there was in the instructions themselves.

    (Wasn't one of the double-wide instructions a 64-bit load immediate? That's not all that special purpose.)

  17. anonymouscommenter says:

    Discussions have been going on since at least the late 50's/early 60's regarding how much work should developers depend on compilers to do on their behalf.  I remember when business programs migrated away from ASM to a high level language.  The old ASM guys would look at the code and laugh.  In today's world, there are optimizing compilers that emit fantastic code.

    If one looks back in time, we had $5,000 per year programmers working on machines that cost millions of dollars.  It made sense to optimize the machine and use as much highly skilled labor that was needed.  In today's world, those economics are turned completely upside down.  Often, the fully loaded cost of a developer for just 2 hours worth of work, exceeds the device the developer is targeting.

    From a pure cost/benefit analysis, today's economics argues strongly for putting as much "intelligence" into a compiler as is possible.

    The "gotcha" re: the Itanium processor is that it was caught in the middle of the economic shift from cheap developer/expensive machine to expensive developer/cheap machine.  In some ways, it was ahead of its time in terms of where we have landed today — not so much from an architectural standpoint — rather from an economic standpoint.

  18. anonymouscommenter says:

    Yesterday's post made me think ia64 would be fun to write assembly in, this post has convinced me otherwise.

    It is sort of sad, so many registers, so many fun features, locked behind a system that requires too many degrees of thinking to allow for ease of use.

    Even ISAs have a UX.

  19. Macrosofter says:

    >Next time, we'll look at the calling convention.

    While we wait for the next time, we can read previous Raymond's post about it:…/58199.aspx

  20. anonymouscommenter says:

    @12BitSlab huh? Itanium hardly was living in a world of cheap programmers. And for actual application developers it doesn't matter whether the compiler does extra work or the hardware. Heck since both compiler writers and hardware engineers earn rather similar wages the argument doesn't hold much water. Sure you want to get the machine that requires the least amount of application developer involvement to get the best performance out of it, but with all those high level languages there really aren't many (any? Can't think of one) specific things that apply to only one architecture either.

    The problem is that itanium was – is – so complicated that no compiler products anything close to optimal code while die space has exploded with hardware engineers ring to find good uses for it anyhow apart from putting more and more caches into the chip.

  21. anonymouscommenter says:

    Not on topic, but I've always wondered what Intel's own compiler writers thought of the chip. The company has a reputation for writing good optimising compilers for numerical applications on their own chips. I came across an article written in 2001 saying Intel was going to fund research into compilers to improve the performance of software running on the Itanium, and that the research would take a couple of years or so to yield results. This was around the same time as the chip was supposed to be on the market.…/0,,s2085832,00.html. In the meantime the AMD64 specification had already been released (according to Wikipedia), though actual chips had not yet been delivered.

  22. anonymouscommenter says:

    When you have a series of posts, please maintain links between them, eg jump from this page back to part 1.

    [I usually do it after the fact, because all of these items are autoposted, and the URL of part 1 isn't known at the time I write part 2. -Raymond]
  23. anonymouscommenter says:

    Zack: Raymond made a slight mistake — M, I, A, F, and B are really instruction types. Slot types are actually based on execution units. Since A-type (Integer ALU) instructions can be executed both M-units and I-units, there is no A-slot.

    With only 4 different types of execution units, there are 512 possible combinations of units and stops. Once you remove the duplicates (MII is the same as IMI and IIM) you get 328 different templates. Adding in the double-wide instructions gives 344 different templates you could possibly define.

    Of those 344 possible templates, they chose 24 to implement, leaving 8 reserved for future use.

  24. anonymouscommenter says:

    This is so tightly integrated with the hardware design.

    I wonder what they can do to maintain binary compatibility across generations of processors.

    x86 use microcode to translate instructions.

    But microcode will essentially nullify all benefit of IA64.

  25. anonymouscommenter says:

    why were there three slots, and not two or ten? Was it tested experimentally (e.g. that MIII is far less common than MII, as is MI) or was this an early design decision ("128bit are a nice round number, let's use this size") that had to be "coded around"?

    [You don't want too many slots, because that results in slot waste. You can't jump into the middle of a bundle, so all jump targets must be in slot zero, and all call instructions must be in the last slot. -Raymond]
  26. anonymouscommenter says:

    @ Voo, "cheap programmers" is a relative term.  The relation is cost_of_machine / cost_of_programmer.  At the time the Itanium was in the initial planning stages, that ratio was still rather high for systems that could do real work.  By the time the Itanium was an after thought, that ratio had dropped dramatically.

  27. anonymouscommenter says:

    @DanielCheng: It is my understanding that this is one of the several reasons why Nvidia refuses to publish an architecture manual for their GPUs — when the next generation comes around they only want to have to rev the shader compiler.

Comments are closed.

Skip to main content