The Itanium processor, part 6: Calculating conditionals


The Itanium does not have a flags register. A flags register creates implicit dependencies between instructions, which runs contrary to the highly parallel model the Itanium was designed for. Instead of implicitly setting a register after computations, the Itanium has explicit comparison operations that put the comparison result into dedicated predicate registers.

Here's a simple fragment that performs some operation if two registers are equal.

        cmp.eq p6, p7 = r32, r33 ;;
(p6)    something

The cmp instruction compares two values and sets the two specified predicate registers as follows:

  • p6 is true if the values satisfy the condition, or false if they do not satisfy the condition.

  • p7 is set to the opposite of p6

The comparison operation generates two results, one which holds the nominal result and one which holds the opposite. This lets you conditionalize both sides of a branch.

        cmp.eq p6, p7 = r32, r33 ;;
(p6)    something // executes if they are equal
(p7)    something // executes if they are not equal

There is also a cmp4 instruction which compares two 32-bit values, in which case only the least-significant 32 bits participate in the comparison.

The comparands can be either two registers or an immediate and a register. The immediate is an 8-bit sign-extended value, though the final value may be interpreted as unsigned depending on the comparison type.

There are three comparison types:

type meaning
eq equality
lt signed less than
ltu unsigned less than

The first destination predicate register receives result of the test, and the second gets the opposite of the result.

These are the only comparisons you will see in disassembly, but the compiler can manufacture other types of comparisons. For example, if the compiler wants to perform a ge comparison, it can just do a lt comparison and flip the order of the two predicates.

More generally, the compiler can synthesize the other integer comparisons as follows:

imaginary opcode meaning synthesized as
cmp.ne p, q = a, b not equal cmp.eq q, p = a, b
cmp.ge p, q = a, b signed greater than or equal cmp.lt q, p = a, b
cmp.gt p, q = a, b signed greater than cmp.lt p, q = b, a if a is a register
cmp.lt q, p = a − 1, b if a is an immediate
cmp.le p, q = a, b signed less than or equal cmp.lt q, p = b, a if a is a register
cmp.lt p, q = a − 1, b if a is an immediate
cmp.geu p, q = a, b unsigned greater than or equal cmp.ltu q, p = a, b
cmp.gtu p, q = a, b unsigned greater than cmp.ltu p, q = b, a if a is a register
cmp.ltu q, p = a − 1, b if a is an immediate
cmp.leu p, q = a, b unsigned less than or equal cmp.ltu q, p = b, a if a is a register
cmp.ltu p, q = a − 1, b if a is an immediate

These syntheses rely on the identities

x > y  ⇔  y < x
xy  ⇔  ¬(x > y)
xy  ⇔  x − 1 < y for integers x and y, assuming no overflow
xy  ⇔  yx

The next level of complexity is the parallel comparisons. These perform a comparison and combine the result with the values already in the destination predicates.

opcode meaning really
cmp.xx.or p, q = a, b p = p || (a xx b)
q = q || (a xx b)
if (a xx b) then p = q = true
cmp.xx.orcm p, q = a, b p = p || ¬(a xx b)
q = q || ¬(a xx b)
if ¬(a xx b) then p = q = true
cmp.xx.and p, q = a, b p = p && (a xx b)
q = q && (a xx b)
if ¬(a xx b) then p = q = false
cmp.xx.andcm p, q = a, b p = p && ¬(a xx b)
q = q && ¬(a xx b)
if (a xx b) then p = q = false
cmp.xx.or.andcm p, q = a, b p = p || (a xx b)
q = q && ¬(a xx b)
if (a xx b) then p = true, q = false
cmp.xx.and.orcm p, q = a, b p = p && (a xx b)
q = q || ¬(a xx b)
if ¬(a xx b) then p = false, q = true

The meaning column describes how it is convenient to think of the operations, but the really column describes how they actually work.

The orcm and andcm versions take the complement of the comparison, which is handy because some of the synthesized comparisons involve taking the opposite of the specified result.

These parallel comparisons get their name because they are designed to have multiple copies executed in parallel. Consequently, they are an exception to the general rule that you can write to a register only once per instruction group. If all writes to a predicate register are AND-like (i.e., and or andcm) or all writes are OR-like (i.e., or or orcm), then the writes are allowed to coexist within a single instruction group. (This is where the actually column comes in handy. You can see that all AND-like operations either do nothing or set the predicate to false, and that all OR-like operations either do nothing or set the predicate to true. That's why they can run in parallel: If multiple conditions pass, they all do the same thing, so it doesn't matter which one goes first.)

Executing them in parallel lets you perform multiple tests in a single cycle. For example:

 x =  ... calculate x ...;
 y =  ... calculate y ...;
 z =  ... calculate z ...;
 if (x == 0 || y == 0 || z == 0) {
   something_is_zero;
 } else {
   all_are_nonzero;
 }

could be compiled as

        ... calculate x in r29 ...
        ... calculate y in r30 ...
        ... calculate z in r31 ...
        cmp.eq p6, p7 = +1, r0 ;; // set p6 = false, p7 = true

        cmp.eq.or.andcm p6, p7 = r29, r0 // p6 = p6 || x == 0
                                         // p7 = p7 && x != 0
        cmp.eq.or.andcm p6, p7 = r30, r0 // p6 = p6 || y == 0
                                         // p7 = p7 && y != 0
        cmp.eq.or.andcm p6, p7 = r31, r0 ;; // p6 = p6 || z == 0
                                         // p7 = p7 && z != 0

(p6)    something_is_zero
(p7)    all_are_nonzero

First, we calculate the values of x, y and x. At the same time, we prime the parallel comparison: we compare the constant +1 against register r0, which is the hard-coded zero register. This comparison always fails, so we set p6 to false and p7 to true.

Now we perform the three comparisons in parallel. We check if r29, r30, and r31 are zero. If any of them is zero, then p6 becomes true and p7 becomes false. If all are nonzero, then nothing changes, so p6 stays false and p7 stays true.

Finally, we act on the calculated predicates.

Notice that the parallel comparison lets us calculate and combine all the parts of the test in a single cycle. In a flags-based architecture, we would have to perform a comparison, test the result, then perform another comparison, test the result, then perform the last comparison, and test the result one last time. That's a sequence of six dependent operations, which is difficult to parallelize. (And most likely consume three branch prediction slots instead of just one.)

The last wrinkle in the comparison instructions is the so-called unconditional comparison. This special instruction violates the rule that a predicated instruction has no effect if the predicate is false.

(qp)    cmp.xx.unc p, q = r, s

Even though there is a qualifying predicate, this comparison is executed unconditionally (as indicated by the unc suffix). The behavior of an unconditional comparison is

p = qp && (r xx s)
p = qp && ¬(r xx s)

In other words, if the qualifying predicate is true, then the instruction behaves as normal. But if the qualifying predicate is false, then the result of the comparison is considered false for all branches, regardless of the actual test.

This formulation is handy when you are nesting predicates. Consider:

 x =  ... calculate x ...;
 y =  ... calculate y ...;
 if (x == 0) {
  x_is_zero;
 } else {
  x_is_nonzero;
  if (y == 0) {
   x_is_nonzero_and_y_is_zero;
  } else {
   both_are_nonzero;
  }
 }

This can be compiled like this:

        ... calculate x in r30 ...
        ... calculate y in r31 ...

        cmp.eq p6, p7 = r30, r0 ;;
(p6)    x_is_zero
(p7)    x_is_nonzero
(p7)    cmp.eq.unc p8, p9 = r31, r0 ;;
(p8)    x_is_nonzero_and_y_is_zero
(p9)    both_are_nonzero

After calculating x and y, we check whether x is zero. If it is, then we execute x_is_zero. If not, then we execute x_is_nonzero. Next, we check whether y is zero, and we do so via an unconditional comparison. That way, if we are in the case that x is zero, then both p8 and p9 are set to false. Now we can use p8 and p9 to select between the final two branches. (Or if x is zero, neither gets selected.)

We'll see later that the unconditional comparison is also useful in register rotation.

So that's a quick tour of the Itanium conditional instructions. Next time, we'll start looking at speculation.

Comments (10)
  1. Gabe says:

    Typo: "First, we calculate the values of x, y and x"

    Should be "First, we calculate the values of x, y and z".

  2. Oxford says:

    Actually, it should be "First, we calculate the values of x, y, and z".

  3. kantos says:

    I get the general feeling that if they hadn't bothered with the compiler driven OOO execution the concepts of Itanium make for an awesome architechture

  4. Medinoc says:

    So the "parallel" comparisons are exceptions to the rule that both predicates are set to opposite values?

  5. Anonymous says:

    In the cmp.xx.unc  table is "p = qp && ¬(r xx s)" a typo?  I think this should be q = ...

  6. Julie Koubova says:

    I love this series, I've never really used anything else than x86 and it really puts things into perspective, seeing how different other platforms can be :) Thank you, Raymond!

  7. boogalooo says:

    @kantos You probably wouldn't have been able to implement OOO in hardware and include everything else that Itanium had. Itanium was one of Intels more successful follies. Hopefully the better ideas turned up in their x64 backend

  8. Neil says:

    The "really" column is the same as the "meaning" column with operands reversed i.e. p = (a xx b) || p etc.

  9. boogaloo says:

    @Neil What is your point? The difference between the "meaning" and the "really" is that the "meaning" column states that p & q are always updated. The "really" column states that p & q are only updated in one outcome of the evaluation, in the other the register is not updated at all. This means that as long as instructions are only going to set or clear the registers then they can be run at the same time. If one could set and the next could clear then they cannot be run at the same time, they have to be run sequentially. It makes no difference to you as the programmer if p is clear and the comparison is not going to set it, but to hardware it makes a difference.

  10. Zack says:

    I was pretty harsh on the memory addressing last week, so I feel I ought to say that the comparison instructions and predicate registers are IMNSHO something Itanium got *right* and it's too bad no one else has picked up on them.  (I only know about two greenfield ISA designs since: AArch64 and RISC-V.  AArch64 appears to have a traditional, single flags register, and RISC-V has fused compare-and-branch instructions.  If there's a design I've missed I'd be glad to hear about it.)

    If you have an OOO core, conditional *execution* of most non-branch instructions turns out to be a net lose, because it wastes fetch and decode bandwidth, and it doesn't give the branch predictor a chance to work its magic.  However, I think a hybrid design with Itanium-like comparisons and predicates, but a limited set of conditional instructions (branch, reg-reg move, set-to-immediate, and maybe load/store) would work out well.

    (Also, unconditional load/store of single predicate registers to RAM would be convenient.  I don't remember if Itanium had that.)

Comments are closed.

Skip to main content