The Itanium does not have a flags register. A flags register creates implicit dependencies between instructions, which runs contrary to the highly parallel model the Itanium was designed for. Instead of implicitly setting a register after computations, the Itanium has explicit comparison operations that put the comparison result into dedicated predicate registers.
Here's a simple fragment that performs some operation if two registers are equal.
cmp.eq p6, p7 = r32, r33 ;; (p6) something
The cmp instruction compares two values and sets the two specified predicate registers as follows:
- p6 is true if the values satisfy the condition, or false if they do not satisfy the condition.
- p7 is set to the opposite of p6
The comparison operation generates two results, one which holds the nominal result and one which holds the opposite. This lets you conditionalize both sides of a branch.
cmp.eq p6, p7 = r32, r33 ;; (p6) something // executes if they are equal (p7) something // executes if they are not equal
There is also a cmp4 instruction which compares two 32-bit values, in which case only the least-significant 32 bits participate in the comparison.
The comparands can be either two registers or an immediate and a register. The immediate is an 8-bit sign-extended value, though the final value may be interpreted as unsigned depending on the comparison type.
There are three comparison types:
type | meaning |
---|---|
eq | equality |
lt | signed less than |
ltu | unsigned less than |
The first destination predicate register receives result of the test, and the second gets the opposite of the result.
These are the only comparisons you will see in disassembly, but the compiler can manufacture other types of comparisons. For example, if the compiler wants to perform a ge comparison, it can just do a lt comparison and flip the order of the two predicates.
More generally, the compiler can synthesize the other integer comparisons as follows:
imaginary opcode | meaning | synthesized as | |
---|---|---|---|
cmp.ne p, q = a, b | not equal | cmp.eq q, p = a, b | |
cmp.ge p, q = a, b | signed greater than or equal | cmp.lt q, p = a, b | |
cmp.gt p, q = a, b | signed greater than | cmp.lt p, q = b, a | if a is a register |
cmp.lt q, p = a − 1, b | if a is an immediate | ||
cmp.le p, q = a, b | signed less than or equal | cmp.lt q, p = b, a | if a is a register |
cmp.lt p, q = a − 1, b | if a is an immediate | ||
cmp.geu p, q = a, b | unsigned greater than or equal | cmp.ltu q, p = a, b | |
cmp.gtu p, q = a, b | unsigned greater than | cmp.ltu p, q = b, a | if a is a register |
cmp.ltu q, p = a − 1, b | if a is an immediate | ||
cmp.leu p, q = a, b | unsigned less than or equal | cmp.ltu q, p = b, a | if a is a register |
cmp.ltu p, q = a − 1, b | if a is an immediate |
These syntheses rely on the identities
x > y | ⇔ | y < x | |
x ≤ y | ⇔ | ¬(x > y) | |
x ≤ y | ⇔ | x − 1 < y | for integers x and y, assuming no overflow |
x ≥ y | ⇔ | y ≤ x |
The next level of complexity is the parallel comparisons. These perform a comparison and combine the result with the values already in the destination predicates.
opcode | meaning | really |
---|---|---|
cmp.xx.or p, q = a, b | p = p || (a xx b) q = q || (a xx b) |
if (a xx b) then p = q = true |
cmp.xx.orcm p, q = a, b | p = p || ¬(a xx b) q = q || ¬(a xx b) |
if ¬(a xx b) then p = q = true |
cmp.xx.and p, q = a, b | p = p && (a xx b) q = q && (a xx b) |
if ¬(a xx b) then p = q = false |
cmp.xx.andcm p, q = a, b | p = p && ¬(a xx b) q = q && ¬(a xx b) |
if (a xx b) then p = q = false |
cmp.xx.or.andcm p, q = a, b | p = p || (a xx b) q = q && ¬(a xx b) |
if (a xx b) then p = true, q = false |
cmp.xx.and.orcm p, q = a, b | p = p && (a xx b) q = q || ¬(a xx b) |
if ¬(a xx b) then p = false, q = true |
The meaning column describes how it is convenient to think of the operations, but the really column describes how they actually work.
The orcm
and andcm
versions take the
complement of the comparison,
which is handy because some of the synthesized comparisons involve
taking the opposite of the specified result.
These parallel comparisons get their name because they are
designed to have multiple copies executed in parallel.
Consequently, they are
an exception to the general rule that you can write to a register
only once per instruction group.
If all writes to a predicate register are AND-like
(i.e., and
or andcm
)
or all writes are OR-like
(i.e., or
or orcm
),
then the writes are allowed to coexist within a single instruction group.
(This is where the actually column comes in handy.
You can see that all AND-like operations either do nothing
or set the predicate to false,
and that all OR-like operations either do nothing
or set the predicate to true.
That's why they can run in parallel:
If multiple conditions pass, they all do the same thing,
so it doesn't matter which one goes first.)
Executing them in parallel lets you perform multiple tests in a single cycle. For example:
x = ... calculate x ...; y = ... calculate y ...; z = ... calculate z ...; if (x == 0 || y == 0 || z == 0) { something_is_zero; } else { all_are_nonzero; }
could be compiled as
... calculate x in r29 ... ... calculate y in r30 ... ... calculate z in r31 ... cmp.eq p6, p7 = +1, r0 ;; // set p6 = false, p7 = true cmp.eq.or.andcm p6, p7 = r29, r0 // p6 = p6 || x == 0 // p7 = p7 && x != 0 cmp.eq.or.andcm p6, p7 = r30, r0 // p6 = p6 || y == 0 // p7 = p7 && y != 0 cmp.eq.or.andcm p6, p7 = r31, r0 ;; // p6 = p6 || z == 0 // p7 = p7 && z != 0 (p6) something_is_zero (p7) all_are_nonzero
First, we calculate the values of x, y and x. At the same time, we prime the parallel comparison: we compare the constant +1 against register r0, which is the hard-coded zero register. This comparison always fails, so we set p6 to false and p7 to true.
Now we perform the three comparisons in parallel. We check if r29, r30, and r31 are zero. If any of them is zero, then p6 becomes true and p7 becomes false. If all are nonzero, then nothing changes, so p6 stays false and p7 stays true.
Finally, we act on the calculated predicates.
Notice that the parallel comparison lets us calculate and combine all the parts of the test in a single cycle. In a flags-based architecture, we would have to perform a comparison, test the result, then perform another comparison, test the result, then perform the last comparison, and test the result one last time. That's a sequence of six dependent operations, which is difficult to parallelize. (And most likely consume three branch prediction slots instead of just one.)
The last wrinkle in the comparison instructions is the
so-called
(qp) cmp.xx.unc p, q = r, s
Even though there is a qualifying predicate,
this comparison is executed unconditionally
(as indicated by the unc
suffix).
The behavior of an unconditional comparison is
p = qp && (r xx s) |
p = qp && ¬(r xx s) |
In other words, if the qualifying predicate is true, then the instruction behaves as normal. But if the qualifying predicate is false, then the result of the comparison is considered false for all branches, regardless of the actual test.
This formulation is handy when you are nesting predicates. Consider:
x = ... calculate x ...; y = ... calculate y ...; if (x == 0) { x_is_zero; } else { x_is_nonzero; if (y == 0) { x_is_nonzero_and_y_is_zero; } else { both_are_nonzero; } }
This can be compiled like this:
... calculate x in r30 ... ... calculate y in r31 ... cmp.eq p6, p7 = r30, r0 ;; (p6) x_is_zero (p7) x_is_nonzero (p7) cmp.eq.unc p8, p9 = r31, r0 ;; (p8) x_is_nonzero_and_y_is_zero (p9) both_are_nonzero
After calculating x and y,
we check whether x is zero.
If it is, then we execute
x_
We'll see later that the unconditional comparison is also useful in register rotation.
So that's a quick tour of the Itanium conditional instructions. Next time, we'll start looking at speculation.
Typo: "First, we calculate the values of x, y and x"
Should be "First, we calculate the values of x, y and z".
Actually, it should be "First, we calculate the values of x, y, and z".
I get the general feeling that if they hadn't bothered with the compiler driven OOO execution the concepts of Itanium make for an awesome architechture
So the "parallel" comparisons are exceptions to the rule that both predicates are set to opposite values?
In the cmp.xx.unc table is "p = qp && ¬(r xx s)" a typo? I think this should be q = ...
I love this series, I've never really used anything else than x86 and it really puts things into perspective, seeing how different other platforms can be :) Thank you, Raymond!
@kantos You probably wouldn't have been able to implement OOO in hardware and include everything else that Itanium had. Itanium was one of Intels more successful follies. Hopefully the better ideas turned up in their x64 backend
The "really" column is the same as the "meaning" column with operands reversed i.e. p = (a xx b) || p etc.
@Neil What is your point? The difference between the "meaning" and the "really" is that the "meaning" column states that p & q are always updated. The "really" column states that p & q are only updated in one outcome of the evaluation, in the other the register is not updated at all. This means that as long as instructions are only going to set or clear the registers then they can be run at the same time. If one could set and the next could clear then they cannot be run at the same time, they have to be run sequentially. It makes no difference to you as the programmer if p is clear and the comparison is not going to set it, but to hardware it makes a difference.
I was pretty harsh on the memory addressing last week, so I feel I ought to say that the comparison instructions and predicate registers are IMNSHO something Itanium got *right* and it's too bad no one else has picked up on them. (I only know about two greenfield ISA designs since: AArch64 and RISC-V. AArch64 appears to have a traditional, single flags register, and RISC-V has fused compare-and-branch instructions. If there's a design I've missed I'd be glad to hear about it.)
If you have an OOO core, conditional *execution* of most non-branch instructions turns out to be a net lose, because it wastes fetch and decode bandwidth, and it doesn't give the branch predictor a chance to work its magic. However, I think a hybrid design with Itanium-like comparisons and predicates, but a limited set of conditional instructions (branch, reg-reg move, set-to-immediate, and maybe load/store) would work out well.
(Also, unconditional load/store of single predicate registers to RAM would be convenient. I don't remember if Itanium had that.)