The Itanium processor, part 7: Speculative loads


Today we'll look at speculative loads, which is when you load a value before you even know whether the value should be loaded. (A related but distinct concept is advanced loading, which we'll look at next time.)

Consider the following code:

int32_t SomeClass::calculateSomething(int32_t *a, int32_t *b)
{
 int32_t result;
 if (m_p > m_q + 1) {
  result = a[*b];
 } else {
  result = defaultValue;
 }
 return result;
}

The naïve way to compile this function would go something like this:

        // we are a leaf function, so no need to use "alloc" or to save rp.
        // on entry: r32 = this, r33 = a, r34 = b

        // calculate complex condition, putting the result in p6
        // and the opposite of the result in p7.

        addl    r31 = 08h, r32          // r31 = &this->m_p
        addl    r30 = 10h, r32 ;;       // r30 = &this->m_q
        ld8     r31 = [r31]             // r31 = this->m_p
        ld8     r30 = [r30] ;;          // r30 = this->m_q
        addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
        cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

        // Now take action based on the result.

(p6)    ld4     r29 = [r34] ;;          // if true: load *b
(p6)    shladd  r29 = r29, 2, r33 ;;    // if true: calculate &a[*b]
(p6)    ld4     ret0 = [r29]            // if true: load a[*b]
(p7)    or      ret0 = 2ah, r0          // if false: return default value
        br.ret.sptk.many rp             // return

First we decide whether the condition is true or not. Once we know whether the branch is taken, we execute the appropriate branch. If the condition is true, then we load *b, then use it to index the array a. If the condition is false, then we just set the result to the default value.

Now, in reality, the encoded instructions aren't that neat due to template restrictions. For example, we cannot put the first three instructions into a bundle because they consist of two arithmetic instructions, a stop, and a memory instruction, but there is no II|M template. The actual encoding would be more like this:

        // we are a leaf function, so no need to use "alloc" or to save rp.
        // on entry: r32 = this, r33 = a, r34 = b

        // calculate complex condition, putting the result in p6
        // and the opposite of the result in p7.

{      // MII| template
        nop.m
        addl    r31 = 08h, r32          // r31 = &this->m_p
        addl    r30 = 10h, r32 ;;       // r30 = &this->m_q
}
{       // MM|I| template (M port can also execute integer operations)
        ld8     r31 = [r31]             // r31 = this->m_p
        ld8     r30 = [r30] ;;          // r30 = this->m_q
        addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
}
{       // M|MI| template
        cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

        // Now take action based on the result.

(p6)    ld4     r29 = [r34]             // if true: load *b
        nop.i ;;                        // move the stop here
}
{       //  M|MI template
(p6)    shladd  r29 = r29, 2, r33 ;;    // if true: calculate &a[*b]
(p6)    ld4     ret0 = [r29]            // if true: load a[*b]
(p7)    or      ret0 = 2ah, r0          // if false: return default value
}
{       // BBB template
        br.ret.sptk.many rp             // return
        nop.b
        nop.b
}

Anyway, let's go back to the original version. What you might notice is that there is a long dependency chain:

addl r31 = 08h, r32 addl r30 = 10h, r32
ld8 r31 = [r31] ld8 r30 = [r30]
addl r30 = 08h, r30
cmp.gt p6, p7 = r30, r31
(p7) or ret0 = 2Ah, r0 (p6) ld4 r29 = [r34]
(p6) shladd r29 = r29, 2, r33
(p6) ld4 ret0 = [r29]
br.ret.sptk.many rp

It would be great if we could parallelize more of this computation. For example, we could precalculate the true branch:

        // we are a leaf function, so no need to use "alloc" or to save rp.
        // on entry: r32 = this, r33 = a, r34 = b

        ld4     r29 = [r34]             // assume true: load *b
        addl    r31 = 08h, r32
        addl    r30 = 10h, r32 ;;
        shladd  r29 = r29, 2, r33       // assume true: calculate &a[*b]
        ld8     r31 = [r31]
        ld8     r30 = [r30] ;;
        ld4     r29 = [r29]             // assume true: load a[*b]
        addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
        cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

        // Now take action based on the result.

(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
        br.ret.sptk.many rp             // return

After applying template restrictions, the code looks more like this:

{       // MII| template
        ld4     r29 = [r34]             // assume true: load *b
        addl    r31 = 08h, r32
        addl    r30 = 10h, r32 ;;
}
{       // MMI| template
        ld8     r31 = [r31]
        ld8     r30 = [r30]
        shladd  r29 = r29, 2, r33 ;;    // assume true: calculate &a[*b]
}
{       // MI|I| template
        ld4     r29 = [r29]             // assume true: load a[*b]
        addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
        cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6
}
{       // MIB template (recalling that the M port can also execute integer instructions)
(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
        br.ret.sptk.many rp             // return
}

Note that we managed to shrink the code because we were able to use the speculative instructions to "fill in the holes" left by the template-mandated nop instructions in the original. We managed to squeeze out all the nops!

Okay, enough with the template restrictions digressions.

This alternate version assumes that the complex condition is true and speculatively evaluates the true branch. If the test turns out to be true, then it just uses the precalculated result. if it turns out to be false, then the precalculated result is thrown away.

In other words, we rewrote the code like this:

int32_t SomeClass::calculateSomething(int32_t *a, int32_t *b)
{
 int32_t speculativeResult = a[*b];
 int32_t result;
 if (m_p > m_q + 1) {
  result = speculativeResult;
 } else {
  result = defaultValue;
 }
 return result;
}

The dependency chain is now much shorter.

addl r31 = 08h, r32 addl r30 = 10h, r32 ld4 r29 = [r34]
ld8 r31 = [r31] ld8 r30 = [r30] shladd r29 = r29, 2, r33
addl r30 = 08h, r30 ld4 r29 = [r29]
cmp.gt p6, p7 = r30, r31
(p7) or ret0 = 2ah, r0 (p6) or ret0 = r0, r29
br.ret.sptk.many rp

This rewrite is generally beneficial if profiling feedback suggests that the conditional is normally true, since it hides the memory access latency inside the calculation of the conditional.

Until somebody does this:

    // Get the default value. We know that m_p == m_q.
    p->calculateSomething(nullptr, nullptr);

In this case, the speculative execution will take an access violation because it tries to deference the null pointer as part of calculating the speculative result.

Well, that sucks.

To solve this problem, the Itanium lets you explicitly tag memory read operations as speculative. If you try to load a value speculatively, the instruction will read the value if possible, but if doing so would normally raise an exception, the processor says, "Sorry, I couldn't read it. But since this is part of speculative execution, I shouldn't raise an exception. Instead, I will set the NaT bit on the value so that you will know that the speculative load failed."

The NaT bit (short for Not a Thing) is a 65th bit associated with each 64-bit general-purpose integer register that says whether the register holds a valid value. (Floating point registers do not have a NaT bit; instead there is a special value called NaTVal which serves the same purpose.) Arithmetic operations on NaT simply result in another NaT, but if you try to do something interesting with a NaT, you incur a STATUS_REG_NAT_CONSUMPTION exception.

So let's take advantage of explicit speculative execution:

        // we are a leaf function, so no need to use "alloc" or to save rp.
        // on entry: r32 = this, r33 = a, r34 = b

        ld4.s   r29 = [r34]             // speculatively load *b
        addl    r31 = 08h, r32
        addl    r30 = 10h, r32 ;;
        ld8     r31 = [r31]
        ld8     r30 = [r30]
        shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
        ld4.s   r29 = [r29]             // speculatively load a[*b]
        addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
        cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

        // Now take action based on the result.

(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
        br.ret.sptk.many rp             // return

We changed the two load operations from ld4 to ld4.s. The trailing .s means that the load is being performed speculatively, and that if an error occurs or if the address is itself NaT, then set the result to NaT rather than raising an exception.

Okay, so this prevents the exception from being raised during the speculative execution, but what if an exception really occurred? How do we turn the NaT back into the original exception? As written, we return NaT back to our caller, which is definitely not what the caller expects!

You might put on your language lawyer hat at this point and say that dereferencing a null pointer invokes undefined behavior, so returning NaT is standard-conforming (because undefined behavior allows anything to be standard-conforming). That's true, if the exception was due to an access violation. But the exception might have been a page not present exception because the memory was paged out. In that case, we really do want to raise the exception so that the kernel can handle it by paging the memory back in, and then we want to read the value and resume our calculations. The caller definitely does not expect that passing valid parameters will result in a NaT just because the memory happens to be paged out.

What we need to do is convert the deferred exception back into the original exception so that it can be raised as if no speculation had occurred. The instruction that lets us know that an exception got converted to a NaT is chk.s. This means "Check if the register contains NaT. If so, then jump to recovery code." The recovery code re-executes the instructions non-speculatively so that all the exceptions can be raised in the standard way, and any exception handlers can do their work in the standard way. Since NaT infects future computations, we don't need to check every speculative step; we need only check the final speculated result.

        // we are a leaf function, so no need to use "alloc" or to save rp.
        // on entry: r32 = this, r33 = a, r34 = b

        ld4.s   r29 = [r34]             // speculatively load *b
        addl    r31 = 08h, r32
        addl    r30 = 10h, r32 ;;
        ld8     r31 = [r31]
        ld8     r30 = [r30]
        shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
        ld4.s   r29 = [r29]             // speculatively load a[*b]
        addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
        cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

        // Now take action based on the result.

(p6)    chk.s   r29, recovery           // if true: recover r29 if not valid
recovered:
(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
        br.ret.sptk.many rp             // return

recovery:
        ld4     r29 = [r34] ;;          // load *b
        shladd  r29 = r29, 2, r33 ;;    // calculate &a[*b]
        ld4     r29 = [r29]             // load a[*b]
        br      recovered               // resume with recovered value

The chk.s instruction checks the specified register to see if the NaT bit is set. If not, the instruction allows execution to continue normally. But if the register is invalid, then control transfers to the specified label. Our recovery code re-executes the instructions that led to the invalid value, but this time we execute them non-speculatively so that exceptions can be raised and handled. Once the value has been recovered, we jump back to the instruction after the chk.s so that normal execution can resume.

In this case, we can make an additional optimization. Since the only things happening after recovery are copying r29 to ret0 and returning, we can inline those two instructions then perform peephole optimization to combine the ld4 r29 = [r29] and or ret0 = r0, r29 into ld4 ret0 = [r29].

Note that optimizing the recovery code is not really that important from an execution speed standpoint, since the recovery code runs only if an exception occurred, and the cost of raising and handling the exception will drown out any cycle squeezing effects. The real benefit of optimizing the recovery code is to avoid the jump back into the mainline code, because that allows the mainline code to be more compact: Recall that all jump targets must be the start of a bundle. If we had the recovery code jump back to the mainline code, we would have to insert some nops so that the recovered label is at the start of a bundle. (In practice, what the compiler will do is repeat the trailing instructions in the bundle containing the chk.s then jump to the start of the next bundle.)

The final compiled function now looks like this:

        // we are a leaf function, so no need to use "alloc" or to save rp.
        // on entry: r32 = this, r33 = a, r34 = b

        ld4.s   r29 = [r34]             // speculatively load *b
        addl    r31 = 08h, r32
        addl    r30 = 10h, r32 ;;
        ld8     r31 = [r31]
        ld8     r30 = [r30]
        shladd  r29 = r29, 2, r33 ;;    // speculatively calculate &a[*b]
        ld4.s   r29 = [r29]             // speculatively load a[*b]
        addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1
        cmp.gt  p6, p7 = r30, r31 ;;    // p6 = m_p > m_q + 1; p7 = !p6

        // Now take action based on the result.

(p6)    chk.s   r29, recovery           // if true: recover r29 if not valid
(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated
(p7)    or      ret0 = 2ah, r0          // if false: return the default value
        br.ret.sptk.many rp             // return

recovery:
        ld4     r29 = [r34] ;;          // load *b
        shladd  r29 = r29, 2, r33 ;;    // calculate &a[*b]
        ld4     ret0 = [r29]            // load a[*b] as return value
        br.ret.sptk.many rp             // return

Next time, we'll look at advanced loading, which is a different type of speculative execution.

Comments (15)
  1. Muzer says:

    To my mind, this seems so hacky, and yet quite clever.

  2. Andre says:

    What happens when one instruction of the bundle incurs, say, a page fault? Does the whole bundle get discarded before calling into the trap handler, or only a single instruction?

    If the whole bundle gets re-executed: Could this lead to a situation where a (pretty naive) page handler runs in circles paging in one page for another, and two instructions of the bundle alternatingly raise page faults?

    [This is a non-problem because bundles are 16 bytes long and are aligned on on 16-byte boundaries, so they can never straddle a page boundary. -Raymond]
  3. Jared S says:

    @Andre: You could get into that situation on x86 if you have an instruction that crosses a page boundary.

    [Also, unaligned data. -Raymond]
  4. Medinoc says:

    I'm not sure Andre's question was perfectly understood here.

    The way I read it, it's about what can happen if a bundle contains two instructions that each load a memory value from two different pages.

  5. Andre says:

    @Medinoc:

    Yes, that's what I meant.

    [Sorry, I misunderstood. Fault handlers have access to a privileged rfi instruction, which has the power to resume execution at a specific instruction within the bundle. -Raymond]
  6. Kirchner says:

    Somebody care to help me understand how this instruction makes sense?

    addl    r30 = 08h, r30 ;;       // r30 = this->m_q + 1

    I'd say it adds 8 to r30. What am I missing?

  7. Medinoc says:

    @Kirchner: The declaration of m_q isn't shown, but if it's a pointer to int64_t, adding 1 to it translates as adding 8 to the actual address it contains.

  8. Typo says:

    It says "it tries to deference the null pointer".

    It should be 'dereference' instead of 'deference'.

  9. Kirchner says:

    @Medinoc, thanks a lot. I missed that.

  10. I am a pure .NET developer but this strategy sounds very clever to me. A really interesting Strategy

    Thanks for the article ;)

  11. asdf says:

    At least as reported in the press at the time, one of the biggest problems of the Itanium was that the compilers never managed to deliver the promised performance. Was that just an early problem, or something that persisted over the lifetime of the architecture? Has the state of compiler research advanced enough that VLIW would be a viable model for general-purpose computing today?

  12. JM says:

    "The recovery code re-executes the instructions non-speculatively so that all the exceptions can be raised in the standard way, and any exception handlers can do their work in the standard way."

    Presumably you need to make sure the code is correct under multithreading/interrupts, given that "the same" instructions can now execute more than once? (The general pattern of "load the register again non-speculatively given that we already know this branch will be taken" is safe, of course.)

    [The non-speculative version is the "real" execution. The speculative version is the "Gosh, I wonder if I can get away with this " version. So any exceptions raised in the non-speculative version are the ones that correspond to the code as-written. -Raymond]
  13. NOR says:

    >(p6)    chk.s   r29, recovery           // if true: recover r29 if not valid

    >(p6)    or      ret0 = r0, r29          // if true: use the value we precalculated

    >(p7)    or      ret0 = 2ah, r0          // if false: return the default value

    >        br.ret.sptk.many rp             // return

    >

    >recovery:

    >        ld4     r29 = [r34] ;;          // load *b

    Q1: Are all five really in the same block? Is there an implicit ';;' after an unconditional jump?

    Q2: Why are there no ';;' in the first four instructions?

       If NAT is set for r29:

         1) Is there a guarantee that we jump to recovery first and don't return?

         2) Is the content of ret0 defined?

    [There is an implicit ;; after any taken jump. If r29 has NaT set, then the jump at chk.s is taken, and a ;; is auto-inserted, thereby preventing ret0 from being set and the br.ret from being taken. -Raymond]
  14. Andre says:

    Very interesting.

    Thanks a lot, Raymond, for the posts and the answer.

  15. Josh Bowman says:

    @asdf:

    35 years on from the introduction of x86, compilers are still improving with multiple-percent gains each major version. Small, but a non-negligible increase. While the science of compilers has evolved in tandem, it was already pretty solid when Itanic was introduced, and especially since the architecture throws out many x86 (and Power, Arm, Sparc) assumptions, there's still decades of theory and engineering work to truly take advantage of its uniqueness. The market didn't give it more than a half-decade, so here we are.

    (It would probably have gained an on-chip recompiler anyway. Intel's the only one who ever understand it well enough.)

Comments are closed.

Skip to main content