Quick info about a great SIMD writeup

Hi, folks. I wanted to put together a more coherent version of a random Twitter conversation I had with Sasha Goldshtein.

First, you should go read Sasha’s excellent write-up on SIMD: http://blogs.microsoft.co.il/sasha/2014/04/22/c-vectorization-microsoft-bcl-simd/. He’s done an excellent job of talking about how scalar & vector operations compare, and really dug in deep to performance of the generated code.

Okay, now let me explain a few things. Sasha’s looking at the code quality coming out of the preview: we know this code quality is less than optimal. Here’s what I’m seeing from the same code with a compiler I built about 30 minutes ago (Siva, one of the JIT devs, just got CopyTo recognized as an intrinsic function):

; parameters: RCX = a, RDX = b (as before)
    sub    rsp,28h
    xor    eax,eax                          ;rax = i (init to zero)
    mov    r8d,dword ptr [rcx+8]            ;r8d = a.Length
    test   r8d,r8d
    jle    bail_out                         ;r8d <= 0?
    mov    r9d,dword ptr [rdx+8]            ;r9d = b.Length
    lea    r10d,[rax+3]                     ;r10d = i + 3
    cmp    r10d,r8d                         ;we still have to range check the read from the array of &a[i+3]
    jae    range_check_fail
    mov    r11,rcx
    movupd xmm0,xmmword ptr [r11+rax*4+10h] ;xmm0 = {a[i], a[i+1], a[i+2], a[i+3]}
    cmp    r10d,r9d                         ;and we have to range check b[i+3], too...
    jae    range_check_fail                 ;this one too :-(
    movupd xmm1,xmmword ptr [rdx+rax*4+10h] ;xmm1 = {b[i], b[i+1], b[i+2], b[i+3]}
    addps  xmm0,xmm1                        ;xmm0 += xmm1
    movupd xmmword ptr [r11+rax*4+10h],xmm0 ;{a[i], a[i+1], a[i+2], a[i+3]} = xmm0
    add    eax,4                            ;i += 1 * sizeof(float)
    cmp    r8d,eax                          ;a.Length > i?
    jg     loop
    add    rsp,28h
    call   clr!JIT_RngChkFail (00007ff9`a6d46590)
    int    3

So the most egregious code quality issue is gone (we’re going to try and get an update out quickly: I’m saying mid-May, but things could change :-/). That gives me numbers like this:

Scalar adding numbers: 765msec
SIMD adding numbers: 210msec

MUCH better. But Sasha makes another interesting point: why did we create two versions of the scalar loop, but left those pesky bounds checks in the SIMD loop? Well, for starters, the two functions aren’t identical. The SIMD version will actually throw a bounds check failure exception. To fix that, we actually need to change (some might say “fix” :-)) the code:

static void AddPointwiseSimd(float[] a, float[] b)
int simdLength = Vector<float>.Length;
int i = 0;
for (i = 0; i < a.Length - simdLength; i += simdLength)
Vector<float> va = new Vector<float>(a, i);
Vector<float> vb = new Vector<float>(b, i);
va += vb;
va.CopyTo(a, i);
for (; i < a.Length; ++i)
a[i] += b[i];

And now you might be expecting some great revelation. Nope. Sasha’s right: we’re missing some fairly straight-forward bound checks here. And we’re not dual-versioning the loop, like RyuJIT does for normal scalar loops. Both items are on our to-do list. But the original code Sasha wrote would probably never have those optimizations. Hopefully, we’ll get our bounds-check elimination work strengthened enough to make this code truly impressive. Until then, you’ll have to limp along with only a 3.6x performance boost.

Comments (10)

  1. Mike Danes says:

    movupd/addps, hmm… shouldn't it be movups instead? Granted, both movupd and movups load a xmm register from memory and they don't really care about the actual values. But movups is shorter by one byte, it doesn't require a 66 prefix.

  2. xor88 says:

    The dual-versioning of the loop is a very nice solution that should eliminate 99% of the dynamic range checks executed at runtime. Provided, that people don't run many short loops which I don't know about.

    The HotSpot version is still far more impressive but I'd rather want you to spend your resources on things that matter. I believe this is the right trade-off. Good job!

    One thing caught my eye: We need to duplicate the loop body so that the code properly deals with not cleanly divisible trip counts. Do you have an idea on how to avoid that? I don't, unfortunately. As the vector length cannot be assumed to be a certain value by the programmer this means that all programmer-written loops will need this. After all, the vector length could suddenly be 1024 on a future version of the CLR and very few arrays are 1024-aligned in length. A programmer forgetting the 2nd loop would then receive incorrect results.

    Btw, (a.Length – simdLength) should be (a.Length – (simdLength – 1)) I think. Hope, the JIT will be able to recognize both as a safe upper bound and remove all range checks alltogether.

  3. LKeene says:

    Nice work,well done! The new emphasis on runtime performance is very encouraging. The production version of RyuJIT can't come soon enough.

  4. FreiK says:

    @Mike: Yes, you're right. That should be fixed before the next public preview.

    @xor88: How to deal with the last "Vector<T>.Length – 1" elements of a loop is a real usability problem, I completely agree. We've been toying with a lot of potential alternatives, but haven't found anything that really looks elegant and is performant. We're open to suggestions!

  5. LKeene says:

    @xor88: To deal with the final elements that don't fill up the vector, I usually just resorted to increasing the buffer length holding the original values to an integer multiple of the SIMD vector size, processed the whole buffer, then discarded the dummy results. I guess it would be nice if the JIT could determine the number of elements that can't be vectorized at runtime, duplicate the code body and run the final values through the scalar unit at the same time the SIMD units are working on the main body of data, then insert them into the final result.

    @Kevin: Are there instructions for generating conditional bitmask vectors included in this new .NET SIMD implementation?

  6. Mike Danes says:

    @Kevin Frei: "That should be fixed…"

    Cool! But here's another nit to pick :)

    What's up with "mov r11, rcx" in the middle of the loop? rcx isn't modified anywhere so it can very well be used to access array a.

    @LKeene: "it would be nice if the JIT could determine the number of elements that can't be vectorized at runtime, duplicate the code body and run the final values through the scalar unit"

    That's what an auto vectorizer would do. Problem is, JIT isn't doing any kind vectorization. You, the author of the code, are doing it.

  7. FreiK says:


    #1: Automatically un-vectorizing the last bits of the loop seems much easier than vectorizing a scalar loop, but both are pretty complicated. It's a rather interesting idea, though, given that the API is inherently size-agnostic: we could turn all those vector loops into scalar loops by making Vector<T>.Length 1, right? (okay, not quite: Vector<largestTyoureUsing>.Length could be 1, though)

    #2: Vector.LessThan/Vector.GreaterThan/Vector.Equal… all generate bitmast vectors. You can find a really simplistic usage in the Mandelbrot demo.

    @Mike: Nice catch on the unnecessary use of R11. I'll talk to our high priestess of register allocation about it and see what she comes back with. The JIT sometimes gets into trouble because we don't have the time to iterate on some optimizations until they hit an ideal fix-point.

  8. harrydev says:

    @Kevin: You say "we don't have the time to iterate on some optimizations until they hit an ideal fix-point" does this refer to the trade-off between jit-time and code generated? And in this case I would assume users of the Simd API would generally prefer longer jit times for better code generation, at least I would. Startup costs are for us not an issue.

    But CTP4 all sounds great, does this include the ability to use pointers when making Vectors?

  9. David Hewson says:

    I'd vote for longer startup for more performance too. I'd you're using SIMD you must really care about performance.

  10. David Hewson says:

    In your loop that uses the Vectors could it not use <= ? Just using < would miss out the last vector.

    ie if your int32 array is length 8 with < you get once around the vector and 4 times around the remainder loop, but with <= you get twice in the vector loop and 0 times around the remainder loop.

    But i tried changing my tests to use <= and it went slower, does this introduce more bounds checks on the vector loop?