RyuJIT CTP4: Now with more SIMD types, and better OS support!


Hi, folks. It’s been a busy month around here. We’ve been working on all sorts of stuff that I can’t talk about right now, but in the meantime, we’ve also been responding to feedback on the SIMD types. So, since it’s busy, I’m just going to list off the details, and link to other places for more information.

  1. Probably the biggest news is that if you install the 4.5.2 runtime (check the .NET blog for details on that), you can use RyuJIT CTP4 on Windows Vista, 7, and 8, as well as Windows Server 2008, 2008 R2, and 2012. In the CTP1 FAQ, I made mention that 4.5.1 on “downlevel” OS’es looked different from a code generation perspective. Well, that’s been addressed in the 4.5.2 update, so we’re happy to support RyuJIT CTP4 across all platforms that support 4.5.2.
  2. Nearly all the available Vector<T> types are now accelerated! The only ones missing are Vector<uint> and Vector<ulong>. In addition, there are a handful of other methods that we are now accelerating, including the CopyTo() method, which means any performance you have measured is now completely invalidated! Wait, no I mean, any performance you measured could potentially be faster!
  3. The fixed-size vector types are all mutable, now. This was the single biggest piece of feedback we received, so we took it.

There you have it. For now, you can download the CTP4 bits from here. The BCL SIMD NuGet package has also been updated, so update that, and you should be good to go. Same directions as before for how to use the types, enable RyuJIT, all that stuff. As always, send feedback to ryujit@microsoft.com. Happy RyuJIT-ing!

-Kev

Comments (55)

  1. Craig Johnson says:

    Looking forward to testing out CTP4.

  2. David Hewson says:

    I installed CTP4 on Windows 7

    My application runs fine normally. After enabling RyuJit with set COMPLUS_AltJit=* when my application runs it crashes with System.AccessViolationException

    🙁

  3. AzureSky says:

    VS Update 2 = Blue Screen on restart / breaks my Win8, breaks VS 2013.  Suggestion(s)?

  4. Kevin Frei says:

    @David, can you get us a repro case? That seems like a bad thing! If you can get a repro, please send it to ryujit@microsoft.com.

  5. Kevin Frei says:

    @AzureSky: Uninstall VS Update 2? Upgrade to Win 8.1 (it's better that WIn8 in every possible way)? Insert additional general purpose troubleshooting guidance, here 🙂

  6. AzureSky says:

    Yeah, thanks for the tips, Kevin.  I've been planning on upgrading to Win 8.1; will be doing so at some point.  In the meantime, back to VS 2013.0 and completing my current project.  After Blue Screen hell not willing to risk anymore downtime at the moment.

  7. AzureSky says:

    By the way, how might I uninstall VS Update 2?  There is nothing listed in the Control Panel Programs and Features list.  There is nothing added to my start menu, etc, no new programs groups or anything.  That might be because VS 2013 Update Setup failed to complete?  VS (in 'About') says Update 2 is installed, but quite apparently it's not installed properly.  I think my VS is in some strange zombie land.  Anyway, what I did earlier was run the VS 2013.0 Setup Repair and that at least got me a Visual Studio that works again.  To get my Desktop back in the first place (to get past the Blue Screens) I had to use a System Restore Point (created by the VS Update 2 installed); so I used that to fix Windows first, of course, followed by the VS 2013.0 Setup Repair.  Anyways, VS still says Update 2 "is installed", so things are not 100% rolled back, which is why I might want to try Update 2 Uninstall; but where?  how?

  8. Kevin Frei says:

    @AzureSky: Lemme bark up a couple trees to see what's possible, here. I had problems with VS Update 2 preview, and I wound up fully uninstalling VS 2013, then reinstalling it, which took care of the problem for me…

  9. AzureSky says:

    Any clues would be much appreciated, thanks Kevin.

  10. leppie says:

    Where can we report bugs?

    Just tested this on Windows 7 64-bit with IronScheme, NullReferenceException.

    You can repro by downloading latest version from ironscheme.codeplex.com and running the IronScheme.Console-v4.exe (that targets 64-bit .NET 4+).

  11. leppie says:

    Good news is when I NGEN my app, and run tests which involves a bit of codegen, I see a very decent speedup o/

    ryujit: 6.8 secs

    normal jit: 9.6 secs

  12. Azarien says:

    still waiting for 32-bit version…

  13. Kevin Frei says:

    @leppie: send e-mail to RyuJIT@microsoft.com.

    @Azarien: 32 bit version won't provide dramatic improvement in JIT times like 64 bit does, but it should provide improved code quality (and SIMD capabilities), but that's not happening in the short term.

  14. David Hewson says:

    Have send a repro to that email. Not heard anything. You get it :)?

  15. Novox says:

    Are you planning to implement support for further SIMD intrinsics like reciprocal and reciprocal square root?

  16. Kevin Frei says:

    @David: Yes, I got that repro, and I think a dev has already fixed the issue. I'll make sure that we ping you with status.

    @Nov0x: The idea for SIMD is to get stuff in customers hands and listen to what people want/need. I'll talk to the BCL folks (Immo & co.) to see how they're collecting customer feedback.

  17. LKeene says:

    What is the time frame for RyuJIT to go into production? Four tech previews is pretty comprehensive.

  18. Kevin Frei says:

    @LKeene: It's planned to go into the next "major" release of the .NET framework. It's already in the new ASP.NET thing that was announced at TechEd. (I don't know the official name of that!) To be clear, CTP4, and this is probably going to be true of any additional previews before we get a .NET beta that includes RyuJIT, was primarily to address SIMD feedback.

  19. Mike Danes says:

    @Kevin Frei: "The idea for SIMD is to get stuff in customers hands and listen to what people want/need."

    I'm not sure if this was mentioned or not already but one thing that's certainly missing is shuffling. Without that some more advanced SIMD stuff can't be implemented efficiently, for example matrix multiplication.

  20. CodesInChaos says:

    I agree with Mike Danes that shuffles/permutations would be very useful. I'm also missing bitshift operations. Both of these are needed for a proper SIMD implementation of ciphers like ChaCha, Salsa or Blake.

  21. Sivarv says:

    @Mike Danes, @CodesInChaos:   Both Shuffle and shift operations are in out ToDo list and under consideration.

  22. AzureSky says:

    I agree.  Would like to see bit shift and (at least some sort of) shuffle operators (needn't be the full SSE equivalent but at least something that would make doing matrix functions doable.)  Please do implement bit shifting and at least some type of shuffle operator.  Thank you in advance for any efforts in this direction.

  23. Kevin Frei says:

    @leppie, Siva has tracked down the IronScheme bug, and will have a fix pretty quickly. The issue is with tail-calling a delegate invoke. Maybe we'll turn around a CTP4b with a few fixes next week some time. I've gotten embarrassingly good at turning the "release an updated CTP" crank…

  24. AzureSky says:

    Also: Please add SIMD 'ROR' and 'ROL' (bit rotate) operators.

  25. _zenith says:

    Yes, all the above please! Bit shifting, rotations, shuffles, and unsigned type (unit, ulong) support.

    These are essential for fast cryptography implementations, and much else.

  26. _zenith says:

    Also please add a new constructor to the Vector<T> type : ( [T] values, int index, int length)

    Otherwise to make a certain length vector requires making a new array and copying over the requisite section, completely negating performance advantages.

  27. OnurG says:

    Check these three code, C#, Java and C, same loop. C# runts magnitudes slower in my computer release Builld 64 bit even if ryujit improves the performance 100%

    C# : https://ideone.com/1EpmBw#

    Java: http://ideone.com/hxNJz2

    C: http://ideone.com/IghNwL

  28. AzureSky says:

    @OnurG I tested it on VB.NET x64 .NET 4.51 and Visual C++ x64 (with full optimizations selected.)

    Results:

    17 seconds in VB.NET (WinForms)

    16 seconds with C++ (console application.)

    A .NET console app might perform slightly better, possibly delivering the same result as the C++ console app.

    I'm still using VS 2013.0 (not RyuJIT.)

  29. Mike Danes says:

    @AzureSky: I'm not sure how you got 17 seconds in VB.NET and 16 seconds in VC++. Are you sure your conversion to VB.NET is correct?

    @OnurG: I get 20 seconds with RyuJIT and 15 seconds with VC++ on a iCore7@2.50GHz machine. That's not exactly "magnitudes slower". In general, there's not much that can be optimized in that code, VC++ does some loop invariant code motion and some other stuff but the impact of that is relatively low, it's measurable in your example only due to the very high number of iterations.

  30. AzureSky says:

    @Mike Yes, I am sure.  Ensure you have set 'remove integer overflow checks' and 'enable optimizations' in Release build.  BTW, I've always noticed VB.NET being on par with C++ when it comes to straightforward loops, math, etc (which of course it should do, because it should emit similar ASM) and which is why I found OnurG's assertion a bit odd; therefore I verified it for myself.  The results I achieved were expected and thus unsurprising to me.

  31. Mike Danes says:

    @AzureSky: Well, I did my own conversion to VB.NET and I get exactly the same result I get with the C# version, 20 seconds. And of course, optimizations are enabled and overflow checks are disabled.

    Anyway, the claim that C# runs magnitudes slower is indeed odd, at least for the presented code.

  32. AzureSky says:

    @Mike  Interesting.  Perhaps it's something inherent to RyuJIT, or perhaps the way code is generated for your hardware. (I'm running an old AMD CPU.)

  33. Rick Minerich says:

    Any word on TCO support? It's a key feature for F#.

  34. Jaun P says:

    Is this still beta? It has no TCE

  35. Yes, any word on support for the "tail." instruction?  Uniformity with the other existing JITs is critical for this to be safe to install. [ Don Syme, for the Visual F# Tools team ]

  36. Kevin Frei says:

    The tail prefix is fully supported, and has been since CTP2. One of our devs actually ran the whole of the F# test suite against RyuJIT a few weeks after CTP2, because it was such a different set of IL than the C#/VB stuff that most of our tests consist of. Oh, and the F# self build, too 🙂

  37. Kevin Frei says:

    And Tail Calls are optimized. I wonder, however, if the we're doing something goofy with tail-prefixed calls. Sample code that's not working right, anyway? Send it to RyuJIT@microsoft.com 🙂

  38. David Hewson says:

    Kevin, any chances of another release with those bug fixes? I'd love to try this out to see if it can improve our run time but no such luck with that JIT level crash.

  39. David Hewson says:

    Ah I managed to figure out which bits were breaking the JIT and correct them. Replaced a few hot spots with the vector version. Sadly they only managed to make up for the cost of using the 64bit JIT. ie. 18 seconds run time on 32bit, went up to 26 seconds just by using ryujit and turning off prefer 32 bit. Vectorized my hot spots and it came back down to 18 seconds. We're using doubles though so maybe when AVX support comes along that will improve without any changes 🙂

  40. Michael Burbea says:

    Add me as another vote to a desire for bitshift operators.

    Is there any plan to optimizing built-in algorithms to use SIMD instructions? Things like Array.IndexOf<byte> have an optimized path, but because of the indirection they use are terribly slow. A SIMDified version could probably gain quite a bit of benefit.

    A really nice benefit of the SIMD types is that they avoid bound checking. When you visit the elements inside a vector it does not emit a bound check which is awesome.

  41. Mike Danes says:

    "A really nice benefit of the SIMD types is that they avoid bound checking. When you visit the elements inside a vector it does not emit a bound check which is awesome."

    That's true only when the index is a constant, if the index is a variable then a bound check must be performed. And beware that accessing a vector element via a variable index will result in the vector register being written to memory

  42. David Hewson says:

    What are the chances of supporting things like FMA3/4 and Gather-Scatter in the full release.

  43. Michael Burbea says:

    Hmm, you are probably right, but I was more talking about the intrinsic that pulls them out it seems to emit very efficient code to pull them out of the array.

    I didn't realize that the vector ops could only copy elements into memory. I was hoping that

    long v=vector[0]; would be smart enough to copy it straight to an available register if need be. (setting up debugging for ryujit is annoying).

    I can also say we are missing things like the opcode to copy the high bits in a SSE register to a short, which can make it much easier to analyze the result of the vector comparison. Since we lack that, my SIMD optimized indexOf is required to switch to longs (which probably also requires copying to another register) and then find the first set bit.

    Here is a gist with me trying to translate some SSE2 optimized memchr implementations to see if this is worthwhile. I think it is for the most part. There are some cases where SIMD is not faster, like when you have to do short runs. I actually used this test harness to test code for SQLCLR, so it can be quite nice if ryuJIT works for SQLCLR as it seems faster. (Note: the SIMD code is awful if you run it without hardware optimization. Like orders of magnitude slower).

    gist.github.com/…/a54d1528c5557c2af1b6

  44. Mike Danes says:

    "long v=vector[0]; would be smart enough to copy it straight to an available register if need be. (setting up debugging for ryujit is annoying)."

    That one is fast, it generates a single instruction. What's slow is something like vector[i] where i is a non const variable.

  45. Michael Burbea says:

    So I finally got around to testing the debugged output, and was somewhat disappointed in some of the things that is output and then you want to test it against zero or all one it generates another full equal. It does a full equal operations against zero again (requiring loading zero and everything). So I had to update my code and it was quite a bit of a gain to remove the test against zero. So now, I just test both halfs of the simd register at once.

    (e.g res= Vector.Equals(v1,v2) if(v3==Vector<int>.Zero)…, should only be one actual pcmpeqb but its actually two and it has to put zero into a SSE register. It also doesn't do it smart like pxor xmm3,xmm3 instead it writes zero to a long and then does a 64bit shuffle.)

  46. Mike Danes says:

    "e.g res= Vector.Equals(v1,v2) if(v3==Vector<int>.Zero)…, should only be one actual pcmpeqb"

    The code generated for Vector.Equals<byte> is dubious:

    var vect = new Vector<byte>(array, startIndex);

       lea         eax,[r8+0Fh]  

       cmp         eax,r10d  

       jae         000007FE8E1E201D  

       movups      xmm1,xmmword ptr [rcx+r8+10h]  

    var res = Vector.Equals<byte>(vect, comparer);

       movaps      xmm2,xmm0  

       mov         eax,80808080h  

       movd        xmm3,eax  

       pshufd      xmm3,xmm3,0  

       psubb       xmm1,xmm3  

       psubb       xmm2,xmm3  

       pcmpeqb     xmm1,xmm2

    The compiler appears to try to compensate for the fact that SSE integer comparison instructions work  with signed values but Equals doesn't need this, pcmpeqb is all that's needed.

    "So now, I just test both halfs of the simd register at once."

    Also try simplifying the branches, they're too many:

    var vl = Vector<byte>.AsVectorInt64(res);

    long v0 = vl[0];

    long v1 = vl[1];

    if ((v0 | v1) == 0)

           continue;

    if (v1 != 0)

    {

           startIndex += 8;

           v0 = v1;

    }

    return startIndex + DebruijnFindByte(v0);

    Anyway, this is kind of pointless because as soon as AVX2 support is added the code no longer works correctly.

  47. Jack P says:

    Has the RyuJIT team tested CTP4 on a Haswell-based machine yet? I've installed CTP4 on a few different machines for testing, and it generally seems to work quite well, save for the NRE bug (which Kevin already said was fixed). However, I installed RyuJIT on my new Haswell-based desktop at home and once I enabled RyuJIT, every .NET program I tried to run crashes immediately. Disabling RyuJIT allows the programs to run normally again, so this seems like a code-generation bug in RyuJIT. Haswell has AVX2 support, so perhaps that has something to do with it (as Mike Danes mentioned in his comment).

    Also — any word on when the next release will be out with the fix for the NRE bug?

  48. Michael Burbea says:

    True, that that version only works with SSE2 registers, but it wouldn't be hard to add an

    if(Vector<byte>.Length == 32) { /* test the last 2 spots */} Since that branch would never be taken or always be taken, branch prediction would optimize the check well.

    Ideally, the operator that collects the high bit of each byte in the register would be added to the language that way you can test it in one swoop with your favorite version of bsf. (Most likely a Debruijn sequence).

  49. Mike Danes says:

    "Ideally, the operator that collects the high bit of each byte in the register…"

    In theory there's already a method that does almost what memchr needs – EqualsAny. It doesn't give you the index but it's enough to stop the main loop without doing any other comparisons. Unfortunately the code generated by EqualsAny is currently messy enough that it erases any performance advantage.

  50. JP says:

    Installed and tested on our huge project, where our application startup time is around 21sec in debug mode. .NET 4.5.2 available as well as Ruy with environment variable set. Win 7 x64 used for testing. I don't see any improvements whatsoever in startup time. Any idea if there is any problem with configuration ?

  51. Mike Danes says:

    I doubt you'll see much difference in the startup time if you're running the application in debug mode. The startup improvements are mainly related to optimizations which are performed only in release mode.

  52. Ben Adams says:

    We have to send a lot of data over the network so lots of copying to and from byte arrays occur; we control the network stack so don't need to worry about endianness. But, with the accelerated CopyTo being to float[] what would the preferred method be to take advantage of hardware accel/intrinsics? (happy to do unsafe casting and fixed blocks)

  53. Charles Ambrye says:

    I was trying to benchmark the difference between SIMD and Non-SIMD code by writing my own Vector4f and comparing to the speed of System.Numerics.Vector4f.  I found that Vector4f I implemented was in most cases faster than the System.Numerics, but to my astonishment it was also almost 4x faster than the same code compiled using VS2010 and not using protojit.dll.  Is RyuJIT clever enough to realize it can match my own version of Vector4f to SIMD instructions?  Will all of my current codebase instantly benefit from SIMD when RyuJIT sees certain patterns and/or operations?

  54. svanbodegraven says:

    Anyone else getting DEBUG: Error 2203:  Database: C:windowsInstallerinprogressinstallinfo.ipi. Cannot open database file. System error -2147287037 when installing?

  55. Expected performance on Intel Haswell? says:

    Hi Kevin – thanks for Ryujit!

    I have a Intel Haswell computer with AVX2 (256-bit-SIMD, I believe) and have downloaded your Mandelbrot sample.

    In the debugger, I see that Vector<int>.Length returns "4".

    Do we expect to work with 8 ints at a time on AVX2? Does that mean that I have not configured CTP4 correctly?

    Cheers

    Jiri