SIMD/ARM-NEON support in Windows Phone Mango

This is an announcement only post, do subscribe to this blog feed or on to https://twitter.com/abhinaba as I’d be making more detailed posts on these topics as we get close to handing over these bits to our developer customers.

ARM processors support SIMD (Single Instructions Multiple Data) instructions through the ARM® NEON™technology that is available on ARMV7 ISA. SIMD allows parallelization/HW-acceleration of some operations and hence performance gains. Since the Windows Phone 7 chassis specification requires ARMV7-A; NEON is available by default on all WP7 devices. However, the CLR on Windows Phone 7 (NETCF) did not utilize this hardware functionality and hence it was not available to the managed application developers. We just announced in MIX11 that in the next version of Windows Phone release the NETCF runtime JIT will utilize SIMD capabilities on the phones.

What it means to the developers

Certain operations on some XNA types will be accelerated using the NEON/SIMD extensions available on the phone. Examples include operations on Vector2, Vector3, Vector4, Matrix from the Microsoft.Xna.Framework namespace will get this acceleration. NOTE: At the point the exact types and the exact operations on them are not closed yet and subject to change. Do note that user types will not get this acceleration. E.g. if you have rolled out your own vector type and use say dot operations on it, the CLR will not accelerate them. This is a targeted acceleration for some XNA types and not a vectorizing JIT compiler feature.

Apps and types heavily using these XNA types (our research shows a lot of games do) will see good performance gain. For example we took this Fluid simulation sample from a team (note this was not written specifically for us to demo) and saw huge gains because it heavily uses Matrix and Vector operations to simulate fluid particles and the forces that work in between them. Frame rates shot up from 18fps to 29fps on the very same device.

Based on the usage of these types and operations in your app you’d see varying amounts of gains. However, this feature should be a good motivation to move to these XNA types.

How does SIMD work

SIMD as the name suggests can process the same operation on multiple data in parallel.

Consider the following Vector addition

 public static Vector2 Add(Vector2 value1, Vector2 value2)
{
    Vector2 vector;
    vector.X = value1.X + value2.X;<br>    vector.Y = value1.Y + value2.Y; 
    return vector;
}

If you see the two lines in blue it’s essentially doing the same addition operation on two data sets and putting the result in two other locations. This today will be performed sequentially. Where the JITer will emit processor instructions to load the X values in registers, add them, store them, and then do the same thing again for the Y values. However, this is inherently parallelizable and the ARM NEON provides an easy way to load such values (vpop, vldr) in single instructions and use a single VADD NEON instruction to add the values in parallel.

A way to visualize that is as follows

image

A single instruction both X1 and Y1 is loaded, another instruction loads both X2, Y2 and the 3rd in parallel adds them together.

For an easy sample on how that works head onto https://www.arm.com/files/pdf/NEON_Support_in_the_ARM_Compiler.pdf