Microsoft recently released a preview of the Accelerator V2 GPU and x64 multicore programming system on Microsoft Connect. This system provides a civilized level of abstraction for writing data-parallel programs that execute on GPUs and multicore processors. An experimental FPGA target is under development.
Even on my low end graphics card I get pretty impressive performance results for the 2D convolver that is described in this blog. All 8 cores of my 64-bit Windows 7 workstation are also effectively exercised by the x64 multicore target, which exploits SIMD processor instructions and multithreading. I won't say anything about performance in this blog post since what I want to focus on is how to use Accelerator from the F# functional programming language. We will work backwards by starting off with a complete implementation of a two dimensional convolver. Step by step we show how this convolver is expressed using Accelerator from F#.
First here is the beautiful implementation of a two dimensional convolver. The rest of this post explains why this code works.
This program sets up a pseudo-random two dimensional input array, declares a two dimensional convolver and then uses two of the targets supported by Accelerator to convolve the input array. The first target uses a GPU to perform the convolution and the second target uses multiple x64 processor cores and makes use of special SIMD instructions. When this program is run it produces the following output:
A convolver computes a weighted average around some point. Image editing programs typically have an operation that smoothes an image and this is usually performed by a two dimensional convolution around each pixel. We shall discuss how to implement a one dimensional (1D) convolver fist and then generalize this to a two dimensional (2D) convolver afterwards. A convolution in one dimension can be expressed as a weighted sum around each value in an input array (or stream):
Here is a concrete example of a one dimensional convolution of an eight element array x which is being convolved by a five element filter kernel a to produce an eight element result array y. The filter kernel a appears to be like a window which slides across each element of the input array. To simplify the explanation and code, the filter is placed to the left of the element being convolved. A more realistic convolver would centre the filter kernel over the element being convolved but that introduces some extra boundary condition code which I donÃ¢ÂÂt want to get bogged down with in this blog.
We can implement a 1D software convolver in F# as follows:
The clamp function has the following type which accepts an array index of type int and an array of float32 values and returns an element of the array or a clamped value.
This function allows us to define values for negative array indices and works by returning the value of the first array element input. whenever we try to compute a negative index of the input array. Note that in F# arrays are indexed with the .[...] notation.
The convolveAt function has the type:
and computes the convolution of a single position in the input array to produce the result at the corresponding output array. For example, the picture above illustrates the calculation of:
Note that array literals in F# are written using [| ... |] brackets. The definition of convolveAt makes use of an array comprehension which allows us to specify an expression that defines the values of an array:
If the value of kernel.Length is 5 and if we are convolving at the first array index value of 0 for i then this is equivalent to writing:
The final step is to sum the elements of this array which is accomplished with a call to the Array.sum library function.
We can also exploit an array comprehension to produce the completely convolved result array by applying the single point convolution function to every element of the array by using the software function which has the type:
and the definition:
Although this code correctly implements a 1D convolver it is not an ideal starting point for a parallel implementation of convolution. To make an efficient implementation of the convolver for Accelerator we have to express the computation avoiding the manipulation of absolute array indices. Instead we should try to think in terms of whole array operations and relative array indices. The first step in this direction is to think about how to compute the whole result array y rather than how to compute an individual element y[i]. That is, we wish to compute an array which has the following form:
y = [y, y, y, y, y, y, y, y]
Furthermore, we wish to define y using only whole array operations. To help spot how we should do this the equations for y are expanded:
y = ax + ax[-1] + ax[-2] + ax[-3] + ax[-4]
y = ax + ax + ax[-1] + ax[-2] + ax[-3]
y = ax + ax + ax + ax[-1] + ax[-2]
y = ax + ax + ax + ax + ax[-1]
y = ax + ax + ax + ax + ax
y = ax + ax + ax + ax + ax
y = ax + ax + ax + ax + ax
y = ax + ax + ax + ax + ax
We need some way of expressing the result of y in terms of whole array operations on a or x or through the use of only relative indices i.e. we wish to avoid mentioning absolute indices. Notice that the computation above does have whole operations e.g. the second column on the left shows the whole of the x array (shifted left by one) is multiplied by the scalar value a. The pattern occurs across all four columns with each column corresponding to one of the five coefficients. In each column the corresponding scalar coefficient value is used to multiply a shifted version of the x input. This observation allows us to express the value of y using only whole array operations on x and its shifted counterparts. The calculations below represent the x and y arrays as a vector. The multiplication corresponds to the multiplication of a vector by a scalar value. The addition corresponds to the addition of two vectors.
y = [y, y, y, y, y, y, y, y]
= a * [x, x, x, x, x, x, x, x] +
a * [x[-1], x, x, x, x, x, x, x] +
a * [x[-2], x[-1], x, x, x, x, x, x] +
a * [x[-3], x[-2], x[-1], x, x, x, x, x] +
a * [x[-4], x[-3], x[-2], x[-1], x, x, x, x]
This representation is an improvement from the original representation because it contains vector multiplications and vector additions which can be performed in parallel. The original formulation contained multiplications of scalar values (i.e. an element of the coefficient array) by scalar values (e.g. an element of the input array). This representation is not directly amenable to a parallel implementation.
To make the nature of the data-parallel operation shown above clearer we introduce the shift operator which takes as input an array and an amount to shift by N and returns an array of the same size except all the elements are shifted up by the specified number of positions. The first N elements of the result array are clamped by filling them with the first element of the input array. Here are some examples of shifts:
shift (x, 0) = [7, 2, 5, 9, 3, 8, 6, 4] = x
shift (x, -1) = [7, 7, 2, 5, 9, 3, 8, 6]
shift (x, -2) = [7, 7, 7, 2, 5, 9, 3, 8]
Using the shift operator we can re-express y as:
y = a * shift (x, 0) + a * shift (x, -1) + a * shift (x, -2) + a * shift (x, -3) +
a * shift (x, -4)
This leads to a data-parallel implementation which can perform each coefficient multiplication in parallel with the others followed by a sum in parallel. These parallel operations are described in the following figure.
Each blue arrow in this picture corresponds to a data-parallel operation. The F# program below shows how to implement this 1D convolution using both the DX9 GPU target and the x64 SIMD multi-core target.
When run this produces the expected output:
To use an Accelerator target the input data must be created for an Accelerator data-parallel array. Typically the data represented by these arrays will be held in some memory specified by the target implementation. For example, for the DX9 GPU target the input values will be copied to memory on your graphics card. Accelerator provides several methods for creating data parallel arrays and here we create a floating point data-parallel array using a constructor:
and the type of inputArray is FloatParallelArray .
In your F# program you should think of a FloatParallelArray as being a reference or a handle to some real array somewhere else. When you write computations over values of type FloatParallelArray what you are really doing is building up a symbolic expression tree that represents the computation that you want. When you try to evaluate such an expression the Accelerator system JITs (e.g. to GPU code via DX9) to produce the actual code for performing your computation and executes it with the data in the input data-parallel arrays and transmits the result back to your F# program.
Accelerator provides a library of data-parallel arrays and data-parallel operations over its data-parallel arrays. To perform a data-parallel multiplication of a data-parallel array by a single kernel coefficient we use one of AcceleratorÃ¢ÂÂs data-parallel multiplication operations (there are many other overloads):
This does not immediately perform the multiplication. What is really happening is that an expression tree is built up with a multiplication node in it. However, the JIT-ing nature of the Accelerator implementation almost makes it looks like the code is being executed immediately.
The addition operation used in the 1D convolver is also a data-parallel addition over FloatParallelArrays:
Finally, the shift operation here works over one dimensional arrays of type FloatParallelArray and takes just one number which specified how much to shift by:
We may be tempted to specify the overall expression for the convolver by writing:
and how nice that would be but sadly we donÃ¢ÂÂt have a proper 0 yet so instead we write a recursive formulation:
where expr has the type:
Given the index of the highest element in the kernel this function builds up an expression which computes the 1D convolution. This expression is used to call the DX9 GPU target to perform the computation and retrieve the result resDX which has the type float32 :
A target only needs to be created once. The first time a target is created for DX9 there is a delay while the system is initialized. For benchmarking it is a good idea to perform a dummy computation first to Ã¢ÂÂwarm upÃ¢ÂÂ the system.
The previous example convolved a one dimensional (1D) matrix. In this section we show how we can convolve a 2D matrix by applying the convolution independently to each line. This computation is illustrated in the diagram below for the following array of arrays:
An F# version of a 1D convolution of 2D input data is shown below which illustrates convolution in the X and Y directions (independently).
When run, this code computes the correct values:
Although this code performs multiplication, shifting and adding of arrays as we saw before in this case we use a different overloaded functions that work over two dimensional arrays and the shift operation uses a two element array to specify how much to shift by in each direction.
We can make a two dimensional convolver by first convolving in the x-direction and then convolving in the y-direction. Or we can perform a convolution in the y-direction and then in the x-direction. To convolve a 2D matrix we need to use a version of the shift operator which takes two arguments which describe how much to shift by in each dimension. Shifting just in the x-direction is illustrated in the diagram below.
A function which shifts a 2D array just in the x-direction could be written as:
This function captures the shifting right along the rows nature of the x-direction convolution by specifying a shift of i elements along the rows for the i-th iteration (but it never shifts down columns hence the value of 0 in the second element of the array literal).
Shifting in the y-direction is illustrated in the diagram below.
A function which shifts a 2D array just in the y-direction could be written as:
This function captures the shifting down along the columns of the y-direction convolution by specifying a shift of i elements along the columns for the i-th iteration (but it never shifts along rows hence the value of 0 in the first element of the array literal).
Rather than define two separate functions for convolving in the X and Y directions we can define just one function and pass a higher order function which allows us to abstract the direction of convolution.
Note that the convolution function takes a higher order parameter which specifies the shift direction. To convolve in the x-direction we provide a function which is given the degree to shift by i and return the (dx, dy) shift to be applied to the data parallel array i.e. (-i, 0). To shift in the y-direction we provide another function which is given the degree to shift by i and return the (dx, dy) shift to be applied to the data parallel array i.e. (0, -i). We use lambda expressions to defines these functions anonymously (in-line) as arguments to the convolve function.
To run the 2D F# convolver please follow these steps.
· Install Accelerator V2.
· Create a new F# console project and paste in the 2D convolver code at the start of this blog.
· If you are not running a 64-bit operating system please remove the code for using the x64 multi-core target i.e. delete the lines:
· Add a reference to System.Drawing and the managed Microsoft.Accelerator.dll (under the appropriate directory for your target e.g. C:\Program Files (x86)\Microsoft\Accelerator v2\bin\Managed\Release\Microsoft.Accelerator.dll
· Build under Visual Studio.
· Make sure that you have the right Accelerator.dll in your path for your platform and target e.g. C:\Program Files (x86)\Microsoft\Accelerator v2\bin\x64\Release\Accelerator.dll or make sure to have a copy of this file in your build directory.
· Run the generated EXE from the build directory e.g. bin/Release
If you want to use Accelerator code from inside F# Interactive you will need to reference the Accelerator library e.g.:
The Accelerator system encourages you to express data-parallel algorithms in terms of whole array operations which leads to efficient implementations for various back ends (targets). For example, the discipline of using only whole array operations allows me to make efficient address generator circuits for an experimental FPGA target that I am working on.
Accelerator is quite well suited for writing stencil-style data parallel programs. The expression orientated nature of Accelerator computations make them a nice fit for a functional language like F#. In effect the Accelerator system manifests itself as a domain specific language in F#, C# and C++ (as well as other .NET languages).
The F# code in this blog was written with the participation of Lubomir Litchev and James Margetson. Thank you Lubo and James! I aslo recommend this article on using Accelerator from F#: http://tomasp.net/blog/accelerator-intro.aspx.
The Accelerator system is still a research incubation project and the preview release is designed to solicit feedback. You can email email@example.com or use the Microsoft Connect Feedback Form. Thank you!