Image processing is a computational task that lends itself very well to GPU compute scenarios. In many cases the most commonly used algorithms are inherently massively parallel, with each pixel in the image being processed independently from the others. As a result, image processing toolkits have been early adopters of the new GPGPU programming model. Many of these mass-market toolkits, however, can be more accurately described as image manipulation packages that offer “image-in, image-out” capabilities; in other words, for each operation there is an input image and a resulting output (manipulated) image. In contrast, an image processing workflow differs from this in that the goal is usually the portrayal or extraction of analytical information which is determined after some multi-step processing workflow. These workflows are commonly employed in scientific and technical industries like medical imaging.
For the last two years, Core Optical Inc. has been building an image processing toolkit for the .NET framework called PrecisionImage.NET. Internally it centers around two separate execution branches, one targeting multicore CPUs and the other targeting GPU execution. While the CPU branch is a fully-managed CLS-compliant implementation leaning heavily on the .NET framework’s excellent built-in thread pool, the GPU branch is implemented using Microsoft’s brand new C++ AMP compiler.
We had two requirements when choosing the GPGPU tool we would use for that branch of our toolkit. First, the generated code needed to be vendor agnostic so that a decision to use our SDK wouldn’t overly restrict our customer’s choice concerning graphics hardware. The current minimum platform for C++ AMP is DirectX 11, a version that will soon be ubiquitous among modern GPUs from Intel, Nvidia and AMD. Secondly, since we focus on the Microsoft developer stack we needed something that would play nicely with the .NET framework. Obviously C++ AMP is the best bet in this regard since it’s produced by Microsoft.
For a v1 product we’ve found C++ AMP to be both solid and easy to program to. Although Microsoft doesn’t produce an official managed wrapper, accessing AMP in .NET was a straight forward matter of P/Invoking from our existing C# code base. To keep the surface area between the two to a minimum, we stuck with our managed code for the CPU fallback and condensed the various operations of the SDK into hundreds of compact AMP kernels compiled in combinations of single/double precision and 32/64-bit implementations. In almost all cases we found the simpler untiled model readily met our speedup goals. When this wasn’t the case, we were able to produce a tiled version that met our performance objectives with minimum drama.
To demonstrate the performance of the GPU branch we decided to compare the speed of two operations running on a 6-core CPU (multithreaded managed code) versus the C++ AMP version running on two different GPUs from Nvidia. The first operation was chosen as an ideal case for GPU implementation and consisted of a 2D convolution using a large kernel implemented using AMP’s simple untiled model. The second was chosen as for its unsuitability to GPU processing and was implemented using the tiled model. Even when including the overhead of marshalling arguments from managed to native code, and memory copying to/from the GPU, we saw huge gains (60x) in the first test case. Perhaps more surprising were the gains achieved in the second, less suitable, test case – up to 7x – an indication of the quality of the AMP compiler. Based on our experience to date, if you are a developer considering using AMP from a managed code base we can recommend it without reservation.
Currently, one aspect of C++ AMP imposes a performance limitation (acknowledged by Microsoft) for our particular use cases: redundant memory copying between CPU and GPU. This is partly imposed by hardware and partly by software. Since our SDK is designed to allow the easy assembly of processing chains, the overhead of these redundant memory copies can add up quickly. Microsoft has stated that this behavior needs improvement, and all our AMP kernels are using the array_view abstraction to take advantage of the improvement when it arrives. This will be a welcome enhancement to the AMP implementation, especially given AMD’s recent announcement of their hUMA architecture initiative. With both the hardware and software pieces falling into place soon, we should see a whole new generation of image processing software with unprecedented power and flexibility.