Using CUDA Libraries from C++ AMP

For CUDA programmers, as you know, we have published a customized C++ AMP learning guide. If you know CUDA and are planning to use C++ AMP for new projects or for porting existing projects, you can follow the guide to see how to learn C++ AMP.

In addition to authoring and porting your own code, your scenario may require the use of additional 3rd party library code. We have you covered with the C++ AMP libraries. In case you are using a library that we don’t yet offer on CodePlex, and also if you are happy with a solution that only runs on NVidia hardware, then you may be looking to re-use some of the 3rd party CUDA libraries, including those from NVidia.

In this blog post, I will address this scenario. First I will present some basics on C++ AMP and CUDA interop, then I will share a utility header file to smoothly enable this scenario, and then I will share a couple of examples that use the utility header file. At the end you can download the entire solution in the ZIP file attached to this blog post.

C++ AMP and CUDA interop basic background

It’s common that a CUDA library function takes one or more pointers to input/output CUDA buffers as parameters, and performs certain computation on CUDA-capable GPUs. Let’s assume that you have written some C++ AMP code, which uses concurrency::arrays as data containers for computations execute on an accelerator_view. Now, at one phase, you want to reuse a CUDA library function, such that it can directly take the resources underlying the arrays as input and output without extra copies, to execute on a CUDA device corresponding to the accelerator_view.

This requirement can be achieved using interop between C++ AMP and Direct3D11, and interop between Direct3D11 and CUDA. Here are the steps:

  1. Get an ID3D11Device from a C++ AMP accelerator_view via C++ AMP/D3D11 interop.
  2. Set the ID3D11Device as the current CUDA device via CUDA/D3D11 interop.
  3. Get ID3D11Buffers from C++ AMP arrays created on the aforementioned accelerator_view via C++ AMP/D3D11 interop.
  4. Register and map the ID3D11Buffers as CUDA resources, and get mapped pointers via CUDA/D3D11 interop.
  5. Invoke the CUDA library function by supplying the mapped pointers as parameters.
  6. Unmap and unregister the CUDA resources.
  7. Reset the current CUDA device.
  8. Continue using the C++ AMP arrays for the rest of the C++ AMP code.

My sample provides a header that contains types I build to encapsulate the interop steps above, and it uses two examples to demonstrates how to invoke CUDA libraries functions. Now let me walk you through the sample.

Utilities for C++ AMP/CUDA interop

I created the “amp_cuda.h” header file that encapsulates the C++ AMP/CUDA interop utilities that I built. It mainly provides two types designed to be used in the RAII fashion. The types are available in the amp_cuda namespace.

The first type is called scoped_device. Its constructor takes an accelerator_view as parameter and sets the current CUDA device using the ID3D11Device fetched from the accelerator_view. The constructor also checks whether the device is CUDA capable. Otherwise, it throws an exception. The destructor resets the current CUDA device. You should construct a scoped_device object within a scope. When the scope exits, the object automatically destructs.

 {
     amp_cuda::scoped_device device(av); // av is a C++ AMP accelerator_view
  
 }

Within the scope, after “device” is constructed, any CUDA library functions invoked will execute on “device”.

The other type is scoped_buffer. Its constructor takes a concurrency::array & , and a map flag. The map flag is an enum defined as scoped_buffer_map_kind, which has three types of mapping: read-write, read-only, and write-discard. The constructor registers and maps the ID3D11Buffer fetched from the array according to the map flag, and gets a mapped pointer (stored inside the object) to the CUDA buffer on device. There is another constructor that only takes a const concurrency::array & . This constructor always maps the buffer as read-only. The destructor unmaps and unregisters the resource. The scoped_buffer type has a member function called cuda_ptr, which returns the pointer to the CUDA buffer. It also has a member function size that returns the size of the buffer in bytes. scoped_buffer is typically used as:

 {
     amp_cuda::scoped_device device(av); // av is a C++ AMP accelerator_view
     amp_cuda::scoped_buffer buffer(arr, // arr is a C++ AMP array on av
                                    amp_cuda::scoped_buffer_read_write_kind);
  
     float * ptr = buffer.cuda_ptr<float>();
  
     // Invoke a CUDA library function by supplying ptr as parameter
 }

Note the arrays that are used for constructing scoped_buffers, have to be located on the accelerator_view that is used to construct the scoped_device. Once the scope exits, the destructors are called for “buffer” and “device” respectively, and we are ready going back to use “av” and “arr” in C++ AMP world.

Two Examples

“Example.h” and “Example.cpp” use two examples to show how to use the scoped_device and scoped_buffer classes introduced in the previous section. One example shows how to invoke a CURAND function (curandGenerateUniform), and the other is to invoke a CUBLAS function (cublasSgemm). Note that you could have just used the C++ AMP RNG and the C++ AMP BLAS libraries, but I am using these in the example to show you the technique. The code is very simple and straightforward. Please take a look at the source code and apply the technique with your 3rd party libraries.

Download the sample

I tested the sample with an Nvidia GTX 480 (driver 9.18.13.479) card on a Windows 7 64-bit SP1 machine, where Visual Studio 2012 and CUDA SDK 4.2 (64-bit) are installed.  The two CUDA libraries used in the sample are part of CUDA SDK. Each is provided as a head file, a lib, and a DLL. The Visual Studio project of the sample has set the “VC++ Directories” to points to the corresponding locations in the SDK. Your own CUDA libraries should be supplied and used in a similar way.

Please download the attached sample project that we discussed in this post. Please try to understand what the code does and learn from it. Note that the sample is only used to demonstrate that it is feasible to use CUDA libraries from C++ AMP, and it does not try to be complete. You are welcome to change and extend it to fit your own needs. As always, if you have any question, please post below or in our forum.

CUDAInterOp.zip