Cuda and Stream Programming – The next big thing?

Today while researching which type of video card to buy I kept seeing the word CUDA in the specs but I did not know what it meant. Nvidia in the last year or so bought a company called PhysX. PhysX created a physics runtime and a special add in card that the runtime uses instead of the CPU. Now that Nvidia owns PhysX and CUDA enabled video cards are also supported by the PhysX runtime. Games that use the PhysX runtime off load some of their physics math to the PhysX runtime in order to do neat effects in games like realistic water, realistic cloth effects, smoke/gas effects, etc. Read Nvida's page for more info

CUDA stands for Compute Unified Device Architecture. CUDA is a parallel programming focused runtime/api that allows developers to write programs that can partly be ran on a their GPU. Modern Nvidia gpus have hundreds of stream processors. The new gtx 295 has a whooping 480 stream processors ! These programs need not be graphics specific. CUDA unleashes a ton of power (4 terraflops on high end Nvidia solutions) to us developers. CUDA/stream processing puts super computer power in to the our hands at a fraction of the cost.

Lets talk a little about stream processing. In stream programming the stream is the data set you are working on. The stream processors perform the same operation on each element or thread in the stream in parallel. You can kinda of think of stream programming as the relative of SIMD (Single Instruction Multiple Dataset - mmx, 3dnow, sse, etc). Nvidia calls their technology Single Instruction Multiple Thread or SIMT. With SIMD the size of the dataset/vector is fixed. Grouping data so it can be executed in parallel can also be difficult as well. With SIMT just need to write code that can be executed in parallel and CUDA takes care of the rest. Efficient CUDA code will easily scale if more processors are available.

So what is the catch? Programming in CUDA/Stream Programming is completely different experience than most developers are used to. Developers need to break the code in to the simplest operations that can be run in parallel without any dependencies on other threads. Not every program can take advantage of CUDA, on consumer computers multimedia programs (image processing, 3d, sound, etc) and maybe encryption? Only operations that deal with large amounts of data and are normally CPU intensive will see improvements. Data compression could be faster but it is very I/O intense and the data has to be read in and of the video card (I would love to see 7zip running on CUDA since it is very very CPU intensive).

If you look at Nvidia CUDA user submitted demo site you see some really, really interesting applications claiming huge speed increases. Math libraries claiming 40x speed increases over CPU (not sure which gpu and CPU they are comparing), Password crackers claiming 50x speed increase, quick sort 10x, googles MapReduce algorithm 100x speed increase, etc.

CUDA program are written in a special version of C. Developers create special functions called kernels. When invoked a kernel function is ran X amount of time on N number of threads. In side functions you use a special variable called threadIdx to identify which thread you are in. threadIdx is vector with an x,y,z component I guess this for doing 3d math which is what these cards are made for 😉 The threadIdx variable is very useful when writing your kernels. Threads are divided in to groups called blocks. Threads have access to three different kinds of memory. There own private thread local storage, block storage that is in scope for the thread block, and global memory which is accessible by all threads. Read the sdk documentation for more info.

Unfortunately I really don't have to much experience with C or C++. A want to make a simple hello world example but I don't really think their is an equivalent in CUDA. We need to create something that will make use of the massive amount of parallel processors a gpu has. I am not a graphics developer, math expert, or anything like that so I decided to make something really simple.

It is possible to work with the CUDA driver api using C#. You can work with the nvcuda.dll using P/invoke but the driver api does not support emulation so you will need an actual card to begin developing (I just ordered mine). You write your CUDA code and compile it with nvcc to generate a cubin file. You can then use the driver api to load the cubin file, execute kernels, read/write to processor memory, etc. There is a third party wrapper for nvcuda developed by the Company for Advanced Supercomputing Solutions called is free so we will be using it in our example code.

When I get my new video card next week we will write some really really simple cuda examples and compare performance results. Peace.


Comments (4)

  1. x1311 says:

    Hey, that´s really interesting. Hope you´ll get a hello world work 🙂

  2. You’ve been kicked (a good thing) – Trackback from

  3. Thank you for submitting this cool story – Trackback from DotNetShoutout

Skip to main content