Using C++ AMP to Perform Millions of Experiments in Hours Instead of Weeks

Article
12/04/2013

Continuing with the series of post on how customers are using C++ AMP, we are happy to publish the following guest blog post from Matthew Crews, a research student at Oregon State University. Captured below are Matthew's experience in his own words.

"I began working with C++ AMP only a few months after it was first announced at the AMD Summit by Herb Sutter and Daniel Moth. When I first saw the demonstration of the N-Body simulation I was stunned. At the time I was in the midst of a Master's Degree in Industrial Engineering and I was struggling with being able to generate enough data for my experiments. My thesis required that I generate millions of data points from several different optimization techniques and I was simply unable to get the performance that I needed out of the CPU that I was using. At this point I had rewritten the implementation three times and while there was at least an order of magnitude improvement in speed each time I was nowhere near where I wanted to be. I walked through the entirety of my code with a friend of mine who has a M.S. in Computer Science and he was unable to suggest further speed improvements.

A full set of tests would have taken weeks on my quad core machine and there was no way I could afford to upgrade to an 8-core machine or a dual socket system. I needed to be able to iterate on my findings so having a multi-week lead time for results was not going to be reasonable. I was near deciding to just throw away my thesis.

As a caveat, I will be the first to confess that I am not a professional programmer. I frequently utilize programming to get my job done though. I learned QBASIC when I was a kid and later picked up VBA, SQL, and Groovy for work and have done work in F#, C# and C++. I had been watching the rise of CUDA but did not feel that I had the programming chops to tackle that challenge. Even with this limited native coding experience though, C++ AMP appeared to be relatively simple to use. It was actually fairly straightforward to walk through the examples to get up and running. The greatest challenge did not come from C++ AMP but just from using C++. Coming from a predominantly managed code background the complexity of C++ as a language was daunting. Fortunately I was able to learn what I needed and felt comfortable with C++ AMP in a couple of weeks.

The problem that I was dealing with was not complex, it was just large. For my thesis I needed to perform several million iterations of two optimization techniques called Simulated Annealing and Threshold Accepting. These techniques belong to class of optimization methods called Stochastic Optimization. They both use random variables as a part of their search for solutions. Since they utilize random variables they are not guaranteed to provide the same solution every time. This randomness was the reason that I needed to perform so many iterations. I needed enough samples to form a solid understanding of the distribution of solutions. This is naturally parallelizable since each sample is completely independent of the others and can be treated as its own thread. The problem was that I needed so many and I did not have the computing horsepower to complete it in a reasonable amount of time.

So there I was, nearly ready to throw in the towel on my thesis when I am blown away with this demonstration of a GPU being programmed with relatively simple C++ and showing incredible performance. What really set C++ AMP apart from CUDA for me was how simple it was to get data to and from the GPU and not having to manage all of the GPU kernels. At the time I did not have a GPU that C++ AMP would run on so I ended up borrowing a friends laptop which barely had a new enough GPU and began saving to buy a more powerful discrete GPU. When I finally had enough money saved up I bought the most powerful compute card I could which was the Radeon HD 7970 at the time.

The performance improvements I achieved were astounding. What used to take me 8 hours on a quad-core CPU took the GPU only 20 minutes, a 24X improvement. Instead of a full battery of tests taking weeks, I could get them done in a day. My professor and I revised the approach to my thesis now that I could generate such large volumes of data. C++ AMP came along at a critical juncture in my thesis and is a key enabling technology for me to actually complete my research.

While re-writing my optimization algorithm for the 4^th time, this time with C++ AMP, I learned several valuable lessons that I wanted to pass on. These are not meant to deter you but help you be aware of some of the potential stumbling blocks along the way.

Lesson 1: Long Running Kernels Will Require Managing TDR

Windows has a feature built into it which detects when a GPU is no longer responsive and will automatically try to recover the GPU, this is called Timeout Detection and Recovery. This means that if you have a GPU kernel which will take longer than 2 seconds to run you will encounter this issue. For what I was doing the GPU would be taken up for hours if not days so I had to find a solution to this. After doing some research I found that if I wanted to I could try managing this with DirectX. For my scenario it was easier to just disable TDR through changing the registry. This is not the advised solution. I was running all of my tests on my own computer so I did not have to worry about running the code anywhere else. If you are planning on distributing an application which uses C++ AMP and has the potential to use the GPU longer than 2 seconds then you will have manage this.

Lesson 2: The Number of Kernels Matter

After I got my code running I played around with the number of kernels that I was instantiating on the GPU at once. If you create too many it becomes too large to fit on the GPU, if there are too few then most of the GPU will be left idle and there will be no performance benefit. I encourage you experiment with the number of kernels so you find the sweet spot for performance.

Lesson 3: Balancing the Evils of Divergence and Global Memory Access

I believe at this point NVIDIA^TM and AMD^TM are both using a SIMD style of architecture for their GPUs. This means that threads which belong to the same block must all be performing the same instruction at the same time. In cases where a thread is not on the same instruction as the rest of the block it will sit idle and unutilized. Some of this can be hid by swapping threads in and out but it is still better to reduce possible divergence. If your code has enough branching and divergence then a large portion of the GPU will be idle due to this branching and you may not be able to experience significant performance gains. In my case I was running thousands of instances of the same algorithm so divergence was less common although it did occur. I tried to eliminate divergence by refactoring some of my code so that it would always be in sync. What I ended up doing though is massively increasing the number of global memory reads that kernels would need to perform. Any benefit this elimination of divergence did was completely overwhelmed by all of the reading from global memory that was added. Be careful when designing your algorithm, a little divergence is much better than large amounts of reading from global memory." -- Matthew

Please do send us your experience using C++ AMP!

Using C++ AMP to Perform Millions of Experiments in Hours Instead of Weeks

Lesson 1: Long Running Kernels Will Require Managing TDR

Lesson 2: The Number of Kernels Matter

Lesson 3: Balancing the Evils of Divergence and Global Memory Access

Additional resources