C++ AMP used for GPU Benchmarking

Article
04/23/2013

Continuing with our series on how C++ AMP is being used, we are happy to highlight how C++ AMP is used for GPU Benchmarking. SystemCompute is an in-house benchmark developed by Dr. Ian Cutress of AnandTech. Once we came to know that the GPU version of this benchmark was written in C++ AMP, we reached out to Dr. Cutress for a blog post. Please find below the guest blog post detailing his experience.

"My Experience with C++ AMP

My first foray as a non-computer scientist but computer enthusiast into multi-threaded programming led to OpenMP. I was always fascinated by all the different BOINC projects, and how they were able to use the resources around them to brute force compute everything around them – the embarrassingly parallel tasks afforded such methods and were welcomed with open arms. During my undergraduate I donated a lot of my CPU PC time to helping these projects, then eventually GPU time. By the time I started my Computational Chemistry PhD, I was neither a competent programmer nor an optimized one; my masters thesis was computational yet written in Visual Basic .NET and thus amazingly slow. However moving to C/C++ and looking at OpenMP gave the research group a good deal of speedup, especially when all around me people were only using one core of a quad core machine (albeit four simulations running at once). It was at this point I casually remarked how interesting it would be to program GPUs. The situation took a lucky turn as one of my friends studying Computational Finance on Facebook linked me up to his supervisor, who was starting the CUDA course at Oxford. After a crash course lasting a week with no IDE and Linux (I was totally out of my depth), I set forth with CUDA knowledge and pains. What followed was the three years of my PhD using CUDA to publish several papers and get to grips with what programming on a parallel device really entails – the intricacies of making sure kernels are register light and rearranging operations or formula to use the least amount of cycles and get the results quicker.

Since leaving academia, I am now a hardware reviewer (which is a nice logical progression from chemistry… not J) for AnandTech.com, a well known hardware technology review website. As the Senior Motherboard Editor, I have the chance to dice and tumble with a lot of hardware, especially on the CPU and motherboard side. This means not only the consumer level kit, but also some workstation builds and dual processor setups. While nothing to do with GPU programming, I enjoyed reading about CUDA developments, and what OpenCL was bringing to the table. Reading about them put me squarely in the higher-level CUDA camp – OpenCL looked very daunting and somewhat confusing without an instructor telling you what to do!

During all this time I am an avid competitive overclocker – the dark art of making computer hardware run faster for small amounts of time to prove to the world you can make hardware run faster than anyone else. I enjoyed some minor success, becoming UK number one in the enthusiast league at HWBot.org for a couple of years, before moving on to the UK Team. The way to determine whether you are better than someone else (in terms of speed and efficiency) is to run one of the benchmarks supported by the league, such as SuperPi or wPrime for CPUs or 3DMark for GPUs. The system is always looking for better benchmarks, and the lack of a compute benchmark for GPUs always irked me. So as part of my reviewing, and also the overclocking scene, I set out to write one.

I cannot remember explicitly how I came to [hear] about C++ AMP. In order to write a benchmark for the overclockers it had to be GPU vendor agnostic, so CUDA was out of the picture, and as I mentioned before, OpenCL looked confusingly daunting (even with a step-by-step manual in front of me). Though I did come across C++ AMP, and the Native Concurrency blog, and started reading about it. I very quickly picked up the C++ AMP book by Gregory and Miller and started to work through it along with the code examples posted on the Native Concurrency blog.

My first thought was that C++ AMP was incredibly simple. I mean amazingly so. For simple kernels there was no need to worry about allocating memory, although I was initially confused with array and array_view given that I had not performed any matrix mathematics in the past. I had no prior knowledge of lambdas (in actual fact I doubt I could describe one to you now), but using a combination of monkey see and monkey do, I was able to port a good deal of the basic structure of my PhD research code into C++ AMP. All in all I probably saved 75%+ lines of code as well. The only thing that bugs me is making all my code work for multi-GPU setups, as it only works for single GPUs right now.

My final result was twofold – a benchmark for reviews (GPU version featured in Ryan Smith's GTX Titan review, CPU version featured in my Dual Sandy Bridge-E and Upgrading from Westmere reviews) and a potential benchmark for overclocking. The GPU version is purely C++ AMP, and the CPU version takes more bites from OpenMP due to efficiency.

If I was learning how to program on GPUs today, from scratch, I would advise newcomers to look at C++ AMP first. It seems quicker to get off the ground than CUDA, and not too overly confusing if you are familiar with C++ and a Windows environment. CUDA does have its advantages, being the more optimized, being the older language and the clout of NVIDIA behind it, but C++ AMP offers that unrestrictive element of AMD GPUs and widens the hardware base of whatever software you are creating.

The thread for the software can be found over at overclock.net. Basic code snippets can also be found in my Dual Sandy Bridge-E review."

--Dr Ian Cutress.

Please do note that we are working with Dr. Cutress to get his software made available to the community.

C++ AMP used for GPU Benchmarking

Additional resources