What’s new for C++ AMP in Visual Studio 2013


Since the first release of C++ AMP in Visual Studio 2012 nearly 8 months ago, we have been working hard to bring you the next set of C++ AMP features. BUILD 2013 day 2 keynote demo provided a snapshot of C++ AMP in Visual Studio 2013. In this post, we will delve into the C++ AMP features available in Visual Studio 2013 Preview.

Support for shared CPU\GPU memory

The CPU\GPU data transfer efficiency on accelerators that share physical memory with CPU is now significantly enhanced due to elimination of redundant copying of data between GPU and CPU memory. Depending upon how the code was written, C++ AMP application that run on integrated GPU and WARP accelerators should see no (or significantly reduced) time spent on copying data. This feature is available only on Windows 8.1 and is turned on by default for WARP and some integrated GPUs. Additionally, developers can also opt into the feature programmatically through a set of APIs.

Enhanced support for textures

In Visual Studio 2013, we added a bunch of features to enhance support for textures. The added features include

  • Access to hardware texture sampling capabilities
  • Support for staging textures
  • Texture_view redesigned (to be more consistent with array_view design)
  • A more complete and performant set of texture copy APIs including section copy
  • Better interop support for textures including a much bigger set of DXGI formats
  • Support for mipmap

Improved C++ AMP debugging experience

The debugging experience for C++ AMP code has been improved in multiple fronts. We had previously announced a series of improvements including

Apart from these in Visual Studio 2013, we enabled the following set of features

  • Side-by-side CPU\GPU debugging. Currently mixed mode debugging is available on Windows 8.1 for the WARP accelerator.
  • Ability to debug using the WARP accelerator instead of single threaded ref accelerator. Using WARP for debugging provides a much faster debugging experience.

Faster C++ AMP runtime

We have worked to improve the performance of the C++ AMP runtime in order to provide even faster application performance. The work includes

  • Reduced parallel_for_each launch overheads
  • Optimized texture copy performance
  • Optimized performance of copying small data sizes between the CPU and accelerator

Array_view API improvements

In Visual Studio 2013, the following set of improvements have been made to the array_view abstraction:

  • Ability to create array_view without a data source
  • Ability to synchronize to a specific accelerator.
  • Performant array_view indexing operators on CPU

Additional changes

Apart from the changes listed above, we also took time to refine other parts of C++ AMP too. These changes include:

  • New APIs to enable clean AMP runtime shutdown
  • Improved the accuracy and helpfulness of C++ AMP runtime exception messages
  • Improved the accuracy of ETW events for better profiling experience
  • Ability to lock/unlock accelerator_views to allow safe access to shared resources between C++ AMP and Direct3D APIs.

We are excited to bring the next set of features in C++ AMP and in the coming weeks, we will be discussing these new features in depth. We hope you will take the time to download Visual Studio 2013 Preview and send us your feedback, comments and questions – below or in our MSDN forum.

 


Comments (22)

  1. rahul says:

    Hi Boby. Great to see the C++ AMP improvements. The shared memory support is particularly exciting. Wondering if that is enabled by using something new from D3D 11.2? Secondly, is there info on supported hardware platforms from Intel and AMD for this feature?

  2. small_mountain says:

    This all sounds good, but is there any movement on a Mac implementation of C++/AMP?  We have a portable C++ app that runs on Mac and Windows, and would like a GPU acceleration solution, and C++/AMP might be the choice if it gets implemented on the Mac.

  3. rahul says:

    To answer my own question, the shared memory support is likely using the ability to map GPU default buffers without going through a staging buffer, which is being introduced in D3D 11.2. I had not realized this has been introduced when I first read the docs 🙂

  4. Yes, shared memory support in C++ AMP uses the ability to map D3D11_USAGE_DEFAULT buffers for CPU access which is being introduced with WDDM 1.3 (supported on Windows 8.1 and later OS versions only). This facility is available for D3D11 feature level 10+.

  5. @small_mountain_0705: there are couple of proof of concept implementation of C++ AMP on top of OpenCL running in the wild, but nothing that you are depend on for production. That is the latest information we have. We are continuing to work with our partners and encouraging them to release support for C++ AMP on other platforms and will release an updated open specification to reflect these changes.

  6. Onur Gumus says:

    Is shared memory supported on Windows Server 8.1 aka blue ?

  7. Yes, shared memory is supported on Windows 8.1 Server.

  8. heff says:

    hey guys, great work on bringing parallel programming and gpgpu to the masses. Much appreciated. Any News on the ppl?

  9. Hi heff,

    The biggest news in the PPL (specifically, PPL tasks) in this release were support for cross-platform and user-defined schedulers. One of the biggest beneficiaries of this is the Casablanca project, which is built by the same team. We also fixed a number of bugs – you’ll care about it mostly if you’re writing Windows Store apps.

    Not everything we wanted to include in the product made the cut, but the good news is that we’re going to release a set of additions to the PPL called the PPL Power Pack that will contain some oft-requested features. We’re putting some finishing touches on it now, and will announce it on this blog soon, so stay tuned.

    Artur

  10. heff says:

    Hi, thank you for the answer … that's great news. You guys are doing a great job.

  11. Ola Weistrand says:

    Hi,

    Could you give more details on the "Reduced parallel_for_each launch overheads"? How much faster can one expect the new version to be?

  12. @ola,

    For this release, we focused on reducing the launch overhead related to runtime. Our micro benchmarks (that measure those specific parts of the runtime overheads) showed significant (25 to 50%) improvement over previous release. However, due to additional elements such as DirectX and IHV driver overheads, end to end scenarios will not see the perf improvement we saw in our micro benchmarks. The performance improvement in end to end scenario depends on the no of p_f_e invocations made. As you batch more p_f_e invocations together, you should see noticeable performance improvements due to runtime perf improvement we made.

  13. Mahmoud says:

    Hi Bobby, Nice Article!.

    I am a computer science researcher and I am preparing a thesis about GPU high performance computing.

    I need to know about C++ amp internals compared to CUDA (because I found performance differences for the same problems):

      – Exactly what hardware devices does C++ AMP use from the GPU ?

      – how kernels run on that hardware (what is the clock speed of processors?,  what are the types and specs of memory used and how it is organized ?, threads scheduling ? ) ?

    are there any resources that i can read for these internal details ?

    Thanks very much for your time.

  14. AMP says:

    Hi there,

    is there any plans to support 64 bits integers or other types like char or short ?

  15. @AMP,

    As we plan for the next version, we would be considering support for types like char, short etc. Could you explain your scenario and how support for other types helps you…this would aid in prioritization of the features for the next release. Thanks. you can email your scenario to bobyg AT Microsoft Dot com.

  16. @Mahmoud  

     Can you contact me via email (bobyg AT Microsoft dot com), I would need more details regarding your request and we can have that conversation offline.

    Thanks

  17. James says:

    Are you requiring VS 2013 for the new version of AMP? If so why do you continue to put roadblock in front of your technology?

  18. @James,

    The new version of C++ AMP would need 2013 version of Visual C++ compiler and runtime. Additionally shared memory is Windows 8.1 and windows server 2012 r2 only feature (since it has OS dependencies). We are interested to hear more about how you are blocked from using newer C++ AMP. Thanks

  19. VB says:

    Just yesterday I downloaded and installed VS 2013 (Express).  With all new stuff now in it I couldn't wait to try, among other things, the AMP [again].  So, I took the example (derived from MatrixMultiply) I had for 2012 and ran it.  Whaddya know, it's SLOWER by about 50% !!!  Two 1024×1024 matrices get multiplied in 340 ms built with VS 2012 and 530 ms built by VS 2013 !

    😯

    Of course I had to compare 'amp.h' files in both installed packages.  They are, as you'd expect, different.  Would those differences be enough to cause THAT much deterioration in performance?  Don't know.  Are all those changes worth losing 50% in execution time?  Don't think so…

  20. Victor Bazarov says:

    I guess I rushed into it.  Today I took the original MatrixMultiply AMP sample code and ran it after building with 2012 and 2013, and it ran with exactly the same speed.  Now I need to go back to my own code and dig deeper to understand what in it makes it slower with 2013…  Oh, what fun!  🙂

  21. Thanks Victor for clarifying.

    It saved us some of our time :). Your previous comment was indeed surprising and we were about to double check that at our end.

    Please feel free to post any further question you might have while trying to understand AMP behavior in your implementation of matrix multiplication.

  22. Martyn says:

    I have VS Express 2013 for Windows Desktop on Windows 8.1 and have been downloading C++ AMP code examples from this site and elsewhere.

    After command line compiling of this code, and saving and compiling this same code to either main.cpp or another filename, I notice that the code compiled from main.cpp runs a lot quicker. For example, string search sample with C++ AMP from this site.