restrict(amp) restrictions part 3 of N – function declarators and calls


 

This post assumes and requires that you have read the introductory post to this series which also includes a table of content. With that out of the way let’s look at restrictions around function declarators and calls.

Function declarators with restrict(amp)

For a function declarator with restrict(amp) (or restrict(amp, cpu)), besides the obvious rules that its return type and parameter types must be supported for amp, there are some extra rules as following:

·         It is not allowed to have a trailing ellipsis (…) in its parameter list;

·         It is not allowed to have an exception specification (including the empty throw() and __declspec(nothrow));

·         It is not allowed to have extern”C”  linkage when it has multiple restriction specifiers;

·         It is not allowed to be virtual;

 

Variadic functions require direct support from the C runtime, which is not amp-compatible in C++ AMP v1. In addition, C++ AMP does not support exception handling, therefore, an exception cannot be thrown inside an amp restricted function, and neither can the function have exception specifications. The empty exception specification is harmless, but we disallow it for consistency. The limitation on extern “C” linkage is due to the fact that the current C++ AMP implementation generates multiple symbols for a function with multiple restriction specifiers, which cannot be done for extern “C” functions since they do not have C++ decorated names and thus those symbols cannot be differentiated. Finally, the non-virtual requirement is due to the lack of hardware function call support.

Function calls

Within an amp-restricted function, the target of a function-like invocation (e.g., functions, member functions, object constructors & destructors, operators) must be amp-restricted too. Following the amp type restrictions, we know that it cannot be a virtual function or a function pointer/pointer to member function either. In addition, due to the lack of hardware stack and function call support, it is not allowed for a function to recursively invoke itself directly or via other functions indirectly.

Comments (7)

  1. Arman says:

    I know the functionality is implied, but just for clarity could you discuss considerations for functions declared with "inline"? How will the VC++11 compiler respond to this keyword and will the rules for inlining functions be different for amp restricted code?

  2. Hi Arman, when a restrict(amp) or restict(amp, cpu) function is called within the call graph rooted from the parallel_for_each, it will always be inlined. Please take a look at: blogs.msdn.com/…/c-amp-full-inlining-requirement.aspx. When a restrict(amp, cpu) function is called on host, its the inlining behavior is unchanged.

  3. Peter says:

    Is it possible to call variadic template amp restricted functions in parallel_for_each with restrict(amp) like this?

           template <typename… Functions>

           int FillArray(std::vector<double>& vArray, Functions… functs)

           {

    double dParam = 1.0;

    std::vector<std::function<bool(double)>> vFunctions = { functs… };

                   for (auto funct : vFunctions)

                           parallel_for_each(vArray.begin(), vArray.begin(), [funct, dParam](double& d)

                           {

                                d += funct(dParam);

                           });

           }

  4. Łukasz Mendakiewicz says:

    Hello Peter,

    The code as you have written directly will not work with amp-restricted functions. In this specific case, std::function is a type that is non-amp-compatible and as such cannot be captured by an amp-restricted lambda which parallel_for_each can accept in C++ AMP (note that it is different for PPL's parallel_for_each, but that will execute only on the CPU). Generalizing, function pointers at all are not amp-compatible and cannot be used in such scenario, so any iteration over a collection of functions will not be possible in a dynamic manner.

    However if your use case allows, you can work around these limitations by using static expansion instead — using function objects (or lambda closures). Note that every closure can call any function, so it should retain the generality you were intending in the example above.

    One example of achieving this could be as follows:

    #include <amp.h>

    #include <iostream>

    #include <vector>

    using namespace concurrency;

    void FillArray_expand(array_view<double>&, double) {}

    template <typename Closure, typename… Closures>

    void FillArray_expand(array_view<double>& av, double dParam, Closure closure, Closures… closures)

    {

       parallel_for_each(av.extent, [av, dParam, closure](index<1> idx) restrict(amp)

       {

           av[idx] += closure(dParam);

       });

       FillArray_expand(av, dParam, closures…);

    }

    template <typename… Closures>

    void FillArray(std::vector<double>& vArray, Closures… closures)

    {

       double dParam = 1.0;

       array_view<double> av(vArray);

       FillArray_expand(av, dParam, closures…);

       av.synchronize();

    }

    double foo(double d) restrict(amp) { return d * 2; }

    double bar(double d) restrict(amp) { return d + 2; }

    int main()

    {

       try

       {

           std::vector<double> vArray = {1, 2, 3, 4, 5};

           // These might be shortened and inlined into the FillArray call with macros.

           auto c1 = [](double d) restrict(amp) { return foo(d); };

           auto c2 = [](double d) restrict(amp) { return bar(d); };

           FillArray(vArray, c1, c2);

           std::copy(begin(vArray), end(vArray), std::ostream_iterator<double>(std::cout));

       }

       catch(const std::exception& ex)

       {

           std::cout << ex.what() << std::endl;

       }

    }

  5. Peter says:

    Thanks a lot Lukasz.

    I've known about  pointer restriction and is good to know that lambda closures will inline parallel_for_each(…) restrict (amp) statically  in compile-time.

    My problem is more complicated becase I need to use it for multiple GPU-s running on vArray.section like this:

    //     std::vector<double> vParams = { 0, 1, 2, …, N};

           array_view<double> avP(vParams);

           std::vector<std::function<double(double)>> vFunctions = { functs… };

           for (auto funct: vFunctions)

                   parallel_for_each(vGPUs.begin(), vGPUs.end(), [&](pair<accelerator, int> accel)

                   {

                           accelerator_view device = accel.first.get_default_view();

                           accel.first.set_default_cpu_access_type(access_type_auto);

                           device.wait();

                           int nGPU = accel.second;

                           auto vArray_index = concurrency::index<1>(nGPU * vArray.extent / nGPUCount);

                           auto vArray_extent = concurrency::extent<1>(vArray.extent / nGPUCount);

                           auto P_index = concurrency::index<1>(nGPU * vArray.extent / nGPUCount);

                           auto vArraySection = vArray.section(vArray_index, vArray_extent);

                           auto avP_section = avP.section(P_index);

                           vArraySection.discard_data();

                           parallel_for_each(device, vArraySection.extent, [&, funct, avP_section, vArraySection](index<1> idx) restrict(amp)

                           {

                                    vArraySection(idx).val = funct(avP_section(idx));;

                           });

                           vArraySection.synchronize();

                   });

    Problem is – vArray.section is determined in run-time.

  6. Łukasz Mendakiewicz says:

    Hi Peter!

    I don't think it should be a problem — the loop that needs to be statically expanded is the loop over functions, the loop over sections may be dynamic just fine.

    The below code is an example that should perform the operation you are asking for, modulo minor details (apologies for the blog software butchering the formatting):

    #include <amp.h>

    #include <iostream>

    #include <vector>

    using namespace concurrency;

    typedef std::vector<std::pair<accelerator_view, array_view<double>>> vec_accvw_av;

    void FillArray_expand(vec_accvw_av&, double) {}

    template <typename Closure, typename… Closures>

    void FillArray_expand(vec_accvw_av& vAccvwSections, double dParam, Closure closure, Closures… closures)

    {

       for (auto& accvwSection : vAccvwSections)

       {

           auto& av = accvwSection.second;

           parallel_for_each(accvwSection.first, av.extent, [av, dParam, closure](index<1> idx) restrict(amp)

           {

               av[idx] += closure(dParam);

           });

       }

       FillArray_expand(vAccvwSections, dParam, closures…);

    }

    template <typename… Closures>

    void FillArray(std::vector<accelerator_view>& vAccVws, std::vector<double>& vArray, Closures… closures)

    {

       // Create the "master" view.

       array_view<double> av(vArray);

       // Create pairs of accelerator_views and sections they operate on.

       vec_accvw_av vAccvwSections;

       int step = av.extent.size() / vAccVws.size();

       for (size_t i = 0; i < vAccVws.size() – 1; ++i)

       {

           vAccvwSections.push_back({ vAccVws[i], av.section(i * step, step) });

       }

       // And the remainder, with atrocious syntax :(.

       vAccvwSections.push_back(

       {

           vAccVws.back(),

           av.section(index<1>{static_cast<int>(vAccVws.size()) – 1} * step)

       });

       // Statically expand the list of closures and submit the work.

       double dParam = 1.0;

       FillArray_expand(vAccvwSections, dParam, closures…);

       av.synchronize();

    }

    double foo(double d) restrict(amp) { return d * 2; }

    double bar(double d) restrict(amp) { return d + 2; }

    int main()

    {

       try

       {

           // Create a vector of accelerator_views we want to run the computation on.

           std::vector<accelerator_view> vAccVws;

           vAccVws.push_back(accelerator().default_view);

           vAccVws.push_back(accelerator(accelerator::direct3d_warp).default_view);

           // The data to be transformed.

           std::vector<double> vArray = { 1, 2, 3, 4, 5 };

           // The closures to be applied.

           // These might be shortened and inlined into the FillArray call with macros.

           auto c1 = [](double d) restrict(amp) { return foo(d); };

           auto c2 = [](double d) restrict(amp) { return bar(d); };

           // Run and print the results.

           FillArray(vAccVws, vArray, c1, c2);

           std::copy(begin(vArray), end(vArray), std::ostream_iterator<double>(std::cout, " "));

       }

       catch (const std::exception& ex)

       {

           std::cout << ex.what() << std::endl;

       }

    }

  7. Peter says:

    Thank You Łukasz!

    It' works even though I need to use std::tuple instead of std::pair because of another parameters for extent<2> and extent<3> array_view 😉