Passing pointers through C++ AMP

Hello, my name is Łukasz Mendakiewicz and I am an engineer on the C++ AMP team.

Converting C++ data parallel algorithms to take advantage of C++ AMP is fairly straightforward, assuming you can live with some of the restrictions of restrict(amp) functions. Let’s go today through a relatively simple problem of passing a C++ AMP incompatible type through the C++ AMP runtime.

Problem

Consider a plain C++ program calculating a weighted sum of two variables. To have some interesting background, this might be a decision problem where we want to minimize a cost of an order of certain amount of apples and pears from local producers. (Alternatively, it might be the cost of an order of certain amount of GPUs from local wholesalers).

Whichever scenario is closer to you heart, let’s assume we have a record struct defined as:

struct record
{
float cost_1;
float cost_2;
float total_cost;
char* label;
};

…and a quite straightforward algorithm:

std::vector<record> records;
float factor_1 = 100.0f;
float factor_2 = 10.0f;
// ...
std::for_each(records.begin(), records.end(),
[=](record& r)
{
r.total_cost =
factor_1 * r.cost_1
+ factor_2 * r.cost_2;
}
);

To convert this algorithm to take advantage of C++ AMP, we can rewrite it as follows:

std::vector<record> records;
float factor_1 = 100.0f;
float factor_2 = 10.0f;
// ...
array_view<record, 1> records_view(static_cast<int>(records.size()), records);
parallel_for_each(records_view.extent,
[=](index<1> idx) restrict(amp)
{
record& r = records_view[idx];
r.total_cost =
factor_1 * r.cost_1
+ factor_2 * r.cost_2;
}
);

However if we try to compile it we will be getting some bad news:

C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\amp.h(2202): error C3581: 'void (const Concurrency::extent<_Rank> &,_Value_type *,bool) restrict(cpu, amp)': unsupported type in amp restricted code
with
[
_Rank=1,
_Value_type=record
]
main.cpp(14): pointer or reference is not allowed as pointed to type, array element type or data member type (except reference to Concurrency::array/texture)
main.cpp(105) : see reference to class template instantiation 'Concurrency::array_view<_Value_type,_Rank>' being compiled
with
[
_Value_type=record,
_Rank=1
]

It turns out the problem here is the char pointer, which is one of the types not allowed in the restricted context (as discussed in the restrictions blog series). Despite the fact that we are not going to touch it in the kernel, it is still going to be passed through the C++ AMP runtime and hence we get the compilation error.

In a perfect world, you should be able to remove the pointer from the structure altogether, effectively turning the array of structures to structure of arrays for an added performance gain. However, if you are constrained by a high cost of refactoring the rest of your application and/or just want to have it up and running quick and dirty way, you may be interested in exploring how we can work around this restriction…

Working towards a solution (you may skip straight to the final solution below if you wish)

The first attempt to address the problem might be to store a pointer in an integer field. Obviously pointers are stored in 4 bytes, just like plain old int variables in VC++, right? Well, this is half-true if you consider 64-bit architecture (8 byte pointers there) and even less if you would target anything more exotic. However let’s stick with the Windows world, use processor architecture macros, and define the structure as follows:

struct record_v_2
{
float cost_1;
float cost_2;
float total_cost;
#if defined(_M_IX86)
static_assert(sizeof(int) == sizeof(char*), "Cannot represent char* as int!");
int label; // warning: char* in a disguise
#elif defined(_M_AMD64)
static_assert(sizeof(__int64) == sizeof(char*), "Cannot represent char* as __int64!");
__int64 label; // warning: char* in a disguise
#else
#error "Unknown architecture!"
#endif
};

Oh, that was close! Now we’ll still have an issue with an AMD64 target, as __int64 is an another type that is not allowed in a C++ AMP restricted function.

But as 2+2=4, similarly 4+4=8, or should I say sizeof(int) + sizeof(int) == sizeof(char*) in 64-bit Windows. Making this observation a little more platform agnostic leads us to the next version:

struct record_v_3
{
float cost_1;
float cost_2;
float total_cost;
int label[(sizeof(char*) + sizeof(int) - 1) / sizeof(int)]; // warning: char* in a disguise
};

This version looks promising and it will certainly work in the C++ AMP context. However it will still require some refactoring in other parts of the code and, honestly, I would not trust myself to remember that this integer array is to be treated as a pointer!

One way to beautify this code would be to encapsulate the nitty-gritty details and provide property modifiers (available in the VC++ compiler) for accessing the data as char* virtual data member. This would both make it safer to use and solve the need for refactoring any other part of the code.

We would only need to devise a pair of setter/getter functions accepting/returning char* and operating internally on an int array. Wait, did I say accepting/returning the char*, the very reason of the whole mess in the first place? Yes, member functions of C++ AMP compatible types are free to use any type or language construction available in C++, as long as we do not try to define them as C++ AMP compatible using the “restrict” keyword.

I like to picture it as a C++ AMP runtime being a well-behaved youngster not peeking where she is not supposed to. As she will not be offended by things she cannot see and she will see only data structure fields and member functions with the C++ AMP restriction, we may have anything in the CPU restricted member functions. As a side note, the CPU is considerably older, but he seems to be a gentleman, not watching C++ AMP specific stuff either!

Solution

The ultimate solution is to generalize the idea for any pointer type, providing a generic wrapper for pointers to be passed through C++ AMP. We need to provide an interface compatible with a pointer concept and carefully mark only functions not operating on pointers as C++ AMP restricted.

The only caveat is the alignment of the data field on AMD64 architecture. The natural alignment of pointers there is 8 bytes, while an integer array will be aligned to 4 bytes. Therefore I have explicitly marked the access to it as unaligned. Keep in mind that it comes with an additional runtime cost. Also, if you prefer not to use compiler specific modifiers, you may use memcpy, bitwise operations, etc. instead.

template <typename Type>
class pointer_holder
{
public:
typedef Type element_type;
typedef Type* pointer;

pointer_holder() restrict(cpu,amp)
{
}

pointer_holder(pointer ptr) restrict(cpu)
{
reset(ptr);
}

pointer_holder& operator=(pointer ptr) restrict(cpu)
{
reset(ptr);
return *this;
}

void reset(pointer ptr) restrict(cpu)
{
*reinterpret_cast<pointer UNALIGNED *>(data) = ptr;
}

operator pointer() const restrict(cpu)
{
return *reinterpret_cast<const pointer UNALIGNED *>(data);
}

element_type& operator*() const restrict(cpu)
{
return *static_cast<pointer>(*this);
}

pointer operator->() const restrict(cpu)
{
return static_cast<pointer>(*this);
}

private:
int data[(sizeof(pointer) + sizeof(int) - 1) / sizeof(int)];
};

struct record
{
float cost_1;
float cost_2;
float total_cost;
pointer_holder<char> label;
};

Please note that the pointer_holder contains only integer data so it is perfectly safe not only to pass, but also to copy and assign it in the C++ AMP restricted code.

You can try out that it works in practice for you! The complete project for Visual Studio 2012 may be downloaded here.

Closing thoughts

Please note that the example uses char* only for the sake of simplicity, I recommend using std::string in your programs.

As I have noted before, while this blog post has some educational value and shows how to quickly port a previously unsupported data structure to C++ AMP, it does not present the solution you would like to follow for the cutting edge performance. Refactoring the record structure to a cache-oblivious counterpart will be beneficial both for CPU (cache effect) and GPU (reduced amount of copying) processing. The easiest way to do so will be to split the vector into three: first containing {cost_1, cost_2} pair, second containing total_cost (being output only for GPU) and third containing pointers (not copied to GPU at all) where related elements share the same index. Stay tuned, we are going to talk about these in more details soon, or if you come up with the example code before I do, feel free to post below!

CppAMPPassingPointers.zip

Tags

1. MattPD says:

Out of curiosity: would using std::intptr_t (from <cstdint> in C++11) work instead of ISA-dependent int/__int64 ?

2. MattPD, thank you for your question. In VS std::intrptr_t is a typedef for int or __int64 (depending on the target being x86/AMD64) so using it would effectively bring us to "record_v_2" solution. So similarly it would work for x86 target, but will not be allowed for AMD64.

In other compilers the underlying types may be selected differently, but the one for the AMD64 target will most probably be a 64-bit integer type, which is not C++ AMP compatible in version 1.

We will talk in-depth about restrictions in future posts on our blog, please check back!

3. MattPD says:

OK, didn't know about the 64-bit integer types not being compatible w/ C++ AMP v. 1, interesting — thanks for the answer! Looking forward to future posts!