C++ AMP: Full Inlining Requirement

Article
12/21/2011

As you may know, in C++ AMP, the entry point for device code is the parallel_for_each function. What you may not know is that the compiler will inline all the function calls in the call graph rooted at the lambda/functor that is supplied to the parallel_for_each! In this blog post, I’m going to first guide you through what you need to know as a C++ AMP programmer and in the end, I will explain why C++ AMP has this inlining requirement.

Prerequisite

All the functions in the call graph must be amp restricted functions (e.g. restrict(amp), or restrict(cpu, amp)). The restrictions imposed by restrict(amp) ensure that functions are not written in a way that would prevent successful inlining, e.g. restrictions such as variable argument list, virtual functions, and recursion, would prevent inlining if they were allowed. For example, if there is a recursion in the call graph, an error would be reported:

test.cpp(12) : error C3559: recursive call to 'void __cdecl Foo(int &,int) restrict(cpu, amp)' : recursion is detected when compiling the call graph for the entry function for an amp target 'test.cpp(36)'

Line number 12 refers to the location where the recursive call to function “Foo” is made. Line number 36 refers to location of the root of the call graph.

Inlining Always Happens

Please be aware that, in this release, the C++ AMP compiler forces inlining regardless of whether the:

source file is compiled with /Ob0.
function is annotated with __declspec(noinline).
function is in a region annotated by #pragma auto_inline(off).

In short, full inlining is always enforced for the call graph rooted from a parallel_for_each lambda/functor parameter.

“What does this mean to me?”

As a C++ AMP developer, you need to make sure that the code for all the functions in the call graph is available to the compiler when it compiles the code in the file containing the parallel_for_each. If you don’t, you may see a compilation error like the following:

Y.cpp(10) : error C3560: 'void __cdecl Bar(void) restrict(amp)': IL is not available when compiling the call graph for the entry function for an amp target at: 'X.cpp(64)'

Let’s explore the two options for avoiding errors like that.

The first option is to ensure that the definitions for all functions are available in the same translation unit – the cpp source file and all the header files it includes (recursively). This practically means putting a lot of your reusable code directly in header files.

The second option is to take advantage of the Link-time Code Generation option of the VC++ Compiler and Linker (cl option /GL, link option /LTCG). With LTCG, you can organize the code with more than one translation unit, as long as all the translation units (where the functions in the call graph reside), are compiled using LTCG. For example, assume

A.cpp contains main(), and the parallel_for_each that is supplied with a lambda, which calls function Foo();
B.cpp contains the code that defines function Foo(), which itself calls function Bar();
C.cpp contains the code defines function Bar().

To make the compilation successful, you can compile and link as follows:

cl /GL /c A.cpp
cl /GL /c B.cpp
cl /GL /c C.cpp
link /ltcg A.obj B.obj C.obj

…or through a single command:

cl /GL A.cpp B.cpp C.cpp

In this way, via LTCG, the code from all three cpp source files are made available to the compiler for full inlining and code-generation. You are encouraged to read more about LTCG via the above links.

If the code of any function in the call graph is not available to the compiler, an error would be issued. With the above example of three source files, if you compile them without using /GL, but only as follows:

cl A.cpp B.cpp C.cpp

You will see an error such as this one:

A.cpp(68) : error C3560: 'void __cdecl Foo(void) restrict(cpu, amp)': IL is not available when compiling the call graph for the entry function for an amp target at: 'A.cpp(64)'

A.cpp(68) is the location where function Foo is called, and A.cpp(64) is the location of the root of the call graph. Similarly, if you do:

cl /GL /c A.cpp
cl /GL /c B.cpp
cl /c C.cpp
link /ltcg A.obj B.obj C.obj

Note /GL is not given when compiling C.cpp. You would get:

B.cpp(6) : error C3560: 'void __cdecl Bar(void) restrict(cpu, amp)': IL is not available when compiling the call graph for the entry function for an amp target at: 'A.cpp(64)'

NOTE: You could get similar errors when using the first option (instead of the LTCG option), if the code of a function is missing from the single translation unit you supply to the compiler.

If you plan to use the LTCG option, please examine your project options in Visual Studio to ensure that you have turned it on. Please look at the “To set this compiler option in the Visual Studio development environment” section from the MSDN /LTCG page.

Why Full Inlining?

So why does C++ AMP require full inlining? There are two reasons both stemming from the fact that, currently, C++ AMP is built on DirectX11 DirectCompute.

First, HLSL is the language that DirectCompute uses to program the GPU and it assumes that GPU hardware has no stack. Therefore, the HLSL compiler (which C++ AMP uses behind the scenes) needs to inline all the function calls. As a result, separate compilation and linking is not supported by HLSL – all the functions being called need to be present in the same translation unit as the entry kernel function.

Second, even though HLSL does not support pointers at all, in C++ AMP, we provide certain level of pointer support. In order to conform to the HLSL compilation model and still provide pointer support, the C++ AMP compiler performs inter-procedural analysis to resolve pointer accesses back to the memory resources they point to. Inlining the entire call graph helps a lot in carrying out this analysis.

In Closing

I hope this post would help you understand the C++ AMP inlining requirement and what you can do to satisfy the requirement. Please feel free to ask questions in our MSDN forum.