Binomial Option using C++ AMP


In the financial industry, Binomial Option is a one of the methods to valuate options. Another one is Black-Scholes, which is faster than Binomial Option, but is not as accurate.

main – Program entry point

Let’s start with main(). First it generates random data (financial options), then it normalizes random options (meaning updates the options’ move up or down factor based on volatility, and the discount factor based on time increment), then main calls two functions (the binomial_options_cpu function that executes on the CPU and the binomial_options_gpu function that executes on the GPU), and finally it verifies correctness of the results.

Below is the explanation both a CPU implementation and GPU implementation (using C++ AMP).

binomial_options_cpu – Regular C++ implementation (on CPU)

Here we go through each option one at a time, creating a binomial tree for each, and walking up this tree calculating the result of the value of the option. For more please see the actual code.

binomial_options_gpu – C++ AMP implementation

In our sample we have a small data set and normal data set (defined in macros at the beginning of file). The MAX_OPTIONS macro defines the number of options and we use that to specify the number of GPU threads to execute our kernel. We also use that as the size of our input data.

For the input data we are using array_view instances that wrap the original data that resides in std::vector instances. The last container (vector/array_view) is the result of the computation, while the rest are inputs to the computation.

Instead of iterating through each option, like we did for the CPU case, we use concurrency::parallel_for_each (passing our kernel algorithm as a lambda, that captures the array_view instances) with restrict(amp). Executing our kernel on the GPU will result in a GPU thread calculating the result of the value of the option, i.e. create binomial tree and walk up its tree to get the value of this option. Each thread calls binomial_options_kernel.

Below is the source of the function we just described.

void binomial_options_gpu(std::vector<float>& v_s, std::vector<float>& v_x,
	std::vector<float>& v_vdt, std::vector<float>& v_pu_by_df,
	std::vector<float>& v_pd_by_df,
	std::vector<float>& call_value)
{
	const unsigned data_size = v_s.size();

	// all data size should be same
	assert(data_size == MAX_OPTIONS); // kernel uses this macro directly
	assert((v_s.size() == data_size) && (v_x.size() == data_size) && (v_vdt.size() == data_size) && 
		(v_pu_by_df.size() == data_size) && (v_pd_by_df.size() == data_size) && (call_value.size() == data_size));

	extent<1> e(data_size);

	array_view<float, 1> av_call_value(e, call_value);
        av_call_value.discard_data();
	array_view<const float, 1> av_s(e, v_s);
	array_view<const float, 1> av_x(e, v_x);
	array_view<const float, 1> av_vdt(e, v_vdt);
	array_view<const float, 1> av_pu_by_df(e, v_pu_by_df);
	array_view<const float, 1> av_pd_by_df(e, v_pd_by_df);

	// Used as temporary buffer on GPU
	extent<1> ebuf(MAX_OPTIONS*(NUM_STEPS + 16));
	array<float, 1> a_call_buffer(ebuf);

	// Totally we have
	// number of tiles = (CACHE_SIZE * MAX_OPTIONS)/CACHE_SIZE
	// number of threads/tile = CACHE_SIZE;
	extent<1> compute_extent(CACHE_SIZE * MAX_OPTIONS);
	parallel_for_each(compute_extent.tile<CACHE_SIZE>(), [=, &a_call_buffer](tiled_index<CACHE_SIZE> ti) restrict(direct3d)
	{
		binomial_options_kernel(ti, av_s, av_x, av_vdt, av_pu_by_df, av_pd_by_df, av_call_value, a_call_buffer);
	});
	av_call_value.synchronize();
}

Before we examine the binomial_options_kernel, let’s briefly revisit our usage of array_view and also use of the single array object that we have not talked about until now.

Note that for all input std::vector objects we have created corresponding array_view wrapper objects as follows: array_view<const float, 1>. By using const before the element type we indicate that data can be discarded after the kernel finishes and there is no need to sync data back to the host – so in effect this can save unnecessary copy out and help our performance.

For our lone output parameter we have created the array_view as follows:

          array_view<float, 1> av_call_value(e, call_value);

          av_call_value.discard_data();

The method discard_data() indicates to runtime that data from backing store doesn’t need to be copied to the GPU – so in effect this can save unnecessary copy in and help our performance. After kernel execution, the result can be copied back to system memory: we can explicitly call the synchronize() member function of array_view in a CPU thread, or let the variable go out of scope (so the destructor will call the synchronize function).

Finally, we are also using an array object, because we need a temporary buffer to which we can read or write from our kernel. Observe that the array is not backed by host memory, so we don’t need to copy it between the CPU and GPU.

binomial_options_kernel – C++ AMP Kernel implementation

You’ll remember the binomial_options_kernel function that we called from the binomial_options_gpu function that we examined earlier.

The entire data is divided into chunks of tile size (CACHE_SIZE) and each thread in each tile will operate on this chunk calculating a partial result until it finishes will all chunks (at which point it has the final result). Notice that we copy the data from the captured array_view instances (which reside in global memory) into the array buffer (also in global memory), and then into tile_static arrays. Following that, repeated accesses for the computation are performed on the tile_static arrays. This was the reason we introduced tiles. Accessing tile_static memory is much faster than accessing global memory repeatedly.

 

expiry_call_value – restrict based overloaded function

Both the GPU kernel and CPU implementation call a function called expiry_call_value. The fast_math function used here uses function overload based on the restrict modifier. Hence implementing this function as restrict(amp, cpu) enables calling it from GPU kernel and/or CPU function.

//----------------------------------------------------------------------------
// GPU/CPU implementation - Call value at period t : V(t) = S(t) - X
//----------------------------------------------------------------------------
float expiry_call_value(float s, float x, float vdt, int t) restrict(amp, cpu)
{
	float d = s * direct3d::fast_exp(vdt * (2.0f * t - NUM_STEPS)) - x;
	return (d > 0) ? d : 0;
}

Download the sample

Please download the attached sample of the binomial options that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.


BinomialOptions.zip

Comments (11)

  1. The code segments are overflowing the page width and being hidden under the archive, for me anyway. Should there be a horizontal scrollbar on those panes?

    Also, why add new functions on the GPU side when you could just make exp() etc. compiler intrinsics (which they probably are on the CPU side anyway.) std::min etc. already exist too.

  2. dmccrady says:

    Hi Andrew,

    In our preview bits you need two different implementations of "expiry_call_value" because they actually have different bodies:  one calls ::exp and the other calls concurrency::fast_exp. The two functions have different names, so overloading can't help.

    As it happens, we've made some changes in C++ AMP since the preview where this gets a lot better and much closer to what you want.  For one, we've ensured that all available functions that compute "exp" are called "exp". With that, and some other refactorings, you WILL be able to write the following:

    float expiry_call_value(float s, float x, float vdt, int t) restrict(cpu,direct3d)

    {

    float d = s * exp(vdt * (2.0f * t – NUM_STEPS)) – x;

    return (d > 0) ? d : 0;

    }

    i.e., one implementation that satisfies both the "cpu" and the "direct3d" restriction, and which correctly resolves to the correct version of "exp" for the appropriate target.

    Note that we can do this with a completely general mechanism, not involving special intrinsic handling. The details behind this are probably worth a separate blog post, which I'll try to write up.

    Thanks for the feedback.

  3. Thanks Don. I was actually talking about the direct3d:: namespace rather than the twin expiry_call_value functions. I saw a reference to direct3d::min in the sample code as well, for instance. Doesn't seem any point adding new named functions there when the standard library names should just be workable. What else is going in there? I asked about accessing GPU-specific functionality (like texture sampling and SIMD instructions) when C++ AMP was announced, but was told there weren't any plans to add access for that. Might some of that turn up in the direct3d:: namespace?

  4. I have updated the sample attached to this blog post with some fixes and correction:

    1. Overflow in GPU kernel

    2. Typo in README.TXT

  5. Ilan says:

    Will the code running on the GPU support full C++ i.e. virtual functions, multiple inheritance, operator overloading, Exception handling and templates ?

  6. The GPU code can have C++ features but not all of them. There are limitations. I would highly recommend to read the documentation in MSDN and also other blog posts under this technology. They give you very good information about the limitations.

  7. Just to add more specifics here to Pavan's response: virtual functions: no, multiple inheritence: yes, but no virtual inheritence, operator overloading: yes, exception handling: no, templates: yes.

    Basically those restrictions (in the features you inquired about) are driven by the need to have complete visibility of the entire call tree at compile time.

  8. ghlee says:

    fatal error C1451

    T.T

  9. Hi ghlee,

    May you please provide more information about your error? What's the error message?

    And the information of your machine?

    (win7 or win8), the version of VS2012, model number of your graphic cards, driver version of the graphic card?

  10. istring says:

    fatal error C1451。。。

  11. Łukasz Mendakiewicz says:

    Hello istring,

    I'm sorry you've run into this issue. We have observed error C1451 being reported when TMP environmental variable value contains non-ASCII characters. The workaround is to set the user environmental variable TMP to e.g. C:tmp.

    Hope that helps!