accelerator_view::create_marker in C++ AMP

In this post I’d like to explain the accelerator_view::create_marker function. It is one of those APIs that is not necessary for every program, but may be useful in some specific scenarios.

Some implementation background

In the current version of C++ AMP, when you run a program on an end-user machine you have high probability it will be executed on heterogeneous machine, where sections of the code marked as restrict(cpu) will be executed on the traditional CPU and those stemming from the parallel_for_each with restrict(amp) on the machine’s GPU. As these two are most likely separate pieces of silicon with communication between them supervised by the OS, you can safely bet on them executing asynchronously to each other. In C++ AMP these technical details are hidden for the sake of ease-of-use, and code executed on an accelerator_view behaves as-if it was a sequential program, see Figure 1 for a mental model.


As you can see on the figure, the program performs two consecutive computations on the array view av and prints out the result. The correct ordering, synchronization and data transfers are managed by the C++ AMP runtime, giving you the final value as soon as it is possible. That way, you don’t have to worry about the intermediate state of execution. Unless you want to…

Markers overview

If you want to be aware of the progress of execution on the accelerator_view (this includes not only parallel_for_each but also explicit asynchronous copy) and you don’t want to do it in a blocking fashion (e.g. using accelerator_view::wait or queueing_mode_immediate) you may want to instrument the sequence of commands that are scheduled by your program with some markers:

class accelerator_view
    completion_future create_marker();

Marker binds an accelerator_view-side command with a host-side completion_future (which basically is std::shared_future with added support for continuations). The future will be ready once the accelerator_view execution passes the associated command, see Figure 2 for a mental model.


There are two observations related with this mechanism.

First, you don’t have to trouble yourself with restraining commands from reordering – markers themselves constitute a fence with regard to other commands. In other words C++ AMP guarantees that neither the parallel_for_each nor the copy command will be moved before or after a marker.

Second, the granularity of a marker is a single command issued from the host. That means that you cannot create a marker in the middle of parallel_for_each or copy operation.


In the attached project you will find a scenario similar to the one presented on the Figure 2. I create there a “watchdog” thread to report on progress of executing 100 consecutive parallel_for_each loops on the main thread. The only part that I have not covered here is a data structure used to pass std::shared_futures safely between two threads, but I tried to keep it simple and use only standard C++11 mechanisms, so you should have no problem following the code.

As a disclaimer, I am obligated to add that, in production quality code, breaking the parallel_for_each loop into multiple parts just to add markers in between them is a bad practice, as it would incur an additional overhead.

Other scenarios

Markers may also be used in an implementation of a coarse-grained timer for accelerator_view-side execution – i.e. you may create two markers: before and after the interesting operation, and measure the time difference between them being made ready.

One more advanced use may be to keep a specified number of parallel_for_each loops in flight in an algorithm performing a work balancing between multiple GPUs. A solution of this problem is left as the exercise for the reader :).

As always, your feedback is welcome below or in our MSDN Forum.

Comments (3)

  1. tobi says:

    The marker model might become problematic if some accelerator can execute jobs in parallel. When exactly will the marker be completed in such a model?

    Also, what if two unrelated CPU threads queue jobs on the same acc_view (i.e. the default view)? Will the marker barrier's completion depend on both CPU thread's queued jobs? This looks like a global solution to a local problem.

    I'd think it would be much more clearly defined to have each parallel_for_each return a future. The user can then use that future how he wants.

  2. tobi, thank you for your comment, I’m glad to answer the questions you’ve raised.

    As you can read on our blog, C++ AMP guarantees that parallel_for_each and copy operations behave as-if-synchronously. Behind the scenes, work may be reordered and parallelized as long as the effects cannot be observed to violate the sequential order of work submission.

    The marker model works by ensuring that any asynchronous execution that has been speculatively deferred by the hardware, is executed to completion. That means, the std::shared_future will be made ready only when all the work submitted to the given accelerator_view prior to the creation of the marker will be physically completed – please note that I’ve mentioned it briefly in the blog post stating that markers are “fences” w.r.t. other command. I believe this statement is the easiest mental model to understand the underlying mechanism.

    If you are interested in a deeper explanation of the synchronization model C++ AMP employs, I would recommend you reading section 8 of the C++ AMP open spec, please find a link in the right hand side column.

    This explanation holds for the scenario you’ve described, i.e. a situation where two unrelated CPU threads queue commands on the same accelerator_view – the created marker will wait for all work previously issued by all threads. Note however that you would have to rely on other CPU side synchronization mechanisms to ensure that the marker is created only after all the work items you are interested in are issued from both threads – i.e. that you do not have a race condition in your algorithm, accelerator_view itself is thread safe.

    On the other hand, if your intent is to separate work submitted by different threads, it can be easily achieved by creating a separate accelerator_view for each of them.

    The idea of parallel_for_each returning a std::future is a good idea for a future release (no pun intended), thank you for the feedback.

  3. tobi says:

    Hey Łukasz,

    thanks for your response. I understand the semantics of the marker and the deferred execution of for-each. I was kind of concerned that the API was dangerously designed. I'd expect many people to make the mistake of using only the default view which will introduce unneeded inter-thread coupling. Imagine two physics simulation windows being open an the same time. You don't want the marker of one window to depend on the other windows marker. Not even accidentally. This API _can_ be used correctly, but it is easy to make a mistake here. That is how the world works. API get misused all the time. Of course this is entirely in your domain of expertise. I just intended to raise awareness for this type of API misuse.

    For example you surely would not introduce such a barrier primitive for CPU-side tasks and futures. It is quite obvious that that would be a mistake because the API would be misused constantly. I think the same is true for GPU-side barriers.

    Btw, I think the amp object model is quite well designed. This is just a minor point that I notices.