Handling TDRs in C++ AMP

Article
03/06/2012

In this blog post I would like to discuss the interaction of C++ AMP with Windows Timeout Detection and Recovery (TDR), a mechanism used to maintain GPU responsiveness.

What is a TDR

The GPU has been mainly used for display making user experience its highest priority. One of the most common stability problems is when the GPU is busy processing an intensive graphics operation and nothing is updated to the screen. The system appears to be frozen and the user usually fixes the hung system by rebooting it.

To improve the user experience, Windows uses a mechanism called TDR to reset the display driver when a task appears to be hanging or running longer than the permitted quantum time (default is 2 seconds). The display driver is restarted to free up the GPU for display and other waiting apps. Users will notice a momentary screen flicker with a message in the task bar like “Display driver has stopped responding and has successfully recovered”.

How C++ AMP apps are affected by TDR

C++ AMP apps can easily observe a TDR during intensive computations which execute longer than the permitted quantum time. All executing commands on the offending accelerator_view will be cancelled using a TDR and any state stored on the accelerator_view will be lost. Here are some examples of situations where TDR can occur in your C++ AMP app:

The app has a parallel_for_each which takes longer than 2 seconds to complete
An irrecoverable out of memory exception is encountered during copying data to the accelerator.
The accelerator_view has queuing mode set to automatic and a batch of copy and parallel_for_each operations are sent to the accelerator as a single DMA buffer. This batch of operations takes longer than 2 seconds to execute
Another app causes a TDR which is broadcast to all apps using the same accelerator
The device is physically removed from the system

Detecting a TDR

Well written apps should always be prepared to handle a TDR when executing an operation on an accelerator. C++ AMP allows you to handle this using the accelerator_view_removed exception which corresponds to a TDR occurrence. The code snippet below is an example of an instance of this exception received for a long running kernel.

1 // Long running kernel
2 void compute_kernel(vector<int>& data, vector<int>& results)
3 {
4 array<int,1> result_array(data.size(), data.begin());
5 parallel_for_each(result_array.extent,
7 [&](index<1> idx) restrict(amp)
8 {
9 // Long running computation
10 });
11 results = result_array;
12 }
13
14 // returns true if the computation was successful, false otherwise
15 bool compute(vector<int>& data, vector<int>& results)
16 {
17 try
18 {
19 compute_kernel(data, results);
20 return true;
21 }
21 catch(concurrency::accelerator_view_removed& ex)
22 {
23 cout << "TDR exception received: " << ex.what();
24 cout << "Error code:" << std::hex << ex.get_error_code();
25 cout << "Removed reason:" << std::hex
26 << ex.get_view_removed_reason();
27 return false;
28 }
29 }

Output

TDR exception received: Failed to wait for D3D marker event.

Error code: 887a0005

Removed reason: 887a0006

The accelerator_view_exception usually has a message like the ones shown below:

Failed to wait for D3D marker event. Error code: 887A0005
Failed to map staging buffer. Error code: 887A0005

You can get additional diagnostics about the failure by compiling with the /MTD or /MDd compiler switches which causes the debug version of the AMP runtime to be used. Here is the output for the previous program when compiled with the debug runtime. The more descriptive error message gives us a clue that this exception is due to a long running kernel.

TDR exception received: Failed to wait for D3D marker event.

ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason(DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware).

Always examine the error_code and removed reason to determine the cause of the TDR. This gives us more information about the reason for TDR. In the example above, the executing operation exited with a DXGI_ERROR_DEVICE_HUNG error code. This means that the device was detected as hanging or possible taking longer than its permitted quantum. To show another example, here is the exception message received for an error during large copy operations. Note that the this TDR occurred because of an Out of Memory condition.

ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (E_OUTOFMEMORY: The application tried to use more adapter memory than the Device can simultaneously accommodate. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when exhaustion occurred. The application needs to make less aggressive use of the display memory, perhaps by leveraging ClearState to ensure large Resources do not stay bound to the pipeline, or by leveraging information, such as DXGI_ADAPTER_DESC::DedicatedVideoMemory)

Recovering from a TDR

Depending on the nature of your app, you may want to recover and continue after the TDR. To achieve this, you will have to create a fresh accelerator_view, reallocate memory resources and populate them afresh on the new accelerator_view.

Note that if your app uses the default accelerator_view, it will not be possible to recover unless you switch to using a different accelerator, or restart the app. Instead, using a non-default accelerator view can make your code cleaner as this accelerator_view can be discarded and a new one created (still corresponding to the same accelerator) in the event of a TDR.

If your individual commands are not long running but collectively they resulted in a long running execution, you can try again with queuing_mode set to immediate. Flushing regularly or using the immediate queuing mode can help avoid TDRs as we discussed in our queuing mode blog post.

The code snippet below shows one way of restarting the computation by using a non-default accelerator_view and changing the queuing mode after a TDR.

1 // Long running kernel
2 void compute_kernel(vector<int>& data, vector<int>& results,
3 accelerator device,
4 queuing_mode qmode =
5 queuing_mode::queuing_mode_automatic)
6 {
7 // Create a new accelerator_view
8 accelerator_view av = device.create_view(qmode);
9 array<int,1> result_array(data.size(), data.begin(), av);
10 parallel_for_each(result_array.extent,
11 [&](index<1> idx) restrict(amp)
12 {
13 // Long running computation
14 });
15 results = result_array;
16 }
17
18 // returns true if the computation was successful, false otherwise
19 bool compute(vector<int>& data, vector<int>& results)
20 {
21 accelerator device = accelerator();
22 try
23 {
24 compute_kernel(data, results, device);
25 }
26 catch(concurrency::accelerator_view_removed& ex)
27 {
28 cout << "TDR exception received: " << ex.what() << endl;
29 cout << "Error code: " << std::hex << ex.get_error_code();
30 try
31 {
32 cout << "Retrying with immediate queuing_mode...";
33 // Uses new accelerator_view
34 // Reallocates memory resources
35 compute_kernel(data, results, device,
36 queuing_mode::queuing_mode_immediate);
37 }
38 catch(concurrency::accelerator_view_removed& ex)
39 {
40 cout << "TDR exception received: " << ex.what() << endl;
41 cout << "Error code:" << std::hex << ex.get_error_code() << endl;
42 cout << "Removed reason: " << std::hex << ex.get_view_removed_reason();
43 cout << "Aborting." << endl;
44 return false;
45 }
46 }
47 return true;
48 }
49

Avoiding a TDR

While your C++ AMP applications must always be prepared to deal with TDR, you should try to avoid a TDR occurrence if possible. Here are some tips on how to avoid TDR.

First you should design for short kernel execution time. If possible, make sure that your kernels last for less than the permitted default quantum time (2 seconds). If the data is coming in from the user, design for the kernel running time. One way to do this is by chunking up copy operations or computations.

Another option, only on Windows 8 (and corresponding Windows Server 8) is to programmatically create an accelerator_view with a setting that allows for execution time greater than 2 seconds as long as there is no contention for that accelerator. For more on that approach please read our blog post: “Disabling TDR on Windows8 ”.

A third option to avoid TDRs when debugging and testing your app is to change the TDR related registry keys. However, we do not recommend using these registry keys as they may make your system unstable.

In conclusion

Handling TDRs can be tricky but a graceful exit can enhance the experience of your C++ AMP app, especially when your app runs on variety of accelerators.

Does your app handle TDR? Is recovery from TDR important to you? We would love to hear from you! Please feel free to share your comments below or at our MSDN forum.