After shipping a number of improvements to Graphics Diagnostics in Visual Studio 2013 Update 3 RC, the team has been working to bring you more profiling tools for DirectX applications. In Visual Studio 2013 Update 4 CTP1 that released yesterday (Download here), you will find a brand new GPU Usage tool in the Performance and Diagnostics hub that you can use to collect and analyze GPU usage data for DirectX applications. CTP1 supports Windows Desktop and Windows Store apps running locally. Windows Phone app support and remote support will come in a later release. You can find document here, watch a live demo in this Channel9 video, or read the rest of this blog to learn more about this feature. J
The world would be such a better place if all games ran at 60 FPS and no performance problems needed to be investigated! But, in reality, during development and sometimes after release, there are apps that couldn’t hit their target frame rate – whether it’s 60 FPS on PC or 30 FPS on a smaller device, or apps whose frame rate drops in the middle of a session.
The cause of performance problems in DirectX apps can vary from only utilizing a single CPU core when multiple cores could readily be used, to the GPU rendering an overly complex mesh. To understand the cause, it’s usually helpful to start with isolating whether the main problem is over or under usage of either the CPU or GPU. The GPU Usage tool can help you determine whether CPU or GPU is the performance bottleneck of the application. You can also inspect the timing of each individual GPU event if a supported graphics card is present and latest drivers are installed. Please check this document for a list of graphics cards that are supported, and check your graphics card vendor websites (Intel, NVidia, AMD) to download the latest driver which supplies GPU event data for this feature.
Let’s give it a first try!
The GPU Usage tool can be launched through the Performance and Diagnostics hub via menu DEBUG->Performance and Diagnostics or Alt+F2.
From here you can choose to check the GPU Usage alone or you can check other tools to run along with it, such as CPU Usage.
To start off, let’s click Start to run the GPU Usage tool by itself on the default DirectX project created using the DirectX project template. On the prompted User Account Control dialog that asks for your permission to collect data, click Yes.
The GPU Usage tool starts collecting data and displays three graphs on the opened diagsession file, and the graphs are showing live data including the Frame time and FPS graphs which are also available in the Graphics Diagnostics tool, and a brand new GPU utilization graph that shows how busy the GPU is at high level.
Now let’s click on the Stop collection link in the bottom or the Stop button in the top left to generate a report. The generated report shows the same three graphs from the live session. If you would like to drill into details of a specific range in the timeline, for instance, if there’s a frame rate drop or a spike in GPU utilization, you can select a range in the timeline and click the here link at the bottom to view details of GPU usage data. In this example, the app was running smooth in the entire session so we can pick any range to inspect the GPU details.
The GPU details window will then be opened separately from the diagsession window. The top half is a timeline view which contains lanes showing how each CPU core and GPU engine are used over time, and the bottom half contains an event list which shows a list of graphics events that occurred on the GPU. Note that the data in the event list requires graphics driver support, therefore it may not be available if your graphics card is not supported or the latest driver hasn’t been installed, in which case all the events will be marked “unattributed”. You can check this document for a list of graphics cards that are supported, and check your graphics card vendor websites (Intel, NVidia, AMD) to download the latest driver which supply GPU event data for this feature.
All the processes that used the GPU will be captured, and each process is assigned a different color in the timeline view. In this example, yellow represents the target process of profiling which is App5.exe.
As you click or navigate through the event list, you’ll notice a little popup widget on the CPU and GPU lanes showing when the selected event was executed on the GPU and when its corresponding CPU work happened on the CPU. Light grey vertical lines across the lanes mark Vsyncs from each monitor. Vsync lines can be used as a reference to understand whether certain Present calls missed Vsync. There must be one Present call between every two Vsyncs in order for the app to steadily hit 60 FPS.
This GPU details view provides useful information to understand:
- How busy the CPU and GPU are on a more granular level
- When DirectX events were called on the CPU and when they were executed on the GPU
- How long each event took on both the GPU and CPU
- If the target frame rate was missed by Present calls missing Vsyncs
The benefits may not be obvious in this example, because the app is very simple and neither GPU nor CPU is busy. In next section, we’ll try it on a more realistic app and see how the data can be used.
Let’s get busy and analyze a more realistic app
In this example, we’re going to use an internal test app called CityDemo, which renders 3D scenes of a simulated city. This time, we’ll try to run both GPU Usage and CPU Usage tools in the same session. While only the GPU Usage tool is required to determine whether an app is CPU bound or GPU bound, adding the CPU Usage information will allow us to more quickly analyze the situation if the CPU is found to be a problem (hint, hint).
Again, let’s launch the Performance and Diagnostics hub, but this time we’ll select both GPU Usage and CPU Usage. The FPS graph tells us that the app is running at ~40 FPS. The red line in the FPS graph represents the default threshold value of 60 FPS. You can change it to 30 FPS using the dropdown if you want to target a lower frame rate. Also you’ll notice that we have a CPU utilization graph because we selected the CPU Usage tool. This provides a cohesive view of GPU and CPU status at high level. In this case, CPU utilization was around 20%, and GPU was around 60%. So neither CPU nor GPU is fully utilized, but, why is the app not hitting 60 FPS?
To figure out the mystery, let’s drill into the GPU details to see if there is any clue for why the app is running slow. Since the graphs are constant, we can select any range and open the GPU details view. From the timelines in the details view we can tell that:
1. Present calls on the GPU miss Vsync roughly 1 out of 4 times, which resulted in ~40 FPS. We don’t currently mark the Present calls on the graph (yet), but in this case the Presents are at the end of each block on the GPU lane. Try using the Filter control above the timeline to show just the Present events will make it easier to find the Presents.
2. Notice that some of the events are grouped, such as “Draw city” and “Draw rain”. The groups are coming from markers inserted into the app using ID3DUserDefinedAnnotation interface. Adding markers to group your rendering code by section can greatly help with figuring out what part of your rendering code is expensive, especially for complex applications. Here is an example of how to insert markers into the app:
3. Looking at the event list, we can tell that “Draw city” took about 14ms to render on the GPU. Compare where CPU started “Draw City” to where it started “Draw rain” on the CPU3 lane on the two screenshots below, they’re very close to each other. This shows CPU quickly finished “Draw City” and started “Draw rain” right away. But from where “Draw rain” started to the end of the block on the CPU3 lane, it took the CPU nearly 12 ms to prepare the data for the rain drops.
4. At this point, we can tell there is a CPU bound problem, because GPU was waiting on the CPU to process the data for rain drops, which was expensive. Taking a look at the CPU core lanes, we see that this app is only utilizing one core at a time and the other three CPU cores are free.
Now that we know there is a CPU bound problem, let’s look at the CPU usage details by going back to the main view and selecting the CPU Usage tab (Good thing we turned on CPU Usage collection when we started!). Here we can drill into the call tree and see which functions were using the most CPU. In this case, it’s stl calls made by the SkinnableModel::CommitUpdates that consumed 66.31% of the selected CPU time. We can right-click on the function and click View Source to bring up this function in the editor.
In the CommitUpdates function, we see that the stl is being called by the following code:
for(auto iter = m_pendingInstances.begin(); iter != m_pendingInstances.end(); iter++)
instanceList[instanceIndex++] = *iter;
At this point we know that this is the bottleneck of our app. This for-loop iterates 5000 times to prepare data for each rain drop. We can make this faster by parallelizing the task to leverage all of the four CPU cores on this machine. One way to implement this could be to turn the for loop into parallel_for (for_each would do the same trick in this case J).
parallel_for(0, size, [&, size](int instanceIndex)
instanceList[instanceIndex++] = *iter++;
Now run the app again. Woohoo! FPS went up to 60FPS, and here is the “after” graph which shows the GPU is hitting every Vsync, and all of the four CPU cores are being utilized.
In this blog post, we walked through how to use the GPU Usage tool. Is this tool helpful? How do you like it? If you haven’t already, please download Visual Studio 2013 Update 4 CTP1, try it out, and let us know! J