Finally, we have shipped!
It’s been solid work since Nov from the Graphics and Terrain team on SP1 performance.
Here is a work-list on performance so you can understand what we did:
General Performance Work
1. performed more work on the LOD system,
2. optimized the UI rendering to reduce overhead when the UI is visible,
3. front-end texture loader no longer loads loads full mips when it doesn’t need to
4. removed redundant elevation queries when scenery complexity is low
5. avoid rasterizing water into the DEM for textures that are all land
6. fixed redundant vertex issue with key Autogen BGLs
7. updated XToMdl tool to leverage same vertex issue, resulting in model vertex reduction on the order of 25-40% for 3rd party developers.
D3D Performance Work
8. enabled skinning for more animated objects; which reduces Draw calls,
9. batched Autogen objects to reduce Draw calls,
10. optimized tree rendering to reduce SetTexture calls
11. coalesced shader state to reduce uploads to the card
12. fixed 3 FS8 AI aircraft in terms of Draw calls. This is the MD-80, Dash-8, and Cherokee. This is a >10x reduction. They are 25% of the worldwide air traffic DB so this can be significant.
Multi-core Performance Work
13. moved DEM loading to threads,
14. moved terrain texture synthesis ( the process itself is documented in Adams' "Global Terrain Technology for Flight sim paper at http://fsinsider.com/Community/Developers-Corner/Global+Terrain+for+Flight+Simulator.htm, see the bit about the layers and texture synthesis ) to threads,
15. moved Autogen batch rebuilds to threads
That’s 15 different work items. We were busy!
How it went together
Now, don’t think because we are calling it a “patch” that we are doing binary patches. We are not.
Yes, we use a delta-patching technology in the installer. For scenery bgls and the like. Even there some files have changed enough that the delta-patcher cannot handle them and we have to have a "full file" in the patch. The bgl for the Japanese traffic data is an example, we reversed the traffic vectors for the entire country, and almost every byte changed. We tried to put it through the delta-patcher, but it keeled over and gave up and errored out during setup.
We *are* rebuilding the binaries from scratch. Thats not trying to patch the old binaries, its replacing them with new files, many of which have quite a bit of new code. The multi-core work, for instance, went thru the terrain code stack from top to bottom. Thats one reason why SP1 took so long. The multi-core infrastructure is solid, will use up to 256 cores if available, and will continue to be used as we migrate systems to it as it makes sense. Terrain and autogen are it for now, we'll be evaluating when to do more.
General Performance Work
The general performance work reduced the amount of work we try to do in various scenarios. I want to call out the redundant vertex issue, that’s a key thing we fixed in the Autogen blgs (autogen.bgl, vegetation.bgl, and roofs.bgl) as well as in the SDK tool to pass the savings on to all 3rd party developers when they reauthor with the SP1 SDK.
D3D Performance Work
D3D API usage work was aimed at reducing our Draw and SetTexture API calls. So what are Draw and SetTexture calls? These are the D3D9 API calls that the engine issues to push textures and draw triangles down to the card, the bulk of the work in rendering. We were issuing way too many Draw and SetTexture calls; SP1 is a 35-40% reduction in both. Taking those optimizations is aimed at enabling the app to scale better on GPUs. We took some optimizations on shader state to, which is a nice win. And the 3 FS8 AI aircraft where just horrendous in RTM so that’s another nice fix.
Multi-core Performance Work
Intel is using FSX as one of their prime examples at IDF, we had a lot of engineering time from one of their threading guys. Intel doesn’t do that lightly. We used the time to good benefit.
During loading, we run the DEM loader on threads. You'll see good balanced usage across all cores; as well as about 1/3 faster load times on average.
During flight we spawn threads for Autogen batch rebuilds as well as the terrain texture synthesis. The terrain texture work tends to be a bit bursty; as an area gets generated the load reduces true. But as you fly forward, as you bank, and as the terrain is lighted ( once a minute ) threads are spawned. The terrain grid system is radial around the current viewpoint, and, depending on level of detail radius can be up to 4.5 tiles in either direction, something like 64 tiles. So there is plenty of work to go around. Autogen is more constant, with a 2km extent being batched.
Even given the bursty nature of the core usage when flying; when there is load, its pretty balanced across the cores. And we got rid of as much of the stutters as we could by going to a lock-free synchronization style. Its solid work that we are deservedly proud of.
As far as practical limits on number of usable cores; currently SetThreadAffinityMask only allows explicit scheduling of threads on 32 cores ( the mask is a dword ) on Win32. So thats our effective limit on number of cores. But as soon as there is a way to explicitly schedule them, we can handle 256 cores.
With all that said, the Draw and SetTexture API call reductions and Autogen size reductions are probably as important for FPS improvements; the multi-core work really shines for load balancing and reducing stutters and blurries. And both are critical for better scaling as CPUs and GPUs get better.
We think SP1 is going to deliver the goods for most users, and will reward users with better hw the most. We expect that, except in the very,very low end hw, all users should see a 20% gain. Some scenarios will see 40%, and some will see a bit more. Its really going to depend on a lot of variables. We hope this enables users to either fly at the same settings with greater FPS, or to bump the sliders up 1 or 2 ticks and still get the same FPS you had.
It’s going to take time to see if that holds true, but we had good results in our perf lab and with the beta testers.
The vastly improved batching of Autogen was one of the major performance items in SP1 and helps to reach our target reduction of 35-40% for Draw and SetTexure calls. However, it does have an implication that, when coupled with a feature we lost from FS9, you should be aware of.
FSX does not “alpha-fade” Autogen in the distance. This makes for a discernable “pop” of Autogen objects. SP1 batches objects in a 2 km boundary. This, when coupled with the lack of “alpha-fade”; does make the Autogen “pop” a little more obvious than in RTM. We think it’s a fair tradeoff, though, for the performance gain.
For DX10 we will look at bringing the “alpha-fade” back.
We changed our bucketing code in SP1, so if you use “Restore Defaults” from the UI, you may see different default settings. What did we do? Well, RTM only detected up to 512m of memory and used that as the “Ultra High” setting. With the 768m 8800 card out, there was no way to stratify that above the 512m 7950s. So we detect 640m of graphics memory and treat 640m and greater as “Ultra High”.
There is an issue on Vista, where on some cards it can report a “shared” memory value larger than the physical value and that confuses our bucketing code. If you don’t have a DX10 card and you are getting bucketed “Ultra High” for instance, change your settings down. We’ll take a look at this again in DX10 to adjust the Vista bucketing.