I whipped up this guide to help people on internal teams take some of our larger goals and think about them as they apply to their own area. I thought a lot of the advice was generally applicable to managed library authors so I rewrote it for a wider audience and I offer it here with my usual disclaimers...
How to think about major performance themes in your feature/library
When considering new features or significant upgrades to existing features it’s important to understand how these items will impact the overall perceived performance of the system. We have several classes of goals to help us to achieve a good customer experience they are:
- No regressions (in a series of scenarios)
- Addressing any hot items from partners
- Planned improvements on core performance theme items
The first two of these are fairly easy for teams to act on, most of the discussion focuses on the third point.
As always, the discussions below are necessarily abbreviated and are only intended to provide a helpful and practical model to think about the problems. Providing a 100% correct discussion of any of these issues would consume entire volumes and not be terribly helpful in any case so no attempt at perfection is made here.
1 No Regressions
It’s easy enough to understand a no regression bar; the trick is to define the appropriate scenarios to use for measuring meaningful regressions. There are two things you should do to achieve this goal:
- Teams must define any additional regression testing that is needed for their feature areas in key performance scenarios above and beyond the baseline for the overall system
- Teams must plan to respond to performance bugs marked in their area – regressions are the first source of performance bugs
It’s comparatively simple to get performance tests running regularly once they are authored, however it is a bit of an art to produce tests that give reasonably repeatable numbers that can be trended. It’s in everyone’s interest to invest in these tests and getting them into the normal performance battery so that specific changes causing regressions can be identified.
2 Partner Items
From time to time we get specific requests for corrections in key scenarios that block partners. Some of these we take on as must-fix issues and they are assigned to a particular milestone.
Particular teams are assigned performance bugs to track this work; this is the second source of performance bugs. Naturally creation and assignment of these bugs is done in cooperation with the appropriate teams as it would otherwise be highly randomizing.
3 Performance Theme Goals
These are the high level goals set by the performance team, our management, and our partners, to drive certain key performance themes. Because of their broad nature it is sometimes difficult for individual teams to understand how they might help or hinder this process. The main purpose of this entry is to provide some advice on how to best internalize these goals.
3.1 Managed Module Working Set (with ngen)
Working Set refers to the number of pages of virtual memory committed to a given process, both shared and private. Managed Module Working Set refers specifically to those pages whose origin can be traced directly to a DLL that contains managed code (e.g. mscorlib.dll, system.dll, system.xml.dll).
The easiest way to help reduce the size of module working sets is of course to reduce the size of the modules, or at least grow the modules as slowly as possible. While it isn’t always the case that adding new code will affect the working set (because the code might not run in scenarios that are working set sensitive) overall code growth is nonetheless the leading indicator of expected working set growth. In contrast, removing or consolidating code doesn’t generally cause working set issues (though it’s possible in exotic cases)
Don’t forget the hidden costs associated with the presence of managed classes – these are things like metadata, vtables and so forth. These can end up rivaling the code itself for size. You can make this situation worse by having large numbers of attributes or other metadata, especially if they are frequently consulted.
Lastly, virtually any use of reflection will cause some otherwise cold metadata to be forced into the process working set, it’s only a question of how much metadata and how often is that data shared with other processes. Avoiding reflection where it is practical to do so means you will never face that particular problem.
3.1.3 How to Plan
- Ballpark the size of the code you expect to add to support a new feature, track this before you check in your code and then in the builds after
- Consider that for managed code our current level of overhead is such that you can expect the working set cost of data structures to support the code to be close in size to the code itself
- Pay special attention to any code that has to run even when your new feature isn’t being used, costs that are not “pay for play” are especially to be avoided
- Consider and plan for the additional cost of accessed metadata in the event that you use reflection
- Consider opportunities to consolidate new features with old rather than duplicating or partly duplicating code – this sounds obvious but we often fail to do it
- Be sure that the value you intend to give to your customer justifies the amount of code you will be adding (this is applicable in all the below items and won’t be repeated)
- Write a unit test to verify that the cost what you think it is, plan to submit this to the perf team as soon as is reasonable (again, (this is applicable in all the below items and won’t be repeated)
- Decide what the acceptance criteria should be well before its time to check in the code – last minute justification is a sign of absent planning
3.2 Per-Application Domain Working Set
This includes state maintained on a per AppDomain basis by the CLR, such as loader data structures, security policy, evidence, grant-sets etc. In addition to CLR overhead, each library can have per AppDomain overhead.
Interestingly it turns out that the bulk of the time/space we spend initializing managed state isn’t general startup but actually has to do with creating the first app domain. This is noteworthy because of course we’ll be paying roughly that same cost on the second and subsequent AppDomains.
If you’re writing managed code the primary sources of per-application domain data are the static members of your classes. The cost may appear in different places (e.g. primitive types don’t end up on the GC heap) but it’s nonetheless per AppDomain memory.
To reduce your per AppDomain working set be sure to defer as much initialization as possible to a time when you’re sure the initialization is necessary (this saves both code and data), reduce the static data (and of course the objects to which the static data refers), and simplify the construction path of that data as much as possible. Where data is shareable between AppDomains, consider plans that would facilitate a central copy – this can be worth the complexity for largish data items.
3.2.3 How to Plan
- Ballpark the size of any static data that you plan to add, you will pay for the size of the primitives and pointers regardless of whether they are initialized
- Ballpark the size of any initialization code that will perforce be activated on creation of a new AppDomain
- Consider strategies which defer the initialization of as much data as possible, this reduces code path and data size
- Consider strategies for sharing as much of the data as possible between AppDomains, especially constant data
- Write unit tests that create lots of AppDomains (several thousand) and then measure the marginal cost of your new feature in that environment
- Measure with your feature in use, not in use, and totally absent to assess the cost – assess this against your initial estimates to see how good a job you did
3.3 Private Bytes
Private memory, is defined as memory allocated for a process which cannot be shared by other processes. This memory is more expensive than shared memory when multiple such processes execute on a machine. Private memory in (traditional) unmanaged dlls usually constitutes of C++ statics and is of the order of 5% of the total working set of the dll. NGEN'ed assemblies, on the other hand contain more information. Apart from the module statics, they contain key CLR data structures required to support managed code execution during runtime. Some of these data structures are private to the process.
Private bytes in the native images are predominately caused the application of “fixups” to get access to data whose location could not be known at ngen time. The biggest source of these by far is the fixups for string literals. But in any case its generally a bad idea to think too much about fixup problems when writing managed code because fixups are routinely targeted for eradication at the mscorwks level and control over them is very limited at best.
Instead of thinking about per module private bytes, you should be thinking about reducing private bytes more generally, and of course the main source of private bytes will be objects on the GC heap, especially long-lived objects. General reduction of long lived managed objects is as valuable, if not more so, than private page reduction in the modules.
3.3.3 How To Plan
- Ballpark your total managed heap usage, consider the peak levels during initialization and steady state
- Consider “collateral damage” done to existing classes/structures, estimate growth in these existing structures
- Use CLR Profiler to get an idea of how many objects of types you expect to affect are typically in memory in your scenarios
- Highlight any non-pay-for-play costs that are incurred on the heap
- Plan for unit tests that run a benchmark under CLR Profiler, account for the additional allocations and validate them against estimates
- Pay special attention to types reported as “relocated” in CLR Profiler, as these are longer lived objects
3.4 Startup Time
The CPU time to load and initialize the CLR and .NET Framework libraries, load user libraries, and get to the Main entry point.
As previously discussed, many of the costs associated with startup actually pertain to the creation of the first AppDomain – this is true for both space and time costs. Most if not all of the points discussed in the reduction of per AppDomain space are entirely applicable here, so they won’t be repeated.
Startup time generally is consumed in two big ways: soft faults, and I/O, and they are deeply intertwined. A soft fault occurs when a page that is required in the process’s working set is not present in the process but is present elsewhere in physical memory. Three reasons this happens are:
- a requested disk I/O that could be satisfied from the disk cache
- a part of a module that is already loaded in another process was needed in this process (this is kind of the same as the above)
- a page of zeroed memory was needed in this process
The remaining cases of actual disk I/O significantly (e.g. 1000x) more expensive than the above, they are:
- A part of a module that was not already loaded elsewhere was needed in this process (hard fault)
- A piece of swapped out read/write data was needed (hard faults of this ilk don’t happen so much in startup scenarios)
- An un-cached piece of a file was referenced (normal disk I/O)
The above 6 categories will tend to dominate the cost.
3.4.3 How To Plan
- Ballpark the bytes in each of the categories above. Charge yourself about 1 microsecond for each 4k in the cheap bucket and 1 millisecond for each 4k in the expensive bucket (I pulled those numbers out of thin air to help planning, you have to measure to get real numbers)
- Remember that shared code ends up in the cheap bucket, as do needed zeroed pages (i.e. new allocations) but these both add up
- Analyze your startup path to get an idea how much code is going to run, you only have to be accurate to the nearest page – for simple startup code the actual cost of running the code is typically dwarfed by the cost of bringing it into the process
- Consider registry accesses as equivalent to I/O because they are – so they are cheap if that part of the registry is cached
- Keep in mind that a startup time that’s much more than 20 milliseconds for a simple library would be considered a lot
- Give special attention to the process of reading your initialization files/state, these can be a big cost-driver