When Performance Matters (Part I)


Over the last year and half, I have created and presented a performance talk at PDC, MDC, and Tech Ed. It’s kind of daunting to get up there in front of 100+ peers, but I get through it with lots of preparation. Since I have already done the groundwork and not everybody has been able to attend these great events, I am going to write a couple posts that cover the core of the material for my presentation. The material will be a mix of architecture and best practices. The end goal is to create a basic speed cost model for managed code running in the .NET Compact Framework and how you can apply it to your application. Also, some of the best practices will apply to full .NET Framework applications as well. In fact, a number of parts of my presentation were derived from full .NET CLR performance presentations.

 

There is some reference material that I don’t need to cover since we already have some great posts.

The .NET Compact Framework provides a great RAD environment. Often there are multiple ways to do things, some more performance oriented than others. Sometimes, there is a trade-off between memory and performance. All of the best practices that I can give are generalized. This means that they may not be best in all applications.

 

Measure, measure, measure. If performance matters.

 

Well, of course performance matters. But the more difficult question is what performance matters? The key to any application or platform is to understand the constraints and build appropriately. This means incorporating performance into the development process from the start. We need to define criteria early, set goals and measure throughout. And of course, we need to adjust appropriately as we discover issues. One of the problems that we have on devices is the trade-offs that we need to make, mostly due to the constrained nature of memory on the device – physical memory, virtual memory and even storage. All of this will affect the way we need to look at performance. The JIT compiler in the .NET Compact Framework is relatively simple for a compiler so that it can generate code quickly and mitigate the effects of pitching code. Pre-JIT compiled code is not feasible on the device currently because there is limited storage and native code is approximately 3 times the size of MSIL. As storage becomes more prevalent, we will need to reconsider if we can persist JIT compiled code on the device.

 

For the rest of this article, there are two things that I will cover. First, some general principles for [device] programming; and second, what the .NET Compact Framework team has done for you in the performance area in v2.

 

Less code is faster code

  • Often code size is traded off for “elegance” or extensibility. In performance critical areas, keep it simple.

Fewer objects are better for performance

  • Due to the limited memory on the device, the more objects that are allocated, the more memory pressure this puts on the system. For the .NET Compact Framework, this can result in more garbage collections, larger GC latencies and even pitching JIT compiled code.

Recycle and reuse expensive objects

  • Objects that are expensive to create and manage should be considered for caching and re-using. Be aware of the memory pressure this may put on the system.

Batch work

  • A good example of batching effectively is to have a web service call that returns an array of items instead of requiring the application to make a call to retrieve each item.

Initialize lazily

  • Only create objects as needed or when it is determined that they are needed. An example of lazy initialization is creating Exception types just before throwing, instead of in advance.

Do work in the background to affect “perceived” performance

  • If there are expensive objects that are absolutely going to be required, think about loading them in the background. There is only one thread for a Windows.Forms.Form that processes the events. If you perform an expensive request on this thread, it will hang the UI and look like poor performance to the user. Threads are relatively inexpensive in the .NET Compact Framework and Windows CE.

Since we released v1 of the .NET Compact Framework, much of the team has been focused on performance. In v1 Service Packs, we improved performance in parenting controls, XmlTextReader saw a significant increase, and Data and resource manager also received good improvements. In v2, we went a lot further to modify the execution engine; re-architect the JIT compiler, improve general memory management, and overhaul the garbage collector.

 

JIT Compiler

  • We have a unified JIT compiler architecture across all CPU architectures
  • Improved code generation with more inlining and enregistration
  • Improved call path

Garbage Collector

  • Less overhead per object
  • Faster allocator and collector

Memory management

  • Reduced heap fragmentation
  • Easier to return more GC heap memory back to OS
  • Strings are widely used, so we reduced the overhead of a string.

How does all of this relate to real world applications? Very early in the v2 development cycle, I wrote a test harness and a number of performance tests. This is my personal performance test suite to validate the findings of our official performance suite and so I always can measure and report numbers relative to my previous measurements. The tests are my baseline of things I care about and report to my management. Some of the tests are micro-benchmarks, such as call path performance, while others are more scenario tests for areas that needed attention. These are measurements that I took on my device today. I expect that the final v2 performance will be similar to below.

 

How do you read my table?

 

Above the red line are my micro-benchmark tests for the .NET Compact Framework CLR. The units are in calls/sec, iterations/sec and bytes/sec, so bigger is better.

 

Below the red-line are my scenario tests. How does the gain in micro-benchmarks affect real world scenarios? I created three data intensive scenario tests. The units are in seconds, so smaller is better.

 

In the columns, I show three products; v1.0, v1.0 Service Pack 2 and v2. The reason that there is only one set of numbers for the micro-benchmarks for v1.0 and v1.0 Service Pack 2 is that we did not change the CLR and so the micro-benchmarks are unaffected. However, we did some XmlTextReader performance enhancements which you see in the Service Pack 2 column.

 

(Pocket PC 2003, XScale 400MHz)

1.0

1.0
SP2

2.0

Method Calls
(Calls/sec)

3.7M

8.1M

Virtual Calls
(Calls/sec)

2.4M

5.3M

Simple P/Invoke
(Calls/sec)

733K

1.7M

Primes (to 1500)
(iterations/sec)

562

859

GC Small (8 bytes)
(Bytes/sec)

1M

7.5M

GC Array (100 int’s)
(Bytes/sec)

25M

114M

XML Text Reader
200KB (seconds)

1.7

1.2

0.72

DataSet (static data)
4 tables, 1000 records (seconds)

13.1

6.6

7.0

DataSet (ReadXml)
3 tables, 100 records (seconds)

12.3

6.5

4.9

 

For the micro-benchmarks, I test three areas:

·         Callpath – what is the cost of making a call?

o        Approximately 100% improvement across the board improvement

o        Now you can see the approximate cost of a virtual call compared to a regular method call, as well as compared to a p/invoke

·         Code generation – what is the quality of the code the JIT compiler generates?

o        This is kind of crude, but it validates the code generation improvement for a computational loop

·         Object allocation and garbage collection – what speed can we generate and collect garbage?

o        GC Small allocates 8 byte objects over and over for at least one second. At 7.5MB/s, the test has performed at least 7 GCs. This test also includes one method call for the constructor of each object. This is the lower edge of allocation cost for the garbage collector.

o        GC Array allocates an int array of size 100. This creates a test without a constructor call and less objects to allocate/collect.

 

For the scenario tests, we are generally better than v1 Service Pack 2. However, these tests also measure the performance of the new code for System.Data and System.Xml as well. Some of the scenarios for DataSet are slower than v1 Service Pack 2, mostly because we added a number of features and a lot of code. We also need to pay attention to the “less code is faster code” principle; however in this case we believe that the features we have added in System.Data are definitely worthwhile.

 

Scott

 

This posting is provided "AS IS" with no warranties, and confers no rights.

 

Comments (0)

Skip to main content