Improving CLRProfiler 4: Reducing SampleObject memory consumption by 58%

In the previous three posts, we managed to double the speed of file loading time of CLRProfiler through profile-guided optimization in three simple steps. Now let’s take a look at reducing CLRProfiler’s memory consumption, making it more useful to real world applications.

I managed to create a 10-Gb profile using a performance test. The test program creates 19.7 million managed objects, averaging 195 bytes each, consuming a total of 3.85 memory, with 32 garbage collections. CLRProfiler loads up the file in 83 secounds with 208Kb private working set.

Using sos’s DumpHeap -stat commands, we can easily see what is really consuming memory in CLRProfiler:

002f4cc4        2      1040028 CLRProfiler.TimePos[]
618ff9ac    16048      1371112 System.String
618eebd4        2      4194336 System.UInt16[]
61902938    40227      4754996 System.Int32[]
002f2190  7870007    188880168 CLRProfiler.SampleObjectTable+SampleObject

The most expensive data type in memory is SampleObjectTable::SampleObject. Actually, there are 7.87 million instances of them, occupying 24 bytes each. The SampleObject class itself has 3 integer fields and one pointer inside. These four fields should consume only 16-bytes, but CLR adds 4 byte for method table pointer and 4 more bytes for sync object. 

internal class SampleObject


    internal int typeIndex;

    internal int changeTickIndex;

    internal int origAllocTickIndex;

    internal SampleObject prev;


    internal SampleObject(int typeIndex, int changeTickIndex, int origAllocTickIndex, SampleObject prev)


        this.typeIndex = typeIndex;

        this.changeTickIndex = changeTickIndex;

        this.origAllocTickIndex = origAllocTickIndex;

        this.prev = prev;




If we store SampleObject in an array form, we could convert the previous sample object pointer into an index into that array. Now we can declare it as a structure, and pack them together in a big array, thus removing the 8-byte object overhead. In most cases, typeIndex, changeTickIndex, and OrigAllocTickIndex are small integers which can be stored using 16-bit integers, instead of 32-bit integers. The last field prev, which references to previous SampleObject, could be quite large depending on the problem we’re profiling. But normally, the current object and the previous object are not so far apart; that is their differences could be stored as 16-bit integers. To reduce the impact on other code which uses SampleObject, we need to provide a method to reconstruct SampleObject given an index:

/// <summary>

/// Create SampleObject when given an index into storage

/// </summary>

internal SampleObject GetSampleObject(int index)


    UInt16[] chunk = m_sampleChunks[index / SampleObjectChunkSize] as UInt16[];


    int p = index % SampleObjectChunkSize;


    SampleObject obj;


    UInt16 w0 = chunk[p];


    if ((w0 & bit_small) != 0)


        if ((w0 & bit_noprev) != 0)


            obj = new SampleObject(w0 & 0x3FFF, chunk[p + 1], chunk[p + 2], 0);




            obj = new SampleObject(w0 & 0x3FFF, chunk[p + 1], chunk[p + 2], index – chunk[p + 3]);





        obj = new SampleObject(

                    (((int) chunk[p    ]) << 16) + chunk[p + 1],

                    (((int) chunk[p + 2]) << 16) + chunk[p + 3],

                    (((int) chunk[p + 4]) << 16) + chunk[p + 5],

                    index – (((int) chunk[p + 6] << 16) + chunk[p + 7]));



    return obj;


Here is what DumpHeap -stat shows after the change:

609bf9ac    11663      1242188 System.String
00584a80        4      1808060 CLRProfiler.TimePos[]
609c2938    40176      5176228 System.Int32[]
609aebd4      935     84447264 System.UInt16[]

188.8 Mb of SampleObject is replaced by 80.2 Mb increase in UInt16[] objects (42% of the original size). The 7.87 SampleObjects are packed in 16-bit integer arrays. There will be more saving when running on 64-bit machines.

Comments (2)

  1. Mike says:

    This is nice but I wonder how does this affect the rest of the code that actually uses SampleObject. Obviously reconstructing the object from the bits can be expensive.

  2. We just can't afford to have 7 million small objects in managed heap which could be worse for more complicated cases, so we have to do something. Luckily, SampleObjects are only used in TimeLineView, nothing else.

    I read some of the code in TimeLineViewForm.cs. Part of it seems to have lots of queries (CPU intensive), part of it seems to be graphics intensive. CPU intensive operation could be optimized using multi-threading; graphics intensive could be optimized by better clipping before drawing, checking for overlap drawing, or implement our own drawing code to render into an off-line bitmap first (GDI+ operations are infamous for taking a globe lock). So there are lots of ways to optimize TimeLineViewForm when there is a real need to do so.