Performance Quiz #6 — Optimizing the Managed Version

Well, I’ve been slacking off too long… sitting on my nice 124ms time and doing nothing.  It’s time I did something to the managed code.

You’ll recall I had some observations in my analysis of the managed code and basically I can summarize my thoughts as follows: a good parser touches each character exactly once.  Right now we’re doing the line breaking and the tokenization seperately but anyone who’s taken an introductory course in parsing knows that these sorts of things can be readily combined.  I still want to read the file in chunks so I’m going to read a nice big buffer and then walk the whole lines and parse as I go.  I can do it all with a nice little state machine.  This kills two birds with one stone:  First I don’t have to allocate temporary strings for the lines and secondly I don’t have to make (at least) two passes over each character.  (Excercise: what assumption am I making about line lengths to do this?)

My new code looks like this:

using System;
using System.IO;
using System.Text;
using System.Collections;

namespace NS
class Test
private static extern bool QueryPerformanceCounter(out long lpPerformanceCount);

private static extern bool QueryPerformanceFrequency(out long lpFrequency);

static void Main(string[] args)

long startTime, endTime, freq;

QueryPerformanceFrequency(out freq);
QueryPerformanceCounter(out startTime);

Dictionary dict = new Dictionary();

QueryPerformanceCounter(out endTime);

Console.WriteLine(“Length: {0}”, dict.Length());
Console.WriteLine(“frequency: {0:n0}”, freq);
Console.WriteLine(“time: {0:n5}s”, (endTime – startTime)/(double)freq);

class DictionaryEntry
private string trad;
private string pinyin;
public string english;

static public DictionaryEntry ParseBuffer(char[] line, int ibIn, int ibMax, out int ibOut)
DictionaryEntry de = new DictionaryEntry();

int state = 0;
char chTrans = ‘ ‘;
int start = ibIn;
int end = 0;
int i;

// there is still a newline or we would not be here
for (i = ibIn; line[i] != ‘\n’; i++)
if (line[i] != chTrans)

switch (state)
case 0:
de.trad = new String(line, start, i – start);
state = 1;
chTrans = ‘[‘;

case 1:
start = i+1;
chTrans = ‘]’;
state = 2;

case 2:
de.pinyin = new String(line, start, i – start);
chTrans = ‘/’;
state = 3;

case 3:
start = i+1;
state = 4;

case 4:
end = i;
state = 5;

case 5:
end = i;

ibOut = i;

if (state == 5)
de.english = new String(line, start, end – start);
return de;

return null;

class Dictionary
ArrayList dict;

public Dictionary()
StreamReader src = new StreamReader(“cedict.b5”, System.Text.Encoding.GetEncoding(950));
DictionaryEntry de;
dict = new ArrayList();

char[] buffer = new char[40960];

int cb = src.Read(buffer, 0, buffer.Length);

while (cb != 0)
int ibLastNewline = cb – 1;

while (buffer[ibLastNewline] != ‘\n’ && ibLastNewline > 0)

// no more newlines… we’re done
if (ibLastNewline == 0)

int ib = 0;
while (ib < ibLastNewline)
if (buffer[ib] == ‘#’)
// there must still be a newline or we wouldn’t be here
while (buffer[ib] != ‘\n’)
int ibOut;
de = DictionaryEntry.ParseBuffer(buffer, ib, ibLastNewline, out ibOut);
ib = ibOut;
if (de != null)


int ibCopy = ibLastNewline + 1;
ib = 0;
while (ibCopy < cb)
buffer[ib++] = buffer[ibCopy++];

int cbNew = src.Read(buffer, ib, buffer.Length – ib);
cb = ib + cbNew;

public int Length() { return dict.Count; }

And it makes a good bit of difference.  On my machine the internal time to process the dictionary goes from about 37ms in the initial version to about 27ms in the optimized version.  That’s 37% faster — not too shabby for a nice simple algorithmic change.  No scary Win32 APIs as well.   For giggles I tested a version that used unsafe code to avoid any bounds checking, I found that I should shave 1ms more off the 27ms if I used unsafe code.  I deem this not worth the grief and leave it as is.

And here’s where it stands using the end-to-end times that we’ve been comparing all along.

Version Execution Time
Unmanaged v1 1.328
Unmanaged v2 0.828
Unmanaged v3 0.343
Unmanaged v4 0.187
Unmanaged v5 With Bug 0.296
Unmanaged v5 Corrected 0.124
Unoptimized Managed port of v1    0.124
Optimized Managed port of v1 0.093

An interesting thing to note is that we actually dropped more like 30ms in total time even though we only removed 10ms from the internal run.  How did we do that?  Well for one thing I’m beginning to suspect that the timer resolution on that external timing program isn’t as high as the output it creates.  Maybe it’s only timing at the 15ms tick rate.  But I think also part of what’s going on is that we actually used fewer library services and so overall load time was less (by the way I pre-jitted the .exe file in both runs although it ends up making only a 2ms difference in this particular scenario — remember the libraries are always pre-jitted and this .exe is small) 

Let’s have a look at the breakdown:

Name  Exclusive
  __CorExeMain (from mscoree.dll) 0.00 91.09
    __CorExeMain (from mscorwks.dll) 0.00 86.14
      ExecuteEXE(struct HINSTANCE__ *) 0.00 74.26
        SystemDomain::ExecuteMainMethod(…) 0.00 74.26
          Assembly::ExecuteMainMethod(class PtrArray * *) 0.00 67.33
            .. 0.00 67.33
              NS.Test.Main(string[]) 0.00 67.33
                NS.Test.Dictionary..ctor() 0.00 65.35
                  NS.Test.DictionaryEntry.ParseBuffer(char[],int32,int32,int32&) 14.85 38.61
                    System.String.CtorCharArrayStartLength(char[],int32,int32) 2.97 20.79
                      .. 0.00 7.92
                        Alloc(unsigned long,int,int) 7.92 7.92
                      System.String.wstrcpy(char*,char*,int32) 7.92 7.92
                  System.IO.StreamReader.Read(char[],int32,int32) 0.00 18.81
                    System.IO.StreamReader.ReadBuffer(char[],int32,int32,bool&) 0.99 18.81
                      .. 0.00 17.82
                        System.Text.DBCSCodePageEncoding.GetChars(…) 17.82 17.82
                  System.Text.Encoding.GetEncoding(int32) 0.00 5.94
                    .. 0.00 5.94
                      Sys.Text.DBCSCodePageEncoding.LoadManagedCodePage() 1.98 5.94
      EnsureEEStarted(enum tagCOINITEE) 0.00 10.89

Things are looking a good bit nicer than in the last run.  Percentage wise we’re still spending close to the same fraction of the time reading buffers from the file (16% then vs. 18% now) but you can see that actual necessary work (ParseBuffer) is up to 38% of the time and of course the wasted work of breaking the line up and allocating (which was 28-15 = 13%) is gone.  Also, raw time copying strings is up as a percentage — also a good sign because that’s part of the necessary work we are to do here.  So overall the total time is down and the blend of tasks is showing that we’re spending more time making forward progress on the job at hand — building the dictionary. 

Best of all we didn’t have to make giant sacrifices in clarity or re-invent basic things like strings and file-reading.

As a footnote, “Hello World” in C# on that same machine clocks in about 63ms start to finish measured with the same tool (such as it is).   If you add 63ms for hello world (which starts the CLR and does some basic console writes) to the time to read the dictionary as reported (27ms)  you get 63+27 = 90ms which is pretty darn close to the measured time of the dictionary reader.

If you figure a few milliseconds of that is writing to the console (which I redirected to a file) you get a rough number of about 60ms to start the CLR on this particular machine.  That seems not unreasonable.  It’s also a lower bound on how fast I can make this program in managed code.

Comments (16)

  1. Dean Harding says:

    Well, you’re assuming the lines are shorter than the buffer. But given that your buffer is 40KB, that’s OK. Also, (I didn’t look too closely, so I might have missed something) you’re assuming the last character in the file is a new line.

  2. Dean Harding says:

    Oh yeah, I forgot to mention, the times are pretty impressive. It just goes to show that not having to support deterministic destruction gives you more than just ease-of-use!

    Of course, you’d probably pay for it when a GC kicks in, but given that this program is meant to interact with a user, a GC can happen when the user’s not doing anything, and they won’t even notice.

  3. Matt Ellis says:

    Rico, I’m loving this nice little duet you and Raymond have got going, and I was very suprised with the initial time from the unoptimised managed version – top marks!

    As Dean mentioned above – it would be very interesting to see what happens to the times when GC kicks in.

    But what I’m after is what version of the framework are you running on? And it might be interesting to see what the timings are for previous versions…

  4. ricom says:

    The initial version reads each line into a temporary string and the file is over 1M in size in Big5 format. I’d expect over 2M of temporary allocations plus almost as much in long term storage for the dictionary.

    In other words the GC is doing work already in all these cases. Plenty of pressure to induce collections.

    But I’ll post something more specific after I run it under CLRProfiler which I have yet to do. There’s some nice comments to be made on space… good topic for the finale.

    Thanks for reading :)

  5. Hugo Batista says:

    Hi there,

    I’m wondering what if you start to use beta 2, and use "dict" as a generic ? wouldn’t be interesting to test it ?

  6. Anthony J. Mills says:

    Well, for one thing, why keep the dictionary file the way it is … why not preprocess it? Make it a Unicode text file. Put each item on a separate line:

    (Chinese characters)

    no3 mo2 pi4

    We’re out of pie.

    (Chinese characters)


    Heck, Raymond could just replace newlines with ‘’ … then, it could all be done with a smooth reading it into memory and updating a table of pointers.

    Only one memory allocation event necessary.

    Hey, why replace newlines with ? Just put it in the file that way.

    Not much flexibility, but you know what? He was already going that way anyway.

  7. ricom says:

    Hugo Batista writes: I’m wondering what if you start to use beta 2, and use "dict" as a generic ? wouldn’t be interesting to test it ?

    I could try replacing the ArrayList with List<DictionaryEntry> but the array processing isn’t a signficant portion of the time in any of the runs anyway. This isn’t an especially good benchmark to highlight the differences between those two classes.

    Despite the general objections to "all those copies" (especially among Raymond’s readers) neither the ArrayList nor the STL vector were ever the most important issue.

    Raymond wisely targetted the biggest bottleneck at each stage and by the time he’s done the vector still didn’t need any attention. I likewise decline to target the ArrayList. Plus I like the fact that my program works on v1.1 :)

  8. ricom says:

    Time to eat my words. My collegue Alex says he ran v1 under CLR profiler and he reports the GC never had to kick in. Interesting, I thought there would be plenty of allocs to force at *least* one GC.

    I’m writing about space in this test case tomorrow to close things out so I’ll get into it more then AFTER I have measured.

    That’s what I get for opening my big mouth without having that measurement… bad Rico… bad bad bad…

  9. ricom says:

    I’m uneating my words :)

    I just did the measurements and as expected (at least on my machine) there were garbage collects during this run. Three of them in fact. Not sure why Alex didn’t see GC’s in his run but with this amount of allocation traffic it’s totally expected that we see a few.

    More details in today’s Space Analysis posting — the last in this series.

  10. So I was reading through one of my favorite MSDN blogs (

    And he…

  11. Hugo Batista says:

    Raymond and Rico are doing a kind of &quot;combat&quot; about unmanaged and managed code&amp;nbsp;. This is becoming…

  12. I’m just not an expert.