Performance Quiz #6 -- Optimizing the Managed Version

Well, I've been slacking off too long... sitting on my nice 124ms time and doing nothing.  It's time I did something to the managed code.

You'll recall I had some observations in my analysis of the managed code and basically I can summarize my thoughts as follows: a good parser touches each character exactly once.  Right now we're doing the line breaking and the tokenization seperately but anyone who's taken an introductory course in parsing knows that these sorts of things can be readily combined.  I still want to read the file in chunks so I'm going to read a nice big buffer and then walk the whole lines and parse as I go.  I can do it all with a nice little state machine.  This kills two birds with one stone:  First I don't have to allocate temporary strings for the lines and secondly I don't have to make (at least) two passes over each character.  (Excercise: what assumption am I making about line lengths to do this?)

My new code looks like this:

 using System;
using System.IO;
using System.Text;
using System.Collections;

namespace NS
{
   class Test
    {
        [System.Runtime.InteropServices.DllImport("Kernel32.dll")]
        private static extern bool QueryPerformanceCounter(out long lpPerformanceCount);

        [System.Runtime.InteropServices.DllImport("Kernel32.dll")]
        private static extern bool QueryPerformanceFrequency(out long lpFrequency);

        static void Main(string[] args)
        {

            long startTime, endTime, freq;

            QueryPerformanceFrequency(out freq);
            QueryPerformanceCounter(out startTime);

            Dictionary dict = new Dictionary();         

            QueryPerformanceCounter(out endTime);

            Console.WriteLine("Length: {0}", dict.Length());
            Console.WriteLine("frequency: {0:n0}", freq);
            Console.WriteLine("time: {0:n5}s", (endTime - startTime)/(double)freq);
        }

        class DictionaryEntry
        {
            private string trad;
            private string pinyin;
            public string english;

            static public DictionaryEntry ParseBuffer(char[] line, int ibIn, int ibMax, out int ibOut)
            {
                DictionaryEntry de = new DictionaryEntry();

                int state = 0;
                char chTrans = ' ';
                int start = ibIn;
                int end = 0;
                int i;

                // there is still a newline or we would not be here
                for (i = ibIn; line[i] != '\n'; i++)
                {
                   if (line[i] != chTrans)
                       continue;

                   switch (state)
                   {
                   case 0:
                      de.trad = new String(line, start, i - start);
                      state = 1;
                      chTrans = '[';
                      break;

                   case 1:
                      start = i+1;
                      chTrans = ']';
                      state = 2;
                      break;

                   case 2:
                      de.pinyin = new String(line, start, i - start);
                      chTrans = '/';
                      state = 3;
                      break;

                   case 3:
                      start = i+1;
                      state = 4;
                      break;

                   case 4:
                      end = i;
                      state = 5;
                      break;

                   case 5:
                      end = i;
                      break;
                   }
                }

                ibOut = i;

                if (state == 5)
                {
                   de.english = new String(line, start, end - start);
                   return de;
                }

                return null;
            }
        }

        class Dictionary
        {
            ArrayList dict;
            
            public Dictionary()
            {
                StreamReader src = new StreamReader("cedict.b5", System.Text.Encoding.GetEncoding(950));
                DictionaryEntry de;
                dict = new ArrayList();

                char[] buffer = new char[40960];

                int cb = src.Read(buffer, 0, buffer.Length);

                while (cb != 0)
                {
                    int ibLastNewline = cb - 1;

                    while (buffer[ibLastNewline] != '\n' && ibLastNewline > 0)
                       ibLastNewline--;

                    // no more newlines... we're done
                    if (ibLastNewline == 0)
                       return;

                    int ib = 0;
                    while (ib < ibLastNewline)
                    {
                        if (buffer[ib] == '#')
                        {
                          // there must still be a newline or we wouldn't be here
                          while (buffer[ib] != '\n')
                            ib++;
                        }
                        else
                        {
                            int ibOut;
                            de = DictionaryEntry.ParseBuffer(buffer, ib, ibLastNewline, out ibOut);
                                ib = ibOut;
                            if (de != null)
                               dict.Add(de);
                        }

                        ib++;
                    }

                    int ibCopy = ibLastNewline + 1;
                    ib = 0;
                    while (ibCopy < cb)
                        buffer[ib++] = buffer[ibCopy++];

                    int cbNew = src.Read(buffer, ib, buffer.Length - ib);
                    cb = ib + cbNew;
               }
            }

            public int Length() { return dict.Count; }       
        };
    }
}

And it makes a good bit of difference.  On my machine the internal time to process the dictionary goes from about 37ms in the initial version to about 27ms in the optimized version.  That's 37% faster -- not too shabby for a nice simple algorithmic change.  No scary Win32 APIs as well.   For giggles I tested a version that used unsafe code to avoid any bounds checking, I found that I should shave 1ms more off the 27ms if I used unsafe code.  I deem this not worth the grief and leave it as is.

And here's where it stands using the end-to-end times that we've been comparing all along.

Version Execution Time(seconds)
Unmanaged v1 1.328
Unmanaged v2 0.828
Unmanaged v3 0.343
Unmanaged v4 0.187
Unmanaged v5 With Bug 0.296
Unmanaged v5 Corrected 0.124
Unoptimized Managed port of v1    0.124
Optimized Managed port of v1 0.093

An interesting thing to note is that we actually dropped more like 30ms in total time even though we only removed 10ms from the internal run.  How did we do that?  Well for one thing I'm beginning to suspect that the timer resolution on that external timing program isn't as high as the output it creates.  Maybe it's only timing at the 15ms tick rate.  But I think also part of what's going on is that we actually used fewer library services and so overall load time was less (by the way I pre-jitted the .exe file in both runs although it ends up making only a 2ms difference in this particular scenario -- remember the libraries are always pre-jitted and this .exe is small) 

Let's have a look at the breakdown:

Name  ExclusivePercent  InclusivePercent
  __CorExeMain (from mscoree.dll) 0.00 91.09
    __CorExeMain (from mscorwks.dll) 0.00 86.14
      ExecuteEXE(struct HINSTANCE__ *) 0.00 74.26
        SystemDomain::ExecuteMainMethod(...) 0.00 74.26
          Assembly::ExecuteMainMethod(class PtrArray * *) 0.00 67.33
            .. 0.00 67.33
              NS.Test.Main(string[]) 0.00 67.33
                NS.Test.Dictionary..ctor() 0.00 65.35
                  NS.Test.DictionaryEntry.ParseBuffer(char[],int32,int32,int32&) 14.85 38.61
                    System.String.CtorCharArrayStartLength(char[],int32,int32) 2.97 20.79
                      .. 0.00 7.92
                        Alloc(unsigned long,int,int) 7.92 7.92
                      System.String.wstrcpy(char*,char*,int32) 7.92 7.92
                  System.IO.StreamReader.Read(char[],int32,int32) 0.00 18.81
                    System.IO.StreamReader.ReadBuffer(char[],int32,int32,bool&) 0.99 18.81
                      .. 0.00 17.82
                        System.Text.DBCSCodePageEncoding.GetChars(...) 17.82 17.82
                  System.Text.Encoding.GetEncoding(int32) 0.00 5.94
                    .. 0.00 5.94
                      Sys.Text.DBCSCodePageEncoding.LoadManagedCodePage() 1.98 5.94
      EnsureEEStarted(enum tagCOINITEE) 0.00 10.89

Things are looking a good bit nicer than in the last run.  Percentage wise we're still spending close to the same fraction of the time reading buffers from the file (16% then vs. 18% now) but you can see that actual necessary work (ParseBuffer) is up to 38% of the time and of course the wasted work of breaking the line up and allocating (which was 28-15 = 13%) is gone.  Also, raw time copying strings is up as a percentage -- also a good sign because that's part of the necessary work we are to do here.  So overall the total time is down and the blend of tasks is showing that we're spending more time making forward progress on the job at hand -- building the dictionary. 

Best of all we didn't have to make giant sacrifices in clarity or re-invent basic things like strings and file-reading.

As a footnote, "Hello World" in C# on that same machine clocks in about 63ms start to finish measured with the same tool (such as it is).   If you add 63ms for hello world (which starts the CLR and does some basic console writes) to the time to read the dictionary as reported (27ms)  you get 63+27 = 90ms which is pretty darn close to the measured time of the dictionary reader.

If you figure a few milliseconds of that is writing to the console (which I redirected to a file) you get a rough number of about 60ms to start the CLR on this particular machine.  That seems not unreasonable.  It's also a lower bound on how fast I can make this program in managed code.