Rambling … loops and multi-processing. When things literally slow down…

Continuing from Rambling … why is my dual processor colleague not twice as fast? I thought I should have a quick look at some basic looping and multi-processor support, released with the Parallel Extensions to the .NET Framework in .NET 3.5. Again I developed some simple and probably ‘silly’ code to experiment with loops and parallel extensions. The code simply increments a long variable many, many times and performs some meaningless maths, just to keep the processor occupied. What I am trying to figure is whether the parallel extensions bring any simple relief.

Snippet 1 – Simple Loop

C# Code MSIL
    1: private static void TestZero(bool log)
    2: {
    3:     DateTime stampStart = DateTime.Now;
    4:     ResetState(log);
    5:     for (int k = 0; k < 2; k++)
    6:     {
    7:         if (log)
    8:         {
    9:             Console.WriteLine("0 -> k:{0}", k);
   10:         }
   11:         for (int i = 0x5f5e100; i > 0; i--)
   12:         {
   13:             SillyProcessing(k);
   14:         }
   15:     }
   16:     if (log)
   17:     {
   18:         Log(stampStart, "TestZero");
   19:     }
   20: }
    1: .method private hidebysig static void TestZero(bool log) cil managed
    2: {
    3:     .maxstack 2
    4:     .locals init (
    5:         [0] valuetype [mscorlib]System.DateTime stampStart,
    6:         [1] int32 k,
    7:         [2] int32 i)
    8:     L_0000: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now()
    9:     L_0005: stloc.0 
   10:     L_0006: ldarg.0 
   11:     L_0007: call void Test.Program::ResetState(bool)
   12:     L_000c: ldc.i4.0 
   13:     L_000d: stloc.1 
   14:     L_000e: br.s L_003d
   15:     L_0010: ldarg.0 
   16:     L_0011: brfalse.s L_0023
   17:     L_0013: ldstr "0 -> k:{0}"
   18:     L_0018: ldloc.1 
   19:     L_0019: box int32
   20:     L_001e: call void [mscorlib]System.Console::WriteLine(string, object)
   21:     L_0023: ldc.i4 0x5f5e100
   22:     L_0028: stloc.2 
   23:     L_0029: br.s L_0035
   24:     L_002b: ldloc.1 
   25:     L_002c: call void Test.Program::SillyProcessing(int32)
   26:     L_0031: ldloc.2 
   27:     L_0032: ldc.i4.1 
   28:     L_0033: sub 
   29:     L_0034: stloc.2 
   30:     L_0035: ldloc.2 
   31:     L_0036: ldc.i4.0 
   32:     L_0037: bgt.s L_002b
   33:     L_0039: ldloc.1 
   34:     L_003a: ldc.i4.1 
   35:     L_003b: add 
   36:     L_003c: stloc.1 
   37:     L_003d: ldloc.1 
   38:     L_003e: ldc.i4.2 
   39:     L_003f: blt.s L_0010
   40:     L_0041: ldarg.0 
   41:     L_0042: brfalse.s L_004f
   42:     L_0044: ldloc.0 
   43:     L_0045: ldstr "TestZero"
   44:     L_004a: call void Test.Program::Log(valuetype [mscorlib]System.DateTime, string)
   45:     L_004f: ret 
   46: }
   47:  
   48:  

As shown, the first test, TestZero() simply loops and calls the SillyProcessing() 100,000,000 times, whereby SillyProcessing() does nothing more than:

  1. float calculation = (float)index * (float)index * (float)data[index] / (float)3.14; // wake up the maths part of the processor
  2. string result = (Math.Cos(calculation)).ToString(); // use some heavy string processing spices
  3. data[index] = data[index] + 1; // actual logic

… do not analyse the logic :)

Snippet 2 – Simple Loop, getting rid of some of the loops

C# Code
    1: private static void TestOne(bool log)
    2: {
    3:     DateTime stampStart = DateTime.Now;
    4:     ResetState(log);
    5:     for (int k = 0; k < 2; k++)
    6:     {
    7:         if (log)
    8:         {
    9:             Console.WriteLine("1 -> k:{0}", k);
   10:         }
   11:         for (int i = 0x5f5e100; i > 0; i -= 0x19)
   12:         {
   13:             SillyProcessing(k);
   14:             SillyProcessing(k);
   15:             SillyProcessing(k);
   16:             SillyProcessing(k);
   17:             SillyProcessing(k);
   18:             SillyProcessing(k);
   19:             SillyProcessing(k);
   20:             SillyProcessing(k);
   21:             SillyProcessing(k);
   22:             SillyProcessing(k);
   23:             SillyProcessing(k);
   24:             SillyProcessing(k);
   25:             SillyProcessing(k);
   26:             SillyProcessing(k);
   27:             SillyProcessing(k);
   28:             SillyProcessing(k);
   29:             SillyProcessing(k);
   30:             SillyProcessing(k);
   31:             SillyProcessing(k);
   32:             SillyProcessing(k);
   33:             SillyProcessing(k);
   34:             SillyProcessing(k);
   35:             SillyProcessing(k);
   36:             SillyProcessing(k);
   37:             SillyProcessing(k);
   38:         }
   39:     }
   40:     if (log)
   41:     {
   42:         Log(stampStart, "TestOne");
   43:     }
   44: }
   45:  
   46:  

The second test, TestOne(), calls SillyProcessing() 25 times and increments the loop counter by 25, instead of the 1:1 as in testZero,

Snippet 3 – Parallel Extensions one level

C# Code MSIL
    1: private static void TestTwo(bool log)
    2: {
    3:     DateTime stampStart = DateTime.Now;
    4:     ResetState(log);
    5:     Parallel.For(0, 2, 1, delegate (int k) {
    6:         if (log)
    7:         {
    8:             Console.WriteLine("2 -> k:{0}", k);
    9:         }
   10:         for (int i = 0x5f5e100; i > 0; i--)
   11:         {
   12:             SillyProcessing(k);
   13:         }
   14:     });
   15:     if (log)
   16:     {
   17:         Log(stampStart, "TestTwo");
   18:     }
   19: }
   20:  
   21:  
    1: .method private hidebysig static void TestTwo(bool log) cil managed
    2: {
    3:     .maxstack 6
    4:     .locals init (
    5:         [0] valuetype [mscorlib]System.DateTime stampStart,
    6:         [1] class Test.Program/<>c__DisplayClass1 CS$<>8__locals2)
    7:     L_0000: newobj instance void Test.Program/<>c__DisplayClass1::.ctor()
    8:     L_0005: stloc.1 
    9:     L_0006: ldloc.1 
   10:     L_0007: ldarg.0 
   11:     L_0008: stfld bool Test.Program/<>c__DisplayClass1::log
   12:     L_000d: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now()
   13:     L_0012: stloc.0 
   14:     L_0013: ldloc.1 
   15:     L_0014: ldfld bool Test.Program/<>c__DisplayClass1::log
   16:     L_0019: call void Test.Program::ResetState(bool)
   17:     L_001e: ldc.i4.0 
   18:     L_001f: ldc.i4.2 
   19:     L_0020: ldc.i4.1 
   20:     L_0021: ldloc.1 
   21:     L_0022: ldftn instance void Test.Program/<>c__DisplayClass1::<TestTwo>b__0(int32)
   22:     L_0028: newobj instance void [mscorlib]System.Action`1<int32>::.ctor(object, native int)
   23:     L_002d: call void [System.Threading]System.Threading.Parallel::For(int32, int32, int32, class [mscorlib]System.Action`1<int32>)
   24:     L_0032: ldloc.1 
   25:     L_0033: ldfld bool Test.Program/<>c__DisplayClass1::log
   26:     L_0038: brfalse.s L_0045
   27:     L_003a: ldloc.0 
   28:     L_003b: ldstr "TestTwo"
   29:     L_0040: call void Test.Program::Log(valuetype [mscorlib]System.DateTime, string)
   30:     L_0045: ret 
   31: }
   32:  
   33:  

With TestTwo() we implemented the parallel extensions on the outer loop.

Snippet 4 – Parallel Extensions two levels

C# Code MSIL
    1: private static void TestThree(bool log)
    2: {
    3:     DateTime stampStart = DateTime.Now;
    4:     ResetState(log);
    5:     Parallel.For(0, 2, 1, delegate (int k) {
    6:         if (log)
    7:         {
    8:             Console.WriteLine("3 -> k:{0}", k);
    9:         }
   10:         Parallel.For(0, 0x5f5e100, 1, delegate (int i) {
   11:             SillyProcessing(k);
   12:         });
   13:     });
   14:     if (log)
   15:     {
   16:         Log(stampStart, "TestThree");
   17:     }
   18: }
   19:  
   20:  
    1: .method private hidebysig static void TestThree(bool log) cil managed
    2: {
    3:     .maxstack 6
    4:     .locals init (
    5:         [0] valuetype [mscorlib]System.DateTime stampStart,
    6:         [1] class Test.Program/<>c__DisplayClass5 CS$<>8__locals6)
    7:     L_0000: newobj instance void Test.Program/<>c__DisplayClass5::.ctor()
    8:     L_0005: stloc.1 
    9:     L_0006: ldloc.1 
   10:     L_0007: ldarg.0 
   11:     L_0008: stfld bool Test.Program/<>c__DisplayClass5::log
   12:     L_000d: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now()
   13:     L_0012: stloc.0 
   14:     L_0013: ldloc.1 
   15:     L_0014: ldfld bool Test.Program/<>c__DisplayClass5::log
   16:     L_0019: call void Test.Program::ResetState(bool)
   17:     L_001e: ldc.i4.0 
   18:     L_001f: ldc.i4.2 
   19:     L_0020: ldc.i4.1 
   20:     L_0021: ldloc.1 
   21:     L_0022: ldftn instance void Test.Program/<>c__DisplayClass5::<TestThree>b__3(int32)
   22:     L_0028: newobj instance void [mscorlib]System.Action`1<int32>::.ctor(object, native int)
   23:     L_002d: call void [System.Threading]System.Threading.Parallel::For(int32, int32, int32, class [mscorlib]System.Action`1<int32>)
   24:     L_0032: ldloc.1 
   25:     L_0033: ldfld bool Test.Program/<>c__DisplayClass5::log
   26:     L_0038: brfalse.s L_0045
   27:     L_003a: ldloc.0 
   28:     L_003b: ldstr "TestThree"
   29:     L_0040: call void Test.Program::Log(valuetype [mscorlib]System.DateTime, string)
   30:     L_0045: ret 
   31: }
   32:  
   33:  

With the last test, TestThree(), we went one step further and implemented the parallel extensions on the inner loop. Normally not needed, but as part of the test I decided to go where no-one would normally venture :)

Pop-Quiz

“Pop-Quiz”, which test will be the quickest and which will be problematic? I was surprised …

Analysis

Let us first look at the processor view, using the good old trusted task manager:

image  
Figure 1 – Processor View (Left:Processor 1, Right:Processor 2)

Blue Male Doctor Holding a Clipboard Clipart Illustration With both TestZero() and TestOne() the total processor utilisation remains around 50% and the processing seems to plod along quite sequentially. With TestTwo() and TestThree() you notice an uprising of threads and the total processor utilisation practically maxes on both processors as expected.

But who is the quickest and who is the problematic case?

image  
Figure 2 – Results

TestZero() and TestOne() basically arrive at the finish line at the same time. I expected TestOne() to be quicker, but with all the sceptical re-runs it always proved to be slightly slower than TestZero(). I used the profiling tools to get a more detailed picture, whereby the results are also not making me feel 100% happy although TestOne() is faster than TestZero() with instrumented code. Using sampling we get a less accurate picture, but with less impact on the application behaviour.

image 

TestTwo() was faster as expected and TestThree() proved to be the champion in terms of speed. It is important to test your application before making extensive use of parallel processing, because if we Blue Man Bending Over to Inspect Something Through a Magnifying Glass Clipart Illustrationremove the first two statements of the SillyProcessing() method, the parallelism actually proves slower than the classic counterparts. Also TestOne() speeded up as expected, when the calculation statement and especially the heavy string statement were removed as shown:

image

What is also evident is that one of the tests has a problem … yes, TestThree() seems to suffer from lost increments. Before we start debugging and analysing the anomaly in great detail, we can emphasise that the code is not thread safe and that the data counters are simply overridden by the inner loops competing for the battle zone.

So what, if anything have we learnt in this chat?

Throwing processors and threads at a problem will not necessarily speed up your application. There are exceptional profiling tools in VSTS 2008 and even better ones in VSTS 2010, which will enable you to profile and analyse your application in a number of environments. Remember that too much of a good thing, i.e. too much parallelism or too much salt, can destroy an exceptional application or an exceptional meal respectively q;-)