Performance Quiz #14: Memory Locality etc. Bonus Round!

[Note: I accidently pasted an x64 result in place of an x86 result. As it happens the point I was trying to make was that they were very similar, which they are, but they aren't identical... corrected.]

Thanks to some careful readers I discovered why the shuffling factors discussed in my previous blog entry were so strange.  The rand() method I was using only returns numbers between 0 and MAX_RAND which is 32767 on my system.  That meant that shuffling became a non-factor as the number of elements in the array increased.

I've since switched the code to use this instead:

 #include <random> // at the top of the file

    void Shuffle(int m)
    {
        std::mt19937 gen(0);  // repeatable results desired
        std::uniform_int_distribution<> dist(0, _n - 1);

        for (int nn = 0; nn < m; nn++)
        {
            int i = dist(gen);
            int j = dist(gen);

            if (i == j) continue;

            int tmp = _perm[i];
            _perm[i] = _perm[j];
            _perm[j] = tmp;            
        }
    }

With this change shuffling becomes generally effective again and the result is that non-locality dominates the effects, which is really a lot more like what we would expect.  I thought that the effect might top off on very large array sizes but it does not.  The normal prefechting of in order memory continues to give us benefits even at very large array sizes.  Let's look at the architectural differences once again. 

 Pointer implementation with no changes
sizeof(int*)=4, sizeof(T)=12
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   1.99,   1.99,   1.99,   1.99,   1.99,   1.99
     2000,   1.99,   1.99,   1.99,   1.99,   1.99,   1.99
     4000,   1.99,   1.99,   2.20,   2.49,   2.92,   3.20
     8000,   1.85,   1.96,   2.38,   3.13,   3.88,   4.34
    16000,   1.88,   1.96,   2.56,   3.40,   4.59,   4.94
    32000,   1.88,   2.17,   3.52,   6.20,   8.16,  11.24
    64000,   1.88,   2.33,   4.45,   8.13,  11.87,  15.96
   128000,   2.06,   2.27,   5.15,   9.32,  14.81,  18.29
   256000,   1.90,   2.35,   5.90,  10.99,  16.26,  21.39
   512000,   1.92,   2.77,   6.42,  11.59,  17.55,  23.71
  1024000,   2.22,   3.42,   7.96,  14.58,  22.97,  30.00
  2048000,   2.54,   4.68,  17.47,  38.54,  62.76,  90.17
  4096000,   2.50,   5.70,  25.17,  53.21,  89.79, 121.06
  8192000,   2.50,   6.31,  30.36,  63.07, 106.04, 142.09
 16384000,   2.50,   6.68,  32.94,  68.79, 115.24, 156.78
 32768000,   2.55,   7.22,  34.64,  73.71, 123.98, 165.47
 65536000,   2.55,   7.96,  36.49,  73.39, 126.17, 168.88

 OK here we see just how profound the non-locality effects are.  The cost per item goes up to a whopping 168.88 at the bottom of the table for fully shuffled data.  And you can see that even a small amount (1%) of non-locality makes a crushing difference.  You can also clearly see that while the data fits none of this matters a whit.  But even at 4k elements we're already starting to see the effects.

Now let's compare the same program run on x64.

 Pointer implementation with no changes
sizeof(int*)=8, sizeof(T)=20
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   2.28,   1.99,   1.99,   1.99,   2.28,   1.99
     2000,   2.28,   2.42,   2.56,   2.84,   3.27,   3.27
     4000,   2.28,   5.12,   2.92,   3.84,   4.62,   5.05
     8000,   2.17,   2.28,   3.13,   4.34,   5.62,   5.90
    16000,   2.24,   2.49,   3.95,   7.02,   8.28,  10.97
    32000,   2.25,   2.54,   5.00,   8.98,  13.33,  16.75
    64000,   2.24,   2.91,   6.01,  10.22,  15.69,  19.93
   128000,   2.40,   2.93,   7.06,  11.51,  18.80,  22.25
   256000,   2.45,   2.96,   7.16,  13.97,  19.75,  24.89
   512000,   3.06,   3.49,   8.25,  19.20,  25.23,  28.36
  1024000,   3.59,   5.39,  16.21,  34.18,  60.90,  83.61
  2048000,   3.79,   7.18,  27.84,  58.19,  97.72, 130.46
  4096000,   3.79,   8.18,  34.17,  70.37, 118.09, 153.55
  8192000,   3.78,   8.72,  37.35,  76.35, 128.28, 166.00
 16384000,   3.76,   9.27,  39.17,  81.54, 137.48, 173.84
 32768000,   3.77,  10.06,  40.25,  83.91, 140.59, 178.86
 65536000,   3.74,  10.94,  43.60,  86.91, 142.89, 183.30

OK you can see the increase in size of the data structure significantly makes the unshuffled data worse, just like before.  And the size difference actually dilutes the non-locality somewhat.  It's merely 49 times worse on x64 rather than 66 times worse like on x86.  But it's still hella bad.   Note that a 66% data growth in this case is resulting in a 46% performance hit.  So actually x64 is doing worse overall not just because of raw data growth, the larger pointers are causing other inefficiencies.

Now I was going to skip the unaligned stuff thinking it would all be the same but I was wrong about that, these data end up being interesting too.  In the interest of saving space I'm just providing the first few rows of the table and the last few for these 4 experiments.

 Pointer implementation with bogus byte in it to force unalignment
sizeof(int*)=4, sizeof(T)=13
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   2.28,   1.99,   2.28,   2.28,   1.99,   2.28
     2000,   1.99,   2.13,   2.13,   2.13,   2.13,   2.13
     4000,   2.06,   2.13,   2.70,   3.06,   3.63,   4.05
 16384000,   2.75,   7.39,  35.18,  74.71, 124.48, 167.20
 32768000,   2.74,   7.83,  37.35,  77.54, 126.54, 170.70
 65536000,   2.69,   8.31,  37.16,  77.20, 129.02, 175.87

OK naturally x86 is somewhat worse with unaligned pointers, but not that much worse. Mostly we can attribute this to the fact that the data structure size is now one byte bigger.  The 2.69 in the leading column is likely a fluke, so I'm going to go with 2.74 as the final value for unshuffled data.  Note that the execution time degraded by about 7.4% and the data grew by about 8.3% so really this is mostly about data growth and nothing else.  The shuffled data grew 4%, so the caching effects seem to dilute the size growth somewhat.

 Pointer implementation with bogus byte in it to force unalignment
sizeof(int*)=8, sizeof(T)=21
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   2.28,   2.28,   2.28,   2.28,   2.28,   2.28
     2000,   2.42,   2.42,   2.84,   3.27,   3.70,   3.84
     4000,   2.42,   2.49,   3.34,   4.48,   5.48,   5.97
 16384000,   3.81,   9.58,  40.01,  83.92, 142.32, 181.21
 32768000,   3.96,  10.34,  41.85,  88.28, 144.39, 182.84
 65536000,   3.96,  11.37,  45.78,  92.34, 151.77, 192.66

Here's where things get interesting though.  On x64 the cost of unshuffled data grew to 3.96ns per item from 3.74.  That's a 5.8% growth.  And the shuffled data grew to 192 from 183, also about 5%.  That's pretty interesting because the data also grew 5%, exactly 5% as it turns out, from 20 to 21 bytes.

What happens if we grow it a little more to get perfect alignment?

 Pointer implementation with extra padding to ensure alignment
sizeof(int*)=4, sizeof(T)=16
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   1.99,   1.99,   1.99,   1.99,   1.99,   1.99
     2000,   1.99,   1.99,   2.13,   2.13,   2.28,   2.13
     4000,   2.06,   1.99,   2.56,   3.34,   3.98,   4.05
 16384000,   3.04,   7.77,  35.52,  74.26, 125.54, 163.54
 32768000,   3.06,   8.44,  36.86,  77.08, 129.97, 168.43
 65536000,   3.08,   9.26,  38.16,  78.69, 129.42, 171.52

Well on x86, growing it doesn't help at all.  The size hit brings us to 3.08 and 171.52 on the bottom rows, the first number is worse, increasing from 2.74 so padding certainly didn't help there.  The second number is somewhat better... ~171 vs. ~175 perhaps alignment is helping us to avoid cache splits but it's only a 2% win and we paid a lot of space to get it (16 bytes instead of 13 or 23% growth.  What about on x64?)

Here the situation is interesting.   

 Pointer implementation with extra padding to ensure alignment
sizeof(int*)=8, sizeof(T)=24
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   1.99,   1.99,   1.99,   2.28,   1.99,   1.99
     2000,   2.13,   2.28,   2.70,   2.99,   3.56,   3.70
     4000,   2.28,   2.28,   2.99,   3.91,   4.48,   4.98
 16384000,   4.30,   9.79,  38.55,  80.31, 133.95, 168.14
 32768000,   4.29,  10.81,  39.99,  83.10, 137.47, 168.60
 65536000,   4.29,  11.45,  44.03,  88.23, 143.56, 176.47

Look at that... things did get worse on unshuffled data due to the size growth, we're up to 4.29.  That's pretty much the result I expected, data size is king, especially in ordered data.  However, the last column is suprising.  With shuffled data we're now at 176.46.  Even though we grew the structure to 24 bytes it's actually running quite a lot faster than the 196 we got in the worst case.  Again I blame the fact that the random access is subject to cache splits and those are not present if the data is aligned.  In fact even a small amount of randomness is enough to make the aligned datastructures better.  At 10% shuffling we were already ahead despite the fact that the data is bigger.  So bottom line here, the unaligned pointer wasn't costing us much but creating cache splits definitely does, and we see those in the unshuffled data.  [Note: this is a bit of a stretch because I didn't actually gather stats on cache splits, I know that the odd size packed data must have them and they seem sufficient to explain the behavior, whilst the in order data will be largely immune, but that isn't full confirmation, so I could be wrong.)

Finally, looking at using indices instead of pointers.

 Standard index based implementation
sizeof(int*)=4, sizeof(T)=12
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   3.70,   3.41,   3.41,   3.41,   3.70,   3.41
     2000,   3.41,   3.56,   3.41,   3.41,   3.41,   3.41
     4000,   3.63,   3.41,   3.63,   3.98,   4.34,   4.62
     8000,   3.45,   3.56,   3.84,   4.62,   5.19,   5.69
    16000,   3.52,   3.56,   4.00,   4.85,   5.67,   6.33
    32000,   3.48,   3.64,   5.12,   7.24,   9.83,  12.07
    64000,   3.48,   3.76,   6.10,   9.52,  13.78,  17.54
   128000,   3.48,   3.87,   6.70,  10.74,  15.90,  20.23
   256000,   3.48,   3.95,   7.34,  11.96,  17.48,  22.48
   512000,   3.46,   4.01,   7.75,  13.03,  19.69,  25.45
  1024000,   3.70,   4.89,  10.27,  16.68,  25.41,  32.78
  2048000,   3.80,   6.03,  19.54,  39.37,  65.81,  89.51
  4096000,   3.80,   7.12,  27.58,  55.90,  94.37, 125.29
  8192000,   3.78,   7.72,  32.42,  65.97, 110.21, 146.96
 16384000,   3.79,   8.07,  35.15,  71.50, 119.32, 159.43
 32768000,   3.63,   8.19,  35.50,  72.99, 121.11, 161.69
 65536000,   3.78,   9.18,  37.63,  76.34, 123.47, 164.85

As before, the x86 code is slower... there is nothing but downsize there since the pointers were already 32 bits.

 Standard index based implementation
sizeof(int*)=8, sizeof(T)=12
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   3.41,   3.41,   3.41,   3.70,   3.41,   3.70
     2000,   3.41,   3.56,   3.41,   3.41,   3.41,   3.41
     4000,   3.56,   3.41,   3.63,   4.05,   4.34,   4.55
     8000,   3.41,   3.52,   3.88,   4.59,   5.19,   5.65
    16000,   3.43,   3.61,   4.00,   4.84,   5.76,   6.31
    32000,   3.47,   3.64,   5.08,   7.24,   9.78,  12.20
    64000,   3.48,   3.75,   6.11,   9.52,  13.84,  17.54
   128000,   3.49,   3.86,   6.71,  10.72,  15.86,  20.23
   256000,   3.48,   3.96,   7.34,  11.96,  17.83,  22.74
   512000,   3.45,   4.00,   7.75,  12.80,  19.16,  25.49
  1024000,   3.70,   5.27,  16.56,  16.10,  23.88,  32.46
  2048000,   3.80,   6.10,  19.48,  39.09,  66.05,  89.76
  4096000,   3.80,   6.94,  26.73,  54.67,  91.38, 125.91
  8192000,   3.79,   7.72,  32.45,  66.14, 110.40, 146.90
 16384000,   3.62,   7.98,  35.10,  71.72, 120.31, 159.13
 32768000,   3.77,   8.43,  36.51,  75.20, 124.58, 165.14
 65536000,   3.77,   9.15,  37.52,  76.90, 126.96, 168.40

And we see that the x64 run times very similar.  x64 provides no benefits, but we can use the data reduction provided by 32 bit pointers to save space and net a significant savings in execution time with even modestly (1%) shuffled data, an entirely normal situation.  And the space savings don't suck either.  The raw speed when comparing to fully ordered data is about a wash.

Just for grins I hacked up an implementation that uses 16 bit integers, here's what it looks like on x64, keeping in mind that you could only use this if your array sizes were known to be less than 64k (which they often are frankly). Or if you could flexibly swap it in.

 Standard  short index based implementation
sizeof(int*)=8, sizeof(T)=6
  shuffle,   0.00,   0.01,   0.10,   0.25,   0.50,   1.00
     1000,   3.98,   3.98,   3.98,   3.98,   3.98,   3.98
     2000,   3.84,   3.98,   3.84,   3.98,   3.84,   3.84
     4000,   3.91,   4.05,   3.91,   3.91,   3.91,   3.91
     8000,   3.95,   3.98,   4.12,   4.37,   4.69,   4.94
    16000,   4.00,   3.98,   4.34,   4.84,   5.49,   6.04
    32000,   3.95,   4.06,   4.46,   5.11,   6.11,   6.28
    64000,   3.93,   4.09,   5.55,   7.05,   9.77,  11.45

What's interesting here is that  even though it's smaller the short word fetching on the x64 seems inferior and so at 64k elements the unshuffled cost is higher.  But if the data varies then the space savings kicks in and you get a nice win.  Its sort of like moving yourself one notch down in the table at that point.