Performance Quiz #14: Memory Locality, x64 vs. x86, Alignment, and Density

[

Note: Mystery of the shuffling is solved, the rand() method I was using returns only numbers between 0 and 32k, so shuffling was ineffective in large array sizes. I will post an update. Thank you Ryan!

See the new entry for the updated results.

 It's been a very long time since I did a performance quiz and so it's only right that this one covers a lot of ground.  Before I take even one step forward I want you to know that I will be basing my conclusions on:

  1. A lot of personal experience
  2. A micro-benchmark that I made to illustrate it

Nobody should be confused, it would be possible to get other results, especially because this is a micro-benchmark.  However these results do line up pretty nicely with my own experience so I'm happy to report them.  Clearly the weight of the "other code" in your application would significantly change these results and yet they illustrate some important points, as well as point out a mystery...  But I'm getting ahead of myself discussing the answers.... First, the questions:

 

Q1: Is x64 code really slower than x86 code if you compile basically the same program and don't change anything just like you said all those years ago? (wow, what a loaded question)

Q2: Does unaligned pointer access really make a lot of difference?

Q3: Is it important to have your data mostly sequential or not so much?

Q4: If x64 really is slower, how much of it relates to bigger pointers?

 

OK kiddies... my answers to these questions are below.... but if you want to make your own guesses then stop reading now... and maybe write some code to try out of a few theories.

 

 

 

 

 

Keep scrolling...

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Are you ready?

 

 

 

 

OK.

To answer these questions I wrote a benchmarking program (see link at the bottom of this posting) that creates a data structure and walks it.  The primary thing that it does is allocate an array with ever increasing size and then build a doubly-linked list in it.  Then it walks that list forwards then backwards.  The time it takes to walk the list is what is measured, not the construction.  And the times reported are divided by the number of items, so in each case you see the cost per item.  Each item is of course visited twice, so if you like the numbers are scaled by a factor of two.  And the number reported is in nanoseconds.

To make things more interesting, I also shuffle the items in the list so that they are not in their original order.  This adds some randomness to the memory access order.  To shuffle the data I simple exchange two slots a certain percentage of times starting from 0% and then growing quickly to 100%.  100% shuffles means the number of exchanges is equal to the number of items in the list, that's pretty thoroughly mixed.

And as another datapoint I run the same exact code (always compiled for maximum speed) on x64 and then on x86.  It's the exact same machine,  my own home desktop unit.  Which is a hexacore high end workstation.

And then some additional test cases.  I do all that 4 different ways.  First with regular next and prev pointers and an int as the payload.  Then I add a bogus byte just to make the alignment totally horrible (and by the way it would be interesting to try this on another architecture where alignment hurts more than Intel but I don't happen to have such a machine handy).  To try to make things better I add a little padding so that things still line up pretty good and we see how this looks.  And finally I avoid all the pointer resizing by using fixed size array indices instead of pointers so that the structure stays the same size when recompiled.

And without further ado, here are the results.  I've put some notes inline. 

 

Pointer implementation with no changes
sizeof(int*)=4  sizeof(T)=12
  shuffle 0% 1% 10% 25% 50% 100%
1000 1.99 1.99 1.99 1.99 1.99 1.99
2000 1.99 1.85 1.99 1.99 1.99 1.99
4000 1.99 2.28 2.77 2.92 3.06 3.34
8000 1.96 2.03 2.49 3.27 4.05 4.59
16000 1.97 2.04 2.67 3.57 4.57 5.16
32000 1.97 2.18 3.74 5.93 8.76 10.64
64000 1.99 2.24 3.99 5.99 6.78 7.35
128000 2.01 2.13 3.64 4.44 4.72 4.8
256000 1.98 2.27 3.14 3.35 3.3 3.31
512000 2.06 2.21 2.93 2.74 2.9 2.99
1024000 2.27 3.02 2.92 2.97 2.95 3.02
2048000 2.45 2.91 3 3.1 3.09 3.1
4096000 2.56 2.84 2.83 2.83 2.84 2.85
8192000 2.54 2.68 2.69 2.69 2.69 2.68
16384000 2.55 2.62 2.63 2.61 2.62 2.62
32768000 2.54 2.58 2.58 2.58 2.59 2.6
65536000 2.55 2.56 2.58 2.57 2.56 2.56
 
Average 2.20 2.38 2.86 3.27 3.62 3.86
Overall 3.03

 

This is the baseline measurement.  You can see the structure is a nice round 12 bytes and it will align well on x86.  Looking at the first column, with no shuffling, as expected things get worse and worse as the array gets bigger until finally the cache isn't helping much and you have about the worst you're going to get, which is about 2.55ns on average per item.

The results for shuffling are not exactly what I expected.  At small sizes, it makes no difference.  I expected this because basically the entire table is staying hot in the cache and so locality isn't mattering.  Then as the table grows you see that shuffling has a big impact at about 32000 elements.  That's 384k of data.  Likely because we've blown past a 256k limit.

Now the bizarre thing is this: after this the cost of shuffling actually goes down, to the point that later on it hardly matters at all.  Now I can understand that at some point shuffled or not shuffled really should make no difference because the array is so huge that runtime is largely gated by memory bandwidth regardless of order.  However... there are points in the middle where the cost of non-locality is actually much worse than it will be at the endgame.

What I expected to see was that shuffling caused us to reach maximum badness sooner and stay there.  What actually happens is that at middle sizes non-locality seems to cause things to go very very bad...  And I do not know why :)

But other than that one anomaly things are going pretty much as expected.

Now let's look at the exact same thing, only it's now on x64

 

Pointer implementation with no changes

sizeof(int*)=8  sizeof(T)=20
  shuffle 0% 1% 10% 25% 50% 100%
1000 2.28 2.28 2.28 1.99 2.28 1.99
2000 2.28 2.28 2.56 2.99 3.13 3.27
4000 2.28 2.35 3.06 3.91 4.84 5.26
8000 2.28 2.38 3.27 4.48 5.9 6.15
16000 2.36 2.63 4.12 6.28 8.53 10.2
32000 2.36 2.68 5.3 9.24 13.4 16.76
64000 2.25 2.9 5.5 8.28 10.36 10.62
128000 2.42 2.92 4.86 6.31 6.49 6.34
256000 2.42 2.74 4.25 4.52 4.43 4.61
512000 2.75 3.86 4.31 4.42 4.56 4.48
1024000 3.56 4.82 5.42 5.42 5.28 5.21
2048000 3.72 4.36 4.64 4.64 4.66 4.67
4096000 3.79 4.23 4.2 4.23 4.2 4.23
8192000 3.77 3.99 3.98 4 3.99 3.99
16384000 3.75 3.88 3.87 3.87 3.89 3.89
32768000 3.78 3.86 3.83 3.8 3.81 3.83
65536000 3.74 3.8 3.79 3.81 3.83 3.82
 
Average 2.93 3.29 4.07 4.83 5.50 5.84
Overall 4.41 X64/X86 1.46

 

Well would you look at that... the increased data size has caused us to go quite a bit slower.  The average ratio shows that execution time is 1.46 times longer.  This result is only slightly larger than typical in my experience when analyzing data processing in pointer rich structures.

Note that it doesn't just get bad at the end, it's bad all along.  There's a few weird data points but this isn't an absolutely controlled experiment.  For instance the 1.99 result for 1000 items isn't really indicating that it was better with more shuffling.  The execution times are so small that timer granularity is a factor and I saw it swiching between 1.99 and 2.28.  Things get a lot more stable as n increases.

Now let's look what happens when the data is unaligned.

 

Pointer implementation with bogus byte in it to force unalignment

sizeof(int*)=4  sizeof(T)=13
  shuffle 0% 1% 10% 25% 50% 100%
1000 1.99 1.99 1.99 1.99 2.28 1.99
2000 2.13 2.13 2.13 2.13 2.13 2.13
4000 2.13 2.13 2.49 3.06 3.7 3.91
8000 2.1 2.17 2.88 3.88 4.76 5.33
16000 2.1 2.2 3.08 4.21 5.4 6.17
32000 2.17 2.39 4.21 6.92 10.1 12.83
64000 2.16 2.46 4.5 6.74 8.18 8.62
128000 2.14 2.45 4.13 5.19 5.4 5.41
256000 2.14 2.41 3.61 3.78 3.77 3.77
512000 2.18 2.51 2.97 3.12 3.16 3.11
1024000 2.45 3.12 3.44 3.43 3.46 3.54
2048000 2.76 3.3 3.36 3.35 3.37 3.36
4096000 2.75 3.08 3.05 3.04 3.07 3.05
8192000 2.75 2.9 2.88 2.9 2.9 2.9
16384000 2.75 2.82 2.82 2.82 2.82 2.82
32768000 2.74 2.78 2.77 2.79 2.77 2.78
65536000 2.74 2.76 2.75 2.75 2.76 2.76
 
Average 2.36 2.56 3.12 3.65 4.12 4.38
Overall 3.37

 

This data does show that things got somewhat slower.  But also data size grew by about 8%.  In fact if you look at first column and compare the bottom row there, you'll find that amortized execution at the limit grew by 7.4%  Basically the same as the data growth.  On the other hand, changes due to shuffling were greater so that the overall index grew by 8.3%.  But I think we could support the conclusion that most of the growth had to do with the fact that we read more memory and only a small amount of it had to do with any extra instruction cost.

Is the picture different on x64?

 

Pointer implementation with bogus byte in it to force unalignment

sizeof(int*)=8  sizeof(T)=21
  shuffle 0% 1% 10% 25% 50% 100%
1000 2.28 2.28 2.28 2.28 2.28 2.28
2000 2.42 2.42 2.84 3.27 3.7 3.84
4000 2.42 2.49 3.34 4.48 5.55 6.12
8000 2.56 2.52 3.7 5.23 6.4 7.15
16000 2.61 2.81 4.85 7.36 9.96 12.02
32000 2.53 2.86 5.8 10.18 15.25 18.65
64000 2.53 2.94 5.88 9.14 11.33 11.64
128000 2.53 2.94 5.41 7.11 7.09 7.09
256000 2.57 3.09 5.14 4.96 5.07 4.98
512000 3.21 3.58 5.29 5.05 5.14 5.03
1024000 3.74 5.03 5.94 5.79 5.75 5.94
2048000 4.01 4.84 4.96 4.93 4.92 4.96
4096000 4 4.47 4.49 4.46 4.46 4.46
8192000 3.99 4.21 4.21 4.21 4.06 4.21
16384000 3.97 4.08 4.08 4.07 4.08 4.08
32768000 3.96 4.02 4.02 4.03 4.03 4.03
65536000 3.96 3.99 4 3.99 4 3.99
 
Average 3.13 3.45 4.48 5.33 6.06 6.50
Overall 4.83 X64/X86 1.43

 

The overall ratio was 1.43 vs the previous ratio of 1.46.  That means the extra byte did not disproportionally affect the x64 build either.  And in this case the pointers are really crazily unaligned.  The same shuffling weirdness happen as before.

Unaligned pointers don't seem to be costing  us much.

What about if we do another control, increasing the size and realigning the pointers.

 

Pointer implementation with extra padding to ensure alignment

 
sizeof(int*)=4  sizeof(T)=16
  shuffle 0% 1% 10% 25% 50% 100%
1000 1.99 1.99 1.99 1.71 1.99 1.99
2000 1.99 1.99 2.13 2.13 2.13 2.13
4000 2.28 1.99 2.49 3.34 3.7 4.05
8000 1.99 2.06 2.74 3.66 4.59 5.08
16000 2.04 2.26 3.16 4.18 5.32 6.06
32000 2.04 2.35 4.44 7.43 10.92 14.2
64000 2.04 2.38 4.6 7.03 8.74 9.11
128000 2.03 2.37 4.24 5.42 5.58 5.59
256000 2.05 2.36 3.66 3.84 3.83 4.07
512000 2.22 2.59 3.15 3.37 3.1 3.39
1024000 2.76 3.81 4.1 4.09 4.26 4.18
2048000 3.03 3.66 3.83 3.82 3.78 3.78
4096000 3.04 3.42 3.4 3.43 3.41 3.42
8192000 3.06 3.23 3.24 3.23 3.24 3.24
16384000 3.05 3.15 3.14 3.14 3.13 3.14
32768000 3.05 3.1 3.1 3.09 3.1 3.09
65536000 3.07 3.08 3.07 3.08 3.07 3.08
 
Average 2.45 2.69 3.32 3.88 4.35 4.68
Overall 3.56

 

Well in this result we converge at about 3.07, and our original code was 2.55.  Certainly re-aligning the pointers did not help the situation.  We're actually just 20% than the original number and 12% worse than the unaligned version.

And let's look at x64...

 

Pointer implementation with extra padding to ensure alignment

 
sizeof(int*)=8  sizeof(T)=24
  shuffle 0% 1% 10% 25% 50% 100%
1000 1.99 1.99 1.99 1.99 1.99 1.99
2000 2.13 2.28 2.7 2.99 3.7 3.7
4000 2.2 2.28 2.99 3.84 4.55 4.84
8000 2.42 2.38 3.34 4.37 4.98 5.37
16000 2.45 2.68 4.55 7.04 9.71 11.88
32000 2.46 2.8 5.43 9.25 13.48 17.16
64000 2.42 2.8 5.46 8.46 10.37 10.7
128000 2.4 2.8 5 6.43 6.55 6.56
256000 2.51 3.18 4.92 5.34 5 4.89
512000 3.9 4.7 5.97 6.5 5.63 5.59
1024000 4.15 5.24 6.34 6.28 6.24 6.33
2048000 4.32 5.13 5.28 5.33 5.34 5.27
4096000 4.32 4.78 4.77 4.81 4.78 4.79
8192000 4.29 4.55 4.55 4.56 4.55 4.54
16384000 4.28 4.42 4.42 4.43 4.42 4.42
32768000 4.3 4.36 4.37 4.37 4.38 4.37
65536000 4.23 4.38 4.35 4.34 4.34 4.33
 
Average 3.22 3.57 4.50 5.31 5.88 6.28
Overall 4.79 X64/X86 1.35

 

Now with the extra padding we have 8 byte aligned pointers, that should be good right?  Well, no, it's worse.  The top end is now about 4.3 nanoseconds per item compared with about 4 nanoseconds before.  Or about 7.5% worse having used more space.  We didn't pay the full 14% of data growth so there is some alignment savings but not nearly enough to pay for the space.  This is pretty typical.

 

And last but not least, this final implementation uses indices for storage instead of pointers.  How does that fare?

 

Standard index based implementation

 

sizeof(int*)=4  sizeof(T)=12
  shuffle 0% 1% 10% 25% 50% 100%
1000 3.41 3.7 3.41 3.41 3.41 4.27
2000 3.41 3.56 3.41 3.41 3.41 3.98
4000 3.41 3.48 3.63 3.98 4.41 4.62
8000 3.41 3.59 3.88 4.62 5.23 5.76
16000 3.43 3.48 4.02 4.8 5.76 6.31
32000 3.5 3.64 5.1 7.2 9.8 11.99
64000 3.48 3.74 5.41 7.26 8.52 8.88
128000 3.49 3.72 5.1 5.98 6.17 6.18
256000 3.48 3.7 4.66 4.82 4.83 4.82
512000 3.52 3.72 4.13 4.24 4.14 4.3
1024000 3.57 4.25 4.6 4.59 4.46 4.43
2048000 3.79 4.23 4.37 4.35 4.36 4.34
4096000 3.77 4.05 4.06 4.06 4.06 4.07
8192000 3.77 3.91 3.93 3.92 3.91 3.93
16384000 3.78 3.84 3.83 3.83 3.84 3.84
32768000 3.78 3.8 3.8 3.8 3.8 3.79
65536000 3.77 3.78 3.78 3.78 3.8 3.78
 
Average 3.57 3.78 4.18 4.59 4.94 5.25
Overall 4.39

 

Well, clearly the overhead of computing the base plus offset is a dead loss on x86 because there is no space savings for those indexes.  They are the same size as a pointer so messing with them is pure overhead.

However... let's look at this test case on x64...

 

Standard index based implementation

sizeof(int*)=8  sizeof(T)=12
  shuffle 0% 1% 10% 25% 50% 100%
1000 3.41 3.41 3.41 3.98 3.41 3.41
2000 3.41 3.41 3.7 3.41 3.41 3.41
4000 3.41 3.48 3.63 3.98 4.34 4.76
8000 3.45 3.45 3.84 4.48 5.33 5.69
16000 3.48 3.57 3.98 4.78 5.71 6.28
32000 3.48 3.64 5.11 7.16 9.69 11.99
64000 3.48 3.73 5.37 7.2 8.47 8.84
128000 3.48 3.72 5.1 5.96 6.25 6.14
256000 3.49 3.69 4.66 4.83 4.82 4.88
512000 3.52 3.72 4.22 4.22 4.22 4.24
1024000 3.59 4.01 4.31 4.53 4.45 4.4
2048000 3.8 4.27 4.33 4.25 4.35 4.38
4096000 3.8 3.97 4.06 4.06 4.07 4.06
8192000 3.79 3.92 3.92 3.93 3.93 3.91
16384000 3.77 3.84 3.83 3.82 3.85 3.85
32768000 3.76 3.81 3.81 3.8 3.8 3.81
65536000 3.76 3.78 3.78 3.79 3.78 3.78
 
Average 3.58 3.73 4.18 4.60 4.93 5.17
Overall 4.37 X64/X86 1.00

 

And now we reach our final conclusion... At 3.76 the top end is coming in a dead heat with the x86 implementation.  The raw x64 benefit in this case is basically zip.  And actually this benchmark tops out at about the same cost per slot as the original pointer version but it uses quite a bit less space (40% space savings).  Sadly the index manipulation eats up a lot of that savings so in the biggest cases we only come out about 6% ahead.

 

 

Now of course it's possible to create a benchmark that makes these numbers pretty much whatever you want them to be by simply manipulating how much pointer math there is, vs. how much reading, vs. how much "actual work".   

And of course I'm discounting all the other benefits you get from running on x64 entirely, this is just a memory cost example, so take this all with a grain of salt.  If there's a lesson here it's that you shouldn't assume things will be automatically faster with more bits and bigger registers, or even more registers.

The source code I used to create this output is available here

 

*The "Average" statistic is the average of the column above it

*The "Overall" statistic is the average of all the reported nanosecond numbers.

*The x64/x86 ratio is simply the ratio of the two "Overall" numbers