Why is the speedup not X on my X-way machine? Or, why does my parallel code run slower?
Less than ideal speedup can typically be attributed to two things:
1. Sequential costs. It’s often the case that parts of a particular algorithm must be executed sequentially. As the parallelizable parts get faster (with the addition of more cores), the sequential parts become more significant (Amdahl’s Law).
2. General overheads. For example, the costs associated with partitioning/merging data, scheduling tasks to the runtime, synchronization to support additional features, or delegate invocations.
The former can sometimes be addressed by choosing a different algorithm that requires less serial execution. The latter can often be addressed by reducing communication between tasks in a parallel operation, making fine-grained parallelism more coarse-grained, etc. Also, Microsoft is constantly working to reduce the impact of both points, where possible.