Optimize Your Code: Matrix Multiplication

Matrix multiplication is common and the algorithm is easy to implementation. Here is one example:

Version 1:

template<typename T>
void SeqMatrixMult1(int size, T** m1, T** m2, T** result)
{
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            result[i][j] = 0;
            for (int k = 0; k < size; k++) {
                result[i][j] += m1[i][k] * m2[k][j];
            }
        }
    }
}

This implementation is straight-forward and you can find it in text book and many online samples.

Version 2:

template<typename T>
void SeqMatrixMult2(int size, T** m1, T** m2, T** result)
{
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            T c = 0;
            for (int k = 0; k < size; k++) {
                c += m1[i][k] * m2[k][j];
            }
            result[i][j] = c;
        }
    }
}

This version will use a temporary to store the intermediate result. So we can save a lot of unnecessary memory write. Notice that the optimizer can not help here because it doesn't know whether "result" is an alias of "m1" or "m2".

Version 3:

template<typename T>
void Transpose(int size, T** m)
{
    for (int i = 0; i < size; i++) {
        for (int j = i + 1; j < size; j++) {
            std::swap(m[i][j], m[j][i]);
        }
    }
}
template<typename T>
void SeqMatrixMult3(int size, T** m1, T** m2, T** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            T c = 0;
            for (int k = 0; k < size; k++) {
                c += m1[i][k] * m2[j][k];
            }
            result[i][j] = c;
        }
    }
    Transpose(size, m2);
}

This optimization is tricky. If you profile the function, you'll find a lot of data cache miss. We transpose the matrix so that both m1[i] and m2[i] can be accessed sequentially. This can greatly improve the memory read performance.

Version 4:

template<typename T>
void SeqMatrixMult4(int size, T** m1, T** m2, T** result);
// assume size % 2 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddpd)
template<>
void SeqMatrixMult4(int size, double** m1, double** m2, double** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            __m128d c = _mm_setzero_pd();

            for (int k = 0; k < size; k += 2) {
                c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(&m1[i][k]), _mm_load_pd(&m2[j][k])));
            }
            c = _mm_hadd_pd(c, c);
            _mm_store_sd(&result[i][j], c);
        }
    }
    Transpose(size, m2);
}
// assume size % 4 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddps)
template<>
void SeqMatrixMult4(int size, float** m1, float** m2, float** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            __m128 c = _mm_setzero_ps();

            for (int k = 0; k < size; k += 4) {
                c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(&m1[i][k]), _mm_load_ps(&m2[j][k])));
            }
            c = _mm_hadd_ps(c, c);
            c = _mm_hadd_ps(c, c);
            _mm_store_ss(&result[i][j], c);
        }
    }
    Transpose(size, m2);
}

For float types, we can use SIMD instruction set to parallel the data processing.

Parallel version using PPL (Parallel Patterns Library) and lambda in VC2010 CTP:

template<typename T>
void ParMatrixMult1(int size, T** m1, T** m2, T** result)
{
    using namespace Concurrency;
    for (int i = 0; i < size; i++) {
        parallel_for(0, size, 1, [&](int j) {
            result[i][j] = 0;
            for (int k = 0; k < size; k++) {
                result[i][j] += m1[i][k] * m2[k][j];
            }
        });
    }
}

Result

Here are the test results (what really matters is the relative time between different version):

Matrix size = 500 (Intel Core 2 Duo T7250, 2 cores, L2 cache 2MB)

  int long long float double
Version 1 0.931119s 2.945134s 0.774894s 0.984585s
Version 2 0.571003s 2.310568s 0.724161s 0.929064s
Version 3 0.239538s 0.823095s 0.570772s 0.241691s
Version 4 N/A N/A 0.063196s 0.187614s
Version 1 + PPL 0.847534s 1.683765s 0.589513s 0.994161s
Version 2 + PPL 0.380174s 1.190713s 0.409321s 0.594859s
Version 3 + PPL 0.135760s 0.495152s 0.370499s 0.185800s
Version 4 + PPL N/A N/A 0.041959s 0.157932s

Matrix size = 500 (Intel Xeon E5430, 4 cores, L2 cache 12MB)

  int long long float double
Version 1 0.514330s 1.434509s 0.455168s 0.608127s
Version 2 0.314554s 1.231696s 0.447607s 0.593517s
Version 3 0.180176s 0.591002s 0.432129s 0.149511s
Version 4 N/A N/A 0.042900s 0.083286s
Version 1 + PPL 0.308766s 0.482934s 0.175585s 0.309159s
Version 2 + PPL 0.105717s 0.325413s 0.124862s 0.164156s
Version 3 + PPL 0.073418s 0.193824s 0.116971s 0.061268s
Version 4 + PPL N/A N/A 0.017891s 0.031734s

From the results, you can find that:

  • Parallelism only helps if you carefully tune your code to maximize its effect (Version 1)
  • Eliminating unnecessary memory write (Version 2) helps the parallelism
  • Data cache miss can be a big issue when there are lots of memory access (Version 3)
  • Using SIMD instead of FPU on aligned data is beneficial (Version 4)
  • Different data types, data sizes and host architectures may have different kinds of bottlenecks