Optimize Your Code: Matrix Multiplication

Article
04/28/2009

Matrix multiplication is common and the algorithm is easy to implementation. Here is one example:

Version 1:

template<typename T>
void SeqMatrixMult1(int size, T** m1, T** m2, T** result)
{
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            result[i][j] = 0;
            for (int k = 0; k < size; k++) {
                result[i][j] += m1[i][k] * m2[k][j];
            }
        }
    }
}

This implementation is straight-forward and you can find it in text book and many online samples.

Version 2:

template<typename T>
void SeqMatrixMult2(int size, T** m1, T** m2, T** result)
{
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            T c = 0;
            for (int k = 0; k < size; k++) {
                c += m1[i][k] * m2[k][j];
            }
            result[i][j] = c;
        }
    }
}

This version will use a temporary to store the intermediate result. So we can save a lot of unnecessary memory write. Notice that the optimizer can not help here because it doesn't know whether "result" is an alias of "m1" or "m2".

Version 3:

template<typename T>
void Transpose(int size, T** m)
{
    for (int i = 0; i < size; i++) {
        for (int j = i + 1; j < size; j++) {
            std::swap(m[i][j], m[j][i]);
        }
    }
}
template<typename T>
void SeqMatrixMult3(int size, T** m1, T** m2, T** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            T c = 0;
            for (int k = 0; k < size; k++) {
                c += m1[i][k] * m2[j][k];
            }
            result[i][j] = c;
        }
    }
    Transpose(size, m2);
}

This optimization is tricky. If you profile the function, you'll find a lot of data cache miss. We transpose the matrix so that both m1[i] and m2[i] can be accessed sequentially. This can greatly improve the memory read performance.

Version 4:

template<typename T>
void SeqMatrixMult4(int size, T** m1, T** m2, T** result);
// assume size % 2 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddpd)
template<>
void SeqMatrixMult4(int size, double** m1, double** m2, double** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            __m128d c = _mm_setzero_pd();

            for (int k = 0; k < size; k += 2) {
                c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(&m1[i][k]), _mm_load_pd(&m2[j][k])));
            }
            c = _mm_hadd_pd(c, c);
            _mm_store_sd(&result[i][j], c);
        }
    }
    Transpose(size, m2);
}
// assume size % 4 == 0
// assume m1[i] and m2[i] are 16-byte aligned
// require SSE3 (haddps)
template<>
void SeqMatrixMult4(int size, float** m1, float** m2, float** result)
{
    Transpose(size, m2);
    for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            __m128 c = _mm_setzero_ps();

            for (int k = 0; k < size; k += 4) {
                c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(&m1[i][k]), _mm_load_ps(&m2[j][k])));
            }
            c = _mm_hadd_ps(c, c);
            c = _mm_hadd_ps(c, c);
            _mm_store_ss(&result[i][j], c);
        }
    }
    Transpose(size, m2);
}

For float types, we can use SIMD instruction set to parallel the data processing.

Parallel version using PPL (Parallel Patterns Library) and lambda in VC2010 CTP:

template<typename T>
void ParMatrixMult1(int size, T** m1, T** m2, T** result)
{
    using namespace Concurrency;
    for (int i = 0; i < size; i++) {
        parallel_for(0, size, 1, [&](int j) {
            result[i][j] = 0;
            for (int k = 0; k < size; k++) {
                result[i][j] += m1[i][k] * m2[k][j];
            }
        });
    }
}

Result

Here are the test results (what really matters is the relative time between different version):

Matrix size = 500 (Intel Core 2 Duo T7250, 2 cores, L2 cache 2MB)

	int	long long	float	double
Version 1	0.931119s	2.945134s	0.774894s	0.984585s
Version 2	0.571003s	2.310568s	0.724161s	0.929064s
Version 3	0.239538s	0.823095s	0.570772s	0.241691s
Version 4	N/A	N/A	0.063196s	0.187614s
Version 1 + PPL	0.847534s	1.683765s	0.589513s	0.994161s
Version 2 + PPL	0.380174s	1.190713s	0.409321s	0.594859s
Version 3 + PPL	0.135760s	0.495152s	0.370499s	0.185800s
Version 4 + PPL	N/A	N/A	0.041959s	0.157932s

Matrix size = 500 (Intel Xeon E5430, 4 cores, L2 cache 12MB)

	int	long long	float	double
Version 1	0.514330s	1.434509s	0.455168s	0.608127s
Version 2	0.314554s	1.231696s	0.447607s	0.593517s
Version 3	0.180176s	0.591002s	0.432129s	0.149511s
Version 4	N/A	N/A	0.042900s	0.083286s
Version 1 + PPL	0.308766s	0.482934s	0.175585s	0.309159s
Version 2 + PPL	0.105717s	0.325413s	0.124862s	0.164156s
Version 3 + PPL	0.073418s	0.193824s	0.116971s	0.061268s
Version 4 + PPL	N/A	N/A	0.017891s	0.031734s

From the results, you can find that:

Parallelism only helps if you carefully tune your code to maximize its effect (Version 1)
Eliminating unnecessary memory write (Version 2) helps the parallelism
Data cache miss can be a big issue when there are lots of memory access (Version 3)
Using SIMD instead of FPU on aligned data is beneficial (Version 4)
Different data types, data sizes and host architectures may have different kinds of bottlenecks

Optimize Your Code: Matrix Multiplication

Additional resources