# Optimize Your Code: Matrix Multiplication

Matrix multiplication is common and the algorithm is easy to implementation. Here is one example:

Version 1:

template<typename T>
void SeqMatrixMult1(int size, T** m1, T** m2, T** result)
{
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
result[i][j] = 0;
for (int k = 0; k < size; k++) {
result[i][j] += m1[i][k] * m2[k][j];
}
}
}
}

This implementation is straight-forward and you can find it in text book and many online samples.

Version 2:

template<typename T>
void SeqMatrixMult2(int size, T** m1, T** m2, T** result)
{
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
T c = 0;
for (int k = 0; k < size; k++) {
c += m1[i][k] * m2[k][j];
}
result[i][j] = c;
}
}
}

This version will use a temporary to store the intermediate result. So we can save a lot of unnecessary memory write. Notice that the optimizer can not help here because it doesn't know whether "result" is an alias of "m1" or "m2".

Version 3:

template<typename T>
void Transpose(int size, T** m)
{
for (int i = 0; i < size; i++) {
for (int j = i + 1; j < size; j++) {
std::swap(m[i][j], m[j][i]);
}
}
}
template<typename T>
void SeqMatrixMult3(int size, T** m1, T** m2, T** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
T c = 0;
for (int k = 0; k < size; k++) {
c += m1[i][k] * m2[j][k];
}
result[i][j] = c;
}
}
Transpose(size, m2);
}

This optimization is tricky. If you profile the function, you'll find a lot of data cache miss. We transpose the matrix so that both m1[i] and m2[i] can be accessed sequentially. This can greatly improve the memory read performance.

Version 4:

template<typename T>
void SeqMatrixMult4(int size, T** m1, T** m2, T** result);
// assume size % 2 == 0
// assume m1[i] and m2[i] are 16-byte aligned
template<>
void SeqMatrixMult4(int size, double** m1, double** m2, double** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
__m128d c = _mm_setzero_pd();

for (int k = 0; k < size; k += 2) {
}
_mm_store_sd(&result[i][j], c);
}
}
Transpose(size, m2);
}
// assume size % 4 == 0
// assume m1[i] and m2[i] are 16-byte aligned
template<>
void SeqMatrixMult4(int size, float** m1, float** m2, float** result)
{
Transpose(size, m2);
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
__m128 c = _mm_setzero_ps();

for (int k = 0; k < size; k += 4) {
}
_mm_store_ss(&result[i][j], c);
}
}
Transpose(size, m2);
}

For float types, we can use SIMD instruction set to parallel the data processing.

Parallel version using PPL (Parallel Patterns Library) and lambda in VC2010 CTP:

template<typename T>
void ParMatrixMult1(int size, T** m1, T** m2, T** result)
{
using namespace Concurrency;
for (int i = 0; i < size; i++) {
parallel_for(0, size, 1, [&](int j) {
result[i][j] = 0;
for (int k = 0; k < size; k++) {
result[i][j] += m1[i][k] * m2[k][j];
}
});
}
}

Result

Here are the test results (what really matters is the relative time between different version):

Matrix size = 500 (Intel Core 2 Duo T7250, 2 cores, L2 cache 2MB)

 int long long float double Version 1 0.931119s 2.945134s 0.774894s 0.984585s Version 2 0.571003s 2.310568s 0.724161s 0.929064s Version 3 0.239538s 0.823095s 0.570772s 0.241691s Version 4 N/A N/A 0.063196s 0.187614s Version 1 + PPL 0.847534s 1.683765s 0.589513s 0.994161s Version 2 + PPL 0.380174s 1.190713s 0.409321s 0.594859s Version 3 + PPL 0.135760s 0.495152s 0.370499s 0.185800s Version 4 + PPL N/A N/A 0.041959s 0.157932s

Matrix size = 500 (Intel Xeon E5430, 4 cores, L2 cache 12MB)

 int long long float double Version 1 0.514330s 1.434509s 0.455168s 0.608127s Version 2 0.314554s 1.231696s 0.447607s 0.593517s Version 3 0.180176s 0.591002s 0.432129s 0.149511s Version 4 N/A N/A 0.042900s 0.083286s Version 1 + PPL 0.308766s 0.482934s 0.175585s 0.309159s Version 2 + PPL 0.105717s 0.325413s 0.124862s 0.164156s Version 3 + PPL 0.073418s 0.193824s 0.116971s 0.061268s Version 4 + PPL N/A N/A 0.017891s 0.031734s

From the results, you can find that:

• Parallelism only helps if you carefully tune your code to maximize its effect (Version 1)

• Eliminating unnecessary memory write (Version 2) helps the parallelism

• Data cache miss can be a big issue when there are lots of memory access (Version 3)

• Using SIMD instead of FPU on aligned data is beneficial (Version 4)

• Different data types, data sizes and host architectures may have different kinds of bottlenecks

Tags

1. agangz@gmail.com says:

For matrix multiplication A*A^T + B*B^T,  and B*A^T – A*B^T,  is there any optimization method?

here, A and B is m*n matrix, T means matrix transpose

Thanks.

2. Xiang Fan says:

The result of A * Tran(A) is symmetric, so you save about 50% of the computation.

Similarly, B * Tran(A) – A * Tran(B) = B * Tran(A) – Tran(B * Tran(A)), so only one matrix multiplication is needed.

3. zenanMeng says:

nice article, help me a lot, thankyou

4. Rafael Setragni says:

I did in a different way, using OO, pointers and increment operations as follows ahead:

// Optimized matrix multiplication R = A * B

const Matrix Matrix::operator*(const Matrix &other) const

{

assert (this->colNum == other.rowNum);

// initialize a empty matrix to return

Matrix result(this->rowNum, other.colNum, 0.0);

// initialize our flags

int positionR = 0, positionX = -1, positionA = -1, positionB = -result.colNum;

int rowsCount = 0, colsCount = 0;

// result elements times iterations needed

int limit = result.rowNum * result.colNum * this->colNum;

for(int step = 0; step < limit; ++step)

{

// iteration counter

positionX++;

// if we reached a complete iteration

if(positionX == this->colNum)

{

positionX = 0;

positionR++; colsCount++;

// avoid to use modulo operator

if(colsCount == result.colNum)

{

colsCount = 0;

rowsCount += result.colNum;

}

// redefine starts point

positionB = colsCount;

positionA = rowsCount;

}

else

{

// move to forward positions

positionA++;

positionB += result.colNum;

}

//access an array using the closest machine run style

result.matrixArray[positionR] += this->matrixArray[positionA]

* other.matrixArray[positionB];

}

/*

// TODO otimizar a multiplicação de matrizes

for(int positionY = 0; positionY < result.rowNum; ++positionY)

for(int positionX = 0; positionX < this->colNum; ++positionX)

for(int positionZ = 0; positionZ < other.rowNum; ++positionZ)

result.element(positionY, positionX) += this->element(positionY, positionZ)

* other.element(positionZ, positionX);

*/

return result;

}

Maybe it could be better, without thouse minus initializations. But its enough for me. =D

5. Rafael Setragni says:

I was wrong. The method above wasnt enough for me.

There is my fastest matrix multiplication method, both with same logic:

// simple to understand, but more slower

// Optimized matrix multiplication understandble R = A * B

const Matrix Matrix::operator*(const Matrix &other) const

{

assert (this->colNum == other.rowNum);

// initialize a empty matrix to return

Matrix result(this->rowNum, other.colNum, 0.0);

// initialize our flags

int positionR = 0, positionA = 0, positionB = 0;

int limit = other.rowNum, colPos = 0;

// use more efficient loop

loopMultiplication:

//access an array using the closest machine run style

result.matrixArray[positionR] += this->matrixArray[positionA]

* other.matrixArray[positionB];

// move to forward positions

positionA++;

positionB += result.colNum;

// if we reached a complete iteration

if(positionA == limit)

{

// move our positions

colPos++; positionR++;

// if every result position was made, there is nothing else to do

if( positionR == result.matrixSize )

return result;

// if we reach a new line, reset our flags

// avoiding to use module operator

if(colPos == result.colNum)

{

colPos = 0;

positionB = 0;

limit += other.rowNum;

}

else

// else just back the first flag

positionA -= other.rowNum;

// B always accompanies

positionB = colPos;

}

goto loopMultiplication;

}

6. Rafael Setragni says:

// complex to understand, but really faster

// Optimized matrix multiplication R = A * B

const Matrix Matrix::operator*(const Matrix &other) const

{

assert (this->colNum == other.rowNum);

// initialize a empty matrix to return

Matrix result(this->rowNum, other.colNum, 0.0);

// initialize our flags

double *pointerR = result.matrixArray;

double *pointerA = this->matrixArray;

double *pointerB = other.matrixArray;

int decrementB = (other.matrixSize – 1);

int rowSize = other.rowNum;

int colSize = this->colNum;

double *limitE  = pointerR + result.matrixSize;

double *limitR  = pointerR + rowSize;

double *limitA  = pointerA + rowSize;

// use more efficient loop

loopMultiplication:

//access an array using the closest machine run style

*pointerR += *pointerA * *pointerB;

// move to forward positions

pointerA++;

pointerB += colSize;

// if we reached a complete iteration

if(pointerA == limitA)

{

// move our R position

pointerR++;

// if every result position was made, there is nothing else to do

if( pointerR == limitE )

return result;

// if we reach a new line, reset our flags

// avoiding to use module operator

if(pointerR == limitR)

{

pointerB = other.matrixArray;

limitA += rowSize;

limitR += rowSize;

}

else

{

// else just back the first flag

pointerA -= rowSize;

// B always accompanies R

pointerB -= decrementB;

}

}

goto loopMultiplication;

}

and my fastest traspose method, for any kind of matrix:

void Matrix::transpose()

{

int tempColNum, positionA = 0, positionB = 0;

double *tempMatrix;

do

{

tempArray[positionB] = matrixArray[positionA++];

positionB += rowNum;

if( positionB >= matrixSize )

positionB -= matrixSize – 1;

} while ( positionA < matrixSize );

tempMatrix = matrixArray;

matrixArray = tempArray;

tempArray = tempMatrix;

tempColNum = colNum;

colNum = rowNum;

rowNum = tempColNum;

}

OBS: my multiplication methods dismiss the transpose method. Doesnt make any diference.

7. Rafael Setragni says:

int colSize = this->colNum;

by

int colSize = other.colNum;

i achieved a multiplication by MatrixA[1][3] x MatrixA[3][2], 1.000.000 times in 242 milliseconds, using the E7500 processor.

8. GAK says:

I followed most of the vectorization code, but can you please explain use of these:

Thank you.

1. Xiang Fan says:

You can check the document for what these intrinsics do: https://msdn.microsoft.com/en-us/library/yd9wecaa.aspx

Using _mm_hadd_pd as an example. You have a vector ‘(C1, C2, C3, C4)’ stored in ‘c’ and you want to get back ‘C1 + C2 + C3 + C4’.

After the first call, you get ‘(C1 + C2, C3 + C4, C1 + C2, C3 + C4)’.
After the second call, you get ‘(C1 + C2 + C3 + C4, C1 + C2 + C3 + C4, C1 + C2 + C3 + C4, C1 + C2 + C3 + C4)’.

Then you can extract the result ‘C1 + C2 + C3 + C4’ from any of the element in the vector.

9. nanyang says: