Suppose you have 4 local variables contained in FP registers and named c11, c12, c21, c22. You also have two 'double*'s A and B, and you want to do a matrix-matrix accumulate into the matrix defined by cij. Also, Asep and Bsep are integer (in registers) that define the distance in doubles between two rows (i.e., the col dim of A and B resp). Goal: minimize the number of memory references, the number of additional FP registers and try to express the computation using multiply-accumulates.
#define mul_mdmd_md2x2(c11,c12,c21,c22,A,Asep,B,Bsep) \
{ \
const double *bp,*ap; \
double b1,b2; \
double a; \
\
bp = B; \
b1 = bp[0]; b2 = bp[1]; /*quad-word load*/ \
bp += Bsep; \
\
ap = A; \
a = ap[0]; ap += Asep; \
\
c11 += a*b1; /* c11 += a11*b11 */ /* fma */ \
c12 += a*b2; /* c12 += a11*b12 */ /* fma */ \
\
a = ap[0]; ap = &A[1]; \
\
c21 += a*b1; /* c21 += a21*b11 */ /* fma */ \
c22 += a*b2; /* c22 += a21*b12 */ /* fma */ \
\
b1 = bp[0]; b2 = bp[1]; /* quad-word load */ \
a = ap[0]; ap += Asep; \
\
c11 += a*b1; /* c11 += a12*b21 */ /* fma */ \
c12 += a*b2; /* c12 += a12*b22 */ /* fma */ \
\
a = ap[0]; \
\
c21 += a*b1; /* c21 += a22*b21 */ /* fma */ \
c22 += a*b2; /* c22 += a22*b22 */ /* fma */ \
}