Previous CS267 Assignment 1 results: [ 2008 | 2007 | 2004 | 2002 | 2000 | 1999 | 1997 ]
The following plot summarizes the performance achieved by different teams. The numbers in the plot correspond to the team numbers. See the list of assigned teams to decode. "GSI" is the code shown in Homework 1 notes. "given" is the simple block implementation that was supplied.
The following plot shows the correlation between the number of lines in the code and its performance. The counting was done using CLOC routine. Empty lines and comments are not counted. Files that were not required to compile matrix multiply (e.g. benchmark.cpp) are not counted. The color code shows whether aligned intrinsics such _mm_load_pd and _mm_store_pd are used in the code. Intrinsics _mm_loadu_pd, _mm_storeu_pd and _mm_load1_pd are considered unaligned.
There was little or no correlation of performance with using local copies and transposes. (However, using local copy is required for alignment; correlation with alignment is clearly visible.)
The following graphs show the raw performance as I was able to reproduce with the submitted codes. Vertical axis is the fraction of peak. Horizontal axis is the dimension of matrices. Notes in the graph indicate if zero-padding (and, therefore, local copy) was used to handle the fringes. You may notice that in most cases it helps achiving close to uniform performance across varying n.