The following plot shows the performance results for each team. The code was run on an AMD Opteron (Budapest) 2.3 GHz processor.
The red line shows the median performance over all matrix sizes from 1 to 768, and the gray line indicates the maximum performance over that same range. The points are color-coded to indicate how the group made use of SSE intrinsics. Unaligned SSE means that the data was loaded using unaligned vector loads (_mm_storeu_pd, mm_loadu_pd), as opposed to aligned loads (_mm_store_pd, mm_load_pd). The use of SSE seems to cluster the performance results quite well.
The sharp peak in the maximum curve is a solution that runs exceptionally fast specifically for matrices that are near multiples of 64. This behavior is shown more clearly in the plot below.
These graphs show the performance on our benchmark, which tests every matrix size from 1 to 768.