Performance was quite good considering how unusual the Itanium2 architecture is. The most successful optimizations were register and cache blocking, but the impact was strongly dependent on the block sizes. The least successful were reordering the matrix without blocking. The overall fastest code after the vendor libraries was from Nishtala, Chang, and Yang, although the code from Kamil, Canas, and Xiao makes a very strong showing at larger matrix sizes.
Many of the codes make unfortunate assumptions about maximum matrix sizes, and most of those assumptions lead to Heisenbugs. Solar-Lezama and Hoemmen at least acknowledged the limitation, and their code let me know what was going on. A few groups used dynamic rather than static allocation for large enough matrices, and Kamil, Diez Canas, and Xiao used dynamic allocation and still performed well. None of the codes with static buffers are thread-safe, but we didn't ask for that.
Each group's page has rate plots focusing on that group along with a link to that group's code.
Code below was compiled with Intel's version 7 compilers, not the version 8 ones I had initially used. There was no difference in group orderings between the two compilers. The vertical bars indicate changes in storage classes. The first class is when all three matrices fit in the IA64's 128 FP registers, the second in L1, then L2, L3, and memory. The recorded rates were the best achieved over seven runs.
The total time to completion below emphasizes performance on the large matrices, and was calculated using the maximum rates above. The red line provides NCY's time, and the blue MKL's time.
Rank, by time to completion:
The final driver used for timing is in matmul.c, and the raw data is in data.table. The R code to generate the plots is in stats.r.
Caveats:
Above is a breakdown of MFLOP rate by storage class. The left boxes corresponds to version 7 of Intel's compiler, and the right boxes with purple outlines correspond to version 8. You can see why I first thought icc gave a performance boost. It does, but not for all sizes.
The boxes hold 50% of the samples in each group. The middle line in each box is the median, and the box extends outward to hold 25% of the sample in each direction. The lines ("whiskers") extend outward around the median to about 1.5 times the size of the box, ideally encompassing 90% of the sample. The individual outliers beyond the whiskers are plotted. The notches give some idea of how sensitive the median is. (See MathWorld for more information on box plots.)
HW1, Main CS267 page, and the TA's CS267 page
E. Jason Riedy