CS267 Assignment 1 Results: Optimize Matrix Multiplication

Overall Comments

Performance was quite good considering how unusual the Itanium2 architecture is. The most successful optimizations were register and cache blocking, but the impact was strongly dependent on the block sizes. The least successful were reordering the matrix without blocking. The overall fastest code after the vendor libraries was from Nishtala, Chang, and Yang, although the code from Kamil, Canas, and Xiao makes a very strong showing at larger matrix sizes.

Many of the codes make unfortunate assumptions about maximum matrix sizes, and most of those assumptions lead to Heisenbugs. Solar-Lezama and Hoemmen at least acknowledged the limitation, and their code let me know what was going on. A few groups used dynamic rather than static allocation for large enough matrices, and Kamil, Diez Canas, and Xiao used dynamic allocation and still performed well. None of the codes with static buffers are thread-safe, but we didn't ask for that.

Groups

Each group's page has rate plots focusing on that group along with a link to that group's code.

3-Loops
Example three nested loop code
Blocked
Example blocked code
ATLAS
Automatically Tuned Linear Algebra Software, Debian package version 3.2.1ln-7
MKL
Intel's Math Kernel Library, version 5.2
BKZ, writeup
Christian Bell, Omair Kamil, and Christian Zambrana
AAS, writeup
Dan Adkins, Mikhail Avrekh, and Sonesh Surana
BRT, writeup
Fabrizio Bisetti, Shariq Rizvi, and Michael Tung
GLH, writeup
Hormozd Gahvari, Benjamin C. Lee, and Jeff Hammel
SS, writeup
Ben Schwarz and Jimmy Su
KCX, writeup
Amir Kamil, Guillermo Diez Canas, and Wei Xiao
GNP, writeup
Frank Gennari, Meling Ngo, and Yatish Patel
SH, writeup
Armando Solar-Lezama and Mark Hoemmen
NCY, writeup
Rajesh Nishtala, Chen Chang, and Guang Yang

Rate Plots

Code below was compiled with Intel's version 7 compilers, not the version 8 ones I had initially used. There was no difference in group orderings between the two compilers. The vertical bars indicate changes in storage classes. The first class is when all three matrices fit in the IA64's 128 FP registers, the second in L1, then L2, L3, and memory. The recorded rates were the best achieved over seven runs.

Rate plot in L2
Rate plot in memory

The total time to completion below emphasizes performance on the large matrices, and was calculated using the maximum rates above. The red line provides NCY's time, and the blue MKL's time.

Barchart of time to completion

Rank, by time to completion:

  1. Rajesh Nishtala, Chen Chang, and Guang Yang
  2. Amir Kamil, Guillermo Diez Canas, and Wei Xiao
  3. Christian Bell, Omair Kamil, and Christian Zambrana [BKZ]
  4. Ben Schwarz and Jimmy Su [SS]
  5. Hormozd Gahvari, Benjamin C. Lee, and Jeff Hammel
  6. Dan Adkins, Mikhail Avrekh, and Sonesh Surana
  7. Frank Gennari, Meling Ngo, and Yatish Patel [GNP]
  8. Fabrizio Bisetti, Shariq Rizvi, and Michael Tung
  9. Armando Solar-Lezama and Mark Hoemmen [SH]

The final driver used for timing is in matmul.c, and the raw data is in data.table. The R code to generate the plots is in stats.r.

Caveats:

Box-and-whiskers plot of MFLOP/s by memory class

Above is a breakdown of MFLOP rate by storage class. The left boxes corresponds to version 7 of Intel's compiler, and the right boxes with purple outlines correspond to version 8. You can see why I first thought icc gave a performance boost. It does, but not for all sizes.

The boxes hold 50% of the samples in each group. The middle line in each box is the median, and the box extends outward to hold 25% of the sample in each direction. The lines ("whiskers") extend outward around the median to about 1.5 times the size of the box, ideally encompassing 90% of the sample. The individual outliers beyond the whiskers are plotted. The notches give some idea of how sensitive the median is. (See MathWorld for more information on box plots.)


HW1, Main CS267 page, and the TA's CS267 page

E. Jason Riedy
ejr@cs.berkeley.edu