next up previous
Next: About this document Up: Optimizing Matrix Multiply using Previous: StatusAvailability, and Future

References

ABB tex2html_wrap_inline1913 92
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK users' guide, release 1.0. In SIAM, Philadelphia, 1992.

ACF95
B. Alpern, L. Carter, and J. Ferrante. Space-limited procedures: A methodology for portable high-performance. In International Working Conference on Massively Parallel Programming Models, 1995.

AGZ94
R. Agarwal, F. Gustavson, and M. Zubair. IBM Engineering and Scientific Subroutine Library, Guide and Reference, 1994. Available through IBM branch offices.

BAD tex2html_wrap_inline1913
J. Bilmes, K. Asanovic, J. Demmel, D. Lam, and C.W. Chin. The PHiPAC WWW home page. http://www.icsi.
[0]berkeley.edu/~bilmes/phipac
.

BAD tex2html_wrap_inline1913 96
J. Bilmes, K. Asanovic, J. Demmel, D. Lam, and C.W. Chin. PHiPAC: A portable, high-performance, ANSI C coding methodology and its application to matrix multiply. LAPACK working note 111, University of Tennessee, 1996.

BLL93
B.Kågström, P. Ling, and C. Van Loan. Portable high performance GEMM-based level 3 BLAS. In R.F. Sincovec et al., editor, Parallel Processing for Scientific Computing, pages 339-346, Philadelphia, 1993. SIAM Publications.

BLS91
D. H. Bailey, K. Lee, and H. D. Simon. Using Strassen's algorithm to accelerate the solution of linear systems. J. Supercomputing, 4:97-371, 1991.

CDD tex2html_wrap_inline1913 96
J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. ScaLAPAC: A portable linear algebra library for distributed memory computers - design issues and performance. LAPACK working note 95, University of Tennessee, 1996.

CFH95
L. Carter, J. Ferrante, and S. Flynn Hummel. Hierarchical tiling for improved superscalar performance. In International Parallel Processing Symposium, April 1995.

DCDH90
J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1-17, March 1990.

DCHH88
J. Dongarra, J. Du Cros, S. Hammarling, and R.J. Hanson. An extended set of FORTRAN basic linear algebra subroutines. ACM Trans. Math. Soft., 14:1-17, March 1988.

GL89
G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.

KHM94
C. Kamath, R. Ho, and D.P. Manley. DXML: A high-performance scientific subroutine library. Digital Technical Journal, 6(3):44-56, Summer 1994.

LHKK79
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic linear algebra subprograms for FORTRAN usage. ACM Trans. Math. Soft., 5:308-323, 1979.

LRW91
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of ASPLOS IV, pages 63-74, April 1991.

MS95
J.D. McCalpin and M. Smotherman. Automatic benchmark generation for cache optimization of matrix algorithms. In R. Geist and S. Junkins, editors, Proceedings of the 33rd Annual Southeast Conference, pages 195-204. ACM, March 1995.

SMP tex2html_wrap_inline1913 96
R. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon. The combined effectiveness of unimodular transformations, tiling, and software prefetching. In Proceedings of the 10th International Parallel Processing Symposium, April 15-19 1996.

WL91
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, pages 30-44, June 1991.

Wol96
M. Wolfe. High performance compilers for parallel computing. Addison-Wesley, 1996.



Richard Vuduc
Tue Nov 18 15:58:12 PST 1997