Automatic Performance Tuning (Bebop)

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
(SIAM Review, December 2008)
Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick
PDF (2.8 MB)

Avoiding Communication in Sparse Matrix Computations
(IEEE International Parallel and Distributed Processing Symposium, April 2008)
James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick
PDF (1 MB)
Talk slides: PDF (6.2 MB)

Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms
(IEEE International Parallel and Distributed Processing Symposium, April 2008) [Winner, Best Paper for Applications track]
Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, and Katherine Yelick
PDF (560k)
Talk slides: PDF (10.4 MB) | PPT (2.6 MB)

Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms
(Supercomputing 2007, November 2007)
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel
PDF (438k)
Talk slides: PDF (6.4 MB) | PPT (2.5 MB)

Avoiding Communication in Computing Krylov Subspaces
(UCB/EECS-2007-123, October 2007)
James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick
PDF (34.9 MB)

Scientific Computing Kernels on the Cell Processor
(International Journal of Parallel Programming, April 2007)
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Katherine Yelick
PDF (376k)

When Cache Blocking Sparse Matrix Vector Multiply Works and Why
(Applicable Algebra in Engineering, Communication and Computing, March 2007)
Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine Yelick
PDF (390k)

Benchmarking Sparse Matrix-Vector Multiply in Five Minutes
(SPEC Benchmark Workshop 2007, Austin, TX, January 2007)
Hormozd Gahvari, Mark Hoemmen, James Demmel, Katherine Yelick
PDF (1 MB)
Talk slides: PPT (6.4 MB)

Implicit and Explict Optimizations for Stencil Computations
(Memory Systems Performance and Correctness, San Jose, California, USA, October 2006)
Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick
PDF (604k)
Talk slides: PDF (3.2 MB)

The Potential of the Cell Processor for Scientific Computing
(Computing Frontiers 2006, Ischia, Italy, May 2006)
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Katherine Yelick
PDF (216k)
OSKI: A library of automatically tuned sparse matrix kernels
(Proceedings of SciDAC 2005, Journal of Physics: Conference Series, June 2005)
Richard Vuduc, James Demmel, and Katherine Yelick.
PDF (190k)

Self-Adapting Linear Algebra Algorithms and Software
(Proceedings of the IEEE, Special Issue on Program Generation, Optimization, and Adaptation, 93(2), February 2005)
James Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Richard Vuduc, R. Clint Whaley, and Katherine Yelick.
PDF (600k)

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
(International Conference on Parallel Processing, Montreal, Quebec, Canada, August 2004) [Winner, Best Paper Award]
Benjamin C. Lee, Richard Vuduc, James Demmel, and Katherine Yelick.
PDF (178k) | Gzip'd PostScript (204k)
Talk slides: PDF (540k)

Toward automatic performance tuning of matrix triple products based on matrix structure
(PARA'04 Workshop on State-of-the-art in Scientific Computing, Copenhagen, Denmark, June 2004.)
Eun-Jin Im, Ismail Bustany, Cleve Ashcraft, James Demmel, and Katherine Yelick.

Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply
(UCB/CSD-04-1335, June 2004)
Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick
PDF (~8MB)

SPARSITY: An Optimization Framework for Sparse Matrix Kernels
(International Journal of High Performance Computing Applications, 18 (1), pp. 135-158, February 2004)
Eun-Jin Im, Katherine A. Yelick, and Richard Vuduc.
PDF (1.1M) | Gzip'd PostScript (1.2M)

Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply
(UCB/CSD-03-1297, November 2003)
Benjamin C. Lee, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick, Michael de Lorimier, and Lijue Zhong.
PDF (867k) | Gzip'd PostScript (1.3M)

Memory Hierarchy Optimizations and Performance Bounds for Sparse ATA*x
(ICCS 2003: Workshop on Parallel Linear Algebra, Melbourne, Australia, June 2003)
Richard Vuduc, Attila Gyulassy, James W. Demmel, and Katherine A. Yelick.
Abstract | PDF (328k) | Gzip'd PostScript (91k)
Talk slides, PDF (735k) | Talk slides, gzip'd PostScript, 4-up (138k)
Extended version: U.C. Berkeley Technical Report UCB/CS-03-1232

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply
Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, Benjamin Lee.
SC 2002 (High Performance Networking and Computing, commonly called "Supercomputing"). Baltimore, November 2002.
Available in pdf (834k) | Gzip'd PostScript (2.7M)
Automatic Performance Tuning and Analysis of Sparse Triangular Solve
Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala, James W. Demmel, Katherine A. Yelick.
ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries. New York, June 22-26, 2002.
Available in pdf (548k) | Gzip'd PostScript (1.2M)
Optimizing Sparse Matrix-Vector Multiplication for Register Reuse
E. Im and K. A. Yelick
International Conference on Computational Science, San Francisco, California, May 2001.
(Postscript)
Optimizing Sparse Matrix Kernels for Data Mining
E. Im and K. A. Yelick
Proceedings of the Text Mine Workshop
Chicago, IL, April 2001.
(Postscript)
Optimizing Sparse Matrix Vector Multiplication on SMPs
E. Im and K. A. Yelick
SIAM Conf. Parallel Processing for Scientific Computing, San Antonio, TX, March 1999.
(Postscript)
Model-based Memory Hierarchy Optimizations for Sparse Matrices
E. Im and K. A. Yelick
Workshop on Profile and Feedback-Directed Compilation, Paris, France, October 1998.
(Postscript)

Intelligent RAM (IRAM)

Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines,
Brian R. Gaeke, Parry Husbands, Xiaoye S. Li, Leonid Oliker, Katherine A. Yelick, and Rupak Biswas.
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS). Ft. Lauderdale, FL.
April, 2002.
Available in PDF.
Hardware/Compiler Co-development for an Embedded Media Processor,
C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick,
Proceedings of the IEEE, vol. 89, no. 11, November 2001 (p. 1694-709).
Draft available in PDF.
Exploiting On-Chip Memory Bandwidth in the VIRAM Compiler,
D. Judd, K. Yelick, C. Kozyraki, D. Martin, and D. Patterson,
Second Workshop on Intelligent Memory Systems, Cambridge, November 2000.
Available in Postscript.
Performance Analysis of an H.263 Video Encoder on VIRAM,
T. Nguyen, A. Zakhor and K. Yelick
International Conference on Image Processing (ICIP),
Vancouver, B.C., Canada, September 2000.
Available in PDF
Efficient FFTs on IRAM
Thomas, R. and Yelick, K.
First Workshop on Media Processors and DSPs, November 15, 1999.
Postscript available.
Scalable processors in the billion-transistor era: IRAM
Kozyrakis, C.E., Perissakis, S., Patterson, D., Anderson, T., Asanovic, K., Cardwell, N., Fromm, R., Golbus, J., Gribstad, B., Keeton, K., Thomas, R., Treuhaft, N., Yelick, K.
Computer, vol.30, (no.9), IEEE Comput. Soc, Sept. 1997. p.75-8.
Available in PDF.
The Energy Efficiency of IRAM Architectures
R. Fromm, S. Perissakis, N. Cardwell, D. Patterson, T. Anderson, and K. Yelick
Proceedings of the 24th Annual International Conference on Computer Architecture, June 1997.
Available in Postscript.
A Case for Intelligent DRAM: IRAM
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick.
IEEE Micro, April 1997, pp. 34-44. Also appeared as an Award Paper, Hot Chips VIII , August 1996.
Available in PDF or Postscript.
Intelligent RAM (IRAM): Chips that remember and compute
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick.
Proceedings of the 1997 IEEE International Solid-State Circuits Conference, February 1997, pp. 224-225.
Available in PDF or Postscript.

Clusters (includes ISTORE and ROC)

ROC-1: Hardware Support for Recovery-Oriented Computing.
Oppenheimer, D., A. Brown, J. Beck, D. Hettena, J. Kuroda, N. Treuhaft, D.A. Patterson, and K. Yelick.
IEEE Transactions on Computers Special Issue on Embedded Fault-Tolerant Computer Systems, Jul.-Aug., 2001.
Available in PDF
Cluster I/O with River: Making the Fast Case Common
R. H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D. E. Culler, J. M. Hellerstein, D. A. Patterson, and K. A. Yelick
Workshop on I/O in Parallel and Distributed Systems, Atlanta, GA, May 1999.
Postscript available.

Parallel Applications

Performance Modeling and Composition: A Case Study in Cell Simulation
Seve G. Steinberg, Jun Yang, and Katherine Yelick, IPPS '96 April 1996.
Abstract, Postscript available.
Parallelizing the Phylogeny Problem
J. Jones and K. Yelick, Supercomputing '95 December 1995.
Abstract, Postscript available.
Connected Components on Distributed Memory Machines
A. Krishnamurthy, S. Lumetta, D. Culler, and K. Yelick, June 1994.
Abstract, Postscript available.
Parallel Timing Simulation on a Distributed Memory Multiprocessor
Chih-Po Wen and Katherine Yelick, International Conference on Computer Aided Design, Santa Clara, California, November 1993.
Abstract, Postscript available.
Implementing an Irregular Application on a Distributed Memory Multiprocessor
Soumen Chakrabarti and Katherine Yelick, ACM Symposium on Principles and Practice of Parallel Programming, San Diego, California, June 1993.
Abstract, Postscript available.
A Parallel Completion Procedure for Term Rewriting Systems
Katherine Yelick and Stephen J. Garland, Conference on Automated Deduction , June 1992.
Abstract, Postscript available.

Compilation

Analyses and Optimizations for Shared Address Space Programs
A. Krishnamurthy and K. Yelick
Journal of Parallel and Distributed Computation, 1996.
Postscript available.
Optimizing Parallel Programs with Explicit Synchronization
Arvind Krishnamurthy and Katherine Yelick, Programming Language Design and Implementation, La Jolla, California, June 1995.
Abstract, Postscript available.
Optimizing Parallel SPMD Programs
Arvind Krishnamurthy and Katherine Yelick, Seventh Annual Workshop on Languages and Compilers for Parallel Computing, Ithaca, New York, August 1994.
Abstract, Postscript available.
Compiling Sequential Programs for Speculative Parallelism
Chih-Po Wen and Katherine Yelick, International Conference on Parallel and Distributed Systems, National Taiwan University, Taiwan, December 1993.
Abstract, Postscript available.

Scheduling and Load Balancing

Models and Scheduling Algorithms for Mixed Data and Task Parallel Programs
S. Chakrabarti, J. Demmel, and K. Yelick
Journal of Parallel and Distributed Computing, Vol. 47, pp. 168--184. 1997.
Modeling the Benefits of Mixed Data and Task Parallelism
Soumen Chakrabarti, James Demmel, and Katherine Yelick, Symposium on Parallel Algorithms and Architectures, Santa Barbara, California, July 1995.
Abstract, Postscript available.
Randomized Load Balancing for Tree Structured Computation
Soumen Chakrabarti, Abhiram Ranade, and Katherine Yelick, IEEE Scalable High Performance Computing Conference, Knoxville, Tennessee, May 1994.
Abstract, Postscript available.

Distributed Data Structures & the Multipol Library

Portable Parallel Irregular Applications.
K. Yelick, C.-P. Wen, S. Chakrabarti, E. Deprit, J. Jones, A. Krishnamurthy, Workshop on Parallel Symbolic Languages and Systems, Beaune, France, October 1995. To appear in Lecture Notes in Computer Science.
Abstract, Postscript available.
Multipol: A Distributed Data Structure Library.
S. Chakrabarti, E. Deprit, J. Jones, A. Krishnamurthy, E.-J. Im, C.-P. Wen, and K. Yelick, UCB//CSD-95-879, July 1995.
Abstract, Postscript available.
Portable Runtime Support for Asynchronous Simulation
Chih-Po Wen and Katherine Yelick, International Conference on Parallel Processing, August 1995.
Abstract, Postscript available.
Portable Runtime Support for Asynchronous Simulation
C.-P. Wen, S. Chakrabarti, E. Deprit, A. Krishnamurthy and K. Yelick, ``Runtime Support for Portable Distributed Data Structures,'' Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers, May 1995.
Postscript available.
Distributed Data Structures and Algorithms for Gröbner Basis Computation
Soumen Chakrabarti and Katherine Yelick, Lisp and Symbolic Computation, Vol. 7, 1994.
Abstract available.
Data Structures for Irregular Applications
K. Yelick, S. Chakrabarti, E. Deprit, J. Jones, A. Krishnamurthy, and C.-P. Wen, DIMACS Workshop on Parallel Algorithms for Unstructured and Dynamic Problems, Piscataway, New Jersey, June 1993.
Abstract, Postscript available.
Programming Models for Irregular Applications
Katherine Yelick. Workshop on Languages and Compilers and Run-Time Environments for Distributed Memory Multiprocessors, October 1992. Also appeared in SIGPLAN Notices, January 1993.
Postscript available.
A Survey of Portable Message Passing Libraries
Chih-Po Wen and Katherine Yelick, unpublished manuscript, October 15, 1992.
Postscript available.

Parallel Languages: Split-C, Titanium, and UPC

An Evaluation of Current High Performance Networks,
C. Bell, D. Bonachea, Y. Cote, J. Duell, P. Hargrove, P. Husbands, C. Iancu, M. Welcome, K. Yelick,
International Parallel and Distributed Processing Symposium, Nice, France, April 22-26, 2003.
Available in PDF
Introduction to UPC and Language Specification,
W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren,
CCS-TR-99-157, IDA Center for Computing Sciences, 1999.
Available in PDF
Titanium: A High-Performance Java Dialect
K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken
Concurrency: Practice and Experience, Vol. 10, No. 11-13, September-November 1998. An earlier version was presented at the Workshop on Java for High-Performance Network Computing, Palo Alto, CA, Feb. 1998.
Postscript available.
Empirical Evaluation of Global Memory Support on the Cray-T3D and Cray-T3E
A. Krishnamurthy, D. Culler, and K. Yelick
UCB//CSD-98-991, 1998.
Postscript available.
Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines
Arvind Krishnamurthy, Klaus E. Schauser, Chris Scheiman, Randy Wang, David Culler, and Katherine Yelick, Proceedings of Architecture Support on Programming Languages and Operating Systems, Cambridge, MA, November 1996.
Postscript available.
Empirical Evaluation of the CRAY-T3D: A Compiler Perspective
Remzi H. Arpaci, David E. Culler, Arvind Krishnamurthy, Steve G. Steinberg, and Katherine Yelick, International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 1995.
Abstract, Postscript available.
Parallel Programming in Split-C
D. Culler, A. Dusseau, S. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick, Supercomputing, Portland, Oregon, November 1993.
Abstract, Postscript available.

Symbolic Computation

On the Correctness of a Distributed Memory Gröbner Basis Algorithm
Soumen Chakrabarti and Katherine Yelick, International Conference on Rewriting Techniques and Applications, Montreal, Canada, June 1993.
Abstract, Postscript available.
Compiling Verilog into Finite State Machines
S.-T. Cheng, R. Brayton, G. York, K. Yelick, A. Saldanha. International Verilog Conference, 1995.
Abstract, Postscript available.
Using Abstraction in Explicitly Parallel Programs
Katherine A. Yelick, MIT Laboratory for Computer Science, July 1991, TR-507. (Revised from PhD Thesis, December 1990.)
Abstract, Postscript available.
A Generalized Approach to Equational Unification
Katherine A. Yelick, MIT Laboratory for Computer Science, August 1985, TR-344.