CS294-76: Communication-Avoiding Algorithms

Fall, 2011

Administrative Info


Instructor: Jim Demmel , assisted by Oded Schwartz
When: Friday, 12-2pm.
Where: 405 Soda Hall
First meeting on: August 26, 2011
Number of meetings: 13.
Academic units: 2.
Course control number: 27271
Students: All student are welcome. The class is likely to be of particular interest to students of EECS, mathematics, or more generally CSE, with an interest and some background in algorithms, either numerical or discrete.
Prerequisite: None. If you took CS267/EngC233 it may make your preparations easier.

Announcements

Date Message
08/26/2011 Welcome!
08/29/2011 This term we will be using Piazza for class Q&A. The system is highly catered to getting you help fast and efficiently from classmates, and from Jim and Oded. Rather than emailing questions to the teaching staff, we encourage you to post your questions on Piazza. Find our class page at: http://www.piazza.com/berkeley/fall2011/cs29476.
10/3/2011 Next Friday we plan to visit NERSC. We meet at the Downtown BART at noon and take the 12:10 train to 19th St. Oakland (arriving 12:19pm). We will hear a 30 min presentation on NERSC, and tour the NERSC machine room. We intend to catch the 1:38pm back to Downtown Berkeley (arriving 1:47pm).
Important notice: we need to provide a list of names of all students visiting the center to OSF Security. If you intend to participate in this tour, but you are not officially registered to this course, please email your name to Oded. If you are registered, no action is required.
10/12/2011 If some of the students have trouble making the 12:10 Downtown Berkeley BART, they should be fine on the next one. Everyone needs to checkin to the center, with a photo id. To allow some time for that and to avoid bottleneck, please try to make it to the earlier train.
10/18/2011 Students who need accounts at NERSC (for final class projects purpose) may use the follow link:
http://www.nersc.gov/users/accounts/user-accounts/get-a-nersc-account/
10/27/2011 If you have not yet picked up a topic for your project or presentation, please: send Oded a short email, stating your choice until the end of this week, and email Oded a one page description of your planned project by next Friday Nov. 4.
11/08/2011 The deadline for submitting CS294 projects is the last week of this term, December 2. We will have a posters presentation the following Friday, December 9, 12-2pm. If you have presented a paper in class and have not done a project beyond that, you can present a poster version of your slides. We will start the event with a very short (2 minutes) presentation of each poster. If you want more time than that, please email Oded, so we can schedule your talk in one of the few remaining spots.
11/30/2011 Correction: the deadline for submitting CS294 projects is end of the day Friday of the poster session. The poster session remains on Friday, December 9, 12-2pm.
For poster printing instructions and a poster template you can look here: http://parlab.eecs.berkeley.edu/wiki/printers/pls562p. If you prefer preparing your poster using latex and need a template, please email Oded.
Many students will be using the posters printer (for our course and other courses). To avoid contention, you may want to print your poster ahead of times.
. .

Lectures

Date Speaker Topic Presentation
08/26/2011 12-1pm Jim Introduction to communication-avoiding algorithms PPT
08/26/2011 1-2pm Oded Communication costs lower and upper bounds: reductions PPT
09/02/2011 12-2pm Oded Communication costs lower and upper bounds: geometric embedding PPT
09/09/2011 12-2pm Jim CA algorithms for dense linear algebra: matrix multiplication and LU decomposition PPT
09/16/2011 12-2pm Jim/Oded CA algorithms for dense linear algebra: LU, QR, and Cholesky decompositions, and sparse Cholesky decomposition PPT
09/23/2011 12-1pm Edgar 2.5D algorithms: from hardware to theory and back PDF
09/23/2011 1-2pm Aydin Communication in sequential and parallel BFS PDF
09/30/2011 12-12:50pm Andrew & Grey Algorithms and lower bounds for heterogeneous models: minimizing communication, saving energy PDF
09/30/2011 12:50-1:15pm Jim Potential Projects PPT
09/30/2011 1:15-2pm Oded Communication costs lower and upper bounds: graph analysis PPT
10/07/2011 12-1pm Derrick Selective, Embedded Just-in-Time Specialization (SEJITS) PPT
10/07/2011 1-2pm Sam The roofline model PPT
10/14/2011 12-2pm . Visiting NERSC: National Energy Research Scientific Computing Center, at Berkeley Lab's Oakland Scientific Facility (OSF). .
10/21/2011 Erin/Nick Avoiding Communication in Sparse Iterative Solvers PPT
10/21/2011 Nick Avoiding Communication in Sparse Matrix-Vector Multiply (SpMV) PPT
10/21/2011 Erin Introduction to Blocking Covers PPT
10/28/2011 12-1pm Mike Communication Lower Bound for the Fast Fourier Transform PPT
10/28/2011 1-2pm Vasily Experience in accelerating linear algebra using GPUs PDF
11/04/2011 12-1pm Ben CA parallel implementation for fast matrix multiplication PDF .
11/04/2011 1-2pm Oded Maximizing communication PPT
11/11/2011 Holiday . .
11/18/2011 12-1pm Jim Further techniques for arithmetic costs and communication costs lower bounds .
11/18/2011 1-2pm Oded Towards communication avoiding algorithm for fast multiplication of sparse matrices. Part I: arithmetics. PPT
11/25/2011 Holiday . .
12/02/2011 12-1pm Bor-Yiing clOSKI: An OpenCL SpMV autotuner on GPU platforms PDF
12/02/2011 1-2pm Razvan Communication costs of LU decomposition algorithms for banded matrices PPT

Tentative plan

The course will include lectures by Jim Demmel, Oded Schwartz, other local experts, students presenting papers, outside experts, and (eventually) reports by students on class projects.
This course aims to familiarize the students with the challenges of minimizing communication. In particular, by the end of the course student will better understand the following terms, in the context of communication avoiding algorithms: lower bounds, upper bounds, bandwidth, latency, the seven dwarfs, memory hierarchy, cache oblivious, network topology, computational graph, dense linear algebra, sparse linear algebra, recursive data-structures, blocking, 2D/2.5D/3D. Other terms you may get familiar with, pending on your selected project, may include: temporal locality, spatial locality, tiling, high-radix, scheduling, MapReduce, work-stealing.
From past CS294 courses' experiences, some of your projects may evolve to articles.

Initial List of Potential Projects

You can do the project/presentation alone or with a fellow student. Time frame for projects: You should begin discussions with the instructors about your proposed project by the 4th week of the semester.

Links

Suggested reading

(L) - Will be presented in the lectures of the course.
(S) - Suggested paper for presentation by students.
(P) - Suggested paper for programming. You may also choose to implement algorithms from paper marked (L) and (S).

    Communication lower bounds in numerical linear algebra

  1. (S)
    J. W. Hong and H. T. Kung.
    I/O complexity: The red-blue pebble game.
    In STOC '81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, New York, NY, USA, 1981. ACM.
  2. (S)
    Y. Saad.
    Communication complexity of the Gaussian elimination algorithm on multiprocessors.
    Linear Algebra Appl., 77:315--340, 1986.
  3. (S)
    J. E. Savage.
    Extending the Hong-Kung model to memory hierarchies.
    In COCOON, pages 270--281, 1995.
  4. (L)
    D. Irony, S. Toledo, and A. Tiskin.
    Communication lower bounds for distributed-memory matrix multiplication.
    J. Parallel Distrib. Comput., 64(9):1017--1026, 2004.
  5. (L)
    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.
    Communication-optimal parallel and sequential cholesky decomposition.
    SIAM Journal on Scientific Computing, 32(6):3495--3523, December 2010.
  6. (L)
    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.
    Minimizing communication in linear algebra.
    SIAM Journal on Matrix Analysis and Applications..
    Accepted. Available from http://arxiv.org/abs/0905.2485.
  7. (L)
    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.
    Graph expansion and communication costs of fast matrix multiplication.
    In 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2011), 2011.

    Communication avoiding algorithms in numerical linear algebra

  8. (L)
    J. Demmel.
    CS 267 Course Notes: Applications of Parallel Processing.
    Computer Science Division, University of California, 1996.
    http://www.cs.berkeley.edu/~demmel/cs267.
  9. (L)
    J. Demmel, L. Grigori, M. F. Hoemmen and J. Langou.
    Communication-optimal parallel and sequential QR and LU factorizations.
    SIAM Journal on Scientific Computing (SISC), to appear, 2011.
    http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-89.pdf
  10. (L)
    L. Grigori, J. Demmel, and H. Xiang.
    Communication avoiding Gaussian elimination,
    Proceedings of the IEEE/ACM SuperComputing Conference (SC'08), November 2008.
  11. (L)
    L. Grigori, J. Demmel, and H. Xiang.
    CALU: a communication optimal LU factorization algorithm,
    To appear in SIAM Journal on Matrix Analysis,
    Preliminary version as UCB-EECS-2010-29 and LAWN 226,
  12. (L)
    L. Grigori, P.-Y. David, J. Demmel, and S. Peyronnet.
    Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem,
    ACM SPAA, 2010.
  13. (L)
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran.
    Cache-oblivious algorithms.
    In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285, Washington, DC, USA, 1999. IEEE Computer Society.
  14. (S)
    W. F. McColl and A. Tiskin.
    Memory-efficient matrix multiplication in the BSP model.
    Algorithmica, 24:287--297, 1999.
  15. (L)
    E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström.
    Recursive blocked algorithms and hybrid data structures for dense matrix library software.
    SIAM Review, 46(1):3--45, March 2004.
  16. (S,P)
    J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick,
    Avoiding communication in sparse matrix computations.
    In IPDPS'08: International Symposium on Parallel and Distributed Processing, 1--12, 2008.
  17. (L)
    V. Volkov and J. Demmel,
    LU, QR and Cholesky factorizations using vector capabilities of GPUs.
    Technical Report No. UCB/EECS-2008-49, EECS Department, University of California, Berkeley, 2008.
  18. (L)
    M. Anderson, G. Ballard, J. Demmel and K. Keutzer
    Communication-Avoiding QR Decomposition for GPUs
    EECS Department University of California, Berkeley, Technical Report No. UCB/EECS-2010-131 2010.

    Communication avoiding parallel algorithms in numerical linear algebra

  19. (L)
    L. Cannon.
    A cellular computer to implement the Kalman filter algorithm.
    PhD thesis, Montana State University, Bozeman, MN, 1969.
  20. (L)
    E. Solomonik and J. Demmel.
    Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms.
    EECS Technical Report EECS-2011-10, UC Berkeley, February 2011.
    To appear in EURO-PAR 2011.

    Fast Fourier Transform

  21. (S)
    J. W. Hong and H. T. Kung.
    I/O complexity: The red-blue pebble game.
    In STOC '81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, New York, NY, USA, 1981. ACM.
  22. (S,P)
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran.
    Cache-oblivious algorithms.
    In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285, Washington, DC, USA, 1999. IEEE Computer Society.

    Software Generation

  23. (S)
    N. Ahmed, N. Mateev, and K. Pingali.
    A framework for sparse matrix code synthesis from high-level specifications.
    In proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM) , Dallas, Texas, United States, 2000.
  24. (S)
    M. Mills Strout, L. Carter, and J. Ferrante.
    Compile-time Composition of Run-time Data and Iteration Reorderings.
    Proceedings of the 2003 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June, 2003.
  25. (S)
    M. M. Strout, L. Carter and J. Ferrante, J. Freeman, and B. Kreaseck.
    Combining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling.
    In Proc. 15th Workshop on Languages and Compilers for Parallel Computing (LCPC), College Park, Maryland, July 25-27, 2002.

    Graph algorithms

  26. (S)
    Y. J. Chiang, M. T. Goodrich, E.F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter.
    External-memory graph algorithms.
    Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms, 139-149, 1995.
  27. (S)
    J. P. Michael, M. Penner, and V. K. Prasanna.
    Optimizing graph algorithms for improved cache performance.
    In Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2002), Fort Lauderdale, FL, pages 769--782, 2002.
  28. (S)
    L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro.
    An optimal cache-oblivious priority queue and its application to graph algorithms.
    SIAM Journal on Computing, 36(6): 1672-1695, 2007.

    Sorting and Searching

  29. (S)
    A. LaMarca and R. E. Ladner,
    The influence of caches on the performance of sorting.
    In Proceeding of SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, 370--379, 1997.
  30. (S)
    A. Aggarwal and J. S. Vitter.
    The input/output complexity of sorting and related problems.
    Commun. ACM, 31(9):1116--1127, 1988.
  31. (S)
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran.
    Cache-oblivious algorithms.
    In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285, Washington, DC, USA, 1999. IEEE Computer Society.
  32. (S)
    M. A. Bender, G. S. Brodal, R. Fagerberg, D. Ge, S. He, H. Hu, J. Iacono, and A. López-Ortiz.
    The Cost of Cache-Oblivious Searching.
    Proceedings of the 44th Annual Symposium on Foundations of Computer Science (FOCS) 271--280, 2003.

    Data structures

  33. (S)
    A. LaMarca and R. E. Ladner.
    The Influence of Caches on the Performance of Heaps.
    Journal of Experimental Algorithmics, 1:4, 1996.
  34. (S)
    M. A. Bender, Z. Duan, J. Iacono, and J. Wu.
    A Locality-Preserving Cache-Oblivious Dynamic Dictionary.
    Journal of Algorithms, 3(2):115-136, 2004.
  35. (S)
    M. A. Bender, E. Demaine, and M. Farach-Colton.
    Cache-Oblivious B-Trees.
    SIAM Journal on Computing, 35(2):341-358, 2005.
  36. (S)
    L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro.
    An optimal cache-oblivious priority queue and its application to graph algorithms.
    SIAM Journal on Computing, 36(6): 1672-1695, 2007.
  37. (S)
    M. A. Bender, B. C. Kuszmaul, S. H. Teng, and K. Wang.
    Optimal Cache-Oblivious Mesh Layouts.
    Theory of Computing Systems, 48(2): 269-296, 2011.

    Dynamic programming

  38. (S)
    C. Cherng and R. E. Ladner.
    Cache Efficient Simple Dynamic Progamming.
    AofA'05: International Conference on the Analysis of Algorithms, 49--58,2005.
  39. (S)
    R. A. Chowdhury and V. Ramachandran.
    Cache-oblivious dynamic programming.
    In SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithms, pages 591--600, New York, NY, USA, 2006. ACM.

    Work stealing

  40. (S)
    U. Acar, G. Blelloch, and R. Blumofe.
    The Data Locality of Work Stealing.
    Theory of Computing Systems (TCS). 35(3), May 2002.
  41. (S)
    K. Agrawal, C. E. Leiserson, Y. He, and W. J. Hsu.
    Adaptive work-stealing with parallelism feedback.
    ACM Trans. Comput. Syst. 26 (3):7:1--7:32,2008.

    Parallel implementations for fast matrix multiplication

  42. (S)
    B. Grayson, A. P. Shah, R. A. .Van De Geijn
    A High Performance Parallel Strassen Implementation.
    Parallel Processing Letters, 6:3--12, 1995.
  43. (S)
    F. Desprez and F. Suter.
    Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms.
    Concurrency and Computation: Practice & Experience archive, 16:8,771--797, 2004.
  44. (S)
    D. K. Nguyen, I. Lavall, M. Bui, Q. T. Ha,
    A General Scalable Implementation of Fast Matrix Multiplication Algorithms on Distributed Memory Computers.
    Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, 116--122, 2005.
  45. (S)
    G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran,S. Chen, M. Kozuch,
    Provably good multicore cache performance for divide-and-conquer algorithms.
    In Proc. 19th ACM-SIAM Sympos. Discrete Algorithms, 501--510, 2008.
  46. (L,P)
    G. Ballard, J. Demmel, O. Holtz, E. Rom and O. Schwartz.
    Communication-optimal parallel Fast Matrix Multiplication.
    Manuscript, 2011.