CS267 Assignment 3: Conjugate Gradient
Due Friday 21 March 2008 at 11:59pm
[ Introduction |
Details |
Submission |
Resources |
FAQ ]
The method of conjugate gradients (CG) is an iterative
technique for solving symmetric positive-definite linear systems. The
conjugate gradient algorithm, popular in practice, is similar in structure to
many other linear and nonlinear optimization and equation-solving algorithms,
and is relatively simple to code. All these points make CG an attractive
benchmark kernel. Indeed, CG appears in both the NAS parallel benchmarks and
the SPEC floating point benchmark suite.
Although it is not required you understand CG and why it works
to solve a system of equations Ax = b, the underlying principles are quite
interesting and, unlike many other scientific algorithms in use, there is a
succint,
understandable text that explains it without much prerequisite knowledge of math,
written by Prof. Jonathan Shewchuk.
I highly recommend reading his paper.
In this project, you will parallelize CG using the UPC language, work out the performance
bottlenecks, and iteratively optimize your code.
Details
You may work in groups of up to 3. The implementation we
give you is capable of solving a simple model problem (the 1-d Poisson
equation) or more general sparse matrix problem. The code can read sparse
matrix files in the Matrix Market format; for instance
./cg gr_30_30.mtx
will solve the 30-by-30 two-dimensional Poisson problem from the Matrix Market
file gr_30_30.mtx.
We will provide a serial C implementation with a dummy preconditioner. Your tasks
are to:
- Create an initial parallel UPC implementation
- Analyze the performance of your parallel implmentation. What are the bottlenecks?
- Optimize your implementation. Restructure your code, change your parallelization strategy, etc.
If you reach a point where you're pretty satisfied with the parallelization, implement a simple
preconditioner (Jacobi or SSOR or some other one).
- Iterate 2 & 3 until you are either satisfied with the result, or you run out of time.
For this homework assignment, the primary platforms are Jacquard and
Franklin, NERSC's new flagship
Cray XT4. UPC is installed on both machines; on Jacquard, simply loading the UPC module and using
upcc as the compiler works. On Franklin,
see here
(contains information for both machines).
On Jacquard, make sure you load the ACML module before compiling/running.
Your group should put together a write-up describing
changes you've made to the programs and their effects on performance.
Mail me (skamil@cs) a
URL to a tar file containing the report, your program code, and any
necessary or useful additional information.
The primary questions of interest:
- Which parallelization strategies did you try? Why did you go from one to another?
- What performance gaffes (and bugs) exist in these codes? Can you
quantify the problems?
- What sort of data decomposition did you use? Were there any load-balancing
issues?
- What kinds of bottlenecks did you encounter? How did you address them?
- Is using UPC and a PGAS model more intuitive than the other languages/libraries we've used in the class? How does implementation ease compare? What about ease of getting good performance?
- How does the code perform on Jacquard vs. Franklin? How well does the code scale?
- What were some of the optimizations that made the most impact on performance? Was it more effective to change the implementation strategy or to add a preconditioner? When does preconditioning help?
Answer the questions I've asked or implied above, and
try to explain any interesting effects you see. If you don't see any,
explain why not. Explanations that are based on a well-understood
system model (PRAM, LogP, etc.) or well-understood programming models
(e.g. comparing shared memory to PGAS to message passing) are the most convincing. The page
should include appropriate speed-up plots and any other figures to convey
your story--- note that tracing may be difficult for UPC.
The goal of this assignment is to learn a PGAS language and
parallelize a scientific code that's actually used in practice. In addition, there
are two vectors for improving performance here: one is implementation, and the other
is using math/changing the algorithm.
- The Berkeley UPC Group has links to documentations, a
downloadable compiler, and general
resources for the Berkeley implementation of UPC.
- UPC at George Washington University is
another place to look for UPC tutorials and information.
- LAPACK working
Note #56 discusses how to reorganize CG to eliminate reductions.
(Fewer global communication steps is a good thing.)
- The UPC language reference
will be a valuable reference.
- Templates for the Solution of
Linear Systems is a great source of information on implementing and
optimizing CG, choosing a preconditioner, etc. There is also source
(albeit source written in Fortran with a reverse communication interface).
- Parallel
Numerical Linear Algebra by Demmel, Heath, and van der Vorst.
Additional information on parallel implementation of CG.
- An Introduction
to the Conjugate Gradient Method Without the Agonizing Pain is a very
popular (and very gentle) introduction to the mathematics of CG. If you
have vague recollections of eigenthingies from your distant past, you can
probably grok this explanation.
- The PETSc toolkit
includes CG and several of its relatives, along with a variety of
preconditioners. You might also be interested in ITPACK,
a package of iterative solvers that includes CG.
- The Matrix Market is a
great source for test cases. Some of the suggested test cases from
previous years included gr_30_30,
nos1,
nos2,
nos3,
nos4,
nos5,
nos6,
nos7
(all smallish), and bcsstk16,
bcsstk17,
bcsstk18
(larger). These look pretty arbitrarily chosen; other SPD problems from
the Matrix Market could be interesting, too. Note that you want the Matrix
Market format files, not Harwell-Boeing format.
- Sparsity is a
package by Eun-Jin Im for optimizing sparse matrix-vector multiply. Its
successor, OSKI is a current, update
implementation of optimized SPMV.
- Something like METIS
could be helpful for partitioning your problem across processors.
- You may want to call LAPACK for some of
the work.
LAPACK relies on the BLAS to do most of its work.
Note that the reference BLAS library provided with LAPACK gets
crappy performance. You really want something like ATLAS
if you plan to get good performance from LAPACK and the BLAS.
- You may notice that for slowly-converging cases, the behavior is
different for different numbers of processors even when the code
appears mathematically identical. Look at the residual histories when
this happens. Do they start of the same, and then start to drift away from
each other? If so, you may be seeing certain round-off effects, probably
due to differences in ordering of sums in the dot product.
How do I compile OSKI on Franklin?
Easy(ish):
- Copy this file over src/timer/cycle.h.
- use the --disable-shared option in configure i.e. "./configure --disable-shared". I also recommend --disable-bench
to save on compilation time.
- edit libtool and comment out line 171 by adding a "#" to the beginning. It should look like
#export_dynamic_flag_spec="\${wl}--export-dynamic"
Remember, Franklin does not support dynamic libraries, so you must statically compile your code. If you are unable to get
your code working on Franklin, it is okay to use Bassi, but I'd really like people to work with (and around) Franklin.
Original project by David Bindel. Last updated March 8, 2008.