CS267 Assignment 3
Spring 2004
Modify and tune code in global address space languages.
Languages like UPC and Titanium are relatively new. This
assignment's goal is to expose you to these languages and the global
address space programming paradigm.
Each enrolled student should be assigned a group. Auditors can
ask groups to be included or group together themselves to tackle a
problem.
For your group's problem, you will need to perform and report
upon the following tasks:
- Create initial implementations in UPC and Titanium.
- Quantify the performance of the initial code and identify
performance problems.
- Change the code to address some of the performance
problems.
- Iterate until you're happy with the code or the assignment
is due.
Remember the typical parallel performance curve: Most initial
versions perform poorly and improvements are incremental. Also,
there will be unexpected performance issues from the UPC and
Titanium implementations. Few people have experience with these
languages.
We also want to know how the languages affect the designs and
performance. (We know reductions are missing from UPC.) How
does the language affect your design? What would you do
differently in a MPI or OpenMP code? Which MPI, OpenMP,
pthreads, etc. features do you miss, and which Titanium
and UPC features do you enjoy?
Performance debugging can be tricky. You'll want to compare
performance between your two implementations as well as with
established benchmark codes when they're available. Also, think
about modelling both your codes' performance and a "best
possible" performance. For irregular problems, you will need to
base models on particular problem instances or assumptions about
instances.
Also, the NAS Parallel Benchmarks are "designed" to run with a
power-of-two number of processors. Any time you can remove that
restriction, please do. The power-of-two trick allows for some
optimizations that aren't applicable to everyday codes.
- CG
- Estimate the smallest eigenvalue of a randomly generated
sparse SPD matrix. The principal operation is a sparse
matrix-vector product (q = A * p) which involves gathering
elements of p from their home processors. CG also involves
multiple reductions. (Problem
from the NAS Parallel Benchmarks)
Mark Hoemmen, Armando Solar-Lezama, Fabrizio Bisetti, Guang
Yang, Ben Schwarz, Christian Rojas
- KNAP
- Pack books into a bag to maximize profit. This is a 0-1
knapsack problem solved via dynamic programming; each process
requires a block of previous solutions a random distance away.
(Problem
from David Culler, used in a previous 267 class)
Meling Ngo, Amir Kamel, Yatish Patel, Chen Chang, Frank
Gennari, Guillermo Canas
- FT
- Solve a simple PDE via 3-D fast Fourier transforms. Here the
main communication step is an all-to-all transpose. (Problem
from the NAS Parallel Benchmarks)
Christian Bell, Rajesh Nishtala, Jeffrey Hammel, Hormozd
Gahvari, Jimmy Su, Michael Tung
- IS
- Rank and sort small integer keys in parallel. The algorithm
is a basic bucket sort, and so the interesting piece is merging
the buckets across different processors.(Problem
from the NAS Parallel Benchmarks)
Dan Adkins, Sonesh Surana, Shariq Rizvi, Omair Kamil,
Benjamin Lee, Mikhail Avrekh
Your group should put together a web page describing the versions
of your codes, performance aspects, and language effects. Mail me a URL to a tar
file containing the page, the programs, and any necessary or
useful additional information.
The NAS benchmarks are available through NASA. Their
reports contain some useful information on different optimizations
and platforms.
Local mirrors of:
UPC at LBL, UPC Community Pages, more
examples
If you receive errors about too little memory, try setting the
UPC_SHARED_HEAP_SIZE to a moderately large number. For example,
export UPC_SHARED_HEAP_SIZE=256MB; in bash will
use 256 MiB of memory for the shared heap. The UPC runtime also
understands the GB postfix.
Using UPC on CITRIS:
- Add /home/cs/ejr/cs267/ia64/bin and
/usr/mill/bin to your PATH.
- Compile programs with upcc -network=mpi or upc
-network=smp -pthreads. See upcc --help for more
options.
- NOTE: This uses a UPC-to-C translator at NERSC.
Network problems will cause interruptions.
- Run programs with upcrun. For example, upcrun -n
4 ./knap-upc will run knap-upc on four processors.
If it's a pthreads executable, then will run on the local machine.
Using UPC on Seaborg and Alvarez:
- First, module load upc.
- Compile with upcc (for LAPI), upcc
-network=mpi (may not work), or upcc -network=smp
-pthreads.
-
Run with upcrun as on CITRIS. WARNING: By
default, UPC jobs run with one task per node. To use multiple
tasks per node, use the -t option to upcrun
to see the poe command executed, copy it, and change
the -tasks_per_node option. (You strip off the
retry option, etc.)
-
On Alvarez, the command qsub -I -lnodes=4 will get you an interactive session with exclusive use of four nodes. You can use this for tuning, although I recommend using the normal batch queue for final timing runs.
Titanium
Using Titanium on CITRIS:
There are two Titanium installations available. One uses gcc,
the other ecc. In both cases, the compiler is tcbuild.
-
For gcc, add
/project/cs/titanium/b/temp/citris/dist/bin to your
PATH.
-
For ecc, add
/project/cs/titanium/b/temp/citris/dist-2.334-ecc/bin to your
PATH.
You'll want to use the udp-cluster-smp backend, so compile with
tcbuild --backend=udp-cluster-smp
Titanium jobs are run with tcrun. The --help
option lists options, and the --show option displays the
command to be executed (e.g. mpirun -np 4
yourfile).
Myrinet (the gm-* backends) can function, but Myrinet programs will only run on nodes c17-c32. And then they will only run if no one else has used all the Myrinet ports. Use Alvarez for Myrinet cluster timings.
Using Titanium on Seaborg and Alvarez:
-
Set up the environment with module load titanium.
-
Compile with tcbuild, run with tcrun.
WARNING: Be sure to check the output of tcrun
--show to see how many tasks per node will be used...
-
On Alvarez, the command qsub -I -lnodes=4 will get you an interactive session with exclusive use of four nodes. You can use this for tuning, although I recommend using the normal batch queue for final timing runs.
... What else would you like to know?
Main
CS267 page, and the TA's CS267
page
E. Jason
Riedy
ejr@cs.berkeley.edu