CS267 Assignment 3
Spring 2004

Modify and tune code in global address space languages.

Due Date: 2 April, 2004

Goal

Languages like UPC and Titanium are relatively new. This assignment's goal is to expose you to these languages and the global address space programming paradigm.

Requirements

Each enrolled student should be assigned a group. Auditors can ask groups to be included or group together themselves to tackle a problem.

For your group's problem, you will need to perform and report upon the following tasks:

  1. Create initial implementations in UPC and Titanium.
  2. Quantify the performance of the initial code and identify performance problems.
  3. Change the code to address some of the performance problems.
  4. Iterate until you're happy with the code or the assignment is due.

Remember the typical parallel performance curve: Most initial versions perform poorly and improvements are incremental. Also, there will be unexpected performance issues from the UPC and Titanium implementations. Few people have experience with these languages.

We also want to know how the languages affect the designs and performance. (We know reductions are missing from UPC.) How does the language affect your design? What would you do differently in a MPI or OpenMP code? Which MPI, OpenMP, pthreads, etc. features do you miss, and which Titanium and UPC features do you enjoy?

Performance debugging can be tricky. You'll want to compare performance between your two implementations as well as with established benchmark codes when they're available. Also, think about modelling both your codes' performance and a "best possible" performance. For irregular problems, you will need to base models on particular problem instances or assumptions about instances.

Also, the NAS Parallel Benchmarks are "designed" to run with a power-of-two number of processors. Any time you can remove that restriction, please do. The power-of-two trick allows for some optimizations that aren't applicable to everyday codes.

Problems

CG
Estimate the smallest eigenvalue of a randomly generated sparse SPD matrix. The principal operation is a sparse matrix-vector product (q = A * p) which involves gathering elements of p from their home processors. CG also involves multiple reductions. (Problem from the NAS Parallel Benchmarks)
Mark Hoemmen, Armando Solar-Lezama, Fabrizio Bisetti, Guang Yang, Ben Schwarz, Christian Rojas
KNAP
Pack books into a bag to maximize profit. This is a 0-1 knapsack problem solved via dynamic programming; each process requires a block of previous solutions a random distance away. (Problem from David Culler, used in a previous 267 class)
Meling Ngo, Amir Kamel, Yatish Patel, Chen Chang, Frank Gennari, Guillermo Canas
FT
Solve a simple PDE via 3-D fast Fourier transforms. Here the main communication step is an all-to-all transpose. (Problem from the NAS Parallel Benchmarks)
Christian Bell, Rajesh Nishtala, Jeffrey Hammel, Hormozd Gahvari, Jimmy Su, Michael Tung
IS
Rank and sort small integer keys in parallel. The algorithm is a basic bucket sort, and so the interesting piece is merging the buckets across different processors.(Problem from the NAS Parallel Benchmarks)
Dan Adkins, Sonesh Surana, Shariq Rizvi, Omair Kamil, Benjamin Lee, Mikhail Avrekh

Submission

Your group should put together a web page describing the versions of your codes, performance aspects, and language effects. Mail me a URL to a tar file containing the page, the programs, and any necessary or useful additional information.

Resources / Notes

The NAS benchmarks are available through NASA. Their reports contain some useful information on different optimizations and platforms.

Local mirrors of:

UPC

UPC at LBL, UPC Community Pages, more examples

If you receive errors about too little memory, try setting the UPC_SHARED_HEAP_SIZE to a moderately large number. For example, export UPC_SHARED_HEAP_SIZE=256MB; in bash will use 256 MiB of memory for the shared heap. The UPC runtime also understands the GB postfix.

Using UPC on CITRIS:

Using UPC on Seaborg and Alvarez:

Titanium

Titanium

Using Titanium on CITRIS:

There are two Titanium installations available. One uses gcc, the other ecc. In both cases, the compiler is tcbuild.

You'll want to use the udp-cluster-smp backend, so compile with

tcbuild --backend=udp-cluster-smp

Titanium jobs are run with tcrun. The --help option lists options, and the --show option displays the command to be executed (e.g. mpirun -np 4 yourfile).

Myrinet (the gm-* backends) can function, but Myrinet programs will only run on nodes c17-c32. And then they will only run if no one else has used all the Myrinet ports. Use Alvarez for Myrinet cluster timings.

Using Titanium on Seaborg and Alvarez:

... What else would you like to know?


Main CS267 page, and the TA's CS267 page

E. Jason Riedy
ejr@cs.berkeley.edu