CS267 Assignment 2: Fishy Code

Due Friday 7 March 2008 at 11:59pm

[ Background | Details | Problem | Implementation | Submission | Resources ]

Background

The Fish Particle Simulation (FPS) is meant to represent an abstraction of natural phenomena that you may find in biology, chemistry, physics, astronomy, etc. and eventually want to model on a parallel machine. We intend to give you a taste of creating parallel algorithms, using parallel programming tools, libraries and compilers, and modelling performance and scalability on this simplified problem.

We give you an implementation of FPS in its simplest form -- essentially a simulation over time of equal mass particles obeying gravitational-like forces. The particles begin either randomly distributed or placed equally along a circle. At every time step, the forces on each particle are summed up due to all of the other particles. Then using Euler's method, the new position of the particles are calculated and distributed to all of the processors.

Details

We are providing fish codes in the following programming models:

The provided code computes the gravity-like interactions between fishes (i.e., fishes) in a bounded box (fishes bounce off walls). You will need to modify the code to compute Van der Waals'-like interactions--forces with a cutoff. So, instead of all-to-all interactions (which is the case with the given code), you will be computing short-range interactions (i.e., a particle only interacts with other particles within a cutoff distance). Though a trivial modification of the given code will compute these short-range forces, but it will provide an O(n2) performance (instead of the expected linear performance if the fishes are uniformly distributed in the bounding box). So, there is a lot which can be done to get better performance.


Figure 1: In the programs' current form, all processors have to store all fish.

In the provided MPI implementation, processors store all the fish and calculate all interactions at once, as in Figure 1. This allows you to use a very standard, sequential interaction function, but you lose many of the advantages of computing in parallel. Each processor needs to know about all of the n fish at once, and you end up storing nP fish. This grows linearly with the number of processors, P. An implementation is considered purely scalable if its resource utilization (time and space) does not grow with P. In real life, you often have some P-dependent bookkeeping, so people accept a tiny growth with P as long as the overall speed improves with P. Try to reduce the storage use to some constant times n to make the code more scalable.

A better distribution of fishes could be based on spatial decomposition. This way, if the distribution of fishes is roughly uniform, a processor would need to communicate only with a small number of processors (which would be independent of P) for the force-field calculations. Remember, however, that as fishes move, they might change their processor affinities. So, a mechanism will be needed to transfer fishes to their appropriate processors when needed. You could also modify the data structure to one which is more suited to nearest neighbor force calculations.

Another optimization could be to overlap communication with computation. For example, at the start of a timestep, processors can begin exchanging the positions of fishes within cutoff distance needed for force computations. While the transfers are in progress, a processor can compute the interactions involving fishes owned by it. At the end of the timestep, each processor can finish the transfers and then compute the remaining interactions involving the transferred fishes.

With MPI, you explicitly overlap computation and communication by using asynchronous sends and receives. The asynchronous communication needs an extra holding block, one that'll be communicating while you're computing with the other. This will raise the storage to 3n, so the obvious question is if this is worth-while.

Problem

You must work in groups of 3. One person in your group should be a non-CS student (if possible), but otherwise you're responsible for finding a group. The required files are packaged here (tar.gz).

Jacquard and Bassi will be the primary platforms, although debugging on the CITRIS cluster when possible is strongly recommened. You'll get faster response, and you can log into nodes directly and attach with gdb.

Modify the code to use a Van der Waal-like force:

F = C(1/r - 1/40r^2)  for r < 1/40, F = 0 for r > 1/40
where C = 1. You might want to change the code to do some sort of spatial decomposition of the problem. Evaluate the performance and accuracy of you simulation.

In addition to modifying the MPI version, you should also modify the OpenMP version or the pthreads version.

Run and profile the performance of the MPI implementation and either the OpenMP version or pthreads version. Are there any obvious bottlenecks? Is the code spending lots of time waiting at synchronization points? Try creating a performance model of of the your chosen implementations and estimate any machine parameters. Profiling the sequential code might also uncover interesting information.

Implement the strategies described under the Details section using MPI and OpenMP or pthreads.

You can use tools like Tau or IPM on Bassi and Jacquard to obtain communication traces to supplement speed-up plots (time v. number of PEs for a fixed problem). Also, compare performance results between Bassi and the Jacquard clusters. How do the results compare? Note that Bassi nodes have 8 processors in an SMP, while Jacquard only has two.

Implementation

The fish code contains the following four implementations:

The code uses GNU autoconf to detect various settings. To compile the code, use a sequence like

    mkdir build
    cd build
    ../configure
    make
  

This will leave the source untouched and place executables in appropriate directories under build.

sequential/fish -h will display all the options the program take. On Jacquard, you can run an MPI job interactively with

mpirun -np 4 mpi/fish
On Bassi, you can run an MPI job interactively with
poe mpi/fish -procs 4

OpenMP uses an environment variable to set the number of processors. For four threads, use

env OMP_NUM_THREADS=4 openmp/fish
Note that OpenMP version can only use one node, since it requires a shared memory. Thus, on Bassi, OpenMP version would be able to use at most 8 processors. On both platforms you probably want to consider using the batch system.

Submission

Your group should put together a write-up describing changes you've made to the programs and their effects on performance. Mail me (skamil@cs) a URL to a tar file containing the report, the modified programs, and any necessary or useful additional information.

The primary questions of interest:

Answer the questions I've asked or implied above, and try to explain any interesting effects you see. If you don't see any, explain why not. Explanations that are based on a well-understood system model (PRAM, LogP, etc.) are the most convincing. The page should include appropriate speed-up plots, traces with Vampir or TAU or IPM, or other pretty pictures to justify your conclusions. Can you see the effects of overlapping computation and communication?

The idea is for you to get a feel for the performance issues in parallel programming. That should help you decide which program designs are feasible for your final project and which are not.

Resources / Notes

Take a look at MPI_Sendrecv. It's quite handy. There's no asynchronous version of MPI_Sendrecv. It's a form of collective communication; you could almost implement it using MPI_Allgatherv. For some insane reason, the designers of MPI decided not to include asynchronous, collective communication. IBM has some vendor-specific extensions which may be worth investigating.

The NERSC website has useful MPI documentation and tutorials. The MPI Forum keeps the standards available.

OpenMP documentation is available through both the OpenMP standards group and NERSC.

Lawrence Livermore has a decent pthreads tutorial.


Last updated February 22, 2008