CS267 Assignment 2: Fishy Code

Due Wednesday 2 March 2005

[ Background | Details | Problem | Implementation | Submission | Resources ]

Background

The Fish Particle Simulation (FPS) is meant to represent an abstraction of natural phenomena that you may find in biology, chemistry, physics, astronomy, etc. and eventually want to model on a parallel machine. We intend to give you a taste of creating parallel algorithms, using parallel programming tools, libraries and compilers, and modelling performance and scalability on this simplified problem.

We give you an implementation of FPS in its simplest form -- essentially a simulation over time of equal mass particles obeying gravitational-like forces. The particles begin either randomly distributed or placed equally along a circle. At every time step, the forces on each particle are summed up due to all of the other particles. Then using Euler's method, the new position of the particles are calculated and distributed to all of the processors.

Details

We are providing fish codes in the following programming models:

The provided code is not the fastest. While implementing a better algorithm is beyond the scope of this assignment, there's still a good deal which can be done with these codes. For one, communication can be blocked to amortize overhead (the o in LogP). It can also be overlapped with computation, working on the latency and gap (L and g). The P term is mostly subject to budget constraints, and those are certainly outside our scope.


Figure 1: In the programs' current form, all processors have to store all fish.

In the provided provided MPI implementation, processors store all the fish and calculate all interactions at once, as in Figure 1. This allows you to use a very standard, sequential interaction function, but you lose many of the advantages of computing in parallel. Each processor needs to know about all of the n fish at once, and you end up storing nP fish. This grows linearly with the number of processors, P. An implementation is considered purely scalable if its resource utilization (time and space) does not grow with P. In real life, you often have some P-dependent bookkeeping, so people accept a tiny growth with P as long as the overall speed improves with P.


Figure 2: After the first modification, processors only store small blocks of fish, sending and receiving individual fish when necessary.

If we can reduce the storage use to some reasonable constant times n, we'll be much closer to achieving scalability. The constant 2 is reasonable, and not terribly difficult to reach. Figure 2 shows one way to do it. Each processor holds two blocks of b fish each. One block stores the local fish, those fish the processor is responsible for updating. The other block is used to hold the fish interacting with the local-fish block.

Each processor computes the interaction between each local fish and the fish directly across from it in the holding block. Then the held fish are rotated by one fish. From processor two's viewpoint (P2) The fish that falls off the left end (in this illustration) are sent to P1. The fish to be brought in from the right is received from P3. The cycle repeats until all fish have experienced all the necessary forces.


Figure 3: The next modification leads to fully blocked communication; the processors send / receive whole blocks of fish to / from their neighbors.

Rotating one fish at a time results in lots of small messages, which isn't terribly efficient with most parallel platforms. (Why?) Figure 3 rotates a larger block of fish each time, amortizing the overhead over an entire block of fish.


Figure 4: With two processors, the implementations so far will have activity diagrams like the one on the left, alternating phases of interaction and rotation. If the phases are completely overlapped, as on the right, twice as many interactions are computed in the same time.

Up to this point, computation (interaction) and communication (rotation) happened on separate phases, depicted in the left diagram of Figure 4. By overlapping these two phases, we can make use of idle CPU / network resources, as seen in the right diagram of Figure 4.

With MPI, you explicitly overlap computation and communication by using asynchronous sends and receives. The asynchronous communication needs an extra holding block, one that'll be communicating while you're computing with the other. This will raise the storage to 3n, so the obvious question is if this is worth-while.

Finding the answer will require looking at the effects of the block size. If you implemented the rotation mechanism in a general fashion, you'll have more data to use. Rotating by b requires transmitting and receiving a block of b fish. You can draw a few conclusions with only two data points (b = 1 or the number of local fish). Can you explain the results in terms of a reasonable model (like LogP)?

Problem

You may work in groups of 2 or 3. One person in your group should be a non-CS student (if possible), but otherwise you're responsible for finding a group. The required files are packaged here (tar.gz).

Seaborg will be the primary platform, although I recommend debugging on the CITRIS cluster when possible. You'll get faster response, and you can log into nodes directly and attach with gdb. A somewhat portable Python + Tk viewer is provided so you can watch the fishes swim (e.g. fish -o filename.out; aquarium filename.out).

Run and profile the performance of the MPI implementation and either the pthreads or the OpenMP version. Are there any obvious bottlenecks? Is the code spending lots of time waiting at synchronization points? Try creating a performance model of of the your chosen implementations and estimate any machine parameters. Profiling the sequential code might also uncover interesting information.

Implement the strategies described under the Details section using MPI. Note the asynchronous communication needs an extra holding block (why ?), requiring extra memory, so the obvious question is if it's worth-while.

Finding the answer will require looking at the effects of the block size. If you implemented the rotation mechanism in a general fashion, you'll have more data to use. Rotating by b requires transmitting and receiving a block of b fish. You can draw a few conclusions with only two data points (b = 1 or the number of local fish). Can you explain the results in terms of a reasonable model (like LogP)?

If you have time, modify the code to use a van der Waal-like force:

F = C(2 - 1/60r)  for r < 1/60, F = C / (60r)^5  for r > 1/60
where C = 0.1. Try changing the MPI code to do some sort of spatial decomposition of the problem. Evaluate the performance and accuracy of you simulation.

You can use tools like Vampir on Seaborg to obtain communication traces to supplement speed-up plots (time v. number of PEs for a fixed problem). Also, compare performance results between Seaborg and the CITRIS cluster. How do the results compare? (Note: There's no Vampir on CITRIS. If you find a good tool, let me know.)

Implementation

The fish code contains the following four implementations:

The code uses GNU autoconf to detect various settings. To compile the code, use a sequence like

    mkdir build
    cd build
    ../configure
    make
  

This will leave the source untouched and place executables in appropriate directories under build. On Seaborg, load a few modules first: module load vampir vampirtrace. This will enable the VAMPIR enabled mpi fish executable mpi/fish_vt.

sequential/fish -h will display all the options the program take. The pthreads implementation takes an additional -p option to specify the number of threads. On CITRIS, you can run an MPI job interactively with

mpirun -np 4 mpi/fish
On Seaborg, you can run an MPI job interactively with
poe mpi/fish -procs 4

OpenMP uses an environment variable to set the number of processors. For four threads, use

env OMP_NUM_THREADS=4 openmp/fish
Note that OpenMP version can only use one node, since it requires a shared memory. On both platform you probably want to consider using the batch system. See CITRIS Essentials and Seaborg Essentials for more information.

On Seaborg, the mpi/fish_vt executable produces a VAMPIR trace in fish_vt.bvt. You can display it with

vampir fish_vt.bvt

There is a python script that allows you to see the "fish" visually. You can do something like

    mpirun -np 4 build/mpi/fish -o fish.out
    ./aquarium fish.out
  
Since neither Seaborg nor CITRIS have Tkinter Python module installed, you can use the one I installed myself on CITRIS:
/home/eecs/yozo/opt/cs267-sp05/ia64-unknown-linux-gnu/python-2.3.4/bin

Submission

Your group should put together a write-up describing changes you've made to the programs and their effects on performance. Mail me a URL to a tar file containing the report, the modified programs, and any necessary or useful additional information.

The primary questions of interest:

Answer the questions I've asked or implied above, and try to explain any interesting effects you see. If you don't see any, explain why not. Explanations that are based on a well-understood system model (PRAM, LogP, etc.) are the most convincing. The page should include appropriate speed-up plots, traces with Vampir or TAU, or other pretty pictures to justify your conclusions. Can you see the effects of overlapping computation and communication? (It may not take as many fish as you'd expect; that's one reason why we're sticking with the O(n2) algorithm.)

The idea is for you to get a feel for the performance issues in parallel programming. That should help you decide which program designs are feasible for your final project and which are not.

Resources / Notes

For rotating the fish in MPI, take a close look at MPI_Sendrecv. It's quite handy. Also, you can insert dummy fish to keep the blocks the same size on every node. You just need to make sure that the dummy fish do not affect the regular (smarter?) fish. There's no asynchronous version of MPI_Sendrecv. It's a form of collective communication; you could almost implement it using MPI_Allgatherv. For some insane reason, the designers of MPI decided not to include asynchronous, collective communication. IBM has some vendor-specific extensions which may be worth investigating.

Seaborg as copious documentation on-line, including MPI documentation and tutorials. The MPI Forum keeps the standards available.

There is pthread information available all over. Just hit Google or search NERSC's documentation. OpenMP documentation is available through both the OpenMP standards group and NERSC.


[ Main CS 267 | GSI Page ] Last updated February 10, 2005