CS267 Assignment 2
Spring 2004

Optimize some fishy code.

Due Date: Monday, 8 March, 2004

Changes

Background

The Fish Particle Simulation (FPS) is meant to represent an abstraction of natural phenomena that you may find in biology, chemistry, physics, astronomy, etc. and eventually want to model on a parallel machine. We intend to give you a taste of creating parallel algorithms, using parallel programming tools, libraries and compilers, and modelling performance and scalability on this simplified problem.

We give you an implementation of FPS in its simplest form -- essentially a simulation over time of equal mass particles obeying gravitational-like forces. The particles begin either randomly distributed or placed equally along a circle. At every time step, the forces on each particle are summed up due to all of the other particles. Then using Euler's method, the new position of the particles are calculated and distributed to all of the processors.

Details

You may work in groups of up to 3. Again, one person in your group should be a non-CS student, but otherwise you're responsible for finding a group. The required files are packaged here (tar.gz). See the README for building instructions.

Run, tune, and profile performance of the sample Fish implementations. We are providing fish codes in the following programming models:

For this assignment, you'll be focusing on the MPI implementation. Seaborg will be the primary platform, although I recommend debugging on the CITRIS cluster when possible. You'll get faster response, and you can log into nodes directly and attach with gdb. A somewhat portable Python + Tk viewer is provided so you can watch the fishes swim (e.g. fish -o filename.out; aquarium filename.out).

The provided code is not the fastest. While implementing a better algorithm is beyond the scope of this assignment, there's still a good deal which can be done with these codes. For one, communication can be blocked to amortize overhead (the o in LogP). It can also be overlapped with computation, working on the latency and gap (L and g). The P term is mostly subject to budget constraints, and those are certainly outside our scope.

In the provided MPI implementation, processors store all the fish and calculate all interactions at once, as in Figure 1. This allows you to use a very standard,sequential interaction function, but you lose many of the advantages of computing in parallel. Each processor needs to know about all of the n fish at once, and you end up storing nP fish. This grows linearly with the number of processors, P. An implementation is only considered purely scalable if its resource utilization (time and space) does not grow with P. In real life, you often have some P-dependent bookkeeping, so people accept a tiny growth with P as long as the overall speed improves with P.

Figure 1: In the programs' current form, all processors have to store all fish.

If we can reduce the storage use to some reasonable constant times n, we'll be much closer to achieving scalability. The constant 2 is reasonable, and not terribly difficult to reach. Figure 3 shows one way to do it. Each processor holds two blocks of b fish each. One block stores the local fish, those fish the processor is responsible for updating. The other block is used to hold the fish interacting with the local-fish block.

Figure 2: After the first modification, processors only store small blocks of fish, sending and receiving individual fish when necessary.

Each processor computes the interaction between each local fish and the fish directly across from it in the holding block. Then the held fish are rotated by one fish. From processor two's viewpoint (P2) The fish that falls off the left end (in this illustration) are sent to P1. The fish to be brought in from the right is received from P3. The cycle repeats until all fish have experienced all the necessary forces.

Figure 3: The next modification leads to fully blocked communication; the processors send / receive whole blocks of fish to / from their neighbors.

Communicating one fish at a time isn't terribly efficient with most parallel platforms, just like computing one entry at a time in a matrix product. The overhead adds up quickly. Figure 3 expands the communication to the entire held block at once, amortizing the overhead over an entire block of fish. You can think of this as rotating the array by b rather than one. You can also implement it that way and have a unified view of both of these. Mentioning implementation brings me to the point, finally.

Problem

You need to implement the above strategies using MPI. Additionally, use the blocking to overlap the shifting communication with the interaction computations. The fish-at-a-time rotation accumulated the most overhead, while the placing all the fish on every processor potentially had the least (although it loses on scalability). You now need to overlap computation and communication, as depicted in Figure 4.

Figure 4: With two processors, the implementations so far will have activity diagrams like the one on the left, alternating phases of interaction and rotation. If the phases are completely overlapped, as on the right, twice as many interactions are computed in the same time.

With MPI, you explicitly overlap computation and communication by using asynchronous sends and receives. The asynchronous communication needs an extra holding block, one that'll be communicating while you're computing with the other. This will raise the storage to 3n, so the obvious question is if it's worth-while.

Finding the answer will require looking at the effects of the block size. If you implemented the rotation mechanism in a general fashion, you'll have more data to use. Rotating by b requires transmitting and receiving a block of b fish. You can draw a few conclusions with only two data points (b = 1 or the number of local fish). Can you explain the results in terms of a reasonable model (like LogP)?

You can use tools like Vampir on Seaborg to obtain communication traces to supplement speed-up plots (time v. number of PEs for a fixed problem). Also, compare performance results between Seaborg and the CITRIS cluster. How do the results compare? (Note: There's no Vampir on CITRIS. If you find a good tool, post it to the newsgroup.)

Submission

Your group should put together a web page describing changes you've made to the programs and their effects on performance. Mail me a URL to a tar file containing the page, the modified programs, and any necessary or useful additional information.

The primary questions of interest:

Answer the questions I've asked or implied above, and try to explain any interesting effects you see. If you don't see any, explain why not. Explanations that are based on a well-understood system model (PRAM, LogP, etc.) are the most convincing. The page should include appropriate speed-up plots, traces with Vampir or TAU, or other pretty pictures to justify your conclusions. Can you see the effects of overlapping computation and communication? (It may not take as many fish as you'd expect; that's one reason why we're sticking with the O(n2) algorithm.)

The idea is for you to get a feel for the performance issues in parallel programming. That should help you decide which program designs are feasible for your final project and which are not.

Resources / Notes

Remember, you're a group. You can split up work, and you can also work together.

For rotating the fish in MPI, take a close look at MPI_Sendrecv. It's quite handy. Also, you can insert dummy fish to keep the blocks the same size on every node. You just need to make sure that the dummy fish do not affect the regular (smarter?) fish. There's no asynchronous version of MPI_Sendrecv. It's a form of collective communication; you could almost implement it using MPI_Allgatherv. For some insane reason, the designers of MPI decided not to include asynchronous, collective communication. IBM has some vendor-specific extensions which may be worth investigating.

Seaborg as copious documentation on-line, including MPI documentation and tutorials. The MPI Forum keeps the standards available.

There is pthread information available all over. Just hit Google or search NERSC's documentation. OpenMP documentation is available through both the OpenMP standards group and NERSC.

... What else do you want to know?


Main CS267 page, and the TA's CS267 page

E. Jason Riedy
ejr@cs.berkeley.edu