CS267 Assignment 5
Spring 2000

Due Date: 15 March, 2000

Background

First, you should find a new-ish group. This way, we can spread expertise (and project ideas) around the groups...

Consider replacing the gravitational force with a van der Waal's-like force. The force acts along the vector between two fish. At a distance r > 1/60.0, the force acts as a 1/r5 attractive force. At very close distances, the force is repulsive as 1/r. The exact force, with positive values implying attraction, is

F(r) = C * (2.0 - 1 / (60.0 * r)) when r < 1/60.0, and
F(r) = C / (60.0 * r)5 when r >= 1/60.0, with
C = 0.1  .

Figure 1: The force is strongly attractive outside the critical radius of 1/60 and somewhat repulsive within a radius of approximately 0.008.

As Figure 1 shows, the force is effectively zero outside a small radius. For simplicity, assume the force is zero outside a radius of 0.05. Your force code should check the radius and simply return zero when the radius is above 0.05.

The next step of the assignment is to chop the 1.0-by-1.0 domain into 0.05-by-0.05 chunks. Then each of the small chunks need only interact with the neighboring chunks, as in Figure 2. This gives a 20-by-20 array of chunks.

Figure 2: The fish in the dark blue, center chunk can only be affected by fish in that chunk and the eight immediately adjacent, lighter blue chunks.

Now there are multiple ways of assigning blocks of these chunks to processors. One is a nice, clean block-cyclic layout. The other is simply to allocate a physically contiguous group of chunks to each processor to maximize the surface area to volume ratio. You can implement either. I have a muddling description of a block-cyclic layout available. For comparison, the former layout for two processors is shown in Figure 3.

Figure 3: The chunk allocation and inter-processor interaction region for two processors.

If the fish are scattered uniformly across the space, the total computational work should now be O(n) per interaction computation rather than O(n2). This blocking should also greatly reduce the communication requirements. Not all processors need to know about all fish. If the fish are not distributed uniformly, you will experience load imbalance.

Problem

Implement the above force both with the grid and without in either MPI or UPC. You can implement either a simplistic allocation of chunks to processors or a block-cyclic distribution. If you implement the simplistic allocation, be sure to provide a visual representation of how the 1-by-1 region is split when describing results for a given number of processors. Note: If P is not a divisor of twenty, you can either try a more complex implementation that actually tracks the size of each block, or you can simply waste some space and store fish-less dummy chunks to round out the size.

Also, collect statistics on the average load on each processor, both in terms of the number of interactions computed and the volume of communcation. The number of comminications is also interesting, but that will depend on your implementation. Include it if you wish. You need to demonstrate the effects of load imbalance.

Optional, but worth-while, topics: Feel free to implement this in both systems. There are many interesting comparisons to be drawn. Also, comparing the load imbalance between the block-cyclic distribution and the simplistic distribution would be interesting (compare the costs of load imbalance v. the amount of communication). You can also parameterize the force a bit more by changing the critical radius, the constant, and thus the cutoff radius. This will give you more room to maneuver and may lead to greater insights. For MPI, comparisions between the T3E, the NoW, and the Millennium cluster may be useful. And you can try other ways to assign blocks or chunks to processors and see how the assignments affect the load.

Submission

Your group should put together a web page describing implementations and their performance. Mail me a URL to a tar file containing the page, the programs, and any necessary or useful additional information.

Answer the questions I've asked or implied above, and try to explain any interesting effects you see. If you don't see any, explain why not. Explanations that are based on a well-understood system model (PRAM, LogP, etc.) are the most convincing. The page should include appropriate speed-up plots (up to 64 processors or so), traces with Vampir or TAU (if using MPI), or other pretty pictures to justify your conclusions. Examine and explain any load imbalance. The circle distribution should be helpful for that.

Resources / Notes

For using more than 16 or so processors on the T3E, you'll need to use the batch system. NERSC has a batch tutorial available.

Last year's version may lead you to other interesting topics to discuss.


Main CS267 page, and the TA's CS267 page

E. Jason Riedy
ejr@cs.berkeley.edu