Now you need to use the observations from the previous assignment.
Your group needs to improve the code from the previous assignment. The codes are for
You probably have modified / cleaner versions by now.
The provided code is not the fastest. While implementing a better algorithm is beyond the scope of a week-long assignment, there's still a good deal which can be done with these codes. For one, communication can be blocked to amortize overhead (the o in LogP). It can also be overlapped with computation, working on the latency and gap (L and g). The P term is mostly subject to budget constraints, and those are certainly outside our scope.
In the MPI implementation given in the previous assignment, processors store all the fish and calculate all interactions at once, as in Figure 1. This allows you to use a very standard,sequential interaction function, but you lose many of the advantages of computing in parallel. Each processor needs to know about all of the n fish at once, and you end up storing nP fish. This grows linearly with the number of processors, P. An implementation is only considered purely scalable if its resource utilization (time and space) does not grow with P. In real life, you often have some P-dependent bookkeeping, so people accept a tiny growth with P as long as the over-all speed improves with P.
Figure 1: In the programs' current form, all processors have to store all fish.
If we can reduce the storage use to some reasonable constant times n, we'll be much closer to achieving scalability. The constant 2 is reasonable, and not terribly difficult to reach. Figure 3 shows one way to do it. Each processor holds two blocks of b fish each. One block stores the local fish, those fish the processor is responsible for updating. The other block is used to hold the fish interacting with the local-fish block.
Figure 2: After the first modification, processors only store small blocks of fish, sending and receiving individual fish when necessary.
Each processor computes the interaction between each local fish and the fish directly across from it in the holding block. Then the held fish are rotated by one fish. From processor two's viewpoint (P2) The fish that falls off the left end (in this illustration) are sent to P1. The fish to be brought in from the right is received from P3. The cycle repeats until all fish have experienced all the necessary forces.
Figure 3: The next modification leads to fully blocked communication; the processors send / receive whole blocks of fish to / from their neighbors.
Communicating one fish at a time isn't terribly efficient with most parallel platforms, just like computing one entry at a time in a matrix product. The overhead adds up quickly. Figure 3 expands the communication to the entire held block at once, amortizing the overhead over an entire block of fish. You can think of this as rotating the array by b rather than one. You can also implement it that way and have a unified view of both of these. Mentioning implementation brings me to the point, finally.
You need to implement the above strategies using MPI and UPC. The UPC code already spreads the fish across all the processors, but not in contiguous blocks. For UPC, you'll need to allocate the spread array of fish in blocks. To do that, try allocating n_fish / BLOCK_SIZE (rounded up) entries of BLOCK_SIZE * sizeof (fish_t) size. You may also need to change the declaration of the fish array, or else the compiler will stride over it in ways you don't expect. (Or you could go the more straight-forward route and handle blocking yourself.)
Additionally, you need to experiment further with one of the two implementations, either MPI or UPC. Your group can choose one focus. You can play with both, but you need go into decent depth about only one.
The additional experimentation revolves around latency. So far, you've played with overhead. The fish-at-a-time rotation accumulated the most overhead, while the placing all the fish on every processor potentially had the least (although it loses on scalability). You now need to overlap computation and communication, as depicted in Figure 4.
Figure 4: With two processors, the implementations so far will have activity diagrams like the one on the left, alternating phases of interaction and rotation. If the phases are completely overlapped, as on the right, twice as many interactions are computed.
With the MPI implementation, you need to explicitly overlap computation and communication by using asynchronous sends and receives. You'll need an extra holding block, one that'll be communicating while you're computing with the other. This will raise the storage to 3n, so the obvious question is if it's worth-while.
Answering will require looking at the effects of the block size. If you implemented the rotation mechanism in a general fashion, you'll have more data to use. Rotating by b requires transmitting and receiving a block of b fish. You can draw a few conclusions with only two data points (b = 1 or the number of local fish). Can you explain the results in terms of a reasonable model (like LogP)?
Examine the Vampir traces on the T3E as well as speed-up plots (time v. number of PEs for a fixed problem). Also, compare results with either the NoW or the Millennium cluster. Those systems have much greater message latencies and overheads. How do the results compare? Unless you feel like getting TAU to work on those systems, just use speed-up plots. If you do get TAU to work, remember that you can still use the Vampir visualization tool on the T3E to see the results. The MPI code itself should not need modifications to work on those platforms.
Ok, I wasn't being entirely truthful. The UPC implementation may well be overlapping some computation and communication, assuming the compiler chose relaxed memory semantics. So the first thing you need to do is compare the performance when you force strict or relaxed memory references. (Remember #include <upc-strict.h> and it's cousin Darryl.) Is the compiler using strict or relaxed semantics by default?
You also need to determine if hand-blocking produces better results than the normal spread allocation. Why might it?
If your code is still following the style I used, you're pulling data to each processor. Try pushing it instead, moving the fish by storing blocks directly into other processors' memories. How does this change things? You'll probably need to put some thought into where to put the barriers in this case. Will split-phase barriers be effective?
The primary graphic for these explorations will be speed-up plots. You may also want to write and time a few small test programs (microbenchmarks) to get some idea of what to expect. For example, you may want to play with a simple code to copy the contents of an array from node to node. You can vary push v. pull, sizes, etc.
Note that explaining some of these points may lead you into the assumptions gcc 2.7 makes when optimizing code. And some effects may not be noticable, simply because gcc 2.7 is missing some key optimizations. Remember that the code is available under ~rflucas/upc/src. Finding needles in the gcc haystack is too much to ask of you, but if you're interested, I may be able to help.
Your group should put together a web page describing changes you've made to the programs and their effects on performance. Mail me a URL to a tar file containing the page, the modified programs, and any necessary or useful additional information.
Answer the questions I've asked or implied above, and try to explain any interesting effects you see. If you don't see any, explain why not (this may happen with UPC). Explanations that are based on a well-understood system model (PRAM, LogP, etc.) are the most convincing. The page should include appropriate speed-up plots, traces with Vampir or TAU, or other pretty pictures to justify your conclusions. Can you see the effects of overlapping computation and communication? (It may not take as many fish as you'd expect; that's one reason why we're sticking with the O(n2) algorithm.)
Again, I'm asking for more than is reasonable in a week. Definitely get the baseline code working in both systems. Many of the modifications will be the same for both MPI and UPC, so that shouldn't take too much effort. For the exploration, take aspects that are interesting to you and flush them out. Not every group need explore exactly the same things. And if you find interesting things I haven't mentioned, and if you can explain them well enough for others to find them interesting, go for it.
The idea is for you to get a feel for the performance issues in parallel programming. That should help you decide which program designs are feasible for your final project and which are not.
Remember, you're a group. You can split up work, and you can also work together. Not all forms of working together effectively are immediately obvious.
For rotating the fish in MPI, take a close look at MPI_Sendrecv. It's quite handy. Also, you can insert dummy fish to keep the blocks the same size on every node. You just need to make sure that the dummy fish do not affect the regular (smarter?) fish. There's no asynchronous version of MPI_Sendrecv. It's a form of collective communication; you could almost implement it using MPI_Allgatherv. For some insane reason, the designers of MPI decided not to include asynchronous, collective communication. The implementors decided not to make their libraries thread-safe. The end product is that you lose access to a good deal of concurrency in your problem... sigh...
... What else do you want to know?
Main CS267 page, and the TA's CS267 page
E. Jason Riedy