Several groups apparently didn't read the handout carefully and what it asked for. In particular, you should have given formulas for "number of calls to collide()", "number of messages", and "number of bytes" each for version 1 and version 2 of the algorithm. Admittedly, these questions (and others) were well ensconced in the assignment and needed to be searched for, but doing one pass just for that purpose wouldn't have been too hard.
Several people seemed to be confused about what a message is. A message, as we meant it, is a cm5 hardware message (otherwise, it would have been pointless to ask for both messages and bytes). Remember, a message is not equal to an interaction. Assuming the standard active message sizes (4 words/message), if a particle takes 6 words, you'll need to send 2 messages if you send the particles one at a time using stores. If you pack particles together, 3 particles take 2 messages.
There was a large variety of solutions, and many people choose to ignore inter-processor collisions. Although I didn't penalize for this, I thought that you all should have done it (it's not that bad if you have a good algorithm). Only one group "found" the algorithm I think is easiest *and* best; a short description of that algorithm follows.
We'll use the terminology in the assignment (so please excuse all the "sub"-ing). Inter-subsubblock interactions are computed by scanning all subblocks on each processor. An interaction between two subsubblocks is the usual check for collisions, update, etc. For each subsublock, we follow an interaction path. The path, relative to the current subsubblock, is: right subsubblock, upper right subsubblock, upper subsubblock, upper left subsubblock (or it's symmetric equivalents). Therefore, if all subsubblocks are scanned, all inter-subsubblock interactions will be found.
If the path goes off processor, a switching algorithm is employed. Here is a rough sketch of the switching algorithm.
=========
current_subsubblock = The current subsubblock to operate on
tmp_subsubblock = a temporary unused subsubblock buffer
for all entries in path
if path is off processor
swapWithNeighborProc(current_subsubblock,tmp_subsubblock)
endif
path_subsubblock = path at (path mod number of horizontal subsubblocks)
inter_subsubblock_interactions(current_subsubblock,path_subsubblock)
endfor
=========
Therefore, the current subsubblock is sent to the neighboring
processor when the path goes off processor rather than getting the
path subsubblocks. This is a good idea because, in general, when a
path goes off processor, there is more than 1 path subsubblocks off
the processor (in the right and top case). If instead, we brought the
path subsubblock to the local processor, would incur additional and
unnecessary communication. In other words, we bring the subsubblock to
the mountain rather than the mountain to the subsubblock. Other
advantages should be obvious. Further (and complete) details of this
algorithm may be found here, at Los
Alamos