The purpose of this assignment is introduction to programming in shared and distributed memory models.
Your goal is to parallelize a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behavior shown in the following animation:

The range of interaction forces is limited as shown in grey for a selected particle. Density is set sufficiently low so that given n particles, only O(n) interactions are expected.
Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run in time T/p when using p processors. We'd like you to write parallel codes that approach these expectations.
You may start with the serial and parallel implementations supplied below. All of them
run in O(n2) time, which is unacceptably inefficient.
|
You are welcome to use any NERSC cluster in this assignment. If you wish to build it on other systems, you might need a custom implementation of pthread barrier, such as: pthread_barrier.c, pthread_barrier.h.
You may consider using the following visualization program to check the correctness of the result produced by your code: Linux/Mac version (requires SDL), Windows version.
You may work in groups of 2 or 3. One person in your group should be a non-CS student, but otherwise you're responsible for finding a group. After you have chosen a group, please come to the GSI office hours to discuss the distribution of work among team members. Email the GSIs your report and source codes. Here is the list of items you might show in your report:
You will also be running this assignment on GPUs. You have access to Dirac, an experimental GPU cluster at NERSC. Each node has an NVIDIA Tesla C2050, as well as two quad-core CPUs (See the NERSC Dirac Webpage for more detailed information.)
We will provide a naive O(n2) GPU implementation, similar to the openmp, pthreads, and MPI codes listed above. It will be your task to make the necessary algorithmic changes and machine optimizations to achieve favorable performance across a range of problem sizes.
It may help to have a clean O(n) serial CPU implementation as a reference. If you feel this will help you, please e-mail the GSIs after Part 1 is due and we can provide this.
Please include a section in your report detailing your GPU implementation, as well as its performance over varying numbers of particles. Here is the list of items you might show in your report: