Water Simulation in the SPLASH Benchmark
Problem
The Stanford Parallel Applications for Shared Memory (SPLASH) benchmark suite consists of several kernels and applications to evaluate aspects of parallel performance. The water application in the suite simulates forces and potential energy of water molecules in the liquid state. The simulation, after many time steps, should reach a stable state.
Possible uses of this application include studying specific properties of water, for example the speed of sound waves in water. But the major use of this application, as well as the others in the SPLASH suite is for benchmarking purposes.
Application Platform
The water simulation application is written in C, using parmacs for parallel functionality. Parmacs is a collection of macros from Argonne National Laboratory for parallel programming. The program was designed for multiple processor machines with a shared memory architecture.
The application was run on a twelve processor Encore Multimax machine. Each processor in the machine is rated at 2 MIPS (millions of instructions per second). These processors are relatively slow, as Pentium processors were rated at over 100 MIPS. Because the processors are slow, other aspects of the system such as the memory speed and the inter-processor communication speed are not major bottlenecks.
The application was also run on a simulator. The simulator assumed a perfect memory system, with all memory references taking once computation cycle. The advantage of the simulator is that imperfect speedup can only be attributed specific problems, for example load imbalances that cause increased wait time for some processors, overhead inherit in the parallelism, or redundant computation.
Calculation
The application sets up a number of water molecules, positioned in a cube. The benchmark can be initialized with a random distribution of the molecules, or in a regular lattice. Initial velocities are generated from an input file containing pseudo-random numbers.
The algorithm then solves Newtonian equations using a predictor-corrector method in each time step. The steps involved include predicting the values of certain variables for each atom (position, velocity, etc.), computing intra-molecular forces for each molecule, computing inter-molecular forces between molecules, calculating the corrected variable values using the predicted values and the forces, and calculating kinetic and potential energies of the system. Each task is partitioned across all of the processors. The vast majority of work done is the inter-molecular force calculations, as this calculation can be O(n2) and may involve a large amount of communication between processors.
There are two versions of water in the SPLASH-2 suite to solve this problem. The original (n-squared) version of water does not maintain a spatial data structure, so the molecules stored next to each other in the main data structure are not necessarily near each other in space. For inter-molecular calculations, a cutoff radius is used (half the cube length) so that forces between molecules separated by a great distance are not calculated. However, because spatial locality is not stored, each pair of molecules must be checked for the cutoff distance, resulting in O(n2) calculations.
A major issue with this version is that the lack of spatial locality can cause increased communication between processors, as molecules are not distributed amongst processors in chunks determined by position in space. This can also affect load balancing as some processors may have more actual inter-molecular calculations to perform. A processor calculates the inter-molecular forces between a particular molecule and the n/2 molecules following it in the molecule array. If most of the close molecules are in the n/2 molecules preceding it in the array, other processors will be calculating these inter-molecular forces.
The spatial version divides the cube containing the molecules into a grid of cells owned by different processors, and uses spatial locality to accomplish a running time of O(n). When calculating inter-molecular forces, only molecules in nearby cells within the cutoff distance need to be considered.
We will only discuss the results of the n-squared version of the application here.
Results

The above diagram shows the speedup on the two platforms for a problem size of 288 molecules. Unfortunately, raw performance data for these machines is not available, so a comparison with the peak performance of the machines can't be done.
The speedup of this application is almost ideal. Load balancing is not an issue in these test runs, as data from the simulator show that synchronization wait time peaked at 0.39%. The application scaled well on the Multimax as well.
Although we did mention load balancing as a potential issue, the results show that this is not the case. Although not clearly stated in the results, it is likely that the application was run with a uniform lattice of molecules. It's possible that with a random distribution, molecules may be distributed in the molecule array in a way that would cause a load imbalance.
Benchmark Example
A major use of this application is for benchmark purposes. An example of this is a comparison between message passing and distributed shared memory for a network of computers. In this example, the benchmark was done with problem sizes of 288 molecules and 1728 molecules.
8-processor Speedup
| 288 Molecules | 1728 Molecules | |
| Distributed Shared Memory | 5.04 | 7.25 |
| Message Passing | 7.23 | 7.44 |
For both interfaces, the application scales well, particularly when given a large data set.
Summary
The water simulation application is useful scientifically to study different properties of liquid water. The application scales very well as the number of processors and the input size increase. This efficiency greatly benefits users of the simulation. Additionally, because load balancing is not an issue, this application is well suited for comparing different parallel architectures. The application is also adaptable to different parallel paradigms - although originally designed for shared memory, it can be modified to benchmark distributed networks or clusters as well.
Links/References
SPLASH Home Page
SPLASH Report
SPLASH-2 Notes
Message Passing Versus Distributed Shared Memory
Gear's Predictor-Corrector Method