Currently Active Projects:

2.5D Algorithms:

Collaborators: James Demmel (UCB), BeBoP group (UCB/LBNL)
Parallel dense linear algorithms typically assign parallel work in matrix blocks. However, the communication cost can be reduced by using redundant copies of the matrices. A classical 3D matrix multiplication algorithm has been long known to achieve the minimal communication cost possible.

2.5D matrix multiplication is an adaptation of the classical 3D algorithm. 2.5D algorithms perform adaptive replication, controlling for communication as well as memory usage (the main flaw of the 3D algorithm).

We've also shown how to extend this approach to 2.5D LU, TRSM, and Cholesky. These new 2.5D algorithms achieve an asymptotically lower inter-processor communication complexity than any previously existing algorithms. In fact, in all cases, we prove their communication optimality. Our 2011 Euro-Par paper received a Distinguished Paper award for this work.

2.5D algorithms also map nicely to modern 3D Torus architectures. Our implementations have shown excellent strong scaling on the Intrepid BlueGene/P machine at Argonne National Lab. Careful mapping techniques allow the code to exploit topology-aware collectives. We analyze the implementations and mapping techniques, as well as create a performance model in our 2011 Supercomputing paper.

The parallelization technique used in 2.5D algorithms has also been extended outside the domain of numerical linear algebra. For example, we recently demonstrated how the all-pairs shortest-paths problem can be solved with the same computational complexity as 2.5D LU using a recursive 2.5D algorithm. The theoretical analysis is again matches by scalability results on a large supercomputer, in this case the Cray XE6. For details, see our 2013 IPDPS paper.

This type of communication analysis has also been applied to Molecular Dynamics (MD). 1.5D parallelization of MD codes allows for a reduction of communication bandwidth in latency in exchange for memory usage. Our analysis and performance of this problem will be presented as IPDPS 2013.

Cyclops Tensor Framework:

Collaborators: Jeff Hammond (ANL), Devin Matthews (UT), James Demmel (UCB)
Project webpage:

State of the art electronic structure calculation codes such as NWChem, spend most of their compute time performing tensor contractions. The most popular distributed implementations currently use a global-memory PGAS layer (e.g. Global Arrays) for communication. However, using this approach it is hard to exploit tensor symmetry or maintain a well-mapped communication pattern.

We've designed a tensor parallelization framework that uses cyclic operations (hence, Cycl-ops), to compute tensor contractions. By using a cyclic decomposition, we can seamlessly take advantage of tensor symmetry and decompose into a structured virtual topology. The framework performs automatic topology-aware mapping of tensors of any dimension, so long as they fit into memory. A technical report preceeding our 2013 IPDPS paper on this framework can be found here (here) the IPDPS paper is (here).

Inactive Projects:

Parallel Sorting:

Collaborators: Laxmikant V. Kale (UIUC)
Parallel sorting has a long and rich history of algorithm development and evolution. Yet, some ideas have persisted for decades. Bitonic Sort was discovered in 1964 by Batcher and used in circuit sorting networks. Today, it is again being used on GPU architectures. In fact, choosing the best parallel sorting algorithm depends on the target architecture. We considered the problem of parallel sorting on modern supercomputers. We found that another iterative algorithm (Histogram Sort) is the best choice when scalability is needed past 5000 nodes. We modified the Histogram Sort to release dependencies and aggressively scheduled it. As a result, we achieved full communication and computation overlap, alleviating the heavy all-to-all data exchange. For more information, read our 2010 IPDPS paper.

We also wrote a "Parallel Sorting" article for David Padua's encyclopedia, as well as contributed to the formulation of the parallel sorting pattern. In preparation for writing these articles, I surveyed the large body of parallel computing literature. I assembled a collection of 46 significant papers on the topic.

Molecular Dynamics:

Collaborators: Laxmikant V. Kale (UIUC), Abhinav Bhatele (UIUC), Michael Bergdorf (DESRES), Charles Rendleman (DESRES)
At University of Illinois, we developed a molecular dynamic simulation meant to be a light-weight framework for benchmarking NAMD. We used the patch/compute parallelization scheme, where both the problem domain and the computational domain get decomposed. We implemented several optimization layers, including multicasts, pairlists, and a specialized load balancer.

In summer of 2010, I interned at DE Shaw Research (DESRES). At DESRES, we implemented and heavily optimized a multi-GPU version of the Smooth Particle Ewald method. This work was part of an effort to port the Desmond parallel molecular dynamics code, to GPU clusters. I also worked on development of the pairlist and near-term GPU kernels.