Benjamin Lee (blee20@eecs.berkeley.edu)
Computer Science 267
Spring 2004

Biography

I am a fourth year undergraduate student at the University of California, Berkeley, pursuing a Bachelors degree in Electrical Engineering and Computer Sciences from the College of Engineering. My interests include digital design, computer architecture, and high performance computing. I am also interested in finance, economics and the application of computing to these disciplines.

Research Interests

I am pursuing my research interests in areas of high-performance computing with the Berkeley Benchmarking and Optimization Group (BeBOP) under the advice and guidance of Professors Jim Demmel and Kathy Yelick. I have been examining the effects of various performance optimizations on symmetric sparse matrix-vector multiply, including algorithmic, data structure, compiler, and architecture-specific optimizations. These optimizations seek to exploit the symmetric structure of the matrix in conjunction with existing optimizations for general sparse matrix-vector multiply. I am also preparing to begin work on extending these optimizations for parallel implementations of matrix-vector multiply on commodity SMPs.

Objectives

I am hoping to get a better understanding of parallel computing and apply the concepts from this course to my own research in high-performance computing.

Homework 0
Optimizing Sparse Matrix Vector Multiplication on SMPs

Application

Sparse matrix-vector multiplication (SpMV) is an important computational kernel employed in scientific and engineering applications. Sparse matrix-vector multiply is the computation of y = y + Ax, where A is a sparse matrix and x,y are dense column vectors. Sparse matrix algorithms tend to run much slower than their dense counterparts, generally achieving less than 10% of a platform's peak performance. This low performance is attributed to the overhead of maintaining the sparse data structure.

The Berkeley Benchmarking and Optimization Group (BeBOP) has implemented several optimization techniques have been employed to improve the performance of sparse matrix-vector multiplication: (1) register blocking, (2) cache blocking, (3) matrix reordering. Although most of these techniques have been implemented and evaluated for uni-processors, these techniques have also been applied to commodity SMPs with mixed performance results.

Register Blocking

Register blocking is an optimization technique for improving register reuse over that of a conventional implementation. Register blocking is designed to exploit naturally occurring dense blocks within a sparse matrix by reorganizing the matrix data structure into a sequence of small dense blocks. The size of these dense blocks is chosen such that the corresponding source vector elements can fit in registers. Register blocking reduces loop overhead, reduces indexing overhead, reduces irregular memory accesses, and increases temporal locality to the source vector.

In the register blocked implementation, consider an (m x n) matrix, logically divided into (m/r x n/c) submatrices, where each submatrix is of size (r x c). Assume for simplicity that r divides m and that c divides n. The computation of SpMV proceeds block-by-block. For each block, we can reuse the corresponding c elements of the source vector by keeping them in registers, assuming a sufficient number are available.

Cache Blocking

Cache blocking is an extension of the register blocking idea. This optimization is used to keep c_cache elements of the source vector in the cache while an (r_cache x c_cache) block of matrix A is multiplied to this portion of the source vector. Cache blocking effectively limits the vector products within a cache block so that elements of the source vector are located in cache and may be re-used for the vector product in the next block row.

Matrix Reordering

Reordering the rows and columns of the matrix changes the memory access patterns to the sparse matrix, possibly reducing cache misses or memory coherence traffic across the bus or crossbar of an SMP. Two techniques for reordering symmetric matrices have been extended to the non-symmetric case. The first technique uses a graph numbering scheme, Reverse Cuthill-McKee (RCM), that numbers the graph nodes in reverse order, according to Breadth First Search. The second technique is an application of graph-partitioning algorithms using the hMETIS package developed at the University of Michigan.

Evaluation on SMPs

BeBOP ran and evaluated the cache blocking optimization on an 8-way Ultrasparc SMP with four variations on processor load assignment:

C1: Rows in the same block are distributed evenly to participating processors.
C2: Each block of rows are assigned to each processor.
C3: Similar to C2 except the computation starts at the diagonal block.
C4: Each block of columns are assigned to each processor.

The performance of the Ultrasparc SMP for these load assignments varied with the matrix. The left graph shows the performance per processor for SpMV with the LSI matrix (arising from document retrieval algorithms) and the right graph shows performance per processor for a FEM matrix (finite element method).

In the left graph, the per processor performance is plotted for the multiplication of the LSI matrix. The base performance for an unoptimized implementation is also shown for comparison. The portion of the LSI matrix used in the experiment to collect this performance data is (10K x 256K) with 3.8M non-zero entries. For this matrix, C1 yields no performance gains per processor because the long rows of the cache block do not fit in cache. In experimental results not shown below, a variation on the C1 configuration with limited cache block sizes yielded performance of 17-18 Mflop/s per processor. C3 starts the computation at the diagonal blocks such that each processor reads different parts of the source vector at any given time. These effects seem to be negligible, however, because C2 and C3 yield similar performance gains. C4 is most effective, yielding 16-17 MFlop/s per processor for larger number of processors.

In the right graph, the per processor performance is plotted for the multiplication of a finite element method (FEM) matrix. The dimensions of this matrix are (24696 x 24696) with 1.7M non-zero elements. Unlike the LSI matrix, the row length of this matrix did not cause problems with cache capacity. Although prior work had demonstrated limited benefits from cache blocking on a uni-processor, the multiplication of this matrix achieves a nearly linear speedup with configurations C2 and C3. Configurations C1 and C4 achieve lower per processor performance. The lower performance of configuration C4 may be caused by the latencies of the extra synchronization needed to accumulate partial sums before writing results into the destination vector.



Key Considerations

What is the scientific or engineering problem being solved?
BeBOP has conducted preliminary studies for the parallel implementation of sparse matrix-vector multiplication.

What parallel platform has the application targeted?
The experimental work discussed in this summary targeted the 167 MHz Ultrasparc SMP, a commodity shared memory multiprocessor. The optimized routines were implemented using C libraries for posix threads.

How well did the application perform?
The per processor performance results varied with the matrix. For the two example matrices (LSI and FEM), the performance as a function of the number of utilized processors exhibited very different characteristics. This suggests that load assignment configuration and the number of processors utilized to maximize performance should consider matrix structure and other matrix characteristics.

How does this compare to the platform's best possible performance
The 167 MHz Ultrasparc processor had two floating point units. Thus, the peak per processor performance is 334 MFlop/s (167 MHz * 2 FP units). The theoretical peak performance of the aggregate 8-way SMP would be 2672 MFlop/s (8 processors * 334 MFlop/s per processor). The achieved per processor performance of approximately 20 MFlop/s is nearly 6% of the theoretical peak per processor performance.

Does the application "scale" to large problems on many processors? What bottlenecks may have limited its performance?
Performance varied with the load assignments. Of the various matrices and load assignments presented, only the FEM matrix with configurations C2 or C3 scaled linearly. The experimental setup, however, did not consider parallel systems with more than eight processors. Therefore, there may be insufficient experimental data to evaluate the application's performance for large problems on many processors.

References

Optimizing Sparse Matrix Vector Multiplication on SMPs
E. Im and K. A. Yelick
SIAM Conf. Parallel Processing for Scientific Computing, San Antonio, TX, March 1999.

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply
R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, B.Lee
Supercomputing 2002. Baltimore, November 2002.

Performance Optimizations and Bounds for Sparse Symmetric Matrix - Multiple Vector Multiply
B. Lee, R. Vuduc, J. Demmel, K. Yelick, M. de Lorimier, L. Zhong
Technical Report UCB//CSD-03-1297, University of California, Berkeley, November 2003.