120 GFLOPS in Matrix-Matrix Multiply using DirectX 9.0

January 31, 2007

Vasily Volkov
UC Berkeley


This is an implementation of matrix-matrix multiply for the GPU that follows the ideas outlined in the ATI CTM publications [1] but is implemented in DirectX 9.0. (CTM was previously known as DPVM.) In particular, it uses 4x4 blocking to reduce the bandwidth requirements and fetch-4 for the higher bandwidth cache access. The achieved performance is similar to that achieved using the CTM.


The following package includes the source codes written in C++ using Windows API and DirectX 9.0 and the project files for Microsoft Visual C++ 6.0 and Microsoft Visual Studio 2005. The code was tested on ATI Radeon X1900 XT only.

Source code (2007-01-31)


The computation is performed in the GPU memory. If the matrices are not in the GPU memory, they are transferred there and back. The resulting performance for square matrices is presented in Figures 1, 2 and 3. The testing platform was a 2.8GHz Pentium 4 with an ATI Radeon X1900 XT.

Figure 1: Computational rates achieved.

Figure 2: Transfer bandwidths for uploading the matrices to the GPU memory and downloading the result.

Figure 3: Breakdown of the runtime when matrices are in main memory. Runtime has three stages: "uploading" the input matrices to the GPU memory, "computing" on the GPU, and "downloading" the result back to the main memory. The stages are timed individually, which causes a slight discrepancy between the total observed time and the sum of the individual stage times. This difference is about 1% for dimensions above 400, and peaks at 10% at ~300.


The Radeon X1900 driver tends to run the GPU at lower clock rates (known as "2D mode"). The ATITool utility was used to switch it back to the advertized values.


[1] Segal, M. and Peercy, M. 2006. A Performance-Oriented Data Parallel Virtual Machine for GPUs, SIGGRAPH 2006 Sketch.