New:
Information on using timers and native method (C) code from Titanium below.
Workaround for compiler bug below.
Small bug in quicksort code.
This directory contains a Titanium implementation of a Conjugate Gradient program. To set your paths so you can use the compiler, check out the notes at the bottom of this page on the current installations of Titanium as well as the tcbuild documentation. Here are a few basic commands, once you can find tcbuild, use:
tcbuild -O Driver.ti
The -O is to turn on optimizations. It will automatically pass an optimization flag to the backend C compiler and also performance its own optimizations. In addition, you may want to turn off bounds checking of arrays using the -nobcheck flag. To run the code, do:
Driver <n>
For a 1D Poisson problem of size n (i.e., an nxn Poisson matrix).
The only matrix operation used in a standard Conjugate Gradient (CG) algorithm is a matrix-vector multiplication. This, CG can be written generically to take this routine (matvec) as an argument. When CG is preconditioned, the routine alse needs the preconditioner operation (called psolve here). Since Java and Titanium do not support parameters as arguments, these are done using inheritance. The following hierarchy is represented in the code:
Sparse Preconditioner
/ \ / \
CSR Poisson1D IdentityPrecond Poisson1DBlockJacobiPrecond ...
The Sparse interface requires a matvec method, and Preconditioner requires psolve. The only preconditioner currently implemented is
These classes are organized to have "tester" methods, which might be "main" functions if we were programming in Java, but in Titanium we are only allowed to have a single main function. (This is related to the compilation model, since a single executable is created for the class that contains main.)
This program was influenced by the C version to make it easier to go back and forth between the two programs (and because David wrote his first). Titanium has support for operator overloading and templates, unlike Java, and both of these could be used to improve the reusability of the code, but this is orthogonal to parallelization issues. Feel free to try these features (and see if they impact performance -- they shouldn't).
There are many cases in which error checking and error handling should be done more carefully. In particular, several methods assume that arrays are aligned, in the sense that they start and end at the same index. The I/O code is also quite fragile.
Absolutely no performance tuning or measurements have been done on this code so far. There are some obvious places where the UPC implementation will be faster, e.g., it calls LAPACK for some dense matrix problems where the Titanium version uses Titanium. In the long run, this seems like the right approach for Titanium -- many people have put years of effort into writing LAPACK, and the highly tuned "BLAS" routines that it is based on, so it is unlikely that compiled Titanium code could beat this. It is possible to call native C from Titanium (recall that Titanium is compiled into C), but the three main issues are:
1) The name "mangling" that goes on to convert Java-style class/method names into C functions. This makes the interface ugly (something the Titanium group is working on), but it is workable--many people have written Titanium programs that interface to C libraries.
2) Accessing Titanium data structures, in particular arrays, from within C. This is also ugly but possible since an Titanium arrays contain some header information and a contiguous block of storage, which can be manipulated like a 1D C array.
3) Converting between row and column major layout for multidimensional arrays. This is needed because LAPACK was written in Fortran, and while it has a C implementation (and even a Java interface), in all languages it assumes a column-major layout, whereas Titanium uses row-major in its implementation. The necessary transpose can be done (see the transpose code in the Matrix class for a simple example), but is but clunky and potentially expensive.
Matrix-market style IO
was not in the code that was handed out early-on, but is available now.
If you already
downloaded and modified the code a while ago, pick up the files MMIO.ti and MatrixEntry.ti
to get the new IO routines. You will need to make some small changes in
Driver.ti (or copy ours) to call the IO routines in the main method.
The simple conjugate gradient and matvec methods provided before have
not been modified.
Missing Features and Known Bugs
Preconditioners: No preconditioner is available yet for general matrices, only for the Poisson problem. This is something you are welcome to add. The parallel code does not have the preconditioner even for the Poisson problem, although this should be quite easy to add, since the block preconditioner is embarrassingly parallel, and the sequential code you need is in the Matrix.ti file in the sequential code. Both the sequential and parallel code are set up to call preconditioners if they are available.
Quicksort: There was an off-by-1
error in the call to quicksort in the sort method of QSort.ti. It should
be:
Thanks to Dan Bonachea, who noticed this while tracking down the following compiler bug.
Compiler bug: There is a subtle compiler bug noticed by Ryan Huebsch, which is a null pointer exception that occurs in the distributed memory backends (e.g., Millennium, the T3E, and the SP). On such machines, pointers that are not declared "local" are represented by a pair of items, a processor ID and a local memory address. This adds space and cost to these "global" pointers, so the compiler automatically converts global pointers to local ones using "Local Qualification Inference (LQI)" if it can prove that the pointer is only used by the processor on which the data resides. Unfortunately, LQI is inferring something incorrectly on our CG code, and converting a global pointer to local that should not be. The bug arises as a null pointer exception in the fillEntries method in the CSR.ti file. There are two possible fixes:
1. Turn off LQI by using the compiler flag:
  --tc-flags "-Onolocal"
This could have some detrimental performance effects on wide-pointer backends, unless you manually go through an add "local" annotations to the local pointers. Some attempt was made in the code to distinguish between local and global pointers (e.g., myX and allX, where myX points only to local data), with computation done using local pointers most of the time. This style helps the LQI analysis, but you might be able to manually get the same effect by annotating the local pointers.
2. The preferred approach is to change this line in your Driver.ti to an equivalent one:
from:
aMat = new CSR(aMM.entries, aMM.n);
to:
aMat = new CSR(broadcast
aMM.entries from 0, aMM.n);
Titanium is based on Java 1.0. You may use the Java libraries (as long as they don't require threads) and you will see all of the standard libraries automatically. Although the following is 1.1, it's pretty close to what the 1.0 library interfaces:
We recommend you use Titanium on Millennium for basic code development, and then use mcurie or seaborg for performance testing. Titanium source code is available on the web page, and it can be built on most non-Windows machines. The Millennium installation is a bit peculiar, because it doesn't use the default gcc implementation as the backend. The reason is that there is a subtle bug in the default gcc that has been seen through some Titanium code. Most likely, you won't run into it, but we thought you would have a better chance if we used a different compiler. You can play with different C backends by passing compiler flags to tcbuild.
Timers in Titanium: To time your Titanium code, use the timer routines in the Language reference manual. On the IBM SP machine, where the PAPI interface to hardware counters is installed, you may also try using PAPI from Titanium.
Native methods in Titanium: You may also try calling native method code, for which a small example is provided. If you want to try something more sophisticated that accesses Titanium data structure from native code, you may find a former 267 project on PeTSC from Titanium useful.
Some additional comments on the new parallel code: 1) the Poisson problem is not implemented. I recommend this as a good warm-up exercise for learning Titanium. (It should be much simpler than the CSR class.) 2) The code really should have used a parallel/distributed vector class, but instead it uses the Titanium analog of a distributed array, which is an array of Procs elements, replicated on all processors, with each element pointing to the block of the array on a given processor. It would be cleaner with a distributed array interface. 3) This is only been run on threads, but it should automatically run on Millennium, the t3e, and the SP. 4) To run the code on an SMP (or a single processor with multiple threads, just for debugging), use
tcbuild --backend smp Driver.ti
and then run with
tcrun --processors <n> Driver <filename>
On the Cray t3e and IBM SP, use tcbuild --help to see the names of the backend options for that machine. The compile defaults to a uniprocessor backend if no flag is specified. Don't forget to use the tester methods for modular testing -- simply rename them to "main" and compile that class.
A relatively new release of the Titanium compiler (called the "SRS release v1.910") has now been installed on a number of machines within the CS domain and elsewhere. Here is a list of the platforms you can use. Note that many of these can be accessed from desktop machines, as long as you have mounted these files systems.
Solaris SRS:
CS Division: /project/cs/titanium/srs/sparc-sun-solaris2.6/bin/tcbuild
Linux SRS:
CS Dept: /project/cs/titanium/srs/i686-pc-linux-gnu/bin/tcbuild
NERSC Cray T3E:
mcurie.nersc.gov: /usr/tmp/Titanium/dist/bin/tcbuild
NERSC IBM SPPower3
seaborg.nersc.gov: /usr/common/ftg/titanium/dist/bin/tcbuild
SDSC ROCKS Linux/Myrinet cluster
slic00.sdsc.edu: ~bonachea/bin/tcbuild
NCSA Origin 2000 array
modi4.ncsa.uiuc.edu: /usr/apps/Ti/dist/bin/tcbuild
Blue Horizon
horizon.npaci.edu: /usr/local/apps/titanium/bin/tcbuild
The recommended parallel backend on this platform is currently mpi-cluster-smp (the system has 8 CPU's per node) until the performance bug on the sp3 backend gets fixed.
The Titanium software page:
http://www.cs.berkeley.edu/Research/Projects/titanium/software.html
has documentation on using the mpi-* and udp-* backends, as well as a language reference and a tutorial (in-progress). If you have any questions, please e-mail titanium-devel@cs.berkeley.edu.