PETSc: Toolkit for Scientific Applications
The PETSc toolkit is meant as a library to aid application programmers
in the scientific computing community in building large applications.
The library itself supports a suite of parallel linear and nonlinear
equation solvers and an interface that permits different coding styles
and data structures. The toolkit supports three different types of solvers:
those included in the package based on Newton and Krylov methods, hooks
for using external solvers such as Matlab as well as an interface allowing
users to extend the library with their own solvers.
PETSc promotes a fairly strict design methodology - although it is implemented
in C, it makes significant use of abstraction and other popular features
available in object-oriented languages. Furthermore, the interface can
be called from C++ and Fortran. Moreover, it includes a fair amount of profiling
information, allowing memory usage and other interesting performance characteristics
to be obtained from a parallel application.
PETSc is implemented alongside the popular Message-Passing Interface,
which allows the toolkit to benefit from the large spectrum of parallel
machines that support MPI. Although the use of MPI is not transparent
to the user, PETSc does not rely on extensive knowledge of the MPI interface.
The task of explicitly managing message-passing code is left to PETSc, which
instead promotes functions to assemble vectors and matrices and efficiently
distribute them across a parallel job. The following vector and matrix
forms can be assembled:
- Vector formats
- indices
- block indices
- strides
- Matrix formats
- (optionally blocked) compressed sparse rows
- Blocked diagonally
- Dense
Example Application: PETSc-FUN3d
FUN3D is an unstructured mesh code developed at NASA, which solves the
incompressible Euler and Navier-Stokes equations and is an example application
that has been ported to the PETSc framework. FUN3D is used in design
optimization of airplanes and automobiles characterised by irregular meshes
composed of several million mesh points. In particular, the time to solution is
an important metric, perhaps even more so that actual flop rate, since rapid
convergence is a necessity in the design optimization. The PETSc-FUN3d effort
was first presented at Supercomputing'1999 and was the recipient of the Gordon
Bell Prize.
In order to achieve a low time to solution, memory references must be minimized
and obtaining excellent serial performance helps to establish a theoretical peak
on parallel performance. As such, cache performance is improved by using the
following approaches:
- Interlacing improves spatial locality
- Structural blocking improves register reuse and hence reduces the
number of loads
- Reordering of vertices and edges improves temporal locality
The resulting performance improvements are shown below, for three different architectures.
In each case, the optimizations contribute to a significant improvement:
As is the case with all major parallel applications that recognize the costs
associated with accessing data on remote processors, the approach taken here is
to use scatter/gather operations in order to partition global data and allow
each processor to operate on a local copy. Where possible, the authors attempt
to make use of split-phase operations, where communication time can be
overlapped with some computation time.
In the parallel scalability results shown below (for the Cray T3E), the reported
implementation efficiency measures the effect of increasing the number of
iterations to convergence with an increasing number of processors. The
resulting efficiency remains above 80% for the largest node configuration (1024
processors) - a tendency visible in the near-linear aggregate Gflop/s graph.
Good scalability can be partly explained by the T3E's excellent network
scalability (torus network), which can keep up with local processor performance.
In summary, the porting effort of FUN3d to PETSc has shown to be successful in
terms of performance but also in terms of programmability and portability.
A more thorough discussion, shown in [3], shows how the application scales on
other architectures. As such, the PETSc architecture served as an excellent
vehicle for allowing a complex application to produce excellent results over an
array of configurations.
[1] PETSc homepage. http://www-unix.mcs.anl.gov/petsc/petsc-2/
[2] Baley, Buschelman et al. PETSc Users manual. ANL-95/11 Revision 2.1.5,
Argonne National Laboratory, 2002.
[2] W.K. Anderson, W.D. Gropp et al. Achieving High Sustained Performance in an
Unstructured Mesh CFD Application. IEEE Proceedings of Supercomputing'99, 1999.
[3] PETSc-FUN3D homepage. http://www-fp.mcs.anl.gov/petsc-fun3d/
-- ChristianBell - 28 Jan 2004