Parallel global address-space (GAS) languages (such as UPC and Titanium), provide the illusion of a shared address space to a parallel program (regardless of the underlying hardware), providing the convenience of shared-memory programming on arbitrary parallel hardware. These languages have typically been implemented by interfacing with network-specific and vendor-specific lightweight communication layers provided by the manufacturers of the target architectures. This approach can provide good performance, but typically requires a substantial porting effort when moving to a new architecture.
There has been recent interest in implementing these languages in a way that is easily portable to new parallel architectures, yet still provides high-performance (i.e. high bandwidth AND low-latency) for remote accesses. The natural mechanism is to identify a software interface that can be efficiently implemented on a number of network architectures, and provides all the functionality required to implement the languages.
Here are the basic properties these languages require of such an interface:
Our group is investigating the possibility of using a new interface based on Active Messages to achieve these goals. This paper justifies why MPI is not an adequate solution.
Before discussing the details of why MPI is inadequate for meeting the above
goals, let us provide some more background on the languages of
interest.
Compilers for GAS languages generally have no way to know a priori which
specific memory locations will be accessed remotely, or when they will be
accessed. An important class of dynamic and irregular applications that we wish
to support using GAS languages have communication patterns which are
data-dependent, and therefore statically unpredictable - hence the need for
one-sided operations.
UPC has a concept of a shared memory heap which contains all the remotely
accessible objects (usage varies by application, but typically the major data
structures reside in shared space). In Titanium, every object can potentially be
accessed remotely, although a sophisticated
compiler escape analysis can detect which objects are potentially
"shared" (interesting applications typically have about 50-100% of the
total bytes allocated judged to be shared by the analysis).
Both languages allow local accesses to shared objects via "local"
pointers, and these accesses are indistinguishable from accesses to purely local
(private) objects - i.e. the languages semantics provide no explicit information
about whether the memory being accessed is potentially shared or not. In both
languages, accesses to shared data which resides locally using "local"
pointers generally provides significantly better performance than access through
"global" pointers. In fact, the performance impact is so dramatic that
most UPC programmers specifically optimize for this case and the Titanium
compiler includes a specialized analysis
to automatically infer when such a transformation is provably legal.
Finally, allocation of shared objects residing in the local process memory can
be achieved in both languages through purely local (i.e. non-collective)
allocation operations. UPC even allows shared objects to be allocated on remote
processes using a local operation (upc_global_alloc()) with no
explicit cooperation from the process allocating the data.
The most widely available, high-performance software interface for programming parallel machines is MPI - specifically the MPI 1.1 specification, which has been implemented and carefully tuned on most of the parallel machines of interest. Unfortunately, communication under MPI 1.1 is strictly two-sided (i.e. matched send and recv operations), which makes it generally unsuitable for directly implementing GAS languages. It is possible to simulate one-sided communication over MPI 1.1 using non-blocking operations, however the communication is not truly one-sided (because it requires the receiver to occasionally communicate or poll the network in order for remote operations to make progress) and there is some latency performance penalty associated with the software buffering that this approach usually implies (relative to a native interface to the hardware network).
There is a more recent version of the MPI specification (MPI 2.0) which extends MPI 1.1 with a number of useful features, however the interface is very wide and complete implementations are just now starting to become available. MPI 2.0 (Chapter 6) adds support for "one-sided communications" (also called "Remote Memory Access (RMA)"), and a common question is whether this support is adequate to meet the needs of implementing GAS languages. This API was reportedly added specifically to support the needs of users such as ourselves, and on the surface it seems to be a natural fit.
The rest of this document explains why the MPI-2 one-sided API does not meet the above requirements for implementing global address space languages. The basic conclusion is that the strong restrictions placed on users of the API and the weakness of the semantic guarantees provided by the interface make it unusable for these purposes. Note we are interested in writing portable code, and therefore are not concerned with the behavior or restrictions of a particular implementation of the MPI-2 one-sided API (which may happen to relax the usage constraints relative to the specification or have well-defined semantics for conditions the specification labels as erroneous), but rather with the guarantees provided by any MPI-2-compliant implementation (including one which aggressively exploits the intentionally under-specified aspects of the specification). We wish to write code which is guaranteed to be correct on any compliant implementation of the MPI-2 software interface, otherwise we lose the advantage of using a portable interface in the first place.
We will show that while the MPI-2 one-sided API does successfully address the issues of portability, collective operations/synchronization and possibly latency performance, the combination of usage restrictions conspire to prevent using it for true one-sided communication and non-blocking operations in GAS language implementations.
The semantics of the MPI-2 one-sided communication API are significantly complicated and difficult to understand, which in itself is a significant barrier to usage - both for MPI clients seeking to write code with guaranteed well-defined semantics, and MPI implementors seeking to write a compliant implementation of the API. Here we try to unravel some of the key usage points and refer the reader to the full document for details.
The one-sided API revolves around the use of abstract objects called "windows" which intuitively specify regions of a process's memory which have been made available for remote operations by other MPI processes. Windows are created using a collective operation (MPI_Win_create) called by all processes within a "communicator" (a group of processes in MPI terminology), which specifies a base address and length (which may be different on each process), and is permitted to span very large areas (e.g. the entire virtual address space). All three one-sided RMA operations (MPI_Put, MPI_Get, MPI_Accumulate) take a reference to such a window and a rank integer to indicate which process is the remote target. All one-sided operations are implicitly non-blocking and must be synchronized using one of the synchronization methods described below.
There are 2 primary "modes" in which the one-sided API can be used, named "active target" and "passive target". The primary semantic distinction is whether or not cooperation is required by the remote node in order to complete a remote memory access. All RMA operations on a window must take place within a synchronization "epoch" (with a start and end point defined by explicit synchronization calls), and operations are not guaranteed to be complete until the end of such an epoch. The active and passive target modes differ in which process makes these synchronization calls.
Active target operation requires synchronization functions to be called on both the origin process (the one making the RMA get/put accesses) and the target process (the one hosting the memory in the referenced window). The origin process calls MPI_Win_start/MPI_Win_complete to begin/end the synchronization epoch, and the target process must cooperate by calling MPI_Win_post/MPI_Win_wait to acknowledge the beginning/end of the epoch (there is also a collective MPI_Win_fence operation which can be substituted for one or more of these calls). In any case, this required cooperation effectively destroys the possibility of implementing the truly one-sided operations that we wish to provide in GAS languages using active-target mode RMA.
Passive target operation provides more lenient synchronization. In
passive-target operation, only the originating process calls synchronization
functions (MPI_Win_lock/MPI_Win_unlock) to start/end the access epoch. As with
active target, all RMA accesses must take place within such an epoch and are not
guaranteed to complete until the MPI_Win_unlock call completes.
There are two forms of MPI_Win_lock - shared and exclusive.
MPI_Win_lock(exclusive) enforces mutual exclusion on the window and the RMA
operations performed within the epoch - i.e. it blocks until it can start an
exclusive access epoch to the window, and no other processes may enter a shared
or exclusive access epoch for that window until the process with exclusive
access unlocks (the semantics are actually slightly weaker than this, but the
intuition is correct). MPI_Win_lock(shared) allows other concurrent shared
epochs from other processes. The spec recommends the use of exclusive epochs
when executing any local or RMA update operations on the memory encompassed by
the window to ensure well-defined semantics.
The interface described thus far for passive target RMA seems reasonable, however unfortunately there are a large number of restrictions on how it may be legally used. Here are some of the most important restrictions:
Now, let us investigate the implications of the above restrictions on our effort to implement remote accesses in a GAS language.
In conclusion, the MPI-2 one-sided API was a nice idea with good potential, but the overly strong usage restrictions placed on the client of the interface conspire to make the API inadequate for implementing global address space languages such as UPC and Titanium, which require the means for efficiently implementing low-latency one-sided and non-blocking remote memory operations. Hopefully a future version of the MPI one-sided specification will address these issues and provide an interface better adapted to handle the needs of GAS language implementations that can serve as a portable network substrate for building such systems.