The Inadequacy of the MPI 2.0 One-sided Communication API for Implementing Parallel Global Address-Space Languages

by Dan Bonachea (email)

 

Global Address-Space Languages

Parallel global address-space (GAS) languages (such as UPC and Titanium), provide the illusion of a shared address space to a parallel program (regardless of the underlying hardware), providing the convenience of shared-memory programming on arbitrary parallel hardware. These languages have typically been implemented by interfacing with network-specific and vendor-specific lightweight communication layers provided by the manufacturers of the target architectures. This approach can provide good performance, but typically requires a substantial porting effort when moving to a new architecture.

There has been recent interest in implementing these languages in a way that is easily portable to new parallel architectures, yet still provides high-performance (i.e. high bandwidth AND low-latency) for remote accesses. The natural mechanism is to identify a software interface that can be efficiently implemented on a number of network architectures, and provides all the functionality required to implement the languages.

Here are the basic properties these languages require of such an interface:

Our group is investigating the possibility of using a new interface based on Active Messages to achieve these goals. This paper justifies why MPI is not an adequate solution.

Additional Background

Before discussing the details of why MPI is inadequate for meeting the above goals, let us provide some more background on the languages of interest.  

Compilers for GAS languages generally have no way to know a priori which specific memory locations will be accessed remotely, or when they will be accessed. An important class of dynamic and irregular applications that we wish to support using GAS languages have communication patterns which are data-dependent, and therefore statically unpredictable - hence the need for one-sided operations. 

UPC has a concept of a shared memory heap which contains all the remotely accessible objects (usage varies by application, but typically the major data structures reside in shared space). In Titanium, every object can potentially be accessed remotely, although a sophisticated compiler escape analysis can detect which objects are potentially "shared" (interesting applications typically have about 50-100% of the total bytes allocated judged to be shared by the analysis). 

Both languages allow local accesses to shared objects via "local" pointers, and these accesses are indistinguishable from accesses to purely local (private) objects - i.e. the languages semantics provide no explicit information about whether the memory being accessed is potentially shared or not. In both languages, accesses to shared data which resides locally using "local" pointers generally provides significantly better performance than access through "global" pointers. In fact, the performance impact is so dramatic that most UPC programmers specifically optimize for this case and the Titanium compiler includes a specialized analysis to automatically infer when such a transformation is provably legal. 

Finally, allocation of shared objects residing in the local process memory can be achieved in both languages through purely local (i.e. non-collective) allocation operations. UPC even allows shared objects to be allocated on remote processes using a local operation (upc_global_alloc()) with no explicit cooperation from the process allocating the data.

MPI 1.1

The most widely available, high-performance software interface for programming parallel machines is MPI - specifically the MPI 1.1 specification, which has been implemented and carefully tuned on most of the parallel machines of interest. Unfortunately, communication under MPI 1.1 is strictly two-sided (i.e. matched send and recv operations), which makes it generally unsuitable for directly implementing GAS languages. It is possible to simulate one-sided communication over MPI 1.1 using non-blocking operations, however the communication is not truly one-sided (because it requires the receiver to occasionally communicate or poll the network in order for remote operations to make progress) and there is some latency performance penalty associated with the software buffering that this approach usually implies (relative to a native interface to the hardware network).

MPI 2.0

There is a more recent version of the MPI specification (MPI 2.0) which extends MPI 1.1 with a number of useful features, however the interface is very wide and complete implementations are just now starting to become available. MPI 2.0 (Chapter 6) adds support for "one-sided communications" (also called "Remote Memory Access (RMA)"), and a common question is whether this support is adequate to meet the needs of implementing GAS languages. This API was reportedly added specifically to support the needs of users such as ourselves, and on the surface it seems to be a natural fit.

The rest of this document explains why the MPI-2 one-sided API does not meet the above requirements for implementing global address space languages. The basic conclusion is that the strong restrictions placed on users of the API and the weakness of the semantic guarantees provided by the interface make it unusable for these purposes. Note we are interested in writing portable code, and therefore are not concerned with the behavior or restrictions of a particular implementation of the MPI-2 one-sided API (which may happen to relax the usage constraints relative to the specification or have well-defined semantics for conditions the specification labels as erroneous), but rather with the guarantees provided by any MPI-2-compliant implementation (including one which aggressively exploits the intentionally under-specified aspects of the specification). We wish to write code which is guaranteed to be correct on any compliant implementation of the MPI-2 software interface, otherwise we lose the advantage of using a portable interface in the first place.

We will show that while the MPI-2 one-sided API does successfully address the issues of portability, collective operations/synchronization and possibly latency performance, the combination of usage restrictions conspire to prevent using it for true one-sided communication and non-blocking operations in GAS language implementations.

Basics of the MPI-2 One-sided API 

The semantics of the MPI-2 one-sided communication API are significantly complicated and difficult to understand, which in itself is a significant barrier to usage - both for MPI clients seeking to write code with guaranteed well-defined semantics, and MPI implementors seeking to write a compliant implementation of the API. Here we try to unravel some of the key usage points and refer the reader to the full document for details.

The one-sided API  revolves around the use of abstract objects called "windows" which intuitively specify regions of a process's memory which have been made available for remote operations by other MPI processes. Windows are created using a collective operation (MPI_Win_create) called by all processes within a "communicator" (a group of processes in MPI terminology), which specifies a base address and length (which may be different on each process), and is permitted to span very large areas (e.g. the entire virtual address space). All three one-sided RMA operations (MPI_Put, MPI_Get, MPI_Accumulate) take a reference to such a window and a rank integer to indicate which process is the remote target. All one-sided operations are implicitly non-blocking and must be synchronized using one of the synchronization methods described below.

Active Target vs. Passive Target

There are 2 primary "modes" in which the one-sided API can be used, named "active target" and "passive target". The primary semantic distinction is whether or not cooperation is required by the remote node in order to complete a remote memory access. All RMA operations on a window must take place within a synchronization "epoch" (with a start and end point defined by explicit synchronization calls), and operations are not guaranteed to be complete until the end of such an epoch. The active and passive target modes differ in which process makes these synchronization calls.

Active target operation requires synchronization functions to be called on both the origin process (the one making the RMA get/put accesses) and the target process (the one hosting the memory in the referenced window). The origin process calls MPI_Win_start/MPI_Win_complete to begin/end the synchronization epoch, and the target process must cooperate by calling MPI_Win_post/MPI_Win_wait to acknowledge the beginning/end of the epoch (there is also a collective MPI_Win_fence operation which can be substituted  for one or more of these calls). In any case, this required cooperation effectively destroys the possibility of implementing the truly one-sided operations that we wish to provide in GAS languages using active-target mode RMA.

Passive target operation provides more lenient synchronization. In passive-target operation, only the originating process calls synchronization functions (MPI_Win_lock/MPI_Win_unlock) to start/end the access epoch. As with active target, all RMA accesses must take place within such an epoch and are not guaranteed to complete until the MPI_Win_unlock call completes. 

There are two forms of MPI_Win_lock - shared and exclusive. MPI_Win_lock(exclusive) enforces mutual exclusion on the window and the RMA operations performed within the epoch - i.e. it blocks until it can start an exclusive access epoch to the window, and no other processes may enter a shared or exclusive access epoch for that window until the process with exclusive access unlocks (the semantics are actually slightly weaker than this, but the intuition is correct). MPI_Win_lock(shared) allows other concurrent shared epochs from other processes. The spec recommends the use of exclusive epochs when executing any local or RMA update operations on the memory encompassed by the window to ensure well-defined semantics.

Restrictions on the Use of Passive Target RMA

The interface described thus far for passive target RMA seems reasonable, however unfortunately there are a large number of restrictions on how it may be legally used. Here are some of the most important restrictions:

  1. Window creation is a collective operation - all processes which intend to use a window for RMA (including all intended origin and target processes) must participate in the creation of that window.
  2. Implementors may restrict the use of passive-target RMA operations to only work on memory allocated using the "special" memory allocator MPI_Alloc_mem (p. 131). This prevents the use of passive-target RMA on static data and forces all globally-visible objects to be allocated using this "special" allocation call (no guarantees are made about how much memory can be allocated using this call, although some implementations are likely to restrict it to a small number of pinnable pages).
  3. It is erroneous to have concurrent conflicting RMA get/put (or local load/store) accesses to the same memory location (p.113).
  4. The memory spanned by a window may not concurrently be updated by a remote RMA operation and a local store operation (i.e. within a single access epoch) - even if these two updates access different (i.e. non-overlapping) locations in the window (p.113). 
  5. Multiple windows are permitted to include overlapping memory regions, however it is erroneous to use concurrent operations to distinct overlapping windows (p. 111).
  6. RMA operations on a given window are only permitted to access the memory of single process during an access epoch (p.131)

Implications 

Now, let us investigate the implications of the above restrictions on our effort to implement remote accesses in a GAS language.

Conclusions

In conclusion, the MPI-2 one-sided API was a nice idea with good potential, but the overly strong usage restrictions placed on the client of the interface conspire to make the API inadequate for implementing global address space languages such as UPC and Titanium, which require the means for efficiently implementing low-latency one-sided and non-blocking remote memory operations. Hopefully a future version of the MPI one-sided specification will address these issues and provide an interface better adapted to handle the needs of GAS language implementations that can serve as a portable network substrate for building such systems.