The Aggregate Remote Memory Copy Interface (ARMCI) is the runtime system for the Global Arrays (GA) package provided by Pacific Northwest National Laboratory (PNL). These are some notes concerning my investigations into this interface, as relevant to the implementation of GAS languages such as UPC and Titanium.
ARMCI seems to be primarily designed for optimizing various flavors of bulk operations (large messages), is tuned for high bandwidth (as opposed to latency) and does not appear to provide good support for small, individual memory operations.
ARMCI provides excellent support for bulk contiguous, scatter/gather and strided operations and would be a good substrate for implementing the bulk operations in a GAS language (e.g. upc_memcpy, Titanium array copy, etc.). They also support a number of atomic remote operations (e.g. accumulate-sum).
It's unclear to what extent ARMCI provides support for non-blocking operations (some of the documents seem to be contradictory). It appears that at least "put" operations are implicitly non-blocking (they return as soon as the local buffer can be reused) and are synchronized using a per-target synchronization function (ARMCI_Fence). The operations to a given target are completed in-order (seems like a silly restriction) however it doesn't appear possible to synchronize on the completion of a given put operation (only all the outstanding puts for a particular target or all targets), or to issue non-blocking gets at all.
ARMCI guarantees progress for remote requests even in the absence of reciever-side network activity or polling, which means that their implementations must all either use a separate network thread or support some form of interrupt-based network reception. I'm not planning to restrict the design of GASNet by requiring this progress guarantee, but the GASNet core interface will definitely support implementations that utilize interrupt-driven or separate-thread asynchronous message reception to provide better network attentiveness and reduce average service time for remote requests.
The ARMCI folks acknowledge the inadequacies of MPI-2 one-sided operations and claim that ARMCI is a good alternative to the one-sided interface (for bulk operations, I agree with them).
Paper #4 below presents some very interesting and relevant results concerning the tradeoffs of using intranode LAPI loopback versus system V shared memory to implement local node memory accesses on the IBM SP. They support both a process-based model which allocates all shared memory within system V shared memory, and a pthread-based model (which shares the entire address space, obviously). They conclude there is a measurable latency and bandwidth performance advantage to using pthreads or processes with shared memory for local accesses (no big surprise) because it reduces the number of memory copies and contention for the network adapter. The improvement is about 15x in one-way latency and 67% more bandwidth for microbenchmarks, which adds up to about a 12% overall performance improvement for the SPLASH LU benchmark on 64 processors (16 x 4-way SMP nodes).
Criticisms of paper #4: they failed to measure the potentially negative caching effects of direct memory access. Also, they failed to recognize that mapping the entire shared memory area into each process (especially on large 16-way, 64GB SP nodes like ours) can quickly exhaust the limits of the 32-bit virtual address space without utilizing the full physical memory resources of the node (i.e. the only reason we plan to ever use gasnet or LAPI for intranode puts and gets is to relieve the problems caused by this memory limit).