SP Communication Layer Questions

Michael Welcome
December, 2001

Background

During the UPC meeting on Thursday December 6, 2001, Kathy asked each of us to answer a collection of questions concerning the communication subsystem for each of the target architectures.  The goal is to help determine the best model of parallelism for UPC within the confines of a single SMP node.

The choices are:
There are numerous issues involved in each choice.  For example, static (non-stack) variables in UPC are local to each thread or process.  In a thread model these variables must be stored in a thread specific data region, but shared variables are automatically available to all the threads.  In a process model, all variables are private to the process, but shared variables must be placed in shared memory segments.  Another issue is access to the network for communication to processes/threads on another node.  UPC will require a high performance, low latency network and therefore access to some form of user space access to the networking hardware.  

IBM SP Information

The IBM SP supports three communication APIs: MPI, LAPI and OpenMP (Fortran only).  Of these, LAPI is the most basic (lowest level) and should provide the best performance.  It is a one-sided communication library that supports remote puts and gets, as well as general "active messages".  At the hardware level, each node has up to two switch adaptors.  Access to the switch can be achieved using either IP, which requires traversing a protocol stack in the kernel as well as (possibly multiple) data copies, or the US (user-space) protocol.  A "user-space window" is a virtualized adaptor which does not require a kernel transition, and provides the ability to DMA data to and from the user's address space and the adaptor.  Regardless of the number of physical adaptors on each node, the number of user-space windows is fixed on a per-system basis.  A user-space window can be bound to a particular adaptor (css0, css1, ...) or to a "bonding driver" csss, which multiplexes communications across all the physical adaptors.  MPI applications can use either IP or UP, LAPI applications must use US.  Note that a lower level API to the switch adaptors, called the Hardware Abstraction Layer (HAL), exists but is not exported by IBM to the user community.

Parallel applications consist of a collection of "tasks" communicating via MPI or LAPI (or both).  A task is the tree of all processes and threads that gets started as a unit of work by the job management system.  In terms of MPI, a task is assigned a single MPI rank.  Parallelism within a task is achieved using multiple threads.  When running a job, the user specifies the number of tasks per node and the number of nodes.  For example, an MPI job with a rank of 32 could run on 4 nodes with 8 tasks per node, 2 nodes with 16 tasks per node, etc.  If the tasks are multi-threaded, the user will want to adjust the number of tasks per node to avoid contention between the threads and the available processors.  Each task that uses MPI requires a user-space window for MPI.  A task that uses LAPI also requires a user-space window.  A task that uses both MPI and LAPI requires two user-space windows.  Thus, on a system that provides 16 user-space windows, an application that uses both LAPI and MPI may run at most 8 tasks per node.  In this situation, the tasks should be multi-threaded to make full use of the hardware.  The other alternative would be to use MPI with the IP communication protocol and reserve all 16 of the user-space windows for LAPI.  Unfortunately, IP has much higher latency and lower bandwidth than US.

Communication between two MPI tasks on the same node is done using shared memory segments.  Communication between two LAPI tasks on the same node is done through the switch adaptor.  That is, LAPI does not shortcut the operation, the switch adaptor notices that it is both source and destination and copies the data back up to the appropriate target user-space window.  

On the NERSC SP:

Questions

How many processes can share a communication endpoint?  
ONE.
An endpoint on the SP is a user-space window.  Each task (process) must have its own window for each communication protocol it uses (MPI and/or LAPI).  Two tasks cannot share a user-space window.

How many endpoints per adaptor?
16 endpoints per node.  
As mentioned above, the user-space windows are virtual adaptors which multiplex use of the physical adaptor.  The number of windows is fixed on a per-system basis.  Seaborg.nersc.gov has 16, but the previous system (gseaborg) had two processors per node, one adaptor per node but four user-space windows per node.
Can threads share an endpoint?
Yes, all the threads in a task share the endpoint(s) assigned to the task.
What does the MPI implementation do about mixing shared memory and threads?
The IBM MPI implementation is thread safe; that is, all the threads in a task can make MPI calls.  On-node communication between two tasks is done through SysV shared memory.  MPI and LAPI cannot share the same user-space window so applications that use both can only run half the number of tasks per node.