During the UPC meeting on Thursday December 6, 2001, Kathy asked each of
us to answer a collection of questions concerning the communication subsystem
for each of the target architectures. The goal is to help determine
the best model of parallelism for UPC within the confines of a single SMP
The choices are:
- Processes using either SysV shared memory or shared (MMaped) pages
- The native thread and synchronization model available on each target
There are numerous issues involved in each choice. For example, static
(non-stack) variables in UPC are local to each thread or process. In
a thread model these variables must be stored in a thread specific data region,
but shared variables are automatically available to all the threads. In
a process model, all variables are private to the process, but shared variables
must be placed in shared memory segments. Another issue is access to
the network for communication to processes/threads on another node. UPC
will require a high performance, low latency network and therefore access
to some form of user space access to the networking hardware.
IBM SP Information
The IBM SP supports three communication APIs: MPI, LAPI and OpenMP (Fortran
only). Of these, LAPI is the most basic (lowest level) and should provide
the best performance. It is a one-sided communication library that
supports remote puts and gets, as well as general "active messages". At
the hardware level, each node has up to two switch adaptors. Access
to the switch can be achieved using either IP, which requires traversing
a protocol stack in the kernel as well as (possibly multiple) data copies,
or the US (user-space) protocol. A "user-space window" is a virtualized
adaptor which does not require a kernel transition, and provides the ability
to DMA data to and from the user's address space and the adaptor. Regardless
of the number of physical adaptors on each node, the number of user-space
windows is fixed on a per-system basis. A user-space window can be
bound to a particular adaptor (css0, css1, ...) or to a "bonding driver" csss,
which multiplexes communications across all the physical adaptors. MPI
applications can use either IP or UP, LAPI applications must use US.
Note that a lower level API to the switch adaptors, called the Hardware Abstraction
Layer (HAL), exists but is not exported by IBM to the user community.
Parallel applications consist of a collection of "tasks" communicating via
MPI or LAPI (or both). A task is the tree of all processes and threads
that gets started as a unit of work by the job management system. In
terms of MPI, a task is assigned a single MPI rank. Parallelism within
a task is achieved using multiple threads. When running a job, the
user specifies the number of tasks per node and the number of nodes. For
example, an MPI job with a rank of 32 could run on 4 nodes with 8 tasks per
node, 2 nodes with 16 tasks per node, etc. If the tasks are multi-threaded,
the user will want to adjust the number of tasks per node to avoid contention
between the threads and the available processors. Each task that uses
MPI requires a user-space window for MPI. A task that uses LAPI also
requires a user-space window. A task that uses both MPI and LAPI requires
two user-space windows. Thus, on a system that provides 16 user-space
windows, an application that uses both LAPI and MPI may run at most 8 tasks
per node. In this situation, the tasks should be multi-threaded to
make full use of the hardware. The other alternative would be to use
MPI with the IP communication protocol and reserve all 16 of the user-space
windows for LAPI. Unfortunately, IP has much higher latency and lower
bandwidth than US.
Communication between two MPI tasks on the same node is done using shared
memory segments. Communication between two LAPI tasks on the same node
is done through the switch adaptor. That is, LAPI does not shortcut
the operation, the switch adaptor notices that it is both source and destination
and copies the data back up to the appropriate target user-space window.
On the NERSC SP:
- Each node has 16 processors.
- Each node has 2 switch adaptors (css0, css1)
and the software "bonding driver" csss.
- There are only 16 user-space windows available
on each processor. The user-space window is our "communication endpoint".
How many processes can share a communication
How many endpoints per adaptor?
An endpoint on the SP is a user-space window. Each task (process) must
have its own window for each communication protocol it uses (MPI and/or LAPI).
Two tasks cannot share a user-space window.
16 endpoints per node.
Can threads share an endpoint?
As mentioned above, the user-space windows are virtual adaptors which multiplex
use of the physical adaptor. The number of windows is fixed on a per-system
basis. Seaborg.nersc.gov has 16, but the previous system (gseaborg)
had two processors per node, one adaptor per node but four user-space windows
Yes, all the threads in a task
share the endpoint(s) assigned to the task.
What does the MPI implementation do about mixing
shared memory and threads?
The IBM MPI implementation is thread safe; that is, all
the threads in a task can make MPI calls. On-node communication between
two tasks is done through SysV shared memory. MPI and LAPI cannot share
the same user-space window so applications that use both can only run half
the number of tasks per node.