Objective
The goal of this study is to understand how best to implement a UPC
run-time communication layer over LAPI on the IBM SP. In the UPC programming
model, data can either be local to a processor or distributed on remote
processors but accessable through the abstractions of global pointers and
arrays. When needed, a value is simply read (or written) rather than
the programmer having to issue a message passing call. The read triggers
the run-time communication layer to access the value across the system interconnect
and read it into a local register. For UPC to be most effective, the
interconnect latency must be low and, ideally, aided by special hardware that
can off-load the work of fetching the value from the CPU. In such an
environment, the UPC compiler can pre-fetch remote values and perform overlapped
computation (with other data) while the values are in transit.
Preliminary Results
Initial results indicate that LAPI would be a poor run-time communication
layer for UPC. Not only is the communication latency high, but the
communication calls are heavy-weight. Rather than simply issueing
a communication operation for an asyncronous hardware engine to process,
nearly every LAPI function call, including the communiction calls, enter
the "dispatcher". The dispatcher is the LAPI progress engine, it packetizes
messages on outstanding LAPI communication operation and feeds them
to the switch adaptor. It also receives and re-assembles packets into
target buffers, acknowledges packet receipts, manages packet flow control
and deals with the re-transmission of unacknowledged packets. That
is, a call into the dispatcher can take a nearly arbitrary amount of time
for the calling thread. This limits the UPC compilers' ability to predict
when the value will be available and makes it diffucult to overlap computation
with communication.
Although the IBM Switch Adaptor cards have an independent Power PC processor,
memory and a DMA engine, the programming interface to the adaptor, the Hardware
Abstraction Layer (HAL), is not exported to the user. LAPI is the
lowest level communication interface available to user applications.
LAPI Overview
LAPI is a library that supports one-sided communication between
the tasks of a parallel application. It supports non-blocking
PUT, GET and Active Message passing calls along with "counters" that
can tell the initiator (or target) task that the operation is complete.
For example, a
Lapi_Put operation
is a remote write in which the origin process writes data directly to
the virtual address space of a target task, without the target application
being aware it is happening. The initiator of the PUT operation
can specify up to three counters:
LAPI Counters
|
|
Incremented on Origin when it is safe to re-use
the data region being copied
|
Target Counter
|
Incremented on the target when the data has arrived
at its destination
|
Completion Counter
|
Incremented on the origin when the operation is
complete both at the origin and target.
|
A
Lapi_Get operation is a read
from remote memory and for which only Target and Completion counters
are available.
General "active message" calls are provided by the
Lapi_Amsend function. This is a one-sided communiction operation
in which the initiator sends data to the target and specifies a user-supplied
function to execute on the data when it arrives. Specifically, the
active message call specifies the data to be sent, the virtual address
of a "header_handler" function to be called when the first packet of data
arrives on the target, as well as an argument structure (of limited size)
which gets copied to the remote task and is passed to the header_handler
as an argument. An Origin, Target add Completion counter can also
be specified in the call. It works as follows:
- When the data first begins to arrive at the target, the LAPI dispatcher
executes the spedified header_handler function with the arguments
supplied by the caller, along with the size (in bytes) of the associated
message. The header_handler must return the address (within the
target tasks virtual address space) where the message should be placed.
It may also return (via a reference argument) the address of a
"completion_handler" function that will be called when the entire message
has arrived.
- As message data packets arrive, the LAPI dispatcher places them
in the location specified by the header_handler.
- When all the data arrives, the completion_handler function (if
specified) is scheduled to execute.
- The Target counter is incremented after the completion_handler
executes, or after the last data packet arrives in the case where
no completion_handler is specified.
LAPI also provides functions for gather/scatter data transfers using either
vectors of buffer locations on the origin and target or general strided operations.
Further,
LAPI_Rmw provides for remote
read-modify-write operations, in which the remote operations can be any of
SWAP, COMPARE_AND_SWAP, FETCH_AND_ADD and FETCH_AND_OR.
A task can check on the status of a communiation operation by examining
the appropriate counter. LAPI provides two functions for this purpose:
LAPI_Getcntr simply returns the current value
of the counter (and does not enter the dispatcher), while
LAPI_Waitcntr will block the calling thread until the counter
reaches a desired value. Note that LAPI_Waitcntr also has the side
effect of decrementing the counter by this value. In addition,
LAPI_Probe is a function that simply attempts to make progress
by entering the dispatcher. One can emulate LAPI_Waitcntr by polling
in a loop, with alternate calls to LAPI_Getcntr and LAPI_Probe. Finally,
LAPI_Setcntr provides a mechanism to initialize
a counter value.
LAPI also offers simple synchronization mechanisms:
LAPI_Gfence is a global barrier and
LAPI_Fence
is a local communication barrier. That is, LAPI_Fence will block
until all outstanding local communication operations are complete. Finally,
since the one-sided communication calls must specify the the virtual address
of target locations in the remote task,
LAPI_Address_init
provides a mechanism for tasks to exchange address values.
Lapi Architecture
The
LAPI_Init function call creates
a context and prepares the task for communication with other tasks in
the parallel job. It creates several threads within the context
of the task to assist in making progress on communication calls. A
completion handler thread is created to run
all active message completion handlers. A
notification
thread is created by registering a callback function with the HAL
layer. This function executes when LAPI runs in Interrupt mode and
a message has been received on the HAL receive FIFO. It will not execute
when LAPI is put into Polling mode. A third LAPI thread, the
retransmission thread, is created, again by registering a callback
with the HAL layer, to deal with retransmission timeouts. It executes
every 400000 usec to deal with stalled transactions and will terminate
the task (or call a handler) if progress is not made on a transaction
within the TIMEOUT period.
Nearly all LAPI function calls attempt to make progress on outstanding
communications by executing the Dispatcher; the LAPI progress engine.
The dispatcher code is a critical section so each thread must obtain
a lock before entry. The dispatcher executes the header handler
for any newly arriving messages, copies incoming data into destination
buffers for ongoing arrivals, make progress on stalled outgoing messages
as credit tokens become available and re-transmits data for un-acknowledged
send operations.
LAPI can be set to operate in either
Interrupt mode
(the default) or
Polling mode. In
interrupt mode, the notification thread will attempt to execute the
dispatcher whenever messages arrive on the receive FIFO. An examination
of the source code indicates that some form of interrupt coalescing is occuring,
such that the notification thread only executes after several receive
events have occurred. The number of receive events it takes to trigger
the notification thread execution varies by some dynamic algorithm. The
notification thread is disabled when LAPI is put into Polling mode. In
Polling mode, if the user application does not make calls into the LAPI
library, progress is only made every 400000 usec when the re-transmission
thread executes. In fairly synchronous applications, where latency
is important, polling mode performs better and with less timing variance
than interrupt mode. The downside of polling mode is that the application
thread is doing all the communication work.
Interrupt mode also introduces a larger variance in the latency timings.
The dynamic nature of the interrupt coalescing algorithm might explain
part of this, but lock contention between the user thread and the notification
thread may also play a large role. Inspection of the source code reveals
that nearly all LAPI function calls block waiting for a global lock prior
to doing any work. If the notification thread holds the lock, because
it is executing the dispatcher and making progress on all outstanding communication
calls, the user thread blocks until it relinquishes control. On the
other hand, if the user thread holds the lock while executing a LAPI function
call and the notification thread is scheduled (due to an interrupt), the
notification thread tries to get the lock, fails and simply goes back to sleep
knowing the user thread will enter the dispatcher and deal with the newly
arrived packets. In a sense, this is the reverse of what we would like
for the UPC runtime layer. We would like the notification thread to
do the vast majority of the work on making progress and keep the calls to
the LAPI functions from the user thread as lightweight as possible. In
order for this to occur, LAPI would have to have to use finer grain locks
so that the user thread could simply queue communication operations without
gaining the "big" lock and then, optionally, enter the dispatcher to make
progress if the dispatcher lock is available.
Latency and Bandwidth Measurements
We define "Latency" as the time required to send a message
of N bytes to a remote task and receive acknowledgement of its arrival.
Bandwidth is the rate, in MB/sec, that the application sees when
sending a message to a remote task. In Lapi, we can implement these
with either puts or gets. For both the Lapi_Get and Lapi_Put implementation,
we measure the time on the origin between when the call is issued and
when the completion counter is incremented.
The graphs of Latency and Bandwidth are presented below. The
computed times are averages of multiple message sends for each message
size (10000 iterations). The curve titles define the type of
operation and the modes in which the operation was performed. For
example, "GET_W_POLLING" means it was a GET operation in which we Wait
on the completion counter with LAPI in POLLING mode. Likewise, "PUT_P_INTERRUPT"
means it was a PUT operation in which we wait for the completion counter
in a LAPI_Getcntr/LAPI_Probe loop, with LAPI set in INTERRUPT mode. Note
that all tests were run between two dedicated nodes, that is, there were
no other user tasks running on the nodes at the time.
The Bandwidth graph shows substantial variation in bandwidth
for large messages, especially when LAPI is in interrupt mode. Multiple
runs of the code produce similar results in that LAPI Polling mode performs
better than LAPI Interrupt mode and shows less variation.
Selected Bandwidth Values using LAPI_Waitcntr
WaitCntr Mode
|
1168 Bytes
|
14337 Bytes
|
61639 bytes
|
127812 bytes
|
Put_Polling
|
21.5 MB/sec
|
118.8 MB/sec
|
251.5 MB/sec
|
302.9
MB/sec
|
Get_Polling
|
21.1 MB/sec
|
117.4 MB/sec
|
246.4 MB/sec
|
244.6 MB/sec
|
Put_Interrupt
|
19.9 MB/sec
|
112.9 MB/sec
|
247.5 MB/sec
|
300.3 MB/sec
|
Get_Interrupt
|
20.0 MB/sec
|
116.9 MB/sec
|
242.0 MB/sec
|
294.4 MB/sec
|
The Latency graph is really only interresting for small messages.
Again we see that we get the best performance with LAPI Polling
mode. For some reason, Interrupt mode while polling for the completion
counter with a LAPI_Getcntr/LAPI_Probe loop produces latencies a factor
of two larger than using LAPI_Waitcntr. Again, substantial variation
was observed between runs with LAPI in Interrupt mode.
Selected Latency Values using LAPI_Waitcntr
WaitCntr Mode
|
2 Byte
|
8 Bytes
|
60 bytes
|
1168 bytes
|
Put_Polling
|
38.1 usec
|
41.3 usec
|
40.3 usec
|
51.7 usec
|
Get_Polling
|
35.4 usec
|
35.4 usec
|
38.1 usec
|
52.9 usec
|
Put_Interrupt
|
41.2 usec
|
40.6 usec
|
43.0 usec
|
55.9 usec
|
Get_Interrupt
|
39.2 usec
|
39.3 usec
|
41.8 usec
|
55.6 usec
|
For the small messages, the acknowledgement (the notification from the
target that the completion counter can be incremented) represents a round
trip. Given this, we see a one-way small message latency (from origin
buffer to target buffer) of 18-20 microseconds.
One compiler optimization that can help to midigate this higher latency
is to overlap communication calls. The hardware may be able to handle
multiple communication calls at once (in parallel) more efficiently than
by executing them in a serial fashon. If so, the effective latency for
each call is reduced. To test this, we issue Q_Depth communiction calls
of a given message size and measure how long it takes for all to complete.
Of course, we perform the test a large number of times for each message
size and Q_Depth and report the average time.
The graphs below show the effective bandwidth and small message latency
for GET operations in LAPI polling mode at various values of Q_Depth.
As we can see, the effective bandwidth rises more steeply and peeks out
at about 350 MB/sec for this test. Similarly, the effective latency
drops to about 13 microseconds (round-trip). Furthermore, both bandwidth
and latency saturate with a Q_Depth in the range of 8-10, indicating no further
benefit (but also little harm) in issuing more than this number of communication
operations simultaniously.
Computation and Communication Overlap
In addition to overlapping communication calls, we can overlap multiple
communication calls with computation. That is, the compiler issues communication
calls for a collection of values it will need in the near future, then, while
the values are being fetched, it issues instructions to compute on other
data. In this section we measure the amount of overlap we can obtain
by interleaving computation with communication calls. The psuedo-code
for the test is given below:
// Inputs:
Num_Iter; // Total number
of Puts or Gets to issue (default = 10000)
Q_Depth; // MAX Number
of simultanious Puts/Gets issued at a time
N_Float; // Number
of times to iterate the CPU loop per issue
Size; // Number
of bytes per message (default = 8 bytes)
// Algorithm:
Fill_Level = (Q_Depth+1)/2; // try
to keep this number of issues in progress
// at
all times
Issued = 0;
// Total number of Gets/Puts already
issued
Flops = 0;
// Total number of times cpu_loop has
executed
T1 = Get_Timer();
// Start the timer (usec)
Issue_Comm(Size, Q_Depth); // Issue
Q_Depth Put or Get operations
Issued = Q_Depth;
while (Issued < Num_Iter) {
Cpu_Loop(N_Float * Fill_Level);
// Floating point loop
Flops += Fill_Level;
to_issue = MIN(Num_Iter - Issued, Fill_Level);
Reap_And_Reissue(Size, to_issue);
// See comments below
Issued += to_issue;
}
Cpu_Loop(N_Float * (Num_Iter - Flops));
// Flop remaining number of times
Reap_Until_Done();
// Wait
for all comm to complete
T2 = Get_Timer();
// Stop the
timer.
Return (double) (T2 - T1)/Num_Iter;
// The Effective Latency
The function "Issue_Comm" will issue the given number of non-blocking
communication calls (PUTS or GETS) of the given message size. For
this test we used 8-byte messages. The "Cpu_Loop" simply computes
a floating point operation the given number of times. One must be
careful that this loop is not optimized away, as it effectively measures
the ammount of computation we can overlap with communication. The "Reap_And_Reissue"
function polls for the completion of the communication operations and,
when found, re-issues another, until "to_issue" operations have been
re-issued. Finally, "Reap_Until_Done" simply waits for all outstanding
communication operations to complete. The value returned is the
effective latency in microseconds.
For this test, when using a PUT operation, we are not really interrested
in when the data arrives at the target, just when the sending buffer can be
re-used. That is, for PUT we wait for the origin counter to increment.
For the GET operation, we need to wait until the data arrives; that
is, we wait until the completion counter increments.
In the graphs below, we show the effective latency as a function
of the time, in microseconds, it takes to compute N_Float floating point
operations. The results we show are for Lapi_Get operations operating
in LAPI Polling mode. We saw no overlap of computation with PUT operations
because, for 8-byte messages, by the time LAPI_Put returned, the data was
sent. For LAPI_Get, we expect that, initially, the floating point
loop has no effect on the total time because it is consuming CPU cycles
that would otherwise be idle while the processor is waiting for the communication
to complete. Eventually, the value of N_Float increases to the point
where the floating point loop consumes more cycles than is available for
communication overlap and the curve begins to rise.
As we can see, the curve begins to rise at about 20 usec for
a Q_Depth of one, and less than 5 microseconds (if any) for a Q_Depth
greater than one.
Note that the graphs are fairly noisy even though the datapoints are taken
as the average value of 20000 iterations.
In an attempt to probe the region of interest for Q_Depth > 1, we re-ran
the calculation with more data points at the left end of the curve. Note
that we removed the Q_Depth = 1 curve to concentrate on values of Q_Depth
> 1. It appears that for a Q_Depth of two, the overlap is about 3-4
microseconds, and for larger values values of Q_Depth, the overlap is nearly
insignificant.
I repeated the experiment with a message size of 5211 bytes,
assuming that it would increase the communication costs and therefore
show a more dramatic overlap of computation with communication. The
results show significant changes for a Q_Depth of 1 and two, but beyond that
the overlap is minor. This indicates that the majority of the work
involved in sending messages is performed by LAPI and not the switch adaptors.