Results of LAPI Performance Tests on the NERSC IBM SP

Michael Welcome
December, 2001

Objective

The goal of this study is to understand how best to implement a UPC run-time communication layer over LAPI on the IBM SP.  In the UPC programming model, data can either be local to a processor or distributed on remote processors but accessable through the abstractions of global pointers and arrays.  When needed, a value is simply read (or written) rather than the programmer having to issue a message passing call.  The read triggers the run-time communication layer to access the value across the system interconnect and read it into a local register.  For UPC to be most effective, the interconnect latency must be low and, ideally, aided by special hardware that can off-load the work of fetching the value from the CPU.  In such an environment, the UPC compiler can pre-fetch remote values and perform overlapped computation (with other data) while the values are in transit.  

Preliminary Results

Initial results indicate that LAPI would be a poor run-time communication layer for UPC.  Not only is the communication latency high, but the communication calls are heavy-weight.  Rather than simply issueing a communication operation for an asyncronous hardware engine to process, nearly every LAPI function call, including the communiction calls, enter the "dispatcher".  The dispatcher is the LAPI progress engine, it packetizes messages on outstanding LAPI communication operation and feeds  them to the switch adaptor.  It also receives and re-assembles packets into target buffers, acknowledges packet receipts, manages packet flow control and deals with the re-transmission of unacknowledged packets.  That is, a call into the dispatcher can take a nearly arbitrary amount of time for the calling thread.  This limits the UPC compilers' ability to predict when the value will be available and makes it diffucult to overlap computation with communication.

Although the IBM Switch Adaptor cards have an independent Power PC processor, memory and a DMA engine, the programming interface to the adaptor, the Hardware Abstraction Layer (HAL), is not exported to the user.  LAPI is the lowest level communication interface available to user applications.

LAPI Overview

LAPI is a library that supports one-sided communication between the tasks of a parallel application.  It supports non-blocking PUT, GET and Active Message passing calls along with "counters" that can tell the initiator (or target) task that the operation is complete.   For example, a Lapi_Put operation is a remote write in which the origin process writes data directly to the virtual address space of a target task, without the target application being aware it is happening.  The initiator of the PUT operation can specify up to three counters:

LAPI Counters
Origin Counter
Incremented on Origin when it is safe to re-use the data region being copied
Target Counter
Incremented on the target when the data has arrived at its destination
Completion Counter
Incremented on the origin when the operation is complete both at the origin and target.

A Lapi_Get operation is a read from remote memory and for which only Target and Completion counters are available.

General "active message" calls are provided by the Lapi_Amsend function.  This is a one-sided communiction operation in which the initiator sends data to the target and specifies a user-supplied function to execute on the data when it arrives.  Specifically, the active message call specifies the data to be sent, the virtual address of a "header_handler" function to be called when the first packet of data arrives on the target, as well as an argument structure (of limited size) which gets copied to the remote task and is passed to the header_handler as an argument.  An Origin, Target add Completion counter can also be specified in the call.  It works as follows:
LAPI also provides functions for gather/scatter data transfers using either vectors of buffer locations on the origin and target or general strided operations.  Further, LAPI_Rmw provides for remote read-modify-write operations, in which the remote operations can be any of SWAP, COMPARE_AND_SWAP, FETCH_AND_ADD and FETCH_AND_OR.

A task can check on the status of a communiation operation by examining the appropriate counter.  LAPI provides two functions for this purpose: LAPI_Getcntr simply returns the current value of the counter (and does not enter the dispatcher), while LAPI_Waitcntr will block the calling thread until the counter reaches a desired value.  Note that LAPI_Waitcntr also has the side effect of decrementing the counter by this value.  In addition, LAPI_Probe is a function that simply attempts to make progress by entering the dispatcher.  One can emulate LAPI_Waitcntr by polling in a loop, with alternate calls to LAPI_Getcntr and LAPI_Probe.  Finally, LAPI_Setcntr provides a mechanism to initialize a counter value.

LAPI also offers simple synchronization mechanisms: LAPI_Gfence is a global barrier and LAPI_Fence is a local communication barrier.  That is, LAPI_Fence will block until all outstanding local communication operations are complete.  Finally, since the one-sided communication calls must specify the the virtual address of target locations in the remote task, LAPI_Address_init provides a mechanism for tasks to exchange address values.

Lapi Architecture

The LAPI_Init function call creates a context and prepares the task for communication with other tasks in the parallel job.  It creates several threads within the context of the task to assist in making progress on communication calls.  A completion handler thread is created to run all active message completion handlers.  A notification thread is created by registering a callback function with the HAL layer.  This function executes when LAPI runs in Interrupt mode and a message has been received on the HAL receive FIFO.  It will not execute when LAPI is put into Polling mode.  A third LAPI thread, the retransmission thread, is created, again by registering a callback with the HAL layer, to deal with retransmission timeouts.  It executes every 400000 usec to deal with stalled transactions and will terminate the task (or call a handler) if progress is not made on a transaction within the TIMEOUT period.  

Nearly all LAPI function calls attempt to make progress on outstanding communications by executing the Dispatcher; the LAPI progress engine.  The dispatcher code is a critical section so each thread must obtain a lock before entry.  The dispatcher executes the header handler for any newly arriving messages, copies incoming data into destination buffers for ongoing arrivals, make progress on stalled outgoing messages as credit tokens become available and re-transmits data for un-acknowledged send operations.  

LAPI can be set to operate in either Interrupt mode (the default) or Polling mode.  In interrupt mode, the notification thread will attempt to execute the dispatcher whenever messages arrive on the receive FIFO.  An examination of the source code indicates that some form of interrupt coalescing is occuring, such that the notification thread only executes after several receive events have occurred.  The number of receive events it takes to trigger the notification thread execution varies by some dynamic algorithm.  The notification thread is disabled when LAPI is put into Polling mode.  In Polling mode, if the user application does not make calls into the LAPI library, progress is only made every 400000 usec when the re-transmission thread executes.  In fairly synchronous applications, where latency is important, polling mode performs better and with less timing variance than interrupt mode.  The downside of polling mode is that the application thread is doing all the communication work.

Interrupt mode also introduces a larger variance in the latency timings.  The dynamic nature of the interrupt coalescing algorithm might explain part of this, but lock contention between the user thread and the notification thread may also play a large role.  Inspection of the source code reveals that nearly all LAPI function calls block waiting for a global lock prior to doing any work.  If the notification thread holds the lock, because it is executing the dispatcher and making progress on all outstanding communication calls, the user thread blocks until it relinquishes control.  On the other hand, if the user thread holds the lock while executing a LAPI function call and the notification thread is scheduled (due to an interrupt), the notification thread tries to get the lock, fails and simply goes back to sleep knowing the user thread will enter the dispatcher and deal with the newly arrived packets.  In a sense, this is the reverse of what we would like for the UPC runtime layer.  We would like the notification thread to do the vast majority of the work on making progress and keep the calls to the LAPI functions from the user thread as lightweight as possible.  In order for this to occur, LAPI would have to have to use finer grain locks so that the user thread could simply queue communication operations without gaining the "big" lock and then, optionally, enter the dispatcher to make progress if the dispatcher lock is available.

Latency and Bandwidth Measurements

We define "Latency" as the time required to send a message of N bytes to a remote task and receive acknowledgement of its arrival.  Bandwidth is the rate, in MB/sec, that the application sees when sending a message to a remote task.  In Lapi, we can implement these with either puts or gets.  For both the Lapi_Get and Lapi_Put implementation, we measure the time on the origin between when the call is issued and when the completion counter is incremented.

The graphs of Latency and Bandwidth are presented below.  The computed times are averages of multiple message sends for each message size (10000 iterations).  The curve titles define the type of operation and the modes in which the operation was performed.  For example, "GET_W_POLLING" means it was a GET operation in which we Wait on the completion counter with LAPI in POLLING mode.  Likewise, "PUT_P_INTERRUPT" means it was a PUT operation in which we wait for the completion counter in a LAPI_Getcntr/LAPI_Probe loop, with LAPI set in INTERRUPT mode.  Note that all tests were run between two dedicated nodes, that is, there were no other user tasks running on the nodes at the time.

The Bandwidth graph shows substantial variation in bandwidth for large messages, especially when LAPI is in interrupt mode.  Multiple runs of the code produce similar results in that LAPI Polling mode performs better than LAPI Interrupt mode and shows less variation.  

LAPI Bandwidth Graph


Selected Bandwidth Values using LAPI_Waitcntr
WaitCntr Mode
1168 Bytes
14337 Bytes
61639 bytes
127812 bytes
Put_Polling
21.5  MB/sec
118.8  MB/sec
251.5  MB/sec
 302.9  MB/sec
Get_Polling
21.1  MB/sec
117.4  MB/sec
246.4  MB/sec
244.6  MB/sec
Put_Interrupt
19.9  MB/sec
112.9  MB/sec
247.5  MB/sec
300.3  MB/sec
Get_Interrupt
20.0  MB/sec
116.9  MB/sec
242.0  MB/sec
 294.4  MB/sec


The Latency graph is really only interresting for small messages.  Again we see that we get the best performance with LAPI Polling mode.  For some reason, Interrupt mode while polling for the completion counter with a LAPI_Getcntr/LAPI_Probe loop produces latencies a factor of two larger than using LAPI_Waitcntr.  Again, substantial variation was observed between runs with LAPI in Interrupt mode.

LAPI Latency Graph


Selected Latency Values using LAPI_Waitcntr
WaitCntr Mode
2 Byte
8 Bytes
60 bytes
1168 bytes
Put_Polling
38.1  usec
41.3  usec
40.3  usec
51.7  usec
Get_Polling
35.4  usec
35.4  usec
38.1  usec
52.9  usec
Put_Interrupt
41.2  usec
40.6  usec
43.0  usec
55.9  usec
Get_Interrupt
39.2  usec
39.3  usec
41.8  usec
55.6  usec

For the small messages, the acknowledgement (the notification from the target that the completion counter can be incremented) represents a round trip.  Given this, we see a one-way small message latency (from origin buffer to target buffer) of 18-20 microseconds.    

One compiler optimization that can help to midigate this higher latency is to overlap communication calls.  The hardware may be able to handle multiple communication calls at once (in parallel) more efficiently than by executing them in a serial fashon.  If so, the effective latency for each call is reduced.  To test this, we issue Q_Depth communiction calls of a given message size and measure how long it takes for all to complete.  Of course, we perform the test a large number of times for each message size and Q_Depth and report the average time.

The graphs below show the effective bandwidth and small message latency for GET operations in LAPI polling mode at various values of Q_Depth.

effective bandwidth graph

As we can see, the effective bandwidth rises more steeply and peeks out at about 350 MB/sec for this test.  Similarly, the effective latency drops to about 13 microseconds (round-trip).  Furthermore, both bandwidth and latency saturate with a Q_Depth in the range of 8-10, indicating no further benefit (but also little harm) in issuing more than this number of communication operations simultaniously.

Effective Latency Graph

Computation and Communication Overlap

In addition to overlapping communication calls, we can overlap multiple communication calls with computation.  That is, the compiler issues communication calls for a collection of values it will need in the near future, then, while the values are being fetched, it issues instructions to compute on other data.  In this section we measure the amount of overlap we can obtain by interleaving computation with communication calls. The psuedo-code for the test is given below:

// Inputs:
Num_Iter;     // Total number of Puts or Gets to issue  (default = 10000)
Q_Depth;      // MAX Number of simultanious Puts/Gets issued at a time
N_Float;      // Number of times to iterate the CPU loop per issue
Size;         // Number of bytes per message (default = 8 bytes)


// Algorithm:
Fill_Level = (Q_Depth+1)/2;  // try to keep this number of issues in progress
                             // at all times
Issued = 0;                  // Total number of Gets/Puts already issued
Flops = 0;                   // Total number of times cpu_loop has executed

T1 = Get_Timer();             // Start the timer (usec)
Issue_Comm(Size, Q_Depth);   // Issue Q_Depth Put or Get operations
Issued = Q_Depth;

while (Issued < Num_Iter) {
    Cpu_Loop(N_Float * Fill_Level);       // Floating point loop
    Flops += Fill_Level;

    to_issue = MIN(Num_Iter - Issued, Fill_Level);
    Reap_And_Reissue(Size, to_issue);     // See comments below
    Issued += to_issue;
}
Cpu_Loop(N_Float * (Num_Iter - Flops));   // Flop remaining number of times
Reap_Until_Done();                        // Wait for all comm to complete
T2 = Get_Timer();                         // Stop the timer.

Return (double) (T2 - T1)/Num_Iter;       // The Effective Latency

The function "Issue_Comm" will issue the given number of non-blocking communication calls (PUTS or GETS) of the given message size.  For this test we used 8-byte messages.  The "Cpu_Loop" simply computes a floating point operation the given number of times.  One must be careful that this loop is not optimized away, as it effectively measures the ammount of computation we can overlap with communication.  The "Reap_And_Reissue" function polls for the completion of the communication operations and, when found, re-issues another, until "to_issue" operations have been re-issued.  Finally, "Reap_Until_Done" simply waits for all outstanding communication operations to complete.  The value returned is the effective latency in microseconds.

For this test, when using a PUT operation, we are not really interrested in when the data arrives at the target, just when the sending buffer can be re-used.  That is, for PUT we wait for the origin counter to increment.  For the GET operation, we need to wait until the data arrives; that is, we wait until the completion counter increments.

In the graphs below, we show the effective latency as a function of the time, in microseconds, it takes to compute N_Float floating point operations.  The results we show are for Lapi_Get operations operating in LAPI Polling mode.  We saw no overlap of computation with PUT operations because, for 8-byte messages, by the time LAPI_Put returned, the data was sent.  For LAPI_Get, we expect that, initially, the floating point loop has no effect on the total time because it is consuming CPU cycles that would otherwise be idle while the processor is waiting for the communication to complete.  Eventually, the value of N_Float increases to the point where the floating point loop consumes more cycles than is available for communication overlap and the curve begins to rise.

As we can see, the curve begins to rise at about 20 usec for a Q_Depth of one, and less than 5 microseconds (if any) for a Q_Depth greater than one.

Communication Overlap Graph 1

Note that the graphs are fairly noisy even though the datapoints are taken as the average value of 20000 iterations.  

In an attempt to probe the region of interest for Q_Depth > 1, we re-ran the calculation with more data points at the left end of the curve.  Note that we removed the Q_Depth = 1 curve to concentrate on values of Q_Depth > 1.  It appears that for a Q_Depth of two, the overlap is about 3-4 microseconds, and for larger values values of Q_Depth, the overlap is nearly insignificant.

overlap test - close up


I repeated the experiment with a message size of 5211 bytes, assuming that it would increase the communication costs and therefore show a more dramatic overlap of computation with communication.  The results show significant changes for a Q_Depth of 1 and two, but beyond that the overlap is minor.  This indicates that the majority of the work involved in sending messages is performed by LAPI and not the switch adaptors.


Ovlp test with 4K message Size