AMLAPI - An Implementation of AM-2 using LAPI for the IBM SP.

Michael Welcome
January 2002


Overview

AMLAPI is an implementation of the AM-2 active message specification for the IBM SP using LAPI.  AMLAPI was originally written by Simon Yau at UC Berkeley as part of a class project.  The main use of this implementation is as the run-time communication layer for the UC Berkeley Titanium compiler.  The version presented here is a modification to Simon's library in an effort to improve Titanium communication performance.  The modifications were performed by myself and Dan Bonachea of UC Berkeley.

Short Message Performance

Titanium relies heavily on AM-2 short and medium length messages.  AMLAPI implements request and reply handlers using the general LAPI active message call (LAPI_Amsend).  As such, when the first packet for this message arrives at the target, the header handler mallocs space for the message and informs the dispatcher where to place the message, it also registers a completion handler to be called when the entire message arrives.  In addition, the header handler places a token on the AM bundle task queue for this message.  Upon return from the header_handler, the dispatcher acknowledges the original packet and arranges for all incoming data for this message to be written to the supplied buffer.  Once the entire message arrives, the completion handler runs (in a special completion handler thread) and marks the token on the task queue as "ready".  The AM request and reply handler are executed during a call to AM-Poll.  AM_Poll examines the first element of the queue and, if marked "ready", executes the corresponding user handler.  AM request and reply handlers must run in the context of the user application thread(s) when they enter the bundle, thus the completion handler cannot execute the user handler directly.  Request handlers are required to issue an AM reply, causing another LAPI_Amsend function call in the opposite direction.  

Clearly, this presents a substantial overhead for small messages.  The main performance gain we achieved was to pack the short and medium length messages into the argument structure that gets delivered to the header handler.  The header handler requires this data, so it must be included in the first (1KB) packet sent to the target.  This argument structure is user defined, but has a size limitation of 864 bytes because of the 1KB packet size and other overhead.  For messages of this size and less, the header handler can immediately place the token on the bundle task queue mark it ready for processing.  It returns a NULL pointer to the dispatcher, indicating that no additional data need be collected and does not have to register a completion handler.  AM_Poll is now free to execute the user request or reply handler as soon as possible.  We added another optimization to allow the header_handler to execute the user handler, if it is executing in an application thread (as opposed to of of the special LAPI threads).  This is only possible for AM reply handlers because they are not allowed to issue communication calls.  LAPI communication calls cannot be performed in LAPI header handlers because of deadlock conditions.  The other main optimization was to re-implement the bundle task queue without locking.  AMLAPI does not allow parallel access to the bundle (only AM_Seq mode) and therefore there is a single producer thread, single consumer thread access to the task queue.  The queue is implemented as a linked list whereby producer and consumer are prevented from updating the same data through the use of a "firewall" element in the list.  See the discussion in amlapi_task_queue.c for an explanation.

The following graph shows the latency for an AM-2 ping code in which messages of the given size are sent to the target node with an acknowledgment sent back via the reply handler.  Note the increase in latency from about 60-65 microseconds to over 100 as the message size goes above the 864 byte limit.  The additional 40 microseconds is attributed to the fact that a completion handler must be used in the case where the message is not entirely contained in the header handler argument structure.  In this case, when the entire message has arrived, the LAPI dispatcher will signal the LAPI completion handler thread.  The completion handler simply marks the request as "ready" for the bundle to execute its request handler.  Additional testing with pure LAPI programs (as opposed to AMLAPI programs) verifies a 35-40 microsecond overhead when a completion handler must be scheduled, even if it does no work.  It is known that the completion handler thread is created at system contention scope and therefore maps directly to an AIX kernel thread.  That is, it does not share the kernel thread with other application threads at process contention scope.  One possibility is that the completion handler thread is contending for the same CPU as the user thread(s).  The application tasks ran on dedicated 16 CPU SMP nodes so there was no contention with other applications.  Further, each task used only 4 threads: the user thread, the LAPI notification thread, the LAPI completion handler thread and the LAPI retransmission thread, so there should be no contention of threads with CPUs.  In one test, we set the pthread concurrency level to 10, but this did not change the performance.  Of course, the AIX pthreads implementation is free to ignore this hint.  Further investigation is required to explain, and possibly remove, this overhead.

Note that even the best performance of 50-60 microseconds is substantial in comparison to what can be achieved using LAPI_Put or LAPI_Get.  See this page for additional information on LAPI performance using Put and Get operations.  Finally, note that the best, and most consistent performance is obtained when LAPI runs in polling mode, rather than interrupt mode.

AMLAPI latency graph


The graph below shows bandwidth curves for AMLAPI on the NERSC IBM SP.  Using LAPI_Put and LAPI_Get in polling mode we can achieve better than 300 MB/sec on messages of about 128K.

AMLAPI Bandwidth Graph


Code Modifications by Dan Bonachea

Code Modifications by Mike

Source Code

Click AMLAPI.tgz for the latest version (1.4) of AMLAPI for Titanium.

LAPI Notes

LAPI will create three additional threads during initialization, for each LAPI task:
Since AMLAPI puts LAPI in polling mode by default, the Notification Thread will never run.  The Retransmission thread will only run every 400000 microseconds so should probably not have a processor reserved for it.  The completion handler thread will only run for large messages (which cannot get packed into the header handler argument struct).  Further, the completion handler is very light-weight, it simply sets a variable in a task queue structure.  Given this, you should reserve at most one processor for LAPI overhead.


Links: