CS262a: Active Messages and U-Net

Advanced Topics in Computer Systems	12/3/01
Anthony Joseph & Joe Hellerstein

Active Messages

U-Net

Active Messages

Remote Procedure Call (RPC)

Request ---------------------> Handler

Work

Handler <----------------------- Reply

“match”

Performance issues:

· Sender blocks while waiting for reply (why? Simplicity, no buffering needed)

· Request and reply are different messages: slow-start effect for TCP connections

· Can’t suspend interrupt handler (running in privileged mode and interrupts can’t be nested)

· Can’t assume that messages are delivered in order

· Can use polling:

o Puts receiver in control, but requires sender buffering

· Event-driven (typical choice):

o Must handle stuff as you get it

· No loops, so no deadlocks (provable by induction)

· Message passing machines (commercial machines):

o Treat network as I/O device

· Message driven architectures (research machines):

o Integrate message reception into instruction scheduling, and message send into execution unit (register-based model)

Multiprocessor model:

· Multiple CPUs on a high perf. network, usually in same machine room

· System Area Network (SAN)

o Very high performance: 2 – 10 GB/s per link, ~2 microsecond latency

o Homogenous, single administrative domain, one location

· Dark ages:

o Synchronous communication between all processors

o Enforced by barrier between computation rounds/stages

o Very poor CPU utilization

· Levels of indirection:

o Heterogeneous HW can be handled by network interface card

o Heterogeneous OS can be handled by drivers, libraries, etc.

o Heterogeneous physical network handled by virtual naming schemes for nodes (NI cards)

· Pointless to use TCP for SAN because the environment is 100% reliability, error-free, whereas TCP is designed for untrusted, high latency, variable loss situations

Basic computation model:

LOOP

compute;

communicate;

END

Separate communication and computation phases yield following observations:

· If compute and communicate phases don’t overlap then low processor utilization unless much more computation than communication is occurring.

o Example:

§ 90% peak CPU utilization at cost of 10% network usage

§ Requires very-high performance network, but wastes it!

=> Thus, a courser-grained sharing model is required for good performance

· Overlapping compute and communication phases imply that high processor utilization is achieved as long as processor and network costs/utilization balance.

o Only need at least slightly greater computation than communication

o But, require flow-control between sender and receiver (3-phase protocol) or large buffer allocations

=> So, we ideally need an asynchronous communications model

· Note that a shorter communication phase implies that a finer-grained sharing model is possible.

o Want to minimize comm. start up time (i.e., low latency comm.)

o What about the speed of light?

=> Need to “fill the pipeline” with messages
(Cray-1)

Approach:

· Match software model to hardware dispatch model (arriving message interrupt fires off interrupt handler),

· Treat message sending as the critical path and get everything off the critical path that you can.

Two primary sources of slow-down:

· Generalized buffering & resource allocation,

· Allowing for blocking/delayed server activity.

Active message solution:

· Head of a message packet contains the address of the receive handler to run.

o Upcall model

o Version 1: pointer to handler (actual address)

o Version 2: Symbolic name (allows heterogeneity)

· No buffering (beyond that needed for data transport).

o Immediate reply or pre-allocated user-level buffers

· Deadlock avoidance – Short user-level receive handlers that may not block:

o Either generate a reply message right away, or

o Extract the received message from the network into user space and return.

Some ways to think about active messages:

· Interrupt-level RPC (when response to a message is generated immediately).

· “Link layer” communication facility: gets bits from A to B and does nothing else; everything else is the responsibility of the application.

· Exports a “raw hardware” model: asynchronous hand-off to network on sender side; interrupt handler on receiver side.

Potentially an order-of-magnitude faster than more generalized communication facilities.

Systems, such as Split-C, that employ simple communications abstractions like Put & Get, benefit from that speed-up.

Message-driven machine designs also benefit from the minimalist approach:

· Very fine-grained data model in which computation is driven by messages that contain a function designator and data

· Frequently a message does not contain all the data needed to invoke a computation (1/3 of J-Machine messages)

=> Have to block awaiting the arrival of the rest of the data

· Active message approach implies that “simple” messages get processed quickly and resource allocation for and execution of multi-message functions gets handled in an application-specific manner; which allows for application-specific optimizations and batching.

3 key features about the paper:

· Try to improve utilization of massively parallel machines by focusing on the interaction between overlapping computation and communication.

· Provide an extremely lean communication facility, called active messages, that tries to remove as much processing as possible from the basic communications operation of getting a message from node A to node B.

· 2 primary sources of slow-down removed: generalized buffering for messages and support for blocking/delayed receiver activity.

Some flaws:

· The paper discusses lots of details that seem only semi-relevant to what this paper is really about. Maybe the paper is confused; maybe the Instructor is confused J

· The active message design pushes all the hard buffering and scheduling decisions into the application and observes that what’s left over is simple and fast.

o For what fraction of various workloads will this end up merely being an “accounting trick”?

o Similarities to E2E and ILP arguments?

· A lesson: Say it again, Sam: Optimize the critical path.

Active Messages in practice: where are/could they be used?

· Web servers (commonly served up content, like root page)

· NFS/CIFS servers (commonly accessed content, like root directory)

· Network caches (commonly accessed content)

· …

U-Net User-Level Network Interface

Motivation – performance limiting factors:

· Lots of kernel to user space buffer copying (high latency, bad for small messages)

o Example: NFS – most messages are under 200 bytes, but they account for half of the bits on the wire.

· Restrictive user-level view of network interface limits ability to implement novel network protocols at user-level (think ALF and ILP)

New ideas for active messages:

· Managing resources without the kernel in the (common!) path

· Minimize copies

o “Zero” copy is really one copy from NI to correct place in the receiving process’ network buffer (“base” level U-Net)

§ Still have to copy to application data structures

o True zero copy, NI puts data directly into application data structures in process address space (like CM-5)

§ Security risk in untrusted systems, so only really useful in SANs (but, still can have application bugs!)

· Virtualize the network interface: each process has its own virtual NI

o Some direct, some emulated by the kernel

o Set up is always through the kernel

· Protection between processes for network state

o Use VM hardware to enforce this

· Authentication of messages (again for protection)

o Reliably tag the sending endpoint

o Reliably dispatch to the correct receiving endpoint

· Uses conventional OS and hardware

“Dumb” ATM adapter (SBA-100)

· Simple I/O model, multi-level FIFO queues

“Intelligent” ATM adapter (SBA-200)

· Adapter-level hacking!

· On-board 25 Mhz i960 embedded processor

· Multi-level FIFO queues or direct memory-mapped I/O

· Naïve off-loading of functionality can backfire (60 MHz SPARC host CPU)

· New model pushes significant amount of functionality into adapter

o Scheduling requests to/from FIFO queues

o De/multiplexing incoming packets

o Copying to/from user-space

Results closely track raw performance

· About 20% slower for 64 byte packets

· What about larger packets?

o Not shown on graph, but curves appear to be diverging

o Bandwidth graph implies that RTT might converge for very large packets, however, packet sizes beyond 1500 bytes are rare (except for large bulk transfers – when do they occur?)

· Can this approach be generalized to other adapters

o TCP offload for web servers

· Is complexity of adapter implementation worthwhile?

o What about the rate of host CPU speed improvement versus embedded processors?

Protocols

· Apples and oranges Ethernet and ATM comparisons?

o Expect that ATM will be faster than Ethernet (dedicated, switched links with no MAC contention)

· Approach pushes complexity to application level

o But, could be done using libraries (e.g., TCP/IP stack, etc.)

Some flaws

· Complex I/O adapter logic for managing communications (comparable to OS complexity)

o Shouldn’t this be under OS control?

o How to track different vendors, revisions, etc.?

· Results are for very complex network protocol (ATM vs Ethernet)

o Would be interesting to compare with SAN protocols (e.g., Myrinet)

· As with AM paper, lots and lots of details that are not always clearly presented