Advanced Topics in Computer Systems

12/3/01

Anthony Joseph & Joe Hellerstein

 

Active Messages

U-Net

 

Active Messages

 

Remote Procedure Call (RPC)

 

Request ---------------------> Handler

                                                    |

                                                   V

                                                Work

                                                    |

                                                   V

Handler <----------------------- Reply

      |

     V

“match”

 

Performance issues:

·        Sender blocks while waiting for reply (why? Simplicity, no buffering needed)

·        Request and reply are different messages: slow-start effect for TCP connections

·        Can’t suspend interrupt handler (running in privileged mode and interrupts can’t be nested)

·        Can’t assume that messages are delivered in order

·        Can use polling:

o       Puts receiver in control, but requires sender buffering

·        Event-driven (typical choice):

o       Must handle stuff as you get it

·        No loops, so no deadlocks (provable by induction)

·        Message passing machines (commercial machines):

o       Treat network as I/O device

·        Message driven architectures (research machines):

o       Integrate message reception into instruction scheduling, and message send into execution unit (register-based model)

 

Multiprocessor model:

·        Multiple CPUs on a high perf. network, usually in same machine room

·        System Area Network (SAN)

o       Very high performance: 2 – 10 GB/s per link, ~2 microsecond latency

o       Homogenous, single administrative domain, one location

·        Dark ages:

o       Synchronous communication between all processors

o       Enforced by barrier between computation rounds/stages

o       Very poor CPU utilization

·        Levels of indirection:

o       Heterogeneous HW can be handled by network interface card

o       Heterogeneous OS can be handled by drivers, libraries, etc.

o       Heterogeneous physical network handled by virtual naming schemes for nodes (NI cards)

·        Pointless to use TCP for SAN because the environment is 100% reliability, error-free, whereas TCP is designed for untrusted, high latency, variable loss situations

 

Basic computation model:

LOOP

compute;

communicate;

END

 

Separate communication and computation phases yield following observations:

·        If compute and communicate phases don’t overlap then low processor utilization unless much more computation than communication is occurring.

o       Example:

§         90% peak CPU utilization at cost of 10% network usage

§         Requires very-high performance network, but wastes it!

=> Thus, a courser-grained sharing model is required for good performance

·        Overlapping compute and communication phases imply that high processor utilization is achieved as long as processor and network costs/utilization balance.

o       Only need at least slightly greater computation than communication

o       But, require flow-control between sender and receiver (3-phase protocol) or large buffer allocations

=> So, we ideally need an asynchronous communications model

·        Note that a shorter communication phase implies that a finer-grained sharing model is possible.

o       Want to minimize comm. start up time (i.e., low latency comm.)

o       What about the speed of light?

=> Need to “fill the pipeline” with messages
      (Cray-1)

 

Approach:

·        Match software model to hardware dispatch model (arriving message interrupt fires off interrupt handler),

·        Treat message sending as the critical path and get everything off the critical path that you can.

 

Two primary sources of slow-down:

·        Generalized buffering & resource allocation,

·        Allowing for blocking/delayed server activity.

 

Active message solution:

·        Head of a message packet contains the address of the receive handler to run.

o       Upcall model

o       Version 1: pointer to handler (actual address)

o       Version 2: Symbolic name (allows heterogeneity)

·        No buffering (beyond that needed for data transport).

o       Immediate reply or pre-allocated user-level buffers

·        Deadlock avoidance – Short user-level receive handlers that may not block:

o       Either generate a reply message right away, or

o       Extract the received message from the network into user space and return.

 

Some ways to think about active messages:

·        Interrupt-level RPC (when response to a message is generated immediately).

·        “Link layer” communication facility: gets bits from A to B and does nothing else; everything else is the responsibility of the application.

·        Exports a “raw hardware” model: asynchronous hand-off to network on sender side; interrupt handler on receiver side.

 

Potentially an order-of-magnitude faster than more generalized communication facilities.

 

Systems, such as Split-C, that employ simple communications abstractions like Put & Get, benefit from that speed-up.

 

Message-driven machine designs also benefit from the minimalist approach:

·        Very fine-grained data model in which computation is driven by messages that contain a function designator and data

·        Frequently a message does not contain all the data needed to invoke a computation (1/3 of J-Machine messages)

=> Have to block awaiting the arrival of the rest of the data

·        Active message approach implies that “simple” messages get processed quickly and resource allocation for and execution of multi-message functions gets handled in an application-specific manner; which allows for application-specific optimizations and batching.

 

 

3 key features about the paper:

·        Try to improve utilization of massively parallel machines by focusing on the interaction between overlapping computation and communication.

·        Provide an extremely lean communication facility, called active messages, that tries to remove as much processing as possible from the basic communications operation of getting a message from node A to node B.

·        2 primary sources of slow-down removed: generalized buffering for messages and support for blocking/delayed receiver activity.

 

Some flaws:

·        The paper discusses lots of details that seem only semi-relevant to what this paper is really about. Maybe the paper is confused; maybe the Instructor is confused J

·        The active message design pushes all the hard buffering and scheduling decisions into the application and observes that what’s left over is simple and fast.

o       For what fraction of various workloads will this end up merely being an “accounting trick”?

o       Similarities to E2E and ILP arguments?

·        A lesson: Say it again, Sam: Optimize the critical path.

 

Active Messages in practice: where are/could they be used?

·        Web servers (commonly served up content, like root page)

·        NFS/CIFS servers (commonly accessed content, like root directory)

·        Network caches (commonly accessed content)

·       


 

U-Net User-Level Network Interface

 

Motivation – performance limiting factors:

·        Lots of kernel to user space buffer copying (high latency, bad for small messages)

o       Example: NFS – most messages are under 200 bytes, but they account for half of the bits on the wire.

·        Restrictive user-level view of network interface limits ability to implement novel network protocols at user-level (think ALF and ILP)

 

New ideas for active messages:

·        Managing resources without the kernel in the (common!) path

·        Minimize copies

o       Zero” copy is really one copy from NI to correct place in the receiving process’ network buffer (“base” level U-Net)

§         Still have to copy to application data structures

o       True zero copy, NI puts data directly into application data structures in process address space (like CM-5)

§         Security risk in untrusted systems, so only really useful in SANs (but, still can have application bugs!)

·        Virtualize the network interface: each process has its own virtual NI

o       Some direct, some emulated by the kernel

o       Set up is always through the kernel

·        Protection between processes for network state

o       Use VM hardware to enforce this

·        Authentication of messages (again for protection)

o       Reliably tag the sending endpoint

o       Reliably dispatch to the correct receiving endpoint

·        Uses conventional OS and hardware

 

“Dumb” ATM adapter (SBA-100)

·        Simple I/O model, multi-level FIFO queues

 

“Intelligent” ATM adapter (SBA-200)

·        Adapter-level hacking!

·        On-board 25 Mhz i960 embedded processor

·        Multi-level FIFO queues or direct memory-mapped I/O

·        Naïve off-loading of functionality can backfire (60 MHz SPARC host CPU)

·        New model pushes significant amount of functionality into adapter

o       Scheduling requests to/from FIFO queues

o       De/multiplexing incoming packets

o       Copying to/from user-space

 

 

 

Results closely track raw performance

·        About 20% slower for 64 byte packets

·        What about larger packets?

o       Not shown on graph, but curves appear to be diverging

o       Bandwidth graph implies that RTT might converge for very large packets, however, packet sizes beyond 1500 bytes are rare (except for large bulk transfers – when do they occur?)

·        Can this approach be generalized to other adapters

o       TCP offload for web servers

·        Is complexity of adapter implementation worthwhile?

o       What about the rate of host CPU speed improvement versus embedded processors?

 

Protocols

·        Apples and oranges Ethernet and ATM comparisons?

o       Expect that ATM will be faster than Ethernet (dedicated, switched links with no MAC contention)

·        Approach pushes complexity to application level

o       But, could be done using libraries (e.g., TCP/IP stack, etc.)

 

Some flaws

·        Complex I/O adapter logic for managing communications (comparable to OS complexity)

o       Shouldn’t this be under OS control?

o       How to track different vendors, revisions, etc.?

·        Results are for very complex network protocol (ATM vs Ethernet)

o       Would be interesting to compare with SAN protocols (e.g., Myrinet)

·        As with AM paper, lots and lots of details that are not always clearly presented