Advanced Topics in Computer Systems |
12/3/01 |
Anthony Joseph & Joe Hellerstein |
|
Remote Procedure
Call (RPC)
Request
---------------------> Handler
|
V
Work
|
V
Handler
<----------------------- Reply
|
V
“match”
Performance issues:
·
Sender blocks
while waiting for reply (why? Simplicity, no buffering needed)
·
Request and
reply are different messages: slow-start effect for TCP connections
·
Can’t suspend
interrupt handler (running in privileged mode and interrupts can’t be nested)
·
Can’t assume
that messages are delivered in order
·
Can use
polling:
o Puts receiver in control, but requires sender
buffering
·
Event-driven
(typical choice):
o Must handle stuff as you get it
·
No loops, so no
deadlocks (provable by induction)
·
Message passing
machines (commercial machines):
o Treat network as I/O device
·
Message driven
architectures (research machines):
o Integrate message reception into instruction
scheduling, and message send into execution unit (register-based model)
Multiprocessor
model:
·
Multiple CPUs
on a high perf. network, usually in same machine room
·
System Area
Network (SAN)
o Very high performance: 2 – 10 GB/s per link,
~2 microsecond latency
o Homogenous, single administrative domain, one
location
·
Dark ages:
o Synchronous communication between all
processors
o Enforced by barrier between computation
rounds/stages
o Very poor CPU utilization
·
Levels of
indirection:
o Heterogeneous HW can be handled by network
interface card
o Heterogeneous OS can be handled by drivers,
libraries, etc.
o Heterogeneous physical network handled by
virtual naming schemes for nodes (NI cards)
·
Pointless to
use TCP for SAN because the environment is 100% reliability, error-free,
whereas TCP is designed for untrusted, high latency, variable loss situations
Basic computation model:
LOOP
compute;
communicate;
END
Separate
communication and computation phases yield following observations:
· If compute and communicate phases don’t overlap then low processor utilization unless much more computation than communication is occurring.
o Example:
§ 90% peak CPU utilization at cost of 10% network usage
§ Requires very-high performance network, but wastes it!
=> Thus, a courser-grained sharing model is required for good performance
· Overlapping compute and communication phases imply that high processor utilization is achieved as long as processor and network costs/utilization balance.
o Only need at least slightly greater computation than communication
o But, require flow-control between sender and receiver (3-phase protocol) or large buffer allocations
=> So, we ideally need an asynchronous communications model
· Note that a shorter communication phase implies that a finer-grained sharing model is possible.
o Want to minimize comm. start up time (i.e., low latency comm.)
o What about the speed of light?
=> Need to “fill the pipeline”
with messages
(Cray-1)
Approach:
· Match software model to hardware dispatch model (arriving message interrupt fires off interrupt handler),
· Treat message sending as the critical path and get everything off the critical path that you can.
Two primary sources of slow-down:
· Generalized buffering & resource allocation,
· Allowing for blocking/delayed server activity.
Active message solution:
· Head of a message packet contains the address of the receive handler to run.
o Upcall model
o Version 1: pointer to handler (actual address)
o Version 2: Symbolic name (allows heterogeneity)
· No buffering (beyond that needed for data transport).
o Immediate reply or pre-allocated user-level buffers
· Deadlock avoidance – Short user-level receive handlers that may not block:
o Either generate a reply message right away, or
o Extract the received message from the network into user space and return.
Some ways to think about active messages:
· Interrupt-level RPC (when response to a message is generated immediately).
· “Link layer” communication facility: gets bits from A to B and does nothing else; everything else is the responsibility of the application.
· Exports a “raw hardware” model: asynchronous hand-off to network on sender side; interrupt handler on receiver side.
Potentially an order-of-magnitude faster than more generalized communication facilities.
Systems, such as Split-C, that employ simple communications abstractions like Put & Get, benefit from that speed-up.
Message-driven machine designs also benefit from the minimalist approach:
· Very fine-grained data model in which computation is driven by messages that contain a function designator and data
· Frequently a message does not contain all the data needed to invoke a computation (1/3 of J-Machine messages)
=> Have to block awaiting the arrival of the rest of the data
· Active message approach implies that “simple” messages get processed quickly and resource allocation for and execution of multi-message functions gets handled in an application-specific manner; which allows for application-specific optimizations and batching.
3 key features about the paper:
· Try to improve utilization of massively parallel machines by focusing on the interaction between overlapping computation and communication.
· Provide an extremely lean communication facility, called active messages, that tries to remove as much processing as possible from the basic communications operation of getting a message from node A to node B.
· 2 primary sources of slow-down removed: generalized buffering for messages and support for blocking/delayed receiver activity.
Some flaws:
· The paper discusses lots of details that seem only semi-relevant to what this paper is really about. Maybe the paper is confused; maybe the Instructor is confused J
· The active message design pushes all the hard buffering and scheduling decisions into the application and observes that what’s left over is simple and fast.
o For what fraction of various workloads will this end up merely being an “accounting trick”?
o Similarities to E2E and ILP arguments?
· A lesson: Say it again, Sam: Optimize the critical path.
Active Messages in practice: where are/could they be used?
· Web servers (commonly served up content, like root page)
· NFS/CIFS servers (commonly accessed content, like root directory)
· Network caches (commonly accessed content)
· …
Motivation – performance limiting factors:
·
Lots of kernel
to user space buffer copying (high latency, bad for small messages)
o Example: NFS – most messages are under 200
bytes, but they account for half of the bits on the wire.
·
Restrictive
user-level view of network interface limits ability to implement novel network
protocols at user-level (think ALF and ILP)
New ideas for active messages:
· Managing resources without the kernel in the (common!) path
·
Minimize copies
o “Zero” copy is really one copy from NI to
correct place in the receiving process’ network buffer (“base” level U-Net)
§
Still have to
copy to application data structures
o True zero copy, NI puts data directly into
application data structures in process address space (like CM-5)
§
Security risk
in untrusted systems, so only really useful in SANs (but, still can have
application bugs!)
·
Virtualize the network interface: each process has its
own virtual NI
o Some direct, some emulated by the kernel
o Set up is always through the kernel
·
Protection
between processes for network state
o Use VM hardware to enforce this
·
Authentication
of messages (again for protection)
o Reliably tag the sending endpoint
o Reliably dispatch to the correct receiving endpoint
· Uses conventional OS and hardware
“Dumb” ATM adapter (SBA-100)
· Simple I/O model, multi-level FIFO queues
“Intelligent” ATM adapter (SBA-200)
· Adapter-level hacking!
· On-board 25 Mhz i960 embedded processor
· Multi-level FIFO queues or direct memory-mapped I/O
· Naïve off-loading of functionality can backfire (60 MHz SPARC host CPU)
· New model pushes significant amount of functionality into adapter
o Scheduling requests to/from FIFO queues
o De/multiplexing incoming packets
o Copying to/from user-space
Results closely track raw performance
· About 20% slower for 64 byte packets
· What about larger packets?
o Not shown on graph, but curves appear to be diverging
o Bandwidth graph implies that RTT might converge for very large packets, however, packet sizes beyond 1500 bytes are rare (except for large bulk transfers – when do they occur?)
· Can this approach be generalized to other adapters
o TCP offload for web servers
· Is complexity of adapter implementation worthwhile?
o What about the rate of host CPU speed improvement versus embedded processors?
Protocols
· Apples and oranges Ethernet and ATM comparisons?
o Expect that ATM will be faster than Ethernet (dedicated, switched links with no MAC contention)
· Approach pushes complexity to application level
o But, could be done using libraries (e.g., TCP/IP stack, etc.)
Some flaws
· Complex I/O adapter logic for managing communications (comparable to OS complexity)
o Shouldn’t this be under OS control?
o How to track different vendors, revisions, etc.?
· Results are for very complex network protocol (ATM vs Ethernet)
o Would be interesting to compare with SAN protocols (e.g., Myrinet)
· As with AM paper, lots and lots of details that are not always clearly presented