ARPA PISLDSPPNT<E. ̡D >E=:8"4Z: : 28 lX : 6.:4:8rFB:!N:#$%:(r(.x:0fh03:55:7:9:;:>z8>>:?p@ J@z@@:@$AJA>HB@BVCVCfVCDD$DBDXDt:DtxDE&G/YG-,>%&<<G-$Compiling "Global Address" Languagesm FORALL . . .CSend, Send, . . . Compute local Rec, Rec, . . . Compute Barrier Fortran-D, HPF, . .2 Compiler has already allocated the buffer space!6@ Compiler knows which processor and what address to store into!$ Compiler knows if it is available!-=> Construct simple memory-to-memory transfer % (i.e., write or put or bulk version)* For user decomposition, use read or get!?G.h,Zk[\G+pT"Av"AA%G/G.p1H%G, $dNAdApAdG/ pG/1dHG/Pdv^T` |p  v vpN`$"GG/TG+| $JG-$5LG-lfA`AiG-<< <`A<<H<< <l~l~lZ,A&A/G, | <&ABN5LG/L~G- ]$G-G-e|G,e|G+h\G-pVG,0?G.v<<w)G-k-G-0LG+KG, Nl%Send & Receive from Simple PrimitivesmcomputeSend s,QcomputecomputeRecv P,tcomputereqreadytdataBlocking Send&Receive]computecomputeNon-Blocking Send&Receive]Time3 Messages 3 times LatencyCopying and Buffer ManagementQPQFundamental Costs:Send s,QRecv P,tPQRecv Q,tSend s,PG-Ei/0/~l/0/u@.|pS@j]/x@|^^||^^Z@`|Srj/u($/xpZ$/u5L1/x /x`ddxxMessage Passing CostsmNINI Send OverheadReceive OverheadNetwork Latency#Machine Year Send+Recv Cycles FLOPS L @ overhead per msg per msg  HnCUBE/10 87 400 s 4000 600tiPSC/2 88 700 s 10000 225nCUBE/2 90 150 s 3000 330# w/ A M. 25 s 500 555QiPSC/860 91 160 s 6400 3200QDelta 91 55 s 2100 1100 CM5 92 95 s 3200 310 <$ w/ A.M. 3 s 100 10G* hEJ G.H`HX^XX`  < 7G/ p@jxZNZNZNEG/^F^ /F%G.hHXH`  < 7G,~L<rZ~Tr`~`rf~lrEG/06YG.`dG,fG, x@X ~F^ Ld Xp $G.~x ~   $G.qXG-w<G-TIr$G,Irc$G+YG,NFl^ Tr "A Communication "machine language"mData PCLNetwork+CM5: register to register communication/Data PCLNetwork+nCUBE/2: memory to memory communication/NINI System DMAXUser-level Memory Mapped(1.6 s)(1.7 s)(21 inst 10+1 s)H(34 inst 10+3 s)H (+5s int)GG.x5i/0/~(/~(<G/ ~ r :|:ZFfTm}$G,f fG/0@pNwm$G,  6 1G/p6`$2G*THTHTH?G/Active MessagesmPrimary Computation HandlerXPrimary Computation Data PCLIKey Idea: associate a small user-level handler directly with each message(/ 76 Sender injects the message directly into the network+ Handler executes immediately upon arrival$b pulls the message out of the network and integrates it into the ongoing computation, or repliesQ No buffering (beyond transport), no parsing, no allocation primitive schedulingNetworkG/4i/0/~(/UG,@.|pS@j]G.@|^^||^^Z@`|SrjG.($G/PZ$G/05L1G-T G/ddxx%Conventional Message Passing OverheadmNINI Send OverheadReceive OverheadNetwork Latency#Machine Year Send+Recv Cycles FLOPS L @ overhead per msg per msg  HnCUBE/10 87 400 s 4000 600tiPSC/2 88 700 s 10000 225nCUBE/2 90 150 s 3000 330iPSC/860 91 160 s 6400 3200QDelta 91 55 s 2100 1100 CM5 92 95 s 3200 310 < G+"! Shared Memory$G+:*>G/G/6*JJG, *9G.pG+p"$G-T&v$G/0z$G/P$G.d$lG.Goals:mSimple primitives to support Fefficient implementation of parallel languages and ccommunication librariesU pay for what you needCommunication ArchitectureEfficient Mapping to Machines<Lthat we know how to build ona large scale.MnCUBECM5 Shared Memory Msg Passing Data ParallelA Data FlowAssumption: communication within a parallel program two logically related entities executing at the same time on the same task !  ?#G/,%$i</0//mKG,FG/:Active Messages: Simple, Fast, and Flexible CommunicationmDavid E. Culler ;Computer Science Division  U.C. Berkeley EArpa PI 9/29/93&G.d('e default format /wX U /w`J /wTitlem D.E.Culler 93t  ARPA PI ## L)+*is n-,le." a?$ ecH al a o a?$ ecH al a ok  a?$ ecH al a o" a?$ ecH al a o a?$ ecH al a oĪUUUU 1   @dddXYZk[\]AT^V_}/dddXYZk[\]AT^V_}/ddd XYZk[\]AT^V_}/345$erm/{/,(:9t anHH(FE(HH(d'@J@, >G58.G2G4im;G,` QlpUG-4G,t/07628'@<