ࡱ;   !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~RQHFΕJ7PP40 SummaryInformation(Current User  ff`~ހހހ&@&"ހ`xx x&*"@*H****&+"+؀+@,,R,R0,R&.@...`.&/H"/n/&/<VXĀY&k@k,klklkl&l"mm4@mLmmm&p@pppp&qڀ"rr"@r:rzrzrz&v@vvvZv&x@"xfx@xxxx&{Ҁ@{|8|8|8&~"~6~X~p**&@ ``h`&Ȁ"(ȀȀ Ȁ&܀@BBhB&"Ѐ@ JJvJ&@&&&&" .@F"&@΀& "2T@l&@΀&ހ"&@>~~n~&@RRR&"(J@b&@&",N@f &²@؀`&x"Þ@؀^&v@Ȝ܀܀܀&܀"$@<|||&@,lll&^"фѦ`Ѿ&Ā@***&"@b@zٺٺ^ٺ&">`'x ,#XD&6@6€777&7"8 8,8D?Ā?Ā?Ā&F@FFFF&HЀ"HI`I0III&O@O@OOO&PV"P|P@PPP,P&U"@UHUU0U&W"WހX@XXXXXXX&\."\T\v@\\΀\΀"\΀&`@aaVaVaV&dd"dd@dĀeee&hĀ@hi*i*ni*&k"kk@kl8l8rl8&o@oЀppp&r"r,rN@rfrrr&v8@v^vv`v&v"w$wF@w^wwPw&x"yy6@yNyyfy&}@~~Z~Zt~Z&΀"@.nnn&&"Lnf hf ΀&؀@>>>&"2@J&f"@ƀ&@4ttt&."Tv rB&@ڀ&ր"6&@RRdR&ʶ"܀@VVV&"*@Bςςς&"BpdԀ`bbb&@$ddPd&۴"ڀ&""Hj܂܂܂܂&ܐ"ܶ؀."Bd.݂ݰ"ʀ""(`Jުv0vᦀ$@"R t 恀 捀 旀检 沀澀 ŀр ـ   Bx^ր + 0r  ul L,N~4~i? ddmpatterson@cs.berkeley.edu http://iram.cs.berkeley.edu/ EECS, University of California Berkeley, CA 94720-1776 m7mddddddddMH ,/7d3d8Intelligent RAM (IRAM): Chips that remember and compute 88,/788d3dL( ; n20dDavid Patterson, Thomas Anderson, Krste Asanovic, Ben Gribstad, Neal Cardwell, Richard Fromm, Jason Golbus, Kimberly Keeton, Christoforos Kozyrakis, Stelianos Perissakis, Randi Thomas, Noah Treuhaft, John Wawrzynek, and Katherine Yelick  n20dUUU ,@ B=,LddrEarly As a result of thinking about 2020: really going to keep spending billions per fab, separate for memory and microprocessor for next 25 years? SLIDES TO ADD: Breakdown RAS/CAS times and where it goes to show hat 60 ns means Put back in DRAM opening chip Get photos/GIFs of boards and chips of Sun Server? History of sizes Excel, Word? Ask Gray? Ask Sites on quote? r rddddddddBdddd3dd;dd 0  UUN  = 4  @ @ @  @ R @B# *0 @#p *0 Zzh @|2 Zz @|2   Kk @n$ @P Z y Z 8  @p @9@!" @D @D @D @DL Dd #&'*    pc g$ @ *0 @(p *0 Vv Vv Vv Vv$ Vvh VvL Vv\ Vv$$  $w]fdd$ dd$wQ]dd$} ddn!ddProc ddhHNddL2$ ddl = dd L o g i c   dddddddddd = Udd f a b ddddddddTddProc dda!GddBus dd!ddBusc dd pddD R A M dddddddd F -dd f a b dddddddd sYuddDd dd sYddRd dd s<YddAd dd sJYddMd dd sYuddDd dd sYddR dd s<YddA dd sJYddM ddN=w S,/7ddIRAM Vision Statement ,/7ddLDddMicroprocessor & DRAM on a single chip: bridge processor-memory performance gap via on-chip latency 5-10X,bandwidth 100X improve energy efficiency 2X-4X (no DRAM bus) adjustable memory size/width (designer picks any amount) smaller board area/volume ( (n2dQDdd/Ddd9DddDdd +UUU ,@ B=,L ddf $B for separa lines for logic and memory Single chip: either processor in DRAM or memory in logic fab ff fdd)dd<ddUUU | QUN  =  0_lN=w S,/7ddOutline, ,/7ddL-I% Qn2dTodays Situation: Microprocessor Todays Situation: DRAM IRAM Opportunities IRAM Architecture Options IRAM Challenges Potential Industrial Impact  "Qn2dQn2dQn2dQn2dQn2dQn2dUUU ,@0 B=,L dd !UUU  "    ] Ϝ  f t`@o   @`@ @  @@R @v @ @ @* @N` @ @ @ @&8 @\n @ @ @ @4F @j| @ @ @  @BT @x @ @ @, @Pb @ @ @ @(: @^p @ @ @ @6H @l~ @ @ @  @DV @z @ @ @. @Rd @ @ @ @*< @`r @ @ @ @8J @n @ @ @" @FX @| @ @ @0 @Tf @ @ @ @,> @<< @< < @<@ @ @  @@R @v @ @ @* @N` @ @ @ @&8 @\n @ @ @ @4F @j| @ @ @  @BT @x @ @ @, @Pb @ @ @ @(: @^p @ @ @ @6H @l~ @ @ @  @DV @z @ @ @. @Rd @ @ @ @*< @`r @ @ @ @8J @n @ @ @" @FX @| @ @ @0 @Tf @ @ @ @,> @ll @M @hG @ll @)) @ @ @ii @&& @ @ @]] @## @ @ @ZZ @ @ @ @WW @ @ @ @KK @@hG  h@@hG  P$@Mz  $@ 7  $@p  $@8`  $@J w  $@4  $@  $@Hp  $@>k  $@p1  $@0  $@~  $@0;Xh  $@%  $@X  $@r  $@8e  $@@"  $@  $@@oh  $@,Y  $@Mz  $@ 7  $@  $@  $@Jw  $@4  $@s  $@h  $@X>k  $@Hp1  $@0X  $@ ~H  $@;8h  $@(%  $@  $@r  $@8e  $@"  $@  $@o  $@,Y   aS] HI 3j܈ 9UL  ݠ  r      fД  ,U ,    D    cd$  IPݰ  hd   @  `x  Fp    t  T}<  : t  D   mZ LUX`@Bo>   @`  Js@P    U 5^D *p&Uw&Uw&U0w&U &Uw&Uw&U00w&U &Uw&Uw&U0Pw&U &Uw&Uw&U0w&U &Uw&Uw&U0pw&U z&Uw&Uw&U0w&U @&Uw&Uw&U0ΐw&U &Uw&Uw&U0 w&U &Uw&Uw&U0ݰw&U w&Uw&Uw&U0@w&U 4&Uw&Uw&U0w&U &Uw&Uw&U0`w&U &Uw&Uw&U1@w&U t&Uw&Uw&U1w&U 1&Uw&Uw&U1`w&U &Uw&Uw&U1$w&U &Uw&Uw&U1,w&U h14 %&Uw&Uw&U1;w&U &U1C0w&U &Uw&Uw&U0Pw&U ^0w\\P\T`T HLpD0DX\\\Tp`TH8(LDD *( #dd Proc 60%/yr.   dddd1 ,dd DRAM 7%/yr.   ddddH~# dd1 dd+ dd10 ddm0 dd100 dd Vc dd1000 dd = dd1980 dd =a dd1981 dd =& dd1983 dd = dd1984 dd =U dd1985 dd =f dd1986 dd =# dd1987 dd = dd1988 dd =R dd1989 dd =Z dd1990 dd = dd1991 dd = dd1992 dd =O dd1993 dd =W dd1994 dd = dd1995 dd = dd1996 dd =C dd1997 dd =K dd1998 dd = dd1999 dd = dd2000 dd85P ddDRAM  dd 8 ddCPUM  dd =9dd1982 ddP dd4Processor-Memory Performance Gap: (grows 50% / year) 444dd#dd $dd Performancee    dd eddTime  ddN=s ,ddProcessor-DRAM Gap (latency) ,dd(%77&' ,@z B=,L ddh Yaxis is performance Xasix is time Latency Clich: Not e that x86 didnt have cache on chip until 1989 hh hdddddddd dd3dd-*H@+, x$). 0P p poDNj3,dd&Processor-Memory Performance Gap Tax &&,&&ddLs Qn2d Processor % Area %Transistors (cost) (power) Alpha 21164 37% 77% StrongArm SA110 61% 94% Pentium Pro 64% 88% 2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap '  @ %A %n2dn2dQn2dQn2dQn2d%DddAQn2d41,23 ,@ޠ B=,L ddJ1/3 to 2/3s of area of processor 386 had 0; successor PPro has second die! JJ J!dd)dd96l78 05:c  ^t @` \N=* ,dd"Todays Situation: Microprocessor "",""ddL]C QDddMicroprocessor-DRAM performance gap time of a full cache miss in instructions executed 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 1/2X latency x 3X clock rate x 3X Instr/clock 5X Power limits performance (battery, cooling) Rely on caches to bridge gap Doesnt work well for a few apps: data bases, $ J / $Qn2d3Ddd8dd6dd7dd4Ddd,Qn2dQn2d/Ddd@=>? ,@ B=,L dd]1st generation Latency 1/2 but CLock rate 3X and IPC is 3X Now move to other 1/2 of industry ]] ]dd dd dddd!ddEBeCD<AFE  =  ? PN=w S,ddTodays Situation: DRAM ,dd Ly DddCommodity, second source industry high volume, low profit, conservative Little organization innovation (vs. processors) in 20 years: page mode, EDO, Synch DRAM DRAM industry at a crossroads: Fewer DRAMs per computer over time Starting to question buying larger DRAMs? $  ' Y LLQn2dYDddQn2d#Ddd)DddLIJK ,@ B=,L ddEDO small tweak to get 30% more Chip BW 3) is refresh RAMBUS 10% larger die area Will DRAM industry hit a wall on current path? Facing a crossroads: next 3 slides  (dddddddd/dddd"ddQN OPHMR  @ 6VO Ffz 66< j F| f6 @` f6D!"$ &o t .v<@PPP@  }t 6@ 66p 6x@0  @ 6Vd@ 6Vݰ $@     D@pp , &F @ &F9o{o{o{ 66S9ddMinimum Memory Size ddcI@ddDRAM Generation ddSp 0ddA86 89 92 96 99 02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb AAAdd&ddc_dd/4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB /       / dddddddddddddddddddddddddd0piEdd16 dd0pidd4 ddC@5dd&Memory per System growth @ 25% / year &&& dddd ddK[dd$Memory per DRAM growth @ 60% / year $$$ dd dd dd `dd(from Pete MacWilliams, Intel)  dddd3dd32 dd3pdd8 ddpdd8 dd@dd2 ddS@9dd4 ddS9{dd1 ddP@dd8 ddP{dd2 dd{dd4 ddKdd1 ddC){dd8 ddC)Kdd2 ddN= ,ddFewer DRAMs/System over Time ,dd XU%sm8VW ,@0 B=,#L ddlTop is generations Side is minimum memory per PC 32 chips in 1986 4 chips today DRAM @ 60% vs. system at 25% ll ldddddddddd]Z[\TY^U =  p@  l @pV t @V<  @<"  0 @"  @`c  *0r @c p *0 vM, L @  { @<" 1 6]V @-0 K @ fddD R A M ddddddddddD R A M ddddddddddD R A M dddddddd nddD R A M dddddddd@yddBus ddcddI$ ddddD$ dd`LddProc ddS9ddL2$ ddNwc S,dd*Reluctance for New DRAMs: DRAM BW App BW **,**dd LIsQn2dMore App Bandwidth (BW) Cache misses DRAM RAS/CAS Application BW Lower DRAM latency RAMBUS, Synch DRAM increase BW but higher latency EDO DRAM, Synch DRAM < 5% performance in PCs    $68Qn2d&Qn2d2Qn2d-Qn2ddaDF#bc ,@ޠ B=,;L ddBW to cache is then the latency + transfer time of block need latency as much as BW EDO 30% higher DRAM BW < 5% on PC benchmarks)P k  9dddd/ddifgh(`ejE  = M  N=w ,ddMultiple Motivations for IRAM ,ddL f Qn2dRSome apps: energy, board area, memory size Gap means performance limit is memory Dwindling interest in future DRAM: 256Mb/1Gb? Too much memory per chip? Industry supplies higher bandwidth at higher latency, but computers need lower latency Alternatives: packaging breakthrough, more out-of-order CPU, fix capacity but shrink DRAM die, ... R &*b R+Qn2d&Qn2d.Qn2dDddWDddbQn2dpmno ,@ B=,L dd Memory is performance bottleneck Energy.board area. memory size makes IRAM attractive DRAM futures: Higher capacity/DRAM fewer chips lower BW Higher BW/DRAM higher latency and higher cost/bit lower BW When caches not best solution Why IRAM exciting? 3 reasons   y       $  D  !dd5dddd.dd?ddddddurstlqv7  p  VN=Z 6,ddPotential IRAM Latency: 5 - 10X  ,,ddLI F Qn2d;No parallel DRAMs, memory controller, bus to turn around, SIMM module, pins New focus: Latency oriented DRAM? Dominant delay = RC of the word lines keep wire length short & block sizes small? << 30 ns for 1024b IRAM RAS/CAS? AlphaSta. 600: 180 ns=128b, 270 ns= 512b Next generation (21264): 180 ns for 512b? ;o Uw ;MQn2d"Qn2d)Ddd,Ddd#Qn2dTQn2d|yVfVfffz{ ,@ B=,L dd:1st 2nd innovate inside DRAM Even compared to latest Alpha :: :dddddd~x},  p ` p dN= &,ddPotential IRAM Bandwidth: 100X  ,,dd LIsDdd1024 1Mbit modules, each 1Kb wide(1Gb) 10% @ 40 ns RAS/CAS = 320 GBytes/sec If cross bar switch or multiple busses deliver 1/3 to 2/3 of total 10% of modules 100 - 200 GBytes/sec FYI: AlphaServer 8400 = 1.2 GBytes/sec 75 MHz, 256-bit memory bus, 4 banks ' &S  > #'Qn2d&DddkQn2d(Qn2d#Ddd!  ,@0 B=,L dd'2nd reason Delivered BW on Alpha Serverw '' ' dddd9  p  4N=* 6,dd"Potential Energy Efficiency: 2X-4X " ,,""ddLI Qn2d>Case study of StrongARM memory hierarchy vs. IRAM memory hierarchy cell size advantages much larger cache fewer off-chip references up to 2X-4X energy efficiency for memory less energy per bit access for DRAM Memory cell area ratio /process:21164,SA 110 cache/logic : SRAM/SRAM : DRAM/DRAM 25-50 : 10 : 1 > D NU    >DQn2dsDdd$DddcQn2d()$ ,@z B=,L dd|bugger caches or less memory on board Cahce in logic process vs. SRAM in SRAM process vs. DRAM in DRAM process Main reason || |&dddddddddddd dd\  l p jtN c ,dd0Potential Innovation in Standard DRAM Interfaces 0 ,',00ddLIs [ Qn2d Optimizations when chip is a system vs. chip is a memory component Lower power with more selective module activation? Lower voltage if all signals on chip? Improved yield with variable refresh rate? IRAM advantages even greater if innovate inside DRAM memory modules?  C D  CQn2d3Ddd&Ddd+DddDQn2d`  ,@ޠ B=,?L ddTSystem on chip allows optimizations not availb in component NOw Chip artchitectures TT T<dddddd3`xk P=  Pp Nmws S,ddVanilla Approach to IRAM ,ddLmS f Qn2dEstimate performance IRAM version of Alpha (same caches, benchmarks, standard DRAM) Used optimistic and pessimistic factors for logic (1.3-2.0 slower), SRAM (1.1-1.3 slower), DRAM speed (5X-10X faster) for standard DRAM SPEC92 benchmark 1.2 to 1.8 times slower Database 1.1 times slower to 1.1 times faster Sparse matrix 1.2 to 1.8 times faster Conventional architecture/benchmarks/DRAM not exciting performance; energy,board area only  T "4*  - TQn2dDdd+Ddd0Ddd(DddZQn2d()$ ,@ B=,3L dd ]    N= F,ddA More Revolutionary Approach ,ddLY DddFaster logic in DRAM process DRAM vendors offer same fast transistors + same number metal layers as good logic process? @ 20% higher cost per wafer? As die cost f(die area4), 4% die shrink equal cost Find an architecture to exploit IRAM yet simple programming model so can deliver exciting cost/performance for many applications Evolve software while changing underlying hardware Simple sequential (not parallel) program; large memory; uniform memory access time  t  :MQn2d|Ddd7DddQn2d3DddUDdd0| ,@ B=,L ddQ $6 $ $ $ $ $ $ $mZ $@_M2  P0$ D # \D # hDH # D - # $D:  $<] DW xH # 0De^H # DW # DWx # DDe :  $$ $Wj $O $@o1  pЀ$@o1  2p nP$@o1  p$@1  ~$@1  2~d |$@1  ~$Tu  8DZ # tD,u # DT # Db #  D9  Μ$:  H|@@C:  t@@-D:  s}D@@/: x v@^|@@X7: h hd @@f1|: @@:n: t@@K":  yM 64 6%%%%2 %% z  DRzRRRDRR+88+ - -  q -HVqc VVV  V Vy qy  VcyqyVmccH6m D-y6-66-yyy y y 6V V6 DDD QQ QD 6 6 Q6lzDzzzDD zD DzyDyDDzyyz6ml66D6mDDQyDQD66Qy yy Dy  DyD6zDz6D   ''' P''5PP 555  B  BW PW  5rPWPW5efPrPe'"f "'W"'"'eWWrW W W eBe eB ebbb9,9,},T}}bTT}Tbbb9o9oWbWbbrWWeTfreT"Tfb"}W"}"T}eTW9Wr,W9bW99bWbe9eoeoe9YtfYtffttWWrWWefre"f"WY""eWtWYrfWtWttWeteeet<<<W!WW<//W/We<eee<<s<<sW<W<<erWWee/freW/"/f<"WW"W"/We!WWrW<W<W<ese<es<e666))W6Wr6W6Wef6r6e"f"W""eWWrWWWe)ee)e0<0.0..f.0<>KffKKK.K.KXJXXefWJeWsfWssfXXXefffef^k^kky^^k]myk])]mk)^'))]k]^B^'y5^Bk^BBk^kkBkxkxkBkXkX^&&^yaay^>^>CC-DD-//X77Xf1|1f|:n:nK""K"">bddMemory Crossbar  dd26ddM  ddO6ddM  dd4dd  ddDh6ddM  dd2DddM  ddODddM  ddBdd  ddDhDddM  dd24ddM  ddO4ddM  dddd  ddD4hddM  dd2BddM  ddOBddM  dd+dd  ddDBhddM  dd2vddM  ddOvddM  dd`dd  ddDvhddM  dd2ddM  ddOddM  ddndd  ddDhddM  dd47dd  dd2fddM  ddOfddM  ddcdd  ddDhfddM  dd2tddM  ddOtddM  ddqdd  ddDhtddM  dd2ddM  ddOddM  dddd  ddDhddM  dd2+ddM  ddO+ddM  dd)dd  ddDh+ddM  ddk3dd+  dd7ddVector Registers  ddl2ddx  ddm8dd  dd9Add Load/Store     ddo)dd 8K I cache     dd7dd 8K D cache     ddQgdd2-wa  ddQs+ddy Super  ddQ.ddscalar   ddz 6ddV  ddz1 Vddector  dd7ddNet Int  dd8cddPr  dd8ddocessor  ddUydd8 x 64  ddX|Ddd8 x 64  dd]2^dd8 x 64  ddX|dd8 x 64  ddedd8 x 64  dd2VCdd8 x 64  dd#$dd8 x 64  dd'K*dd8 x 64  dd8\dd8 x 64  ddSOddQueue  ddRpvdd Instruction     ddN c f,ddEV-IRAM-2: 0.18 m, Fast Logic, 1GHz 16 GFLOPS(64b)/128 GOPS(8b)/96MB EE,EEdd== ,@ޠ B=,L dd91Gbit technology Put in perspective 10X of Cray T90 today 99 9dd(ddlh $@Yn , .B4K, @@  Pm0  @X*$@Pp1+,$@v6p2 Ff$@&p2 $@PTt* *$ /  0  $@`+$@`+$@`>a+$@`+$@`+$@`B+$@`j+$@`+$@`#+$@P`1$@PP1$@P@1$@pP01$@`P 1$@PP1$@@P1$@0P1$@ P1$@P1$@P1$@P1$@P1$@P1$@P1-;$@p1$@`1$@P1$@@1$@p01$@` 1$@P1$@@1$@01$@ 1$@1$@1$@1$@1$@1$@1::K|ddCPU +$ ddc#IddMemory Crossbar3 dd#ddMemory Crossbar ddbddddI/O ddpindd8 Vector Units (+ 1 spare) dd?ddMemory (48 MBytes)  ddT?ddMemory (48 MBytes)  ddL& fQn2d0.18 m, 1 Gbit DRAM Die size = DRAM die 1B Xtors: 80% Memory, 4% Vector, 3% CPU regular design Spare VU & Memory 80% die repairable VQn2dQn2d=Qn2d(Qn2dNm ,ddV-IRAM-2 Floorplan ,dd< ,@ B=,L ddFloor plan showing memory in purple Crossbar in blue (need to match vector unit, not maximum memory syte) vector units in pink CPU in orange I.O in yellow How to spend 1B transistors vs. all CPU!  $ddFdddddddd(ddq|v  =   @\X  g@ l,||L=# VQn2dV-IRAM-2 (2002) 1 Gbit generation (0.18) Die size = DRAM (420 mm2) 1.0 - 1.5 volts (logic) 0.5 - 2.0 watts 500 - 1000 MHz 8 64-bit pipes/lanes 16 GFLOPS/128GOPS 96 MB cap. + DRAM bus Firewire/FC-AL serial lines A Qn2dQn2dQn2dQn2dQn2dQn2dQn2dQn2dQn2dQn2dN=w S,ddVector IRAM Generations ,ddL=#VQn2dV-IRAM-1 (1999) 256 Mbit generation (0.25) Die size = DRAM (290 mm2) 1.5 - 2.0 volts (logic) 0.5 - 2.0 watts 300 - 500 MHz 4 64-bit pipes/lanes 4 GFLOPS(64b)/32GOPS(8b) 24 MB capacity + DRAM bus PCI bus/Fast serial lines C Qn2dQn2dQn2dQn2dQn2dQn2dQn2dQn2dQn2dQn2d ,@ B=,L ddJWe work first in 256 Gbit technolgy C-IRAM-1, products in Gbit technology JJ JJddl$p E`  !0 NC [,ddIRAM Applications ,ddL-[ f Qn2dSupercomputer on a AA battery Super PDA/Smart Phone: speech I/O + voice email + pager + GPS +... Super Gameboy/Portable Network Computer: 3D graphics + 3D sound + speech I/O+ Gbit link + ... Intelligent SIMM (ISIMM) Put IRAMs + serial network + serial I/O into SIMM & put in standard memory system Cluster/Network of IRAMs Read/compare/write all memory in 1 ms Apps? Full text search? Fast sort? No index database? Intelligent Disk (IDISK) 2.5 disk + IRAM + net. !  Rv2 !Qn2dEDdd^DddQn2dmDdd'Ddd6Ddd2Qn2d  ,@0 B=,L ddA supercomputer you could lose? Hnoy, I cant find my supercomptuer; have you seen it? Look at the speed of processor and amount of I/O: seems that can have a balanced system using Ghz serial I/O Point 2: DRAM vs. Disk: now 104 faster latency and bandwidth         dd7ddmdd<dd  l  < q  =   N=w S,ddISIMM/IDISK Example: Sort ,ddL Q Qn2dBerkeley NOW cluster has world record sort: 6GB disk-to-disk using 64 processors in 1 minute Balanced system ratios for processor:memory:I/O Processor: N MIPS Large memory: N Mbit/s disk I/O & 2N Mb/s Network Small memory: 2N Mbit/s disk I/O & 2N Mb/s Network Serial I/O at 2-4 Ghz today IRAM: 2-4 GIPS + 2 2-4Gb/s I/O + 2 2-4Gb/s Net ISIMM: 8 IRAMs + net swtich + FC-AL links + disks IDISK: Intelligent Disks + switch = cluster  y  ^Qn2d1Qn2dDdd2Ddd3DddQn2d1Qn2d2Qn2d+Qn2d{ 0P 04  Nj3 ,dd%Why IRAM now? Lower risk than before %%,%%ddL Qn2dDRAM manufacturers now facing challenges Before not interested, so early IRAM = SRAM Past efforts memory limited multiple chips 1st solve the unsolved (parallel processing) Gigabit DRAM 100 MB; OK for many apps? Fast Logic + DRAM available now/soon? Embedded apps leverage energy efficiency, adjustable mem. capacity, smaller board area OK market v. desktop (55M 32b RISC 96) ) ,      *    ( )Qn2d,Ddd^Qn2d*Ddd&Qn2dQn2dr  ,@z B=,L ddWhy we are excited now? Not a new idea. Embedded market is larger (32b market is growing) E.g., 5.5 M MIPS microprocessors shipped in 1995. (more than PPC) Only 5% went into computers. Rule of thumb that it takes 1 year between chip working before ship a system: caches, busses, boards. this means whatever fab must wait 1 year before gets volume. Is this a good idea? IRAM as system on a chip should be more like putting computers on a LAN  (dd2dd2ddddddgddTddHddhhT 1 Q=]  l N ,ddIRAM Challenges ,ddL DddChip Speed, area, power, yield, cost in DRAM process? Good performance and reasonable power? BW/Latency oriented DRAM tradeoffs? Testing time of IRAM vs DRAM vs microprocessor? Reconfigurable logic to make IRAM more generic? Architecture How to turn high memory bandwidth into performance for real applications? Extensible IRAM: Large program/data solution? (e.g., external DRAM, clusters, CC-NUMA, ...)     Qn2d2Ddd'Ddd%Ddd0Ddd0Ddd Qn2dJDdd[Ddd&#$% ,@ B=,#PLddrOr Speed, Area, power, yield of DRAM in logic process Can slowdown in performance of portion and still be attractive Testing time much worse, or better due to BIST? DRAM operate at 1 watt: every 10 degrees increase in operative temperature doubles refresh rate; what to do? IRAM: acts as MP, acts as Cache to real memory, acts as low part of physical address space + OS? rrr6dd?dd0ddmdd`dd+(ܠ)*"',N  P U \Ns ,ddIRAM Conclusion ,ddL \ddIRAM potential in bandwidth (memory and I/O), latency, energy, capacity, board area; challenges in yield, power, testing, memory size 10X-100X improvements based on technology shipping for 20 years (not photons, MEMS, ...) Potential shift in balance of power in DRAM/microprocessor (P) industry in 5-7 years? P-oriented vs. DRAM-oriented manufacturers: Who ships the most memory? Who ships the most microprocessors?  X  W qQn2dYQn2dWQn2d.ddCdd2/01 ,@ B=,L dd1Captain of industry challenge is taking advantage of new technology once see quantification Balance of power: MPer companies shipping most of DRAM, or DRAM companies shipping most of MPers Not talking about exotic technology, based on photons or neurons, based on opening up technology shipped in 20 years 11 1\ddaddtdd7456D.38R  p  ?@ m8%0N= ,ddInterested in Participating? ,ddL=y# Qn2dhLooking for industrial partners to help fab, (design?) test chips and prototype of V-IRAM-1 Fast, modern DRAM process Existing RISC CPU core? Looking for partners with memory intensive apps Contact us if youre interested: http://iram.cs.berkeley.edu/ email: patterson@cs.berkeley.edu Thanks for advice/support: DARPA, Intel, Neomagic, Samsung, SGI/Cray, Sun h\ 2Q @ I h\Qn2dDddDdd0Qn2d!Qn2ddd"ddIQn2d>;{<= ,@ޠ B=,L dd C@lAB:?D`  =  p_LN=w S,dd Backup Slides  ,  ddLIs% n2d8(The following slides are used to help answer questions) 88 88n2dJG @HIhFK?    5XN ,ddWhy a company should try IRAM ,ddL f Qn2dIf IRAM doesnt happen, then someday: $10B fab for 16B Xtor MPU (too many Xtors per die)?? $10B fab for 16 Gbit DRAM (too many bits per die)?? This is not rocket science. In 1997: 25-50X improvement in memory density; more memory per die or smaller die 10X -100X improvement in memory performance Regularity simplifies design/CAD/validate: 1B Xtors easy Logic same speed 20% higher cost / wafer (but redundancy improves yield) IRAM success requires MPU expertise + DRAM fab & i% (.  &Qn2d5Ddd4Ddd%Qn2dMDdd,Ddd;DddDdd:Ddd.Qn2dQNOP ,@$ B=,L ddgISIMM/IDISK: Intel pursuing cluster strategy already Captain of industry challenge is taking advantage of new technology once see quantification Balance of power: MPer companies shipping most of DRAM, or DRAM companies shipping most of MPers Not talking about exotic technology, based on photons or neurons, based on opening up technology shipped in 20 years g43 g5dddd\ddaddtddVSTUMRW|  =   N=w S,ddWords to Remember ,ddL: [Ddd_ ...a strategic inflection point is a time in the life of a business when its fundamentals are about to change. ... Let's not mince words: A strategic inflection point can be deadly when unattended to. Companies that begin a decline as a result of its changes rarely recover their previous greatness. Only the Paranoid Survive, Andrew S. Grove, 1996 _/ _/n2d0Ddd]Z[\4Y^} ` ]DQF$4LDMm! "4tDo84D@@D@3 4D@hu4D@Dt 4D?4@ FL$ &F$T &W J6s &F$0@-$p-D0@p 4D`@ 4|D  4HD ! "4|DPP! "4$P56 @V$$ 56 @fST$:56 @F,d&_~\--G-U&\-U4UHN]Gr@:::33,,,,G,c,~3~_N% 4PkdVyB-G--GH:k, G @4VkyB,-]p,ryk]VHB4--4BO]dr3GyUycrpkp]\dGk-p L[?*"7DYt~pppppw~yk]VI;4-&#18F[o}  %p}o10KYn{wppwyd]P;- #*?Mav}'poT?*'<J_lzwppwsl_QC5'# 8F[oI83Mgd3- 9m)<|!X` 7Bz3k}oaT$M9FU?i88888?T8[oolz7`|<m-V P%q3x}MgM3KF3M13qd?Pp->usX/(t5egK) Ax}$oGaiTTFFFMT1[a[5le__XQQJJJC7C`<CCQ_!lQz&kIy%c3qM1M3DoWM"ovvS 8"#7Rn~b!NC2_$z  ;rqH:x$sMWM$3Odx3r42Nbwf<#`F>a"}  }v"KF3M13qd?Pp->usX/(t5egK) Ax}$oGaiTTFFFMT1[a[5le__XQQJJJC7C`<CCQ_!lQz&kIy%c3qM1M3KF3M13qd?Pp->usX/(t5egK) Ax}$oGaiTTFFFMT1[a[5le__XQQJJJC7C`<CCQ_!lQz&kIy%c3qM1M3KF3M13qd?Pp->usX/(t5egK) Ax}$oGaiTTFFFMT1[a[5le__XQQJJJC7C`<CCQ_!lQz&kIy%c3qM1M3-G--GH:k, G @4VkyB-G--GH:k, G @4VkyBTTcL/ddGraphics Acc. ddC)ddSuper PDA/Phone/Games dd#2 ddEmbedded Proc. dd39ddNetwork Computer ddC)ddLaptop dd0idd8 MB dd@>y dd2 MB dd svYdd32 MB ddN #,dd7Commercial IRAM highway is governed by memory per IRAM? 77,77ddda,*bc ,@ޠ B=,(L ddLimited by DRAM on chip: DRAM/chip increases faster than application memory demand, so I expect new applications to become popular as memory per chip increases 1MB to 4MB to 16MBto 64 MB (1Gbit = 128 MB)  ddddddddif gh`ejT 3 L  `;Nm I,dd5Energy to Access Memory by Level of Memory Hierarchy{ 55,55ddL) &DddFor 1 access, measured in nJoules Conventional IRAM on-chip L1$(SRAM) 0.5 0.5 on-chip L2$(SRAM v. DRAM) 2.4 1.6 L1 to Memory (off- v. on-chip) 98.5 4.6 L2 to Memory (off-chip) 316.0(n.a.) Based on Digital StrongARM, 0.35 m technology See "The Energy Efficiency of IRAM Architectures," 24th Intl Symp. on Computer Architecture, June 1997    c) "Qn2d=n2dn2d"n2d(n2d.n2d0DddgDddpmnolqy Pp   ` Nm V,dd:Reluctance for New DRAMs: Proc. v. DRAM BW, Min. Mem. size ::,::ddL &DddtProcessor DRAM bus BW = width x clock rate Pentium Pro = 64b x 66 MHz 500 MB/sec RISC = 256b x 66 MHz 2000 MB/sec DRAM bus BW = width x clock rate EDO DRAM, 8b wide x 40 MHz = 40 MB/sec Synch DRAM, 16b wide x 125 MHz = 250 MB/sec CPU BW / DRAM BW = 8 -16 chips minimum 64Mb 64-128 MB min. memory; 256Mb/Gb? Wider DRAMs more expensive: bigger die, test time t+ R# S' Tt +Qn2d(Ddd*Ddd#Qn2d'Ddd,Ddd'Qn2d)Ddd1Dddwt0|uv ,@z B=,?L dd2 slide for reluctance 1st processor 2nd DRAM Divide gets 8 to 16 64Mb is too large? Hence prefer older generation because cheaper per bit AND more desirable size  dddd ddddddMdd|y^z{Psx}d EW@P@@P@@} @} @ @} @QQ @ @u30  V@@u3/  LP@x @x @x @} @}} @ @YY @ @ @f @ @f @ @f @ @f $@>4  $@0u  @O @( @( @O @ G @( D@j!n   ZtD@!Bn   N Bd< hb `Z XR[ H E ! "li ;wI 9 0 vJ f: X $q  &% x@$f &% g; mOD@'>  M #Cat@  @ @ @   6 ~p@M @Z' @Z' @M  ~ 0 0 $@ p  @G0 @:#nV @:0n @G#V $@ A  @0z @#V @0 @#zV @## @++ D@5Q   VhD@4Q   VX ;  l;f ( ; - ;  ; HQQQHHHQQHH___V|V|___UUT.T.r_udd10  ddHEdd100  dd }-=dd1000  ddu5!dd10000  dddd0.1l  dd,boudd1  ddq[dd10  dd ?Lfdd100I  dd"XddMbits of Memory  ddddddV=ddComputational RAMA  dd-ddPIP-RAM  dd ddMitsubishi M32R/DD  dd} ddExecube  ddSydd Pentium Pro     dd5[dd Alpha 21164     ddXdd Transputer T9     ddTdd1000  dd@&'ddIRAMUNI? dd3 DddIRAMMPP? dd_]ddPPRAM  ddddBits of Arithmetic Unit  ddgu ddSIMD on chip (DRAM)r  dduI ddUniprocessor (SRAM)o  dd<u ddMIMD on chip (DRAM)o  ddu ddUniprocessor (DRAM)o  dd2ub RddMIMD component (SRAM )  ddXE|){ ,@z B=,LddScale no. proc. with memory capacity on-chip MPP difficult SW problem, especially with limited memory/proc Scale memory capacity with processor speed uniprocessor easier SW problem, especially with more memory/proc+  I43dd<dddd:dd5ddm Pd U T@   R  0 .w  p8  .dU Evolutionary  .  dUh=dU Revolutionary    dUr dUdPackaging Standard CPU in DRAM process Standard ISA+ Vector CPU in DRAM process New ISA CPU in DRAM process New ISA + FPGA in DRAM process  dUddUd)dUddUd dUdNs ,ddIRAM Conclusion ,ddL3 Qn2dIResearch challenge is quantifying the evolutionary-revolutionary spectrum II IIQn2d ,@ B=,L ddCaptain of industry challenge is taking advantage of new technology once see quantification Balance of power: MPer companies shipping most of DRAM, or DRAM companies shipping most of MPers  \dd`ddu  p  e N= f,ddGJustification#2: Berkeley has done one lap; ready for new architecture? GG,GGddL f Qn2d:RISC: Instruction set /Processor design + Compilers (1980-84) SOAR/SPUR: Obj. Oriented SW, Caches, & Shared Memory Multiprocessors + OS kernel (1983-89) RAID: Disk I/O + File systems (1988-93) NOW: Networks + Protocols (1993-98) IRAM: Instruction set /Processor design/Memory Hierarchy and Compilers/OS (1996-200?) :  :   R  $  !  Q :>Qn2d[Qn2d(Qn2d$Qn2dUQn2d0t   _ `N= V,dd21st Century Benchmarks? ,ddL &DddPotential Applications (new model highlighted) Text: spelling checker (ispell), Java compilers (Javac, Espresso), content-based searching (Digital Library) Image: text interpreter(Ghostscript), mpeg-encode, ray tracer (povray), Synthetic Aperture Radar (2D FFT) Multimedia: Speech (Noway), Handwriting (HSFSYS) Simulations: Digital circuit (DigSim),Mandelbrot (MAJE) Others? suggestions requested! Encryption (pgp), Games?, Object Relational Database?, Word Proc?, Reality Simulation/Holodeck?,/ 0 ?)e ' * `0Qn2dmDddjDdd1Ddd8DddQn2d`Ddd0s&  #/;GS_kw!-9ELX_kr~ &~ UN  = { p_{t M > z,?j x dd##  ddN=w S,dd Click to edit Master title style  ,  ddLIs%DddRClick to edit Master text styles Second Level Third Level Fourth Level Fifth Level R!   R!Qn2d Ddd Ddd UlAd Ddd ,@z B=,L ddSClick to edit Master notes styles Second Level Third Level Fourth Level Fifth Level SS S"dd dd dd dd dd BCUN 7 UN %l *=ͬ!`Ψ   @@``@ @`@   \( @`@  @`@L  @@``@`0    ==\\{{= @`@L @`@ @`@k @`@ x !! 4:**X2gP pXlYk(:\ģXO0ġRR&} Xy-_;TZK%~^5 ܵ?ǿ|ix>>>/" " NSv=I~z}z 4Q(%MD D ģXO;1)*,+(2^= MM_L?  dddddC`C/ЀXYGhOMDaG[;0ddd7€XYhhhYDaG[3(؀\!p'"   N,,,,,ddddddddddL Qn2dDddDddUlAdDddL     ddddddddddLTddTddTddTddTdddddddddddd 00 Book Antiqua HelveticaMonotype SortsGeneva@CourierArialSymbolTimes!+7$ PHH(FP(PHH(d'd"qP,P!Xp p @KUP+L! ࡱ;  UOh+'Oh+'0x> L X d p|Case for IRAM 2 IRAM, cachenew color scheme _MacintoshHD:Microsoft PowerPoint:Templates:On-Screen & 35mm Slides:dbllines.ppt - Double LinesDave205@J7@@ΕJ7&Microsoft PowerPoint 4.0GPICT * @@Xp  Xp 2Xp ,  Helvetica `.@@@@( x10, Palatino `(*JIntelligent RAM (IRAM): (; Chips that$(remember and compute (2David Patterson, Thomas Anderson, Krste Asanovic, (A,Ben Gribstad, Neal Cardwell, Richard Fromm, (tJason Golbus, Kimberly Keeton, (.Christoforos Kozyrakis, Stelianos Perissakis, (Randi Thomas, Noah Treuhaft,( +$John Wawrzynek, and Katherine Yelick, Courier (3 patterson@cs.berkeley.edu(@?http://iram.cs.berkeley.edu/(MEECS, University of California(ZBerkeley, CA 94720-1776ࡱDaveࡱ; ࡱ; Current IDSummaryInformation(