CS 294-4 "Intelligent DRAM (IRAM)" Wednesday-Friday 2:10-3:30 in 505 Soda Hall Course Control Number: 24942 4 Units Prerequisite: CS 250 or CS 252 or CS 254 or CS 262 or CS 264 or EECS 225A or EECS 241 Background: Microprocessors and memories are made on distinct manufacturing lines, yielding 10M transistor microprocessors and 256M transistor DRAMs. One of the biggest performance challenge today is the speed mismatch between the microprocessors and memory. To address this challenge, I predict that over the next decade processors and memory will be merged onto a single chip. Not only will this narrow or altogether remove the processor-memory performance gap, it will have the following additional benefits: provide an ideal building-block for parallel processing, amortize the costs of fabrication lines, and better utilize the phenomenal number of transistors that can be placed on a single chip. Let's dub it an "IRAM", standing for Intelligent RAM, since most of transistors on this merged chip will be devoted to memory. Whereas current microprocessors rely on hundreds of wires to connect to external memory chips, IRAMs will need no more than computer network connections and a power plug. All input/output devices will be linked to them via networks, as will be other IRAMs. If they need more memory, they get more processing power as well, and vice versa--an arrangement that will keep the memory capacity and processor speed in balance. A single gigabit IRAM should have an internal memory bandwidth of nearly 1000 gigabits per second (32K bits in 50 ns), a hundredfold increase over the fastest computers today. Off-chip accesses will go over 1 gigabit per second serial links. Hence the fastest programs will keep most memory accesses within a single IRAM, rewarding compact representations of code and data. Course: This advanced graduate course re-examines the design of hardware and software that is based on the traditional separation of the memory and the processor. Without prior constraints of legacy architecture or legacy software, the goal of the course is to lay the foundation for IRAM; it could play the role that prior Berkeley courses did for RISC and RAID. As in the past, this is a true EECS course which needs a mixture of students with different backgrounds: IC design, computer architecture, compilers, and operating systems. The ideal student will have taken one of the prerequisites, enjoys learning from students in other disciplines, shows initiative to help identify important questions and sources of answers, and is excited by the opportunity to shape the directions of a new technology where many issues are cross-disciplinary and unresolved. The first part of the course will consist of weekly readings with round table discussions followed by a short lecture to bring people of all backgrounds up to speed for the next topic. There will also be several guest lectures followed by extensive questions and answers. Students will take turns putting up the summary of the paper and conclusions from the discussions and lectures on the course home page. In the last part of the course we will break up into teams to work on related term projects, ideally with an interim milestone to make sure that the project makes sense and to make midcourse corrections in the projects. The end of the course will be a series of presentations of the results and then a final lecture where we determine our progress on IRAMs and what are the remaining steps and most promising directions. The home page at the end of the course should document our contributions to IRAM. There are no exams: grades are based on class participation and on the term projects. I expect the course and projects will answer questions such as: ¥ Are vector instructions needed to use IRAM bandwidth efficiently? ¥ Does current compiler technology allow replacement of traditional multilevel data caches with scratch pad memories or vector registers? (For example, Dick Sites has an Alpha address trace of a database that breaks all known data caches: how well would the trace play on an IRAM?) ¥ How much bigger and slower is a microprocessor designed in a DRAM process versus an IC process tuned to microprocessors? (For example, what is the size and clock rate of a MIPS CPU designed in a straight DRAM process?) ¥ What are the appropriate compiler optimizations when data bandwidth is relatively cheap (due to IRAM) and instructions are relatively slow (due to lower clock rates)? ¥ Does the power budget of a DRAM imply that the IRAM processor must use low-power techniques? How does that impact IRAM performance? ¥ An alternative model is a new packaging technology ("flip chip"), that promises thousands of wires between a processor chip and DRAM chip: if we can get access to the full page mode buffer on a DRAM in a single 8K bit transfer, do the architecture/software research issues remain the same even if the hardware implementation is quite different? ¥ Current data structures allocate maximum sizes per data element: what is the real size of data elements in a running program, and how often does size change? (For example, what is the current data size vs. actual size of SPEC95 programs?) ¥ How can compression, which is inherently variable, be combined with the fixed-block architecture of IRAMs? ¥ Given the importance of compact code and data, what is the tradeoff between segmented and fixed addressing? ¥ Can linked data structures be linearized on the fly to improve IRAM performance? ¥ Are programs written in Java, which emphasizes code size and uses garbage collection, a better match to IRAM than C, which ignores code size and relies on malloc? ¥ Are programs written in Fortran 90, which offers array operations, better for IRAMs than Fortran 77, which does not? ¥ Are gigabit serial lines sufficient to sufficient to satisfy the IRAM demands on disk, networks, and displays? Do we need to stripe data across these lines? How many lines do we need? ¥ What are the characteristics on an ideal operating system for an IRAM: virtual memory, scheduling, protection, and so on? ¥ What applications are a good match to IRAM: digital signal processing, systolic array apps, graphics? Which are a poor match to IRAM?