CS 294-4 "Intelligent DRAM (IRAM)"
Wednesday-Friday 2:10-3:30 in 505 Soda Hall
Course Control Number: 24942
4 Units
Prerequisite: CS 250 or CS 252 or CS 254 or CS 262 or CS 264 or EECS 
225A or EECS 241

Background:
Microprocessors and memories are made on distinct manufacturing lines, 
yielding 10M transistor microprocessors and 256M transistor DRAMs. One 
of the biggest performance challenge today is the speed mismatch between 
the microprocessors and memory. To address this challenge, I predict that 
over the next decade processors and memory will be merged onto a single 
chip. Not only will this narrow or altogether remove the processor-memory 
performance gap, it will have the following additional benefits: provide an 
ideal building-block for parallel processing, amortize the costs of fabrication 
lines, and better utilize the phenomenal number of transistors that can be 
placed on a single chip. Let's dub it an "IRAM", standing for Intelligent 
RAM, since most of transistors on this merged chip will be devoted to memory. 
	Whereas current microprocessors rely on hundreds of wires to 
connect to external memory chips, IRAMs will need no more than computer 
network connections and a power plug. All input/output devices will be 
linked to them via networks, as will be other IRAMs. If they need more 
memory, they get more processing power as well, and vice versa--an 
arrangement that will keep the memory capacity and processor speed in balance.
	A single gigabit IRAM should have an internal memory bandwidth 
of nearly 1000 gigabits per second (32K bits in 50 ns), a hundredfold 
increase over the fastest computers today. Off-chip accesses will go over 1 
gigabit per second serial links. Hence the fastest programs will keep most 
memory accesses within a single IRAM, rewarding compact representations 
of code and data.

Course:
This advanced graduate course re-examines the design of hardware and 
software that is based on the traditional separation of the memory and the 
processor. Without prior constraints of legacy architecture or legacy 
software, the goal of the course is to lay the foundation for IRAM; it could 
play the role that prior Berkeley courses did for RISC and RAID. As in the 
past, this is a true EECS course which needs a mixture of students with 
different backgrounds: IC design, computer architecture, compilers, and 
operating systems. The ideal student will have taken one of the 
prerequisites, enjoys learning from students in other disciplines, shows 
initiative to help identify important questions and sources of answers, and is 
excited by the opportunity to shape the directions of a new technology 
where many issues are cross-disciplinary and unresolved.
	The first part of the course will consist of weekly readings with 
round table discussions followed by a short lecture to bring people of all 
backgrounds up to speed for the next topic. There will also be several guest 
lectures followed by extensive questions and answers. Students will take 
turns putting up the summary of the paper and conclusions from the 
discussions and lectures on the course home page. In the last part of the 
course we will break up into teams to work on related term projects, ideally 
with an interim milestone to make sure that the project makes sense and to 
make midcourse corrections in the projects. The end of the course will be a 
series of presentations of the results and then a final lecture where we 
determine our progress on IRAMs and what are the remaining steps and 
most promising directions. The home page at the end of the course should 
document our contributions to IRAM. There are no exams: grades are based 
on class participation and on the term projects.
	I expect the course and projects will answer questions such as: 
Ą Are vector instructions needed to use IRAM bandwidth efficiently?
Ą Does current compiler technology allow replacement of traditional 
multilevel data caches with scratch pad memories or vector registers?
(For example, Dick Sites has an Alpha address trace of a database that 
breaks all known data caches: how well would the trace play on an IRAM?)
Ą How much bigger and slower is a microprocessor designed in a DRAM 
process versus an IC process tuned to microprocessors? (For example, 
what is the size and clock rate of a MIPS CPU designed in a straight DRAM 
process?)
Ą What are the appropriate compiler optimizations when data bandwidth is 
relatively cheap (due to IRAM) and instructions are relatively slow (due to 
lower clock rates)?
Ą Does the power budget of a DRAM imply that the IRAM processor must 
use low-power techniques? How does that impact IRAM performance?
Ą An alternative model is a new packaging technology ("flip chip"), that 
promises thousands of wires between a processor chip and DRAM chip:  
if we can get access to the full page mode buffer on a DRAM in a 
single 8K bit transfer, do the architecture/software research issues remain 
the same even if the hardware implementation is quite different?
Ą Current data structures allocate maximum sizes per data element: what is 
the real size of data elements in a running program, and how often does size 
change? (For example, what is the current data size vs. actual size of 
SPEC95 programs?)
Ą How can compression, which is inherently variable, be combined with the 
fixed-block architecture of IRAMs?
Ą Given the importance of compact code and data, what is the tradeoff 
between segmented and fixed addressing?
Ą Can linked data structures be linearized on the fly to improve IRAM 
performance?
Ą Are programs written in Java, which emphasizes code size and uses 
garbage collection, a better match to IRAM than C, which ignores code size 
and relies on malloc?
Ą Are programs written in Fortran 90, which offers array operations, better 
for IRAMs than Fortran 77, which does not?
Ą Are gigabit serial lines sufficient to sufficient to satisfy the IRAM demands 
on disk, networks, and displays? Do we need to stripe data across these 
lines? How many lines do we need?
Ą What are the characteristics on an ideal operating system for an IRAM: 
virtual memory, scheduling, protection, and so on?
Ą What applications are a good match to IRAM: digital signal processing, 
systolic array apps, graphics? Which are a poor match to IRAM?