Vector IRAM Dave Patterson March 2, 1996

Problems in IRAM design:

  • Logic is slower in DRAM process
  • Adding a processor means an instruction set which cutomizes part , limits software
  • How can you really use the phenomenal bandwidth?

    Observations

  • The vector processing units on different brands of vector computer are largely the same, with the same operations and registers. There is basically widespread agreement on the design of vecot units.
  • Vector units can trade off clock rate and amount of hardware: you can build a vector processor with the same peak bandwidth in a slower technology simply by replicating the function units so that they do, say, 4 elements per clock cycle
  • The large costs of vector systems is the network that connects the memory banks to the vector units. It essentially is a cross bar or fat tree.
  • There is a very clear dividing line between a vector processor and a scalar processor: vector isntructions, operations to load vector length and vector mask regiters, are the primary items that cross the line.

    Proposal: Instead of putting a full processor in a DRAM, put a vector unit in a DRAM and provide a port between a traditional processor and the vector IRAM. Across this port goes vector instructions and pssobily scalar values, which can specify a lot of work in a few bits. Thus a conventional processor-cache complex might operate well things that work well on caches, and anything that needs lots of bandwidth done inside the memory, using the standard load- store interface to communicate between the two worlds.

    Details:

  • The width of this port depends on how radical you want the IRAM.
  • A conservative model would use, say, Synchronous DRAM to send instructions in peices into an instruction queue at whatever rate the processor can generate information. By reserving a portion of the space for commands, you can get information from the address lines as well as the data lines.
  • A more radical model might use the Rambus interface to ship instructions in 8b chunks. These is no need to send a single instruction at a time.
  • Of course, you can make the portal as wide as you want
  • You can have multiple vector IRAMs if you need more memory or more processing; communication between IRAMs could be done by
  • chip to chip transfers over the memory bus, assuming an appropriate controller
  • through the processor via a block move instruction using the normal memory interface
  • via a network connection between IRAMs
  • By adding some instructions to manipulate the vector control registers (moves, possibly simple arithmetic) you may be able to reduce the number of scalar moves

    Comments: * The single, central set of vector registers mean that memory on chip is uniformly acessible, unlike some of the SIMD approaches to IRAM. * The speed of the logic on a DRAM process simply determines the width of the vector units. If logic is 1/2 speed, we can get the same peak performance by doubling the number of vecotr elements processed per clock (costing twice as much hardware). The cost is larger vector startup time, making a large N1/2 value. * Key to the design is the interconnect: how much area does it take to provide a potent interconnect?

    Questions:

  • How much data and instructions really cross the vector-scalar interface?
  • How expensive is the hardware to fully connect to the memory modules?
  • How much area would it take for, say, 16 or 32 vector registers, each with 64 or 128 vector elements? what about the functional units?
  • Can the scalar register remain in the vector unit, or must they be transmitted as well across that interface? (Since its only for reads, may not be too back)
  • How good a match is vectors to visualization instructions?
  • Can the idea become popular as graphics accelerators?
  • Is the overhead of communication so high that its better to perform length 1 vector operations than to perform operations in the scalar unit?
  • Can software handle all the syncronization between scalar and vector accesses? (e.g., read/write conflicts to same work)
  • Do any such machines exist so that we could look at the code?
  • How many applicatios will run well with vector assist?