CS 258 Parallel Processors
University of California, Berkeley
Dept. of Electrical Engineering and Computer Sciences
 
 
Prof. David E. Culler
Spring 1999


Project Suggestions
 

Evaluation

  1. Intel has produced a series of SMP-oriented microprocessors and chipsets,  starting with the PPro 'quadpack', then the Pentium II, now the Xeon and soon the PIII.  These have different physical organization, somewhat different system busses, cache coherence, synchronization support.  The scale of the basic node and the emphasis on NUMA extensions has changed.  There lots of scattered data out there, but not a real systematic study of the differences and their implications.  We've got several of the machines available, but some careful thought needs to go into the evaluation methodology, benchmarks, etc.
  2. VIA is an emerging high performance communication substrate for clusters. An increasing number of network and platform vendors are supporting it, and major applications are relying on it, such as Oracle.  There is a limited test suite, but benchmarking in this regime is pretty imature.  Can you develop a set of microbenchmarks to tease apart aspects of the implementation, like Savaadre-Barrera's microbenchmarks do for memory systems?  The LogP benchmarks for Active Messages and the ISCA 96 T3D paper are potential starting points.
  3. Multithreading is finally real.  The Tera machine is running at SDSC.  It has hardware support for thousands of threads with single cycle context switching, and it relies on multithreading and lots of wires, rather than caches to keep the processor fed.  There were a couple of preliminary studies in SC98, but they were of the form, "How does it compare to my SMP for applications that I'm running on my SMP."  The conclusion is "You needs to use lots of threads!"  Surprise. There is wide scope for interesting evaluations, from microbenchmarks on up.  It would be especially important to look at the class of programs that don't do well on conventional architectures, such as those with very large memory bandwidth requirements.  Maybe this machine should be evaluated on primitives typical of databases.  Maybe its a sorting machine!
  4. Programming environment and tools.  Implementation and analysis of MPI, SVM, Split-C, etc. over VIA.  You end up learning a lot about architecture by using it.
  5. Performance Engineering: Tools for understanding parallel machine and/or parallel application performance.  Characterizing application demand, sensitivity, scalability.  Assistance in developing and validating analytical models.

Radical Designs

We have more transistors than we know what to do with, we are crossing critical levels of integration, wires are going really fast, optics is here, applications are changing.  Perhaps its time to think about things very differently, or to return to old ideas and see if their time has come!
  1. The impact of the system-on-a-chip building block. We saw a tremendous architectural rennaisance when a processor fit on a chip and a computer fit on a board.  It enabled not just the personal computer and workstation, but the bus-based SMP, the MPPs, and a host of more radical concepts - dataflow, systolic arrays, and so on.  We are now seeing the point where a complete system fits on a chip.  How does this change the equation?  What does it enable that wasn't viable before?  There were designs that pushed into this space, such as the J-machine and the Cosmic Cube, but the technology wasn't ready.  Or maybe, the ideas were somehow flawed.  There's lots of local spins on this with IRAM work.  Technology threshold thinking tends to be harder than technology discontinuity, because there isn't an obvious single line of change, but a wealth of possibilities.  Functions that were previously separate can be integrated, or those that were previously combined separated.  You can cut across levels of abstraction, and put more calculation into each action.  Certainly connectivity is essential, but we have multigigabit copper links, etc. today.
  2. Multiprocessors on a chip.  Clearly it can be done, but what is really worth doing differently?  There is some recent work at Stanford and other places in ISCA, but not a lot of real creative thinking.
  3. Database machines.  Several times in history there has been tremendous excitement over specialized database machines.  One of the most interesting parts of Illiac IV was the multiple heads per disk and the work to push processing close to the heads.  This also had a huge revival in the eighties, with computation per head, per track, per disk.  It kind of closed with a paper by DeWitt entitled, "Database machines are dead.  Long live database machines" in which he produced a decent set of benchmarks and showed that commodity technology dominates.  These days, Jim Gray, says all the processing is in the sheet metal.  We have so much processing in every controller, soon the CPUs will look like old bureaucrats.  (Final part is my wording.)  The I-disk project here is certainly looking at processing and memory near the disk.  I've also seen one inch disks - so perhaps the design point is million disk systems!  Certainly room for forward thinking.
  4. What happens if we put 100x the amount of DRAM that we are accustomed to putting in systems.  It is quite reasonable to build a terabyte main memory today - at least in a moderate sized cluster.  The cost continues to go through the floor, why not run with it.  Peter Chen has done some nice work showing how to make memory as reliable as disk - so think wild.  Interesting engineering issues here too.
  5. Integration of IP packet processing at wirespeed into classical Proc/Mem/IO structure. The latest routers/switches are doing this to some degree today.  They can run ten to a hundred layer 3 filters on each of many ports.  What if this were part of our everyday building  blocks.  Does it lead to a new machine organization?  What goes where? Does SMP become P+M+GPIO+IPIO?
  6. Parallel I/O systems.  To first order, we still don't understand how to design good I/O systems for parallel machines.  My understanding is that when the 85M$ ASCI Blue Pacific machine was delivered the I/O performance fell two orders of magnitude below spec.  A complete rethinking was required with a much larger fraction of the resources, including a separate network and machine processors, devoted to I/O.  The big database systems today look like one cabinet of processing with 50 to 100 disks cabinets around it in a gronky SCSI or fiberchannel network to a few thousand disks.  Generally, the fall back is a few big fat pipes with lots of disks on one end and lots of processors on the other.
  7. Networks for I/O.  Today we have one kind of network for interprocessor stuff in a parallel machine, another for system-area, another for I/O, and another for LAN.  It is believed by many that I/O networks need to behave more gracefully near saturation than the others.  This, for example, keeps them off the ethernet curve.  The argument is a lot like the token-ring vs ethernet debates of old.  What really is true here?  Is there a unifying view?
  8. As we have moved into distributed, cache-coherent memory systems, it ain't your grandfather's DRAM access anymore.  There is a tremendous amount of work going into every memory access across the cache controllers, directories, etc.  Perhaps it is time to elevate the semantics of memory operations.  What about ACID memory?  Transactional memory?  There's some papers by Helirhy on the later.  Are there interesting extensions of the cache protocols that make this possible?  Can we move persistence and durability assumptions from disk to memory.
  9. Large-scale machines are increasingly hosting internet services and internet services are increasingly demanding every large machines.  So far, most are either big SMPs or farms serving independent streams.  Inktomi is one of the few examples exploiting an alternative architecture.  Is there a new design space emerging?  How is it different?
  10. Reconfigurable logic is a very viable option.  How does it change parallel computer design?  Is it just new functional units and compiling past the instruction set, ala uniprocessors, or is there something much more interesting to be done?
  11. Everybody sticks vector units on their nodes for a while (the Intel Paragon i860 numerical libraries, the CM-5 vector memory controllers, the Meiko CS-2 with the Fujitsu vector units).  They always mess up the integration with the memory system, especially the cache coherence protocols, and they always fail because they are too hard to program.  Is there a way to do it right?

Availability, Reliability, Managability

There are a number of factors that limit the ultimate scalability of machines.  By the early 90s several vendors, Thinking Machines, Intel, Cray, IBM, got to the point where they could build parallel machines as large as anyone could afford to buy.  In practice, this peaked at about 2000 processors and 30 million dollars.  Today we have the ASCI machines pushing up toward 10,0000 processors and a hundred million dollars.  (Remember, it isn't the cost of the processor, its the cost of a processor's share of memory, interconnect, disk storage, access, maintenance).  There is perhaps a more fundamental trend, where over the past 40 years the largest computer systems peaked at about 10000 components - tubes, then discrete components, then chips.  Today, these are huge chips!  (The cost per ton is also pretty constant, when corrected for the value of $).  Machines much above this never seem to quite really work on a production basis - their tempermental.  The fundamental limit seems to be the ability to diagnose and isolate problems.  It takes time - downtime - and grows rapidly with the scale of the system.  Generally fixing it is quick, once it is found.  Many interesting questions come out of this.
  1. Clearly, the new levels of integration allow a very different engineering point.  Does this mean that with the right building blocks we can set a trend out to million processor designs?
  2. Why is it that within a chip the scale has continued to grow.  Much more so than at the system level.  Certainly part of it is that system yield is by definition 1.  If it doesn't work, we don't throw it out, we fix it until it does.
  3. First-fault diagnosis. How do you create a system that accurately captures HW and SW fault information as the faults happen?
  4. Large systems are constantly changing, some of the parts are always broken, they evolve at different rates.  This is one of the things that the internet got right - perhaps at the cost of functionality or performance.  Large parallel machines must track the technology curves, and you can't wait for a year and do a lot of expensive reengineering.  How do we design parallel machines so that they are much more a federated system?  The answer is certainly not just "Go clusters, so every thing is connected through a scalable network".  They parts need to learn about each other and negotiate their best roles.  They need to monitor and react.  So far, cluster technology hasn't really gotten there.

Better Protocols, Programming Systems, Tools

There is a wealth of interesting ideas to explore in the cache coherence protocols, message protocols, I/O protocols, diagnostics protocols.  You can look through recent ISCAs, HPCAs, SC conference proceedings to get some ideas of open problems.  Here are a few ideas.

Other Links