CS 258 Parallel Computer Architecture

CS 258 Parallel Processors University of California, Berkeley Dept. of Electrical Engineering and Computer Sciences

Prof. David E. Culler

Spring 1999

Project Suggestions

Evaluation

Intel has produced a series of SMP-oriented microprocessors and chipsets, starting with the PPro 'quadpack', then the Pentium II, now the Xeon and soon the PIII. These have different physical organization, somewhat different system busses, cache coherence, synchronization support. The scale of the basic node and the emphasis on NUMA extensions has changed. There lots of scattered data out there, but not a real systematic study of the differences and their implications. We've got several of the machines available, but some careful thought needs to go into the evaluation methodology, benchmarks, etc.
VIA is an emerging high performance communication substrate for clusters. An increasing number of network and platform vendors are supporting it, and major applications are relying on it, such as Oracle. There is a limited test suite, but benchmarking in this regime is pretty imature. Can you develop a set of microbenchmarks to tease apart aspects of the implementation, like Savaadre-Barrera's microbenchmarks do for memory systems? The LogP benchmarks for Active Messages and the ISCA 96 T3D paper are potential starting points.
Multithreading is finally real. The Tera machine is running at SDSC. It has hardware support for thousands of threads with single cycle context switching, and it relies on multithreading and lots of wires, rather than caches to keep the processor fed. There were a couple of preliminary studies in SC98, but they were of the form, "How does it compare to my SMP for applications that I'm running on my SMP." The conclusion is "You needs to use lots of threads!" Surprise. There is wide scope for interesting evaluations, from microbenchmarks on up. It would be especially important to look at the class of programs that don't do well on conventional architectures, such as those with very large memory bandwidth requirements. Maybe this machine should be evaluated on primitives typical of databases. Maybe its a sorting machine!
Programming environment and tools. Implementation and analysis of MPI, SVM, Split-C, etc. over VIA. You end up learning a lot about architecture by using it.
Performance Engineering: Tools for understanding parallel machine and/or parallel application performance. Characterizing application demand, sensitivity, scalability. Assistance in developing and validating analytical models.

Radical Designs

We have more transistors than we know what to do with, we are crossing critical levels of integration, wires are going really fast, optics is here, applications are changing. Perhaps its time to think about things very differently, or to return to old ideas and see if their time has come!

The impact of the system-on-a-chip building block. We saw a tremendous architectural rennaisance when a processor fit on a chip and a computer fit on a board. It enabled not just the personal computer and workstation, but the bus-based SMP, the MPPs, and a host of more radical concepts - dataflow, systolic arrays, and so on. We are now seeing the point where a complete system fits on a chip. How does this change the equation? What does it enable that wasn't viable before? There were designs that pushed into this space, such as the J-machine and the Cosmic Cube, but the technology wasn't ready. Or maybe, the ideas were somehow flawed. There's lots of local spins on this with IRAM work. Technology threshold thinking tends to be harder than technology discontinuity, because there isn't an obvious single line of change, but a wealth of possibilities. Functions that were previously separate can be integrated, or those that were previously combined separated. You can cut across levels of abstraction, and put more calculation into each action. Certainly connectivity is essential, but we have multigigabit copper links, etc. today.
Multiprocessors on a chip. Clearly it can be done, but what is really worth doing differently? There is some recent work at Stanford and other places in ISCA, but not a lot of real creative thinking.
Database machines. Several times in history there has been tremendous excitement over specialized database machines. One of the most interesting parts of Illiac IV was the multiple heads per disk and the work to push processing close to the heads. This also had a huge revival in the eighties, with computation per head, per track, per disk. It kind of closed with a paper by DeWitt entitled, "Database machines are dead. Long live database machines" in which he produced a decent set of benchmarks and showed that commodity technology dominates. These days, Jim Gray, says all the processing is in the sheet metal. We have so much processing in every controller, soon the CPUs will look like old bureaucrats. (Final part is my wording.) The I-disk project here is certainly looking at processing and memory near the disk. I've also seen one inch disks - so perhaps the design point is million disk systems! Certainly room for forward thinking.
What happens if we put 100x the amount of DRAM that we are accustomed to putting in systems. It is quite reasonable to build a terabyte main memory today - at least in a moderate sized cluster. The cost continues to go through the floor, why not run with it. Peter Chen has done some nice work showing how to make memory as reliable as disk - so think wild. Interesting engineering issues here too.
Integration of IP packet processing at wirespeed into classical Proc/Mem/IO structure. The latest routers/switches are doing this to some degree today. They can run ten to a hundred layer 3 filters on each of many ports. What if this were part of our everyday building blocks. Does it lead to a new machine organization? What goes where? Does SMP become P+M+GPIO+IPIO?
Parallel I/O systems. To first order, we still don't understand how to design good I/O systems for parallel machines. My understanding is that when the 85M$ ASCI Blue Pacific machine was delivered the I/O performance fell two orders of magnitude below spec. A complete rethinking was required with a much larger fraction of the resources, including a separate network and machine processors, devoted to I/O. The big database systems today look like one cabinet of processing with 50 to 100 disks cabinets around it in a gronky SCSI or fiberchannel network to a few thousand disks. Generally, the fall back is a few big fat pipes with lots of disks on one end and lots of processors on the other.
Networks for I/O. Today we have one kind of network for interprocessor stuff in a parallel machine, another for system-area, another for I/O, and another for LAN. It is believed by many that I/O networks need to behave more gracefully near saturation than the others. This, for example, keeps them off the ethernet curve. The argument is a lot like the token-ring vs ethernet debates of old. What really is true here? Is there a unifying view?
As we have moved into distributed, cache-coherent memory systems, it ain't your grandfather's DRAM access anymore. There is a tremendous amount of work going into every memory access across the cache controllers, directories, etc. Perhaps it is time to elevate the semantics of memory operations. What about ACID memory? Transactional memory? There's some papers by Helirhy on the later. Are there interesting extensions of the cache protocols that make this possible? Can we move persistence and durability assumptions from disk to memory.
Large-scale machines are increasingly hosting internet services and internet services are increasingly demanding every large machines. So far, most are either big SMPs or farms serving independent streams. Inktomi is one of the few examples exploiting an alternative architecture. Is there a new design space emerging? How is it different?
Reconfigurable logic is a very viable option. How does it change parallel computer design? Is it just new functional units and compiling past the instruction set, ala uniprocessors, or is there something much more interesting to be done?
Everybody sticks vector units on their nodes for a while (the Intel Paragon i860 numerical libraries, the CM-5 vector memory controllers, the Meiko CS-2 with the Fujitsu vector units). They always mess up the integration with the memory system, especially the cache coherence protocols, and they always fail because they are too hard to program. Is there a way to do it right?

Availability, Reliability, Managability

There are a number of factors that limit the ultimate scalability of machines. By the early 90s several vendors, Thinking Machines, Intel, Cray, IBM, got to the point where they could build parallel machines as large as anyone could afford to buy. In practice, this peaked at about 2000 processors and 30 million dollars. Today we have the ASCI machines pushing up toward 10,0000 processors and a hundred million dollars. (Remember, it isn't the cost of the processor, its the cost of a processor's share of memory, interconnect, disk storage, access, maintenance). There is perhaps a more fundamental trend, where over the past 40 years the largest computer systems peaked at about 10000 components - tubes, then discrete components, then chips. Today, these are huge chips! (The cost per ton is also pretty constant, when corrected for the value of $). Machines much above this never seem to quite really work on a production basis - their tempermental. The fundamental limit seems to be the ability to diagnose and isolate problems. It takes time - downtime - and grows rapidly with the scale of the system. Generally fixing it is quick, once it is found. Many interesting questions come out of this.

Clearly, the new levels of integration allow a very different engineering point. Does this mean that with the right building blocks we can set a trend out to million processor designs?
Why is it that within a chip the scale has continued to grow. Much more so than at the system level. Certainly part of it is that system yield is by definition 1. If it doesn't work, we don't throw it out, we fix it until it does.
First-fault diagnosis. How do you create a system that accurately captures HW and SW fault information as the faults happen?
Large systems are constantly changing, some of the parts are always broken, they evolve at different rates. This is one of the things that the internet got right - perhaps at the cost of functionality or performance. Large parallel machines must track the technology curves, and you can't wait for a year and do a lot of expensive reengineering. How do we design parallel machines so that they are much more a federated system? The answer is certainly not just "Go clusters, so every thing is connected through a scalable network". They parts need to learn about each other and negotiate their best roles. They need to monitor and react. So far, cluster technology hasn't really gotten there.

Better Protocols, Programming Systems, Tools

There is a wealth of interesting ideas to explore in the cache coherence protocols, message protocols, I/O protocols, diagnostics protocols. You can look through recent ISCAs, HPCAs, SC conference proceedings to get some ideas of open problems. Here are a few ideas.

Highly available cache coherence. Today, the limiting factor on SAS is availability, not engineering. In fact, must large SAS machine are partitioned and run as multiple smaller machines, with availability firewalls between them. In part the problem is that operating systems, as a crucial SAS program, don't scale too well, from a performance, but especially from an availability perspectve. Rosenblum's work on Hive and Disco address some of this aspect. However, the availability of the cache coherence protocols themselves is a problem. Critical information gets spread all over the place, so loss of a single node can effective take out the rest. So, one question is how do you back off of existing protocols - at the cost of some performance - to increase their robustness. A different question is whether you can use the existence of such protocols to enhance availability. Perhaps it should be ensured that there is always two copies of every block in caches. Before update, we establish a primary and secondary and run a commit protocol across them.
Most protocols today are conservative. For example, we make sure that there is no outstanding shared copies before performing an update. Meanwhile, inside the processor we are speculating like crazy and fixing it up later. It would seem that there is lots of room to apply optimistic protocols from the distributed system literature to the cache coherence problem.
Software shared virtual memory relative to Java. The SVM approach is pretty well investigated and well understood today, but one of the interesting questions is what happens at compile time and what happens at run-time through a mix of software and hardware. The presence of a type-safe language changes the equation. Perhaps this also enables new opportunities for mobile code.
A lot of people are talking about a worldwide global store. Call it Aetherstore, call it what you like. It clearly starts with and extends the CC-NUMA protocols today.
In the world of disconnect mail, calendars, files, pilots, etc we live today with the most awful idiosyncratic, often misguided notions of consistency. These are ususally expressed in terms of human conflict resolution actions. The results are an abuse of the word consistency. Perhaps it is time to apply a crisp architecture notion of a consistent store to these problems. Maybe not sequentially consistent, but one where a sequence of actions to distinct objects is meaningful. The presumption might be that data access is disconnected, but there is some lower grade communication for the consistency protocol. Perhaps a small class of actions would be prevented. Where is the consistency model weakened? Where are optimistic techniques used? What is the abort/roll-back options?

Evaluation

Radical Designs

Availability, Reliability, Managability

Better Protocols, Programming Systems, Tools

Other Links