258 Parallel Processors
University of California,
Dept. of Electrical Engineering
and Computer Sciences
|Prof. David E. Culler
Intel has produced a series of SMP-oriented microprocessors and chipsets,
starting with the PPro 'quadpack', then the Pentium II, now the Xeon and
soon the PIII. These have different physical organization, somewhat
different system busses, cache coherence, synchronization support.
The scale of the basic node and the emphasis on NUMA extensions has changed.
There lots of scattered data out there, but not a real systematic study
of the differences and their implications. We've got several of the
machines available, but some careful thought needs to go into the evaluation
methodology, benchmarks, etc.
VIA is an emerging high performance communication substrate for clusters.
An increasing number of network and platform vendors are supporting it,
and major applications are relying on it, such as Oracle. There is
a limited test suite, but benchmarking in this regime is pretty imature.
Can you develop a set of microbenchmarks to tease apart aspects of the
implementation, like Savaadre-Barrera's microbenchmarks do for memory systems?
The LogP benchmarks for Active Messages and the ISCA 96 T3D paper are potential
Multithreading is finally real. The Tera machine is running at SDSC.
It has hardware support for thousands of threads with single cycle context
switching, and it relies on multithreading and lots of wires, rather than
caches to keep the processor fed. There were a couple of preliminary
studies in SC98, but they were of the form, "How does it compare to my
SMP for applications that I'm running on my SMP." The conclusion
is "You needs to use lots of threads!" Surprise. There is wide scope
for interesting evaluations, from microbenchmarks on up. It would
be especially important to look at the class of programs that don't do
well on conventional architectures, such as those with very large memory
bandwidth requirements. Maybe this machine should be evaluated on
primitives typical of databases. Maybe its a sorting machine!
Programming environment and tools. Implementation and analysis of
MPI, SVM, Split-C, etc. over VIA. You end up learning a lot about
architecture by using it.
Performance Engineering: Tools for understanding parallel machine and/or
parallel application performance. Characterizing application demand,
sensitivity, scalability. Assistance in developing and validating
We have more transistors than we know what to do with, we are crossing
critical levels of integration, wires are going really fast, optics is
here, applications are changing. Perhaps its time to think about
things very differently, or to return to old ideas and see if their time
The impact of the system-on-a-chip building block. We saw a tremendous
architectural rennaisance when a processor fit on a chip and a computer
fit on a board. It enabled not just the personal computer and workstation,
but the bus-based SMP, the MPPs, and a host of more radical concepts -
dataflow, systolic arrays, and so on. We are now seeing the point
where a complete system fits on a chip. How does this change the
equation? What does it enable that wasn't viable before? There
were designs that pushed into this space, such as the J-machine and the
Cosmic Cube, but the technology wasn't ready. Or maybe, the ideas
were somehow flawed. There's lots of local spins on this with IRAM
work. Technology threshold thinking tends to be harder than technology
discontinuity, because there isn't an obvious single line of change, but
a wealth of possibilities. Functions that were previously separate
can be integrated, or those that were previously combined separated.
You can cut across levels of abstraction, and put more calculation into
each action. Certainly connectivity is essential, but we have multigigabit
copper links, etc. today.
Multiprocessors on a chip. Clearly it can be done, but what is really
worth doing differently? There is some recent work at Stanford and
other places in ISCA, but not a lot of real creative thinking.
Database machines. Several times in history there has been tremendous
excitement over specialized database machines. One of the most interesting
parts of Illiac IV was the multiple heads per disk and the work to push
processing close to the heads. This also had a huge revival in the
eighties, with computation per head, per track, per disk. It kind
of closed with a paper by DeWitt entitled, "Database machines are dead.
Long live database machines" in which he produced a decent set of benchmarks
and showed that commodity technology dominates. These days, Jim Gray,
says all the processing is in the sheet metal. We have so much processing
in every controller, soon the CPUs will look like old bureaucrats.
(Final part is my wording.) The I-disk project here is certainly
looking at processing and memory near the disk. I've also seen one
inch disks - so perhaps the design point is million disk systems!
Certainly room for forward thinking.
What happens if we put 100x the amount of DRAM that we are accustomed to
putting in systems. It is quite reasonable to build a terabyte main
memory today - at least in a moderate sized cluster. The cost continues
to go through the floor, why not run with it. Peter Chen has done
some nice work showing how to make memory as reliable as disk - so think
wild. Interesting engineering issues here too.
Integration of IP packet processing at wirespeed into classical Proc/Mem/IO
structure. The latest routers/switches are doing this to some degree today.
They can run ten to a hundred layer 3 filters on each of many ports.
What if this were part of our everyday building blocks. Does
it lead to a new machine organization? What goes where? Does SMP
Parallel I/O systems. To first order, we still don't understand how
to design good I/O systems for parallel machines. My understanding
is that when the 85M$ ASCI Blue Pacific machine was delivered the I/O performance
fell two orders of magnitude below spec. A complete rethinking was
required with a much larger fraction of the resources, including a separate
network and machine processors, devoted to I/O. The big database
systems today look like one cabinet of processing with 50 to 100 disks
cabinets around it in a gronky SCSI or fiberchannel network to a few thousand
disks. Generally, the fall back is a few big fat pipes with lots
of disks on one end and lots of processors on the other.
Networks for I/O. Today we have one kind of network for interprocessor
stuff in a parallel machine, another for system-area, another for I/O,
and another for LAN. It is believed by many that I/O networks need
to behave more gracefully near saturation than the others. This,
for example, keeps them off the ethernet curve. The argument is a
lot like the token-ring vs ethernet debates of old. What really is
true here? Is there a unifying view?
As we have moved into distributed, cache-coherent memory systems, it ain't
your grandfather's DRAM access anymore. There is a tremendous amount
of work going into every memory access across the cache controllers, directories,
etc. Perhaps it is time to elevate the semantics of memory operations.
What about ACID memory? Transactional memory? There's some
papers by Helirhy on the later. Are there interesting extensions
of the cache protocols that make this possible? Can we move persistence
and durability assumptions from disk to memory.
Large-scale machines are increasingly hosting internet services and internet
services are increasingly demanding every large machines. So far,
most are either big SMPs or farms serving independent streams. Inktomi
is one of the few examples exploiting an alternative architecture.
Is there a new design space emerging? How is it different?
Reconfigurable logic is a very viable option. How does it change
parallel computer design? Is it just new functional units and compiling
past the instruction set, ala uniprocessors, or is there something much
more interesting to be done?
Everybody sticks vector units on their nodes for a while (the Intel Paragon
i860 numerical libraries, the CM-5 vector memory controllers, the Meiko
CS-2 with the Fujitsu vector units). They always mess up the integration
with the memory system, especially the cache coherence protocols, and they
always fail because they are too hard to program. Is there a way
to do it right?
Availability, Reliability, Managability
There are a number of factors that limit the ultimate scalability of machines.
By the early 90s several vendors, Thinking Machines, Intel, Cray, IBM,
got to the point where they could build parallel machines as large as anyone
could afford to buy. In practice, this peaked at about 2000 processors
and 30 million dollars. Today we have the ASCI machines pushing up
toward 10,0000 processors and a hundred million dollars. (Remember,
it isn't the cost of the processor, its the cost of a processor's share
of memory, interconnect, disk storage, access, maintenance). There
is perhaps a more fundamental trend, where over the past 40 years the largest
computer systems peaked at about 10000 components - tubes, then discrete
components, then chips. Today, these are huge chips! (The cost
per ton is also pretty constant, when corrected for the value of $).
Machines much above this never seem to quite really work on a production
basis - their tempermental. The fundamental limit seems to be the
ability to diagnose and isolate problems. It takes time - downtime
- and grows rapidly with the scale of the system. Generally fixing
it is quick, once it is found. Many interesting questions come out
Clearly, the new levels of integration allow a very different engineering
point. Does this mean that with the right building blocks we can
set a trend out to million processor designs?
Why is it that within a chip the scale has continued to grow. Much
more so than at the system level. Certainly part of it is that system
yield is by definition 1. If it doesn't work, we don't throw it out,
we fix it until it does.
First-fault diagnosis. How do you create a system that accurately captures
HW and SW fault information as the faults happen?
Large systems are constantly changing, some of the parts are always broken,
they evolve at different rates. This is one of the things that the
internet got right - perhaps at the cost of functionality or performance.
Large parallel machines must track the technology curves, and you can't
wait for a year and do a lot of expensive reengineering. How do we
design parallel machines so that they are much more a federated system?
The answer is certainly not just "Go clusters, so every thing is connected
through a scalable network". They parts need to learn about each
other and negotiate their best roles. They need to monitor and react.
So far, cluster technology hasn't really gotten there.
Better Protocols, Programming Systems, Tools
There is a wealth of interesting ideas to explore in the cache coherence
protocols, message protocols, I/O protocols, diagnostics protocols.
You can look through recent ISCAs, HPCAs, SC conference proceedings to
get some ideas of open problems. Here are a few ideas.
Highly available cache coherence. Today, the limiting factor on SAS
is availability, not engineering. In fact, must large SAS machine
are partitioned and run as multiple smaller machines, with availability
firewalls between them. In part the problem is that operating systems,
as a crucial SAS program, don't scale too well, from a performance, but
especially from an availability perspectve. Rosenblum's work on Hive
and Disco address some of this aspect. However, the availability
of the cache coherence protocols themselves is a problem. Critical
information gets spread all over the place, so loss of a single node can
effective take out the rest. So, one question is how do you back
off of existing protocols - at the cost of some performance - to increase
their robustness. A different question is whether you can use the
existence of such protocols to enhance availability. Perhaps it should
be ensured that there is always two copies of every block in caches.
Before update, we establish a primary and secondary and run a commit protocol
Most protocols today are conservative. For example, we make sure
that there is no outstanding shared copies before performing an update.
Meanwhile, inside the processor we are speculating like crazy and fixing
it up later. It would seem that there is lots of room to apply optimistic
protocols from the distributed system literature to the cache coherence
Software shared virtual memory relative to Java. The SVM approach
is pretty well investigated and well understood today, but one of the interesting
questions is what happens at compile time and what happens at run-time
through a mix of software and hardware. The presence of a type-safe
language changes the equation. Perhaps this also enables new opportunities
for mobile code.
A lot of people are talking about a worldwide global store. Call
it Aetherstore, call it what you like. It clearly starts with and
extends the CC-NUMA protocols today.
In the world of disconnect mail, calendars, files, pilots, etc we live
today with the most awful idiosyncratic, often misguided notions of consistency.
These are ususally expressed in terms of human conflict resolution actions.
The results are an abuse of the word consistency. Perhaps it is time
to apply a crisp architecture notion of a consistent store to these problems.
Maybe not sequentially consistent, but one where a sequence of actions
to distinct objects is meaningful. The presumption might be that
data access is disconnected, but there is some lower grade communication
for the consistency protocol. Perhaps a small class of actions would
be prevented. Where is the consistency model weakened? Where
are optimistic techniques used? What is the abort/roll-back options?