Introduction to Parallel Computing
CS267: Notes for Lecture 1, Jan 17 1995
Introduction
We will have much class material available on the World Wide Web (WWW),
accessed via
Mosaic or
Netscape:
Mixed student teams (CS and non-CS) will do homework and a project.
Grading will be 25% homework, 25% midterm, and 50% project.
Machines available to the class include the CM-5, SP-1, and NOW (networks
of workstations).
Related classes being given this semester include
There is some overlap with these other classes, but we will mostly consider
applications and software tools, including the following.
Programming Languages
Libraries for Parallel Machines
We will try to emphasize programming tools that have a future, or at least
interesting ideas.
Applications and related algorithms
We will use a collection of programs called
Sharks and Fish,
which simulate fish and sharks swimming around,
eating, breeding, and dying, while obeying rules which model more
complicated and realistic force fields.
Sharks and Fish provide a simple model of
particle simulations, such as
ions in a plasma, cars on a freeway, stars in a galaxy, etc.
We will also discuss linear algebra problem, such as
solving Ax=b and finding eigenvalues. These
arise in many applications from all fields (structural mechanics,
circuit simulation, computational chemistry, etc.). We will cover
both dense and sparse problems, and direct and iterative methods.
These linear algebra problems often arise from solving
differential equations, which arise in many areas. We will
discuss the heat equation, Poisson's equation, and
equations involved in climate modeling in particular.
Combinatorial problems such as sorting and the Traveling Salesman Problem
will also be discussed.
Motivation for Parallel Computing
The
traditional scientific paradigm is first to do theory (say on paper),
and then lab experiments to confirm or deny the theory. The
traditional engineering paradigm is first to do a design (say on paper),
and then build a laboratory prototype.
Both paradigms are being replacing by numerical experiments and
numerical prototyping. There are several reasons for this.
Scientific and engineering problems requiring the most computing power to
simulate are commonly called "Grand Challenges"
Click on
Grand Challenges or
Sequoia 2000 Global Change Research Project is concerned with building
this database.
Why parallelism is essential
The speed of light is an intrinsic limitation
to the speed of computers. Suppose we wanted to build a completely sequential
computer with 1 TB of memory running at 1 Tflop. If the data has to travel
a distance r to get from the memory to the CPU, and it has to travel this
distance 10^12 times per second at the speed of light c=3e8 m/s,
then r <= c/10^12 = .3 mm. So the computer has to fit into a box .3 mm on a
side.
Now consider the 1TB memory. Memory is conventionally built as a planar grid
of bits, in our case say a 10^6 by 10^6 grid of words. If this grid
is .3mm by .3mm, then one word occupies about 3 Angstroms by 3 Angstroms,
or the size of a small atom. It is hard to imagine where the wires would go!
Why writing fast programs is hard.
On the World Wide Web (WWW) there is a long list of "all" computers,
sorted by their speed in solving systems of linear equations Ax=b with
Gaussian elimination. The list is called the
Linpack Benchmark.
Currently (as of December 1994) the fastest machine is an Intel Paragon with
6768 processors and a peak speed of 50Mflops/proc, for an overall
peak speed of 6768*50 = 338 Gflops. Doing Gaussian elimination, the machine
gets 281 Gflops on a 128600x128600 matrix; the whole problem takes 84min.
This is also a record for the largest dense matrix solved by Gaussian
elimination (a record destined to fall within months, as these record tend to).
Current Paragons have i860 chips, but future Paragons may have
Pentiums (hopefully ones that do division correctly).
But, if we try to solve a much tinier 100-by-100 linear system, the fastest the
system will go is 10 Mflops rather than 281 Gflops, and that is attained using
just one of the 6768 processors, an expensive waste of silicon. Staying with
a single processor, but going up to a 1000x1000 matrix, the speed goes up to
36 Mflops.
Where do the flops go? Why does the speed depend so much on the problem size?
The answer lies in understanding the memory hierarchy.
All computers, even cheap ones, look something like this:
Registers
|
Cache (perhaps more than one level)
|
Main Memory (perhaps local and remote)
|
Disk
The memory at the top level of this hierarchy, the registers, is small, fast
and expensive. The memory at the bottom level of the hierarchy, disk, is
large, slow and cheap (relatively speaking!). There is a gradual change
in size, speed and cost from level to level.
Useful work, such as floating point operations, can only be done on
data at the top of the hierarchy. So to work on data stored lower in the
hierarchy, it must first be transferred to the registers, perhaps displacing
other data already there.
Transferring data among levels is slow, much slower than the rate at which
we can do useful work on data in the registers, and in fact
this data transfer is the bottleneck in most computations.
Good algorithm design consists in keeping active data near the top
of the hierarchy as long as possible, and minimizing movement
between levels. For many problems, like Gaussian elimination, only if
the problem is large enough is there enough work to do at the top of
the hierarchy to mask the time spent transferring data among lower
levels. The more processors one has, the larger the problem has to be
to mask this transfer time. We will study this example in detail later.