Introduction to Parallel Computing

CS267: Notes for Lecture 1, Jan 17 1995

Introduction

We will have much class material available on the World Wide Web (WWW), accessed via Mosaic or Netscape:

The URL for the class home page is http://www.cs.berkeley.edu/~demmel/cs267.

You can click on Computational Science Education Project or Designing and Building Parallel Programs or MIT's 18.337 to access on-line courses similar to this one.

You can click on CS 258 to get an on-line version of Prof. David Culler's Parallel Architecture class.

You can click on CS 294 to get an on-line version of Prof. Eric Brewer's Multiprocessor Networks class.

You can click on Castle, to get a description of a portable parallel programming environment describes local research project which has produced a number of programming tools we will use this semester.

Mixed student teams (CS and non-CS) will do homework and a project. Grading will be 25% homework, 25% midterm, and 50% project. Machines available to the class include the CM-5, SP-1, and NOW (networks of workstations). Related classes being given this semester include

CS 273 (Parallel Algorithms, Ranade, MW 12:30-2, 505 Soda)

CS 258 (Parallel Architecture, Culler, MW 9:30-11, 310 Soda)

There is some overlap with these other classes, but we will mostly consider applications and software tools, including the following.

Programming Languages

Matlab - This is a serial language with a parallel flavor, plus good graphics.

CM Fortran - This is like a parallel Matlab, but not nearly so simple. It runs only on the CM-5, but HPF (High Performance Fortran) is intended to be a more portable and closely related language.

Message passing - This is the "assembly language" of parallel computing. It is low level and error prone. But currently it is the most common and most portable parallel programming style.

Split-C is C augmented with a few constructs to make parallelism available and predictable. It is produced locally by Profs. Culler, Yelick, et al, and runs on many platforms.

Other parallel languages we may consider include pSather, CC++, NESL, Linda, ID90, etc. Click on the National HPCC Software Exchange (NHSE) for a more complete list.

Libraries for Parallel Machines

ScaLAPACK - a parallel numerical linear algebra library (Demmel et al).

Multipol - a parallel distributed data structure library (Yelick etal).

PETSc, LPAR - solving PDEs (partial differential equations) and related scientific problems.

Chaco - automatic mesh partitioning, load balancing of irregular problems

We will try to emphasize programming tools that have a future, or at least interesting ideas.

Applications and related algorithms

We will use a collection of programs called Sharks and Fish, which simulate fish and sharks swimming around, eating, breeding, and dying, while obeying rules which model more complicated and realistic force fields. Sharks and Fish provide a simple model of particle simulations, such as ions in a plasma, cars on a freeway, stars in a galaxy, etc.

We will also discuss linear algebra problem, such as solving Ax=b and finding eigenvalues. These arise in many applications from all fields (structural mechanics, circuit simulation, computational chemistry, etc.). We will cover both dense and sparse problems, and direct and iterative methods.

These linear algebra problems often arise from solving differential equations, which arise in many areas. We will discuss the heat equation, Poisson's equation, and equations involved in climate modeling in particular.

Combinatorial problems such as sorting and the Traveling Salesman Problem will also be discussed.

Motivation for Parallel Computing

The traditional scientific paradigm is first to do theory (say on paper), and then lab experiments to confirm or deny the theory. The traditional engineering paradigm is first to do a design (say on paper), and then build a laboratory prototype. Both paradigms are being replacing by numerical experiments and numerical prototyping. There are several reasons for this.

Real phenomena are too complicated to model on paper (eg. climate prediction).

Real experiments are too hard, too expensive, or too dangerous for a laboratory (eg oil reservoir simulation, large wind tunnels, galactic evolution, whole factory or product life cycle design and optimization, etc.).

Scientific and engineering problems requiring the most computing power to simulate are commonly called "Grand Challenges" Click on Grand Challenges or Sequoia 2000 Global Change Research Project is concerned with building this database.

Why parallelism is essential

The speed of light is an intrinsic limitation to the speed of computers. Suppose we wanted to build a completely sequential computer with 1 TB of memory running at 1 Tflop. If the data has to travel a distance r to get from the memory to the CPU, and it has to travel this distance 10^12 times per second at the speed of light c=3e8 m/s, then r <= c/10^12 = .3 mm. So the computer has to fit into a box .3 mm on a side.

Now consider the 1TB memory. Memory is conventionally built as a planar grid of bits, in our case say a 10^6 by 10^6 grid of words. If this grid is .3mm by .3mm, then one word occupies about 3 Angstroms by 3 Angstroms, or the size of a small atom. It is hard to imagine where the wires would go!

Why writing fast programs is hard.

On the World Wide Web (WWW) there is a long list of "all" computers, sorted by their speed in solving systems of linear equations Ax=b with Gaussian elimination. The list is called the Linpack Benchmark. Currently (as of December 1994) the fastest machine is an Intel Paragon with 6768 processors and a peak speed of 50Mflops/proc, for an overall peak speed of 6768*50 = 338 Gflops. Doing Gaussian elimination, the machine gets 281 Gflops on a 128600x128600 matrix; the whole problem takes 84min. This is also a record for the largest dense matrix solved by Gaussian elimination (a record destined to fall within months, as these record tend to). Current Paragons have i860 chips, but future Paragons may have Pentiums (hopefully ones that do division correctly).

But, if we try to solve a much tinier 100-by-100 linear system, the fastest the system will go is 10 Mflops rather than 281 Gflops, and that is attained using just one of the 6768 processors, an expensive waste of silicon. Staying with a single processor, but going up to a 1000x1000 matrix, the speed goes up to 36 Mflops.

Where do the flops go? Why does the speed depend so much on the problem size? The answer lies in understanding the memory hierarchy. All computers, even cheap ones, look something like this:

        Registers
            |
        Cache (perhaps more than one level)
            |
        Main Memory (perhaps local and remote)
            |
        Disk

The memory at the top level of this hierarchy, the registers, is small, fast and expensive. The memory at the bottom level of the hierarchy, disk, is large, slow and cheap (relatively speaking!). There is a gradual change in size, speed and cost from level to level.

Useful work, such as floating point operations, can only be done on data at the top of the hierarchy. So to work on data stored lower in the hierarchy, it must first be transferred to the registers, perhaps displacing other data already there. Transferring data among levels is slow, much slower than the rate at which we can do useful work on data in the registers, and in fact this data transfer is the bottleneck in most computations.

Good algorithm design consists in keeping active data near the top of the hierarchy as long as possible, and minimizing movement between levels. For many problems, like Gaussian elimination, only if the problem is large enough is there enough work to do at the top of the hierarchy to mask the time spent transferring data among lower levels. The more processors one has, the larger the problem has to be to mask this transfer time. We will study this example in detail later.