CS 258 Parallel Processors
University of California, Berkeley
Dept. of Electrical Engineering and Computer Sciences

Prof. David E. Culler
Assignment 2
Spring 1999
Due Fri, 2/12 Before Class

This assignment has both a hands-on part and book problems.  You should do the assignment in teams of two.  Please work together.  Don't split it, "You do message passing, I'll do shared address and we'll glue them together at the end."  It is talking about it and puzzling through problems that is important.  Please email your assignment to culler@cs.berkeley.edu with a subject line containing CS258 HW2.  Include your names and email addresses.

Hands-on with NOW

If you have not already filled out a NOW account form, please do so at http://now.cs.berkeley.edu/Spr99NewAccount/NewAccount.html This is a front-end to a cluster of about 100 UltraSparcs (more or less depending on the day) connected by a very fast network to provide a general purpose parallel machine.  It runs under the Glunix global operating system layer.  (You can find out more about NOW at http://now.cs.berkeley.edu. Tutorial information is at http://now.cs.berkeley.edu/nowTutorial.html. Man pages are at http://now.cs.berkeley.edu/man/html1/glunix.html) We will treat it as a conventional parallel machine.  You should find little need to log in to specific nodes within the machine. The machine is configured into a flexible set of partitions.  To find out the current partition aliases, run You will notice that there are several overlapping partitions defined.  You are currently associated with the default production partition.  To find out more about it type You don't really care which nodes you have, just how many are available.  You can get this from You can see what is running with From time to time, researchers reserve nodes for dedicated use.  If glustat reports that few or no nodes are available, that is what is going on.  To see what is reserved, run The glunix tools understand how to steer clear of reservations, so you won't get burned. The production nodes tend to get filled up with a bunch of sequential stuff, so we have arranged a couple of partitions that are intended for parallel execution.  To switch over to one of these you set an environment variable.  For example, Will cause run jobs to run in the partition of that name.  glustat will give you information on those nodes.

Using MPI

So now let's learn about message passing using MPI. I have built a simple example at http://www.cs.berkeley.edu/~culler/cs258-s99/programs/mpi. Copy this to your home directory and build it This should produce the executible msample.  You run it on N processors in the current partition by Glunix will pick N nodes (with lightest load) and run the executible on each of them.

Spend some time studying msample.c.  This shows several of the MPI concepts.  All N processes start executing at main. The first thing they all do is to initialize their MPI environment. They are, by default, in the MPI_COMM_WORLD communicator.  They each determine how many processes are included and what is their rank or address within the communicator.

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &NPROCS);       /* Number of Procs */
  MPI_Comm_rank(MPI_COMM_WORLD, &MyProc);         /* Local address */

Remember, they are all executing the same code.  In this example, process 0 behaves specially as a kind of master process.  All other processes send a message to it consisting of their address using the standard send, MPI_Send.

   msg[0] = MyProc;
   MPI_Send(msg, 1, MPI_INT, 0, Tag, MPI_COMM_WORLD);

This transfers 1 integer from the buffer address msg to processor 0 in communicator MPI_COMM_WORLD with tag Tag.  Then each of these go on to eventually wait for a response.  The master receives these messages in whatever order they happen to arrive using a wildcard source in

                                   MPI_COMM_WORLD, &status);

It uses its  msg buffer to hold the incoming data and is happy to accept up to 128 ints.  It records the order of the messages in a record array.  Then it sends a 16 word message to each of the other processes.  They receive it with a specific MPI_Recv call which must match.

    MPI_Recv(msg, 16, MPI_INT, 0, Tag, MPI_COMM_WORLD, &status);

As you can see, this does a kind of strange reduction and broadcast, forming a global barrier.  It is entirely sequential.

Your MPI task

Write the equivalent of MPI_Allreduce under addition.  It should be scalable, efficient and reliable.  You should be able to call it over and over, without the distinct calls getting confused.  Clearly you want something that is log depth, i.e., a tree, rather than linear depth.  However, if the array is small, you may be better using a fan-in greater than two.  You will find that timing parallel programs is little bit tricky.  You can find a bunch of handy routines in http://www.cs.berkeley.edu/~culler/cs258-s99/benchmarks. Still, it is a complex empirical device.  You will likely need to repeat experiments to get clock fidelity and take several measurements to get reliable data. Turn in your program and your graphs.

Hands-on with PTREADS on SMPs

You should be all set to run on any of the 8 processor Sun Enterprise 5000s in the CLUMPS cluster of SMPs.  clumpN.cs.berkeley.edu - where N = 0,1,2,3.  In this case, you will just start a process and it will fork a bunch of threads.  Again, I have given you a simple example to get started at http://www.cs.berkeley.edu/~culler/cs258-s99/programs/pthreads. Copy this to your home directory and build it This should produce the executible psample.  You run it on N processors in the current partition by Spend some time studying psample.c.  This shows several of the PTHREADS concepts.  You can find out more about the specifics from man pthreads.  Notice that PTHREADS and Solaris threads are very similar and the man pages cover both together.  This adds some unnecessary confusion - let's stick to the portable standard - pthreads.psample starts out with only one thread in main.  This provides a handy time to initialize things before the threads start trying to use them. Then we need to create all the threads.  We create N of them and start them all working at start_fun, giving each its logical thread number.  The main thread sets the shared GO flag and then waits for all the work threads to complete - in the pthread_join loop.  It should block and get out of the way so that the other threads get the processors.

One of the distinctions that shows up already is that between system threads, which are preemptible scheduling units, and user threads, which are not.  We create the real thing.  We don't bind them to specific processors, but we do bind each to a solaris light-weight process (LWP), i.e., system thread.  Observe, that each thread spins on the GO flag.  If these were non-preemptible user level threads they may spin here forever while the main thread never gets scheduled to set the flag. The use of the shared GO flag has created a cheap and dirty barrier.  All threads wait on it.  (What happens inside the machine while they are doing that?  Why do we need the volatile storage modifier? What would we need to do if we wanted to use such a construct repeatedly?)

The third important construct is the use of a mutex_lock to provide a critical section around the update of the shared report array.  Like the MPI code, the position of a thread in this array is nondeterministic, depending on when they get started.  (I haven't been able to get them out of order yet, but logically they should.)  This is a tricky aspect of thread programming.  Subtle reorderings may not occur time and time while testing - they still pop up in the real world.

What is the actual construction of the shared address space for these threads?  Is there actually any private regions?  Clearly each thread has its own stack, but are they private?  You can do some tests to figure it out.

Your PTHREAD task

Now that you have a starting point, write the analogous All_Reduce for the shared address space case.  You will only be able to test machine scalability to 8 processors.  What happens to your all_reduce beyond this?  How does it behave when there are other things going on in the machine?  Turn in your code and your results.

Parallel Programming Experience

Now we're ready to get serious and have some fun developing an effective parallel program for a non-trival algorithm in these two models.  The problem you are going to solve is the 0-1 knapsack problem.  You are given N balls with positive integer weights Wi for i = 0..N-1 and profit Pi for i = 0..N-1, and a knapsack with integer capacity C.  (Only mathematicians start counting at 1.) Your goal is to determine a subset of the balls such that the sum of the weights is no more than C and the sum of the profits is maximized.  The obvious exhaustive approach is to try all 2N combinations.  Plenty of parallelism, no communication, and it will take forever.  There is an elegant dynamic programming solution to this problem, however.  Imagine that you have an NxC table T.  The idea is that entry T(i,j) gives the maximum profit obtained within capacity j using only balls 0..i.  We can also think of it as describing the subset that obtains that max.  Clearly you can computing T(0,1) directly.  The first ball either fits or it doesn't.  In fact, you can compute that entire row in parallel.  We can fill the table in row by row.  Consider T(i,j) assuming T(*,j') has been computed for all smaller j'.  The basic question, is when we increase the capacity by 1, do we now include ball i?  If we do include it, we fill up the rest of the knapsack as best possible, i.e., T(i-1, j-Wi) .  So the recurrence is
        T(i,j) = max [  T(i-1,j),  T(i-1, j-Wi)+P].
You can find a sequential C version of the solution in http://www.cs.berkeley.edu/~culler/cs258-s99/programs/knapsack
The nice aspect of this problem is that while the parallelism is obvious and regular, the dependences and thereby the communication pattern is not.  It depends on the actual weights of the input set.

You will implement it both in a shared address space and message passing framework.  Follow the step-by-step process outlined in chapter 2.  After you have thought through the decomposition (partitioning and assignment) you should make (and turn in) some basic calculations to model the expected performance.  The total amount of work is obviously xNC, for constant x. With your decomposition, what is the work and communication per processor?  Hint: the latter will be a function of the average weight.  How does this scale with N, C and p, the number of processors?  I would build the shared address space version first.  Test it and show that it matches your performance model.  What information gets replicated automatically into caches?

Think about and explain how to tackle the message passing implementation before you build it.  How is the global data structure (the table) represented.  What information is explicitly replicated?  Do you emulate the shared address space or do you perform a global data exchange.  Is termination a problem?  Build a simple performance model along the lines of the above and cmpare it with what you program does.

Book Problems Culler and Singh:

2.4, 2.6, 2.9, 3.12, 3.14, and one problem of your choice from either chapter.