258 Parallel Processors
University of California,
Dept. of Electrical Engineering
and Computer Sciences
|Prof. David E. Culler
Due Fri, 2/12 Before Class
This assignment has both a hands-on part and book problems. You should
do the assignment in teams of two. Please work together. Don't
split it, "You do message passing, I'll do shared address and we'll glue
them together at the end." It is talking about it and puzzling through
problems that is important. Please email your assignment to firstname.lastname@example.org
with a subject line containing CS258 HW2. Include your names and
Hands-on with NOW
If you have not already filled out a NOW account form, please do so at
This is a front-end to a cluster of about 100 UltraSparcs (more or less
depending on the day) connected by a very fast network to provide a general
purpose parallel machine. It runs under the Glunix global operating
system layer. (You can find out more about NOW at http://now.cs.berkeley.edu.
Tutorial information is at http://now.cs.berkeley.edu/nowTutorial.html.
Man pages are at http://now.cs.berkeley.edu/man/html1/glunix.html)
We will treat it as a conventional parallel machine. You should find
little need to log in to specific nodes within the machine. The machine
is configured into a flexible set of partitions. To find out the
current partition aliases, run
You will notice that there are several overlapping partitions defined.
You are currently associated with the default production partition.
To find out more about it type
You don't really care which nodes you have, just how many are available.
You can get this from
You can see what is running with
From time to time, researchers reserve nodes for dedicated use. If
glustat reports that few or no nodes are available, that is what is going
on. To see what is reserved, run
The glunix tools understand how to steer clear of reservations, so you
won't get burned. The production nodes tend to get filled up with a bunch
of sequential stuff, so we have arranged a couple of partitions that are
intended for parallel execution. To switch over to one of these you
set an environment variable. For example,
Add /usr/now/bin to your path.
Add /usr/now/man to your man path.
Log in to now.cs.berkeley.edu using ssh.
Will cause run jobs to run in the partition of that name. glustat
will give you information on those nodes.
setenv GLUNIX_NODES 32pns
So now let's learn about message passing using MPI.
I have built a simple example at http://www.cs.berkeley.edu/~culler/cs258-s99/programs/mpi.
Copy this to your home directory
and build it
This should produce the executible msample. You run it on N processors
in the current partition by
Glunix will pick N nodes (with lightest load) and run the executible on
each of them.
cp -r ~culler/public_html/cs258-s99/programs/mpi yourMPI
Spend some time studying msample.c. This shows several of the
MPI concepts. All N processes start executing at main. The first
thing they all do is to initialize their MPI environment. They are, by
default, in the MPI_COMM_WORLD communicator. They each determine
how many processes are included and what is their rank or address within
/* Number of Procs */
/* Local address */
Remember, they are all executing the same code. In this example,
process 0 behaves specially as a kind of master process. All other
processes send a message to it consisting of their address using the standard
msg = MyProc;
MPI_Send(msg, 1, MPI_INT, 0, Tag, MPI_COMM_WORLD);
This transfers 1 integer from the buffer address msg to processor
0 in communicator MPI_COMM_WORLD with tag Tag. Then each
of these go on to eventually wait for a response. The master receives
these messages in whatever order they happen to arrive using a wildcard
MPI_Recv(msg, 128, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
It uses its msg buffer to hold the incoming data and
is happy to accept up to 128 ints. It records the order of the messages
in a record array. Then it sends a 16 word message to each of the
other processes. They receive it with a specific MPI_Recv call which
MPI_Recv(msg, 16, MPI_INT, 0, Tag, MPI_COMM_WORLD,
As you can see, this does a kind of strange reduction and broadcast,
forming a global barrier. It is entirely sequential.
Your MPI task
Write the equivalent of MPI_Allreduce under addition. It should be
scalable, efficient and reliable. You should be able to call it over
and over, without the distinct calls getting confused. Clearly you
want something that is log depth, i.e., a tree, rather than linear depth.
However, if the array is small, you may be better using a fan-in greater
than two. You will find that timing parallel programs is little bit
tricky. You can find a bunch of handy routines in http://www.cs.berkeley.edu/~culler/cs258-s99/benchmarks.
Still, it is a complex empirical device. You will likely need to
repeat experiments to get clock fidelity and take several measurements
to get reliable data. Turn in your program and your graphs.
Hands-on with PTREADS on SMPs
You should be all set to run on any of the 8 processor Sun Enterprise 5000s
in the CLUMPS cluster of SMPs. clumpN.cs.berkeley.edu - where N =
0,1,2,3. In this case, you will just start a process and it will
fork a bunch of threads. Again, I have given you a simple example
to get started at http://www.cs.berkeley.edu/~culler/cs258-s99/programs/pthreads.
Copy this to your home directory
and build it
This should produce the executible psample. You run it on N processors
in the current partition by
Spend some time studying psample.c. This shows several of the PTHREADS
concepts. You can find out more about the specifics from man pthreads.
Notice that PTHREADS and Solaris threads are very similar and the man pages
cover both together. This adds some unnecessary confusion - let's
stick to the portable standard - pthreads.psample starts out with only
one thread in main. This provides a handy time to initialize things
before the threads start trying to use them. Then we need to create all
the threads. We create N of them and start them all working at start_fun,
giving each its logical thread number. The main thread sets the shared
GO flag and then waits for all the work threads to complete - in the pthread_join
loop. It should block and get out of the way so that the other threads
get the processors.
cp -r ~culler/public_html/cs258-s99/programs/pthreads yourPT
One of the distinctions that shows up already is that between system
threads, which are preemptible scheduling units, and user threads, which
are not. We create the real thing. We don't bind them to specific
processors, but we do bind each to a solaris light-weight process (LWP),
i.e., system thread. Observe, that each thread spins on the GO flag.
If these were non-preemptible user level threads they may spin here forever
while the main thread never gets scheduled to set the flag. The use of
the shared GO flag has created a cheap and dirty barrier. All threads
wait on it. (What happens inside the machine while they are doing
that? Why do we need the volatile storage modifier? What would we
need to do if we wanted to use such a construct repeatedly?)
The third important construct is the use of a mutex_lock to provide
a critical section around the update of the shared report array.
Like the MPI code, the position of a thread in this array is nondeterministic,
depending on when they get started. (I haven't been able to get them
out of order yet, but logically they should.) This is a tricky aspect
of thread programming. Subtle reorderings may not occur time and
time while testing - they still pop up in the real world.
What is the actual construction of the shared address space for these
threads? Is there actually any private regions? Clearly each
thread has its own stack, but are they private? You can do some tests
to figure it out.
Your PTHREAD task
Now that you have a starting point, write the analogous All_Reduce for
the shared address space case. You will only be able to test machine
scalability to 8 processors. What happens to your all_reduce beyond
this? How does it behave when there are other things going on in
the machine? Turn in your code and your results.
Parallel Programming Experience
Now we're ready to get serious and have some fun developing an effective
parallel program for a non-trival algorithm in these two models.
The problem you are going to solve is the 0-1 knapsack problem. You
are given N balls with positive integer weights Wi for i = 0..N-1
and profit Pi for i = 0..N-1, and a knapsack with integer capacity
C. (Only mathematicians start counting at 1.) Your goal is to determine
a subset of the balls such that the sum of the weights is no more than
C and the sum of the profits is maximized. The obvious exhaustive
approach is to try all 2N combinations. Plenty of parallelism,
no communication, and it will take forever. There is an elegant dynamic
programming solution to this problem, however. Imagine that you have
an NxC table T. The idea is that entry T(i,j) gives the maximum profit
obtained within capacity j using only balls 0..i. We can also think
of it as describing the subset that obtains that max. Clearly you
can computing T(0,1) directly. The first ball either fits or it doesn't.
In fact, you can compute that entire row in parallel. We can fill
the table in row by row. Consider T(i,j) assuming T(*,j') has been
computed for all smaller j'. The basic question, is when we increase
the capacity by 1, do we now include ball i? If we do include it,
we fill up the rest of the knapsack as best possible, i.e., T(i-1, j-Wi)
. So the recurrence is
T(i,j) = max [ T(i-1,j),
T(i-1, j-Wi)+Pi ].
You can find a sequential C version of the solution in http://www.cs.berkeley.edu/~culler/cs258-s99/programs/knapsack
The nice aspect of this problem is that while the parallelism is obvious
and regular, the dependences and thereby the communication pattern is not.
It depends on the actual weights of the input set.
You will implement it both in a shared address space and message passing
framework. Follow the step-by-step process outlined in chapter 2.
After you have thought through the decomposition (partitioning and assignment)
you should make (and turn in) some basic calculations to model the expected
performance. The total amount of work is obviously xNC, for constant
x. With your decomposition, what is the work and communication per processor?
Hint: the latter will be a function of the average weight. How does
this scale with N, C and p, the number of processors? I would build
the shared address space version first. Test it and show that it
matches your performance model. What information gets replicated
automatically into caches?
Think about and explain how to tackle the message passing implementation
before you build it. How is the global data structure (the table)
represented. What information is explicitly replicated? Do
you emulate the shared address space or do you perform a global data exchange.
Is termination a problem? Build a simple performance model along
the lines of the above and cmpare it with what you program does.
Book Problems Culler and Singh:
2.4, 2.6, 2.9, 3.12, 3.14, and one problem of your choice from either chapter.