CS 194-2: Homework 2



Homework 2, Part 1

Provided code, and writeup.

In discussion on 07 Sep 2007, we talked about using one of Python's string substitution libraries for code generation. The script template.py shows some introductory examples for how to use this library, and unroll.py demonstrates a (relatively) complete loop unroller, with rudimentary code indenting (so that the generated output looks decent). FYI, by "loop unrolling" we don't necessarily mean eliminating the branch altogether; we mean something more like what you might have heard called "strip mining."


Homework 2, Part 2

Update (mfh 23:44 22 Sep 2007): I've posted examples of how to generate and submit multiple, parametrized PBS scripts, using a Python script pbsgen.py. The example template for that script is a file called "pbstemplate".

Assignment

Part 2 of Homework 2 is to write and benchmark a parallel matrix-matrix multiply routine, and report the results. The provided code does not change. You may use either Pthreads or OpenMP for your multithreaded parallel solution. In either case, you are responsible for ensuring correctness and maximizing parallel efficiency.

Due Date

Part 2 of Homework 2 is due next Tuesday, 25 September, at 5pm.

Special instructions for report

Your report should measure two kinds of speedup: strong and weak. Strong speedup is what you get when you keep the amount of data (which is 3n^2) constant and increase the number of processors p. Weak speedup is what you get when you increase the amount of data proportionately with p.

Sequential BLAS routine

You may use either your own optimized matrix-matrix multiply routine, or the highly optimized Intel BLAS routine DGEMM, which is called by the provided code. If you use Intel's DGEMM, you should be aware that the Intel BLAS does its own internal multithreading (using OpenMP). In order to avoid data corruption, you need to tell the Intel BLAS to use only one thread. If you use Pthreads for parallelism, you can set OMP_NUM_THREADS=1 in the calling environment (i.e., in the PBS script). If you use OpenMP for parallelism, don't do this (or your OpenMP program will only use one thread); instead, set MKL_SERIAL=yes. For more information, see p. 28 of the Intel MKL Users' Guide.

Comparison with multithreaded Intel BLAS

For obvious reasons, just calling the Intel BLAS isn't acceptable as a parallel solution; the point is to learn how to do it! ;-) However, we strongly encourage you to benchmark the multithreaded Intel BLAS and graph the results in your report. You can adjust the number of threads used by setting the environment variable OMP_NUM_THREADS appropriately. This will give you a baseline against which you can compare your results. If you don't have this, then you won't have any measure of success other than "perfect linear speedup," which even Intel's BLAS probably won't get.

OpenMP

Online OpenMP tutorials

You are welcome to try OpenMP for your solution, although it is not required. (You'll probably find it less tedious to code.) LLNL offers an online OpenMP tutorial, as does NERSC. Note that they tend to focus on Fortran-flavored OpenMP syntax, which is slightly different than when using OpenMP with C. You may also like to read one of Intel's OpenMP tutorials or poke around Intel's website for more tutorials.

Compiling and linking OpenMP programs

If you use any OpenMP library functions, be sure to "#include <omp.h>" at the top of your program. Otherwise, you only need to use the "-openmp" (without double quotes) flag with Intel's C compiler. You don't need to link in the Pthreads library.

Here's an example code:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int
main ()
{
  int i;
  int n = 1000;
  double x[1000];
  double sum = 0.0;

#pragma omp parallel 
  {
    // "single" means "run only on one thread"
    #pragma omp single
    {
      printf ("nthreads = %d\n", omp_get_num_threads());
    }
    // Parallelize the loop; i is an index variable and so the threads 
    // don't need to share it.  Declaring it "private" can improve 
    // performance (in essence, it removes a mutex).
#pragma omp for private(i)
    for (i = 0; i < n; i++)
      x[i] = (double) 1.0;

    // This for loop is a "reduction" -- we apply the "+" operator to
    // each element of x and store the resulting sum in the "sum" variable.
#pragma omp for reduction(+:sum) private(i) 
    for (i = 0; i < n; i++)
      sum += x[i];
  }
  printf ("sum = %e\n", sum);
  return 0;
}
and here is the output (after running on the PSI cluster's front-end node):
nthreads = 8
sum = 1.000000e+03
I compiled it like this:
$ icc -openmp foo.c
and got the following output from the Intel compiler:
foo.c(19): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
foo.c(23): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
foo.c(13): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
foo.c(20): (col. 5) remark: LOOP WAS VECTORIZED.
foo.c(24): (col. 5) remark: LOOP WAS VECTORIZED.
You can see that the Intel compiler likes to tell you when it has success parallelizing a loop :-) (Note that "vectorized" and "parallelized" mean two different things.)

OpenMP code outline

If you choose to use OpenMP, a blocked code may look something like this:

#pragma omp parallel for private(k)
for (k over starting indices of chunks of C) {
    Invoke sequential matrix-matrix multiply to compute k-th chunk of C;
}
or like this:
#pragma omp parallel for private(i,j)
for (i over block row indices of C) 
    // Don't need "parallel" in the pragma below because 
    // we're already in a parallel region.
#pragma omp for
    for (j over block column indices of C)
        Invoke sequential matrix-matrix multiply to compute C_{ij};


Last updated 25 Sep 2007.