Proposed Implementation Scheme for the BLAS Technical Forum Draft Standard

Participants

  • Jim Demmel
  • Xiaoye Li
  • Anil Kapur (anilk@uclink4.berkeley.edu)
  • Michael Martin (mcmartin@cory.eecs.berkeley.edu)
  • Weihua Shen (weihua@cs.berkeley.edu)
  • Berkat Tung (meio@cs.berkeley.edu)
  • Ben Wanzo (wanzo@unlink4.berkeley.edu)
  • Table of Contents

  • Description
  • Testing
  • Other Languages Besides C
  • Related Work by Andrew Lumsdaine
  • Future Work
  • Sample Software for DOT PRODUCT (as of 3/14/99)
  • Description

    Our goal is to automate the construction and testing of the BLAS as proposed in Chapters 2 and 4 of the BLAS Technical Forum Draft Standard. We are using the M4 macro package to automatically produce a C reference implementation and test code for some of the routines found in these chapters.

    Here is the challenge and our proposed solution. Consider just the dot product routine dot, which computes

       r <- alpha * sum (i=1 to n) x(i)*y(i) + beta * r
    
    There are 32 versions of this subroutine, depending on the mathematical type (real or complex) and precision (single or double) of the input and output arguments. For example, in c_sdot all the arguments are real and single, whereas in c_zdot_d_d, r, alpha, and beta are complex double, but x and y are real double.

    Furthermore, each subroutine may contain 1, 2 or 3 implementations of the dot product, depending on the internal precision PREC requested by the user (a run-time case statement). For example, c_sdot only uses internal precision = single, but in c_sdot_x the user can specify single, double, or extended internal precision. (On an Intel machine, 80-bit extended would be a fourth possibility.)

    Our goal is to automate the construction and testing of this enormous range of possible algorithms.

    To this end, we are using the M4 macro package to automatically generate all the subroutines and implementations. There are 3 levels of macros.

  • The top level generates the 32 subroutines with the correct names, type statements, and switch statements based on precision. This is very similar for all the BLAS. This currently consists of 247 lines inside the file dot.m4.
  • The middle level is the actual dot product algorithm. It is written exactly once in such a way that lets it support all combinations of types and precisions. It is about as long and complicated as a straightforward C implementation, with the difference that statements like prod = x(i)*y(i) are converted into M4 macro calls. This currently consists of 64 lines inside the file dot.m4.
  • The bottom level consists of (currently) 39 macros that are common to all the BLAS, and implement basic operations like multiplying two double precision numbers and returning their extended precision product. This currently consists of 493 lines inside the file cblas.m4.h.
  • We also have macro code for symv, symmetric-matrix-vector multiplication.

    Testing

    We have designed but not yet built testcode for all 32 variations. The challenge is again to use macros to generate all versions automatically. Now we have the extra challenge of determining whether we use as much extra precision as promised by the PREC argument.

    Here is the general scheme for the dot product. All flavors of matrix-vector multiplication and matrix-matrix multiplication will be tested the same way, based on our ability to test dot products.

    The idea is to choose r(in), alpha, x(1:n), y(1:n), beta, r(out,true) and a scale factor s so that

  • r(out,true) = alpha*sum(i=1 to n) x(i)*y(i) + beta*r(in) is as accurate as possible, and
  • e = | fl(r(out)) - r(out,true) | / s is a good approximation of the actual internal precision used to compute fl(r(out)). In other words, if the output fl(r(out)) of the dot-product routine being tested is computed with IEEE single internal precision, e should be O(2^(-24)) for most choices of r(in), alpha, x(1:n), y(1:n) and beta. Similarly, e should be O(2^(-53)) if fl(r(out)) is computed using IEEE double internal precision, and O(2^(-106)) if fl(r(out)) is computed using double-double (extra internal precision).
  • Here is the algorithm for choosing r(in), alpha, x(1:n), y(1:n), beta, r(out,true) and s:
    1)     Choose r(in), alpha, x(1:n), beta and y(1:n-k) randomly, or from the user
              (for some small k specified below)
    2)     for j=n-k+1 to n
    2.1)        Compute tmp = alpha*sum(i=1 to j-1) x(i)*y(i) + beta*r(in) as
                     accurately as possible (presumably in double-double)
    2.2)        y(j) = -round(tmp/(alpha*x(j))) all in working precision
           endfor
    3)     tmp = alpha*sum(i=1 to n) x(i)*y(i) + beta*r(in) as accurately as possible
    4)     r(out,true) = round(tmp)
    5)     s = |alpha|*sum(i=1 to n) |x(i)*y(i)| + |beta*r(in)|
    

    Note that the loop tries to choose y(n-k+1) through y(n) in order to create as much cancellation as possible, and make r(out,true) as close to zero as possible. Here is why. The usual error analysis of dot products says that a correctly computed fl(r(out)) should satisfy

          | fl(r(out)) - r(true) | <= O(eps_output)*|r(true)| + O(eps_internal)*s
    
    where r(true) is the mathematically correct answer. Since we only know r(true,out) where
          | r(out,true) - r(true) | <= O(eps_output)*|r(true)| + O(eps_doubledouble)*s
    
    we get
          | fl(r(out)) - r(out,true) | <= O(eps_output)*|r(true)| + 
                                          O(eps_internal+eps_doubledouble)*s
    
    or
          | fl(r(out)) - r(out,true) |/s <= O(eps_output)*|r(true)|/s + 
                                            O(eps_internal+eps_doubledouble)
    
    For this last quantity to be a good estimate of eps_internal, we need that |r(true)|/s be as small as possible (at most eps_doubledouble/eps_output). This is what the loop above tries to achieve.

    How big should k be, i.e. how many cancellations do we need? It depends on the relative sizes of eps_output and eps_doubledouble. Basically,

          k = ceiling( log(eps_doubledouble)/log(eps_output) )
    
    which is as large as ceiling(106/24) = 5 for eps_ouput = single. This means that we cannot easily test dot products with single output and extra internal prcision with n < 6 (more likely 7 or 8) so simply.

    Notice that we essentially need a "trusted" highly accuracy dot product in order to build our other test code. This trusted implementation will need separate unit testing outside this scheme.

    Other Languages Besides C

    Although I have not tried it, I believe similarly structured macros could generate F77 code as well.

    As long as the F95 code eventually called F77, that would help here too.

    Let me note that conventional operator overloading where the implementation of each binary operation only depends on the argument types and not the result type is not enough for our purposes. It cannot implement extra = double*double correctly, for example. The alternative would be to promote one argument to the output type, but this is unnecessarily inefficient (extra = extra*double is significantly more expensive than extra = double*double). This may limit the utility of this feature in F95 and C++.

    Related Work by Andrew Lumsdaine

    Andrew Lumsdaine has sent us a pointer to his Matrix Template Library (MTL), which is written in C++. Let me compare and contrast these two efforts. Andrew et al use powerful features of the C++ language to have a much more extensive implementation of various linear algebra routines than we are attempting here, including sparse BLAS, interfacing to LAPACK, etc. Plus they are clearly much farther along than we are.

    I can't tell from my brief reading of his web page how he deals with many flavors of mixed precision, possible extended precision, or testing it all. But I have not had time to read all the material.

    From the point of view of the BLAST Forum, where we voted not to have a C++ interface, I wonder whether we could then go ahead and use C++ for our reference implementation. Presumably we would need "wrappers", but I'm not sure how we would construct them, or whether they would take much less effort than what we have been trying to do at Berkeley.

    I look forward to a discussion of these issues at the BLAST meeting in March 99.

    Future Work

    I hope that we can come to a consensus approach to producing a reference implementation at the next BLAST meeting. If there turns out to be consensus around the approach presented here, then it would be natural to divide the work by subroutine, using our DOT implementation as a model. There is clearly a lot to do.

    There is the possibility of an F77 version, as described above.

    It might be possible to use this approach to use the same code template for the interval arithmetic versions of the code in chapter 5, but I have not pursued this.

    The sparse routines in Chapter 3 seem sufficiently different in scope that I do not see any big advantage of this approach there. If they go for extended precision versions some day, then maybe.

    Everything we have discussed so far applies to all the BLAS that "look like dot products", i.e. matrix-vector-multiply, matrix-matrix-multiply etc. But is does not apply so simply to solving triangular systems with extended internal precision.

    First, solving triangular systems with extended internal precision requires mallocing an array of extra precise temporary variables, since we need to keep the solution in extended to perform substitution, before finally rounding at the end. (Blocked implementations of matrix-matrix multiplication will require this too, but not the reference implementation.) This is not hard, but it is different, and mallocing can be slow. Perhaps we should not care about this in a reference implementation. (Note that using all stride-1 access in GEMV also would require mallocing a temporary array of extra precision variables, which we have chosen not to do in our current efforts.)

    Second, we have not figured out how to test accuracy of triangular solves as straightforwardly as for dot products, where we can easily construct problems with answers known ahead of time to extended precision high accuracy. We are still thinking about this.

    Sample Software for DOT PRODUCT (as of 3/14/99)

  • cblas.m4.h - Shared macros for all BLAS, containing operations like MUL(c, c_type, a, a_type, b, b_type) which does c=a*b with the appropriate algorithm, depending on the types. (493 lines)
  • cblas.h - Used by cblas.m4.h (40 lines)
  • dot.m4 - Code specifically for generating the dot product. (311 lines).
  • dot.c - automatically generated C code for all versions of the dot product (6936 lines). No human intervention should be needed with this file, or the analogous files for other files.
  • symv.m4 - Code specifically for generating the symmetric-matrix-vector product. (376 lines).
  • symv.c - automatically generated C code for all 32 versions of symv (19613 lines). No human intervention should be need with this file, as before.