Notes for Ma221 Lecture 1, Aug 29, 2014

Greetings!
   class URL: www.cs.berkeley.edu/~demmel/ma221_Fall14
   Prereqs: good linear algebra, 
            programming experience in Matlab (other languages welcome,
               but homework assignments will be provided in Matlab)
            numerical sophistication at the level of Ma128ab desirable
   Office Hours (subject to change): T 9-10, Th 1-2, F 1-2
   First Homework on web page - due Monday Sep 8
   Reader - recruiting now!
   Enrollment: hopefully enough room for everyone 
   No final, will be project instead
   Status of books in bookstore - should be there
   Fill out survey - what we cover may depend on what you say
      you are interested in

To motivate the syllabus, we tell a story about a typical office hours meeting:
A student says: "I need to solve an n-by-n linear system Ax=b. What should I do?"
Professor replies: "The standard algorithm is Gaussian Elimination (GE),
     which costs (2/3)*n^3 floating point operations (flops)."
   Student: "That's too expensive."
   Professor: "Tell me more about your problem."
   S: "Well, the matrix is real and symmetric, A=A^T."
   P: "Anything else?"
   S: "Oh yes, it's positive definite, x^T*A*x > 0 for all nonzero vectors x"
   P: "Great, you can use Cholesky, it costs only (1/3)*n^3 flops, half as much."
 The professor also begins to record their conversation in a "decision tree",
 where each node represents an algorithm, and edge represents a property of the matrix,
 with arrows pointing to nodes/algorithms depending on the answer.
   S: "That's still too expensive."    P: "Tell me more about your matrix"
   S: "It has a lot of zeros it, in fact all zeros once you're a distance
       n^(2/3) from the diagonal." 
   P: "Great, you have a band matrix with bandwidth bw=n^(2/3), so there is a
       version of Cholesky that only costs O(bw^2*n) = O(n^(7/3)) flops, much cheaper!"
   S: "Still too expensive."  P: "So tell me more."
   S: "I need to solve the problem over and over again, with the same A and different b,
       so should I just precompute inv(A) once and multiply by it?"
ASK&WAIT: Is this a good idea?
   P: "inv(A) will be dense, so just multiplying by it costs 2*n^2 flops, 
       but you can reuse the output of Cholesky (the L factor) to solve for each b
       in just O(bw*n) = O(n^(5/3))".
   S: "That's still too expensive."   P: "Tell me more."
   S: "There are actually a lot more zero entries, just at most 7 nonzeros per row."
   P: "Let's think about using an iterative method instead of a direct method, 
       which just needs to multiply your matrix times a vector many times, updating
       an approximate answer until it is accurate enough."
   S: "How many matrix-vectors multiplies will I need to do, to get a reasonably 
       accurate answer?"
   P: "Can you say anything about the range of eigenvalues, say the
       condition number = kappa(A) = lambda_max / lambda_min?
   S: "Yes, kappa(A) is about n^(2/3) too."
   P: "You could use the conjugate gradient method, which will need about
       sqrt(kappa(A)) iterations, so n^(1/3). With at most 7 nonzeros
       per row, matrix-vector multiplication costs at most 14n flops, so
       altogether O(n^(1/3)*n) = O(n^(4/3)) flops. Happy yet?"
   S: "No."  P: "Tell me more."
   S: "I actually know the largest and smallest eigenvalues, does that help?"
   P: "You know a lot about your matrix. What problem are you really trying to solve?"
ASK&WAIT:
   S: "I have a cube of metal, I know the temperature everywhere on the surface,
       and I want to know the temperature everywhere inside."
   P: "Oh, you're solving the 3D Poisson equation, why didn't you say so!
       Your best choice is either a direct method using an FFT = Fast Fourier Transform
       costing O(n log n) flops, or an iterative method called multigrid, costing
       O(n) flops. And O(n) flops is O(1) flops per component of the solution,
       you won't do better."
   S: "And where can I download the software?" ...

This illustrates an important theme of this course, exploiting the mathematical
structure of your problem to find the fastest solution. The Poisson equation
is one of the best studied examples, but the number of interesting mathematical
structures is bounded only by the imaginations of people working in math, science, 
engineering and other fields, who come up with problems to solve, so we will only
explore some of the most widely used structures and algorithms in this course.
   
The story above suggests that counting floating point operations is the right metric 
for choosing the fastest algorithm. In fact others may be much more important. 
Here are some other metrics:
(1) Fewest keystrokes:  eg "A\b" to solve Ax=b. More generally, the metric is finding
    an existing reasonable implementation with as little human effort as possible.  
    We will try to give pointers to the best available implementations, usually in 
    libraries.  There are lots of pointers on class webpage (eg netlib, GAMS) (demo).
    A\b invokes the LAPACK library, which is also used as the basis of the libraries 
    used by most computer vendors, and has been developed in a collaboration by 
    Berkeley and other universities over a period of years.
(2) Fewest flops = floating point operations.
    ASK&WAIT: how many operations does it take to multiply 2 nxn matrices?    
    Classical: 2*n^3
    Strassen (1969):  O(n^log2 7) ~ O(n^2.81) - sometimes practical
    Coppersmith/Winograd (1987):  O(n^2.376)  - not practical
    Umans/Cohn: O(n^2.376), maybe O(n^2)? reduced to group theory 
          problem (FOCS2003) - not yet practical
    Williams (2013): O(n^2.3728642)
    Le Gall  (2014): O(n^2.3728639)
    Demmel, Dumitriu, Holtz (2008): all the other standard linear algebra problems
       (solve Ax=b, eigenvalues, etc.) have algorithms can be as fast as matmul 
       (and backward stable) - ideas behind some of these algorithms are practical, 
       and will be discussed

    But counting flops is not the only important metric in today's and future world, for two reasons:
    (1) ASK&WAIT: who knows Moore's Law?
       Until 2007: computers doubled in speed every 18 months with no code 
       changes. This has ended, instead the only way computers can run faster 
       is by having multiple processors, so all code that needs to run faster 
       (not just linear algebra!) has to change to run in parallel.
       This is causing upheaval in the computer industry, and generating lots of
       research and business opportunities.  We will discuss parallel versions of 
       some algorithms, a number of which are different numerically from the best
       sequential algorithms, in terms of error properties, how the answer is
       represented, etc.
       (This is also discussed in more detail in CS267, taught in the Spring).

    (2) ASK&WAIT: what is most expensive operation in a computer? Is it doing arithmetic?
        No: it is moving data, say between DRAM and cache, or between processors over 
        a network.  You can only do arithmetic on data stored in cache, not in DRAM,
        or on data stored on the same parallel processor, not different ones
        (draw pictures of basic architectures).
        It can cost 10x, 100x, 1000x more to move a word than do add/sub/multiply.
         Ex: Consider adding two nxn matrices C=A+B. The cost
             is n^2 reads of A (moving each A(i,j) from main memory to the 
             CPU, i.e. the chip that does the addition), n^2 reads of B,
             n^2 additions, and n^2 writes of C. The reads and writes cost
             O(100) times as much as the additions.
         Ex: nVIDIA GPU (circa 2008) attached to a CPU: It cost 4 microseconds
             to call a subroutine in which time you could have done 1e7 flops
             since GPU ran at 300 GFlops/s.
         Technology trends are making this worse: arithmetic getting faster
             60%/year, but speed of getting data from DRAM just 23%/year
         Consequence: two different algorithms for the same problem, even if
           they do the same number of arithmetic operations, may differ vastly 
           in speed, because they do different numbers of memory moves.
         Ex: difference between matmul written in the naive way and optimized easily 
           40x, same for other operations (this is again a topic of CS267)

         We have recently discovered new algorithms for most of the algorithms
         discussed in this class that provably minimize data movement, 
         and are much faster than the conventional algorithms.
         We have also gotten grants to replace the standard libraries 
         (called LAPACK and ScaLAPACK) used by essentially all companies 
         (including Matlab!). So you (and many others) will be using these 
         new algorithms in the next few years (whether you know it or not!).
         This is ongoing work with lots of open problems and possible class projects.

(3) ASK&WAIT: So far we have been talking about minimizing the time to solve a problem.
    Is there any other metric besides time that matters?
    Yes: energy. It turns out that a major obstacle to Moore's Law continuing as it has
    in the past is that it would take too much energy: the chips would get so hot they
    would melt, if we tried to build them the same way as before. And whether you are
    concerned about the battery in your laptop dying, or the $1M per megawatt per year 
    it costs to run your datacenter or supercomputer, people are looking for ways to 
    save energy. 
    ASK&WAIT: Which operations performed by a computer cost the most energy?
    Again, moving data can cost orders of magnitude more energy per operation than 
    arithmetic, so the algorithms that minimize communication can also minimize
    energy.
      
Another office hours story:
  Someone from industry: "Could you help me solve a least squares problem, that is 
     find x to minimize the 2-norm of A*x-b, where A has more rows than columns."
  Professor: "How accurately do you need the solution?"
  SFI: "I suppose it would be easiest for my customers if I could promise them
        the answer was always correct to 16 digits."
  P: "That will be expensive, there are hard cases where your matrix is exactly
      low rank, which means the answer is not unique, and to provably identify
      and report that to the user, we'll need arbitrary precision arithmetic,
      which can be quite slow."
  SFI: "That's too expensive. What if I don't promise them what happens if they
        give me a low-rank matrix (for which I can blame them for giving me an
        ill-posed problem)? What if I just promise that it is accurate for full-rank 
        problems?
  P: "When your matrix is close to low-rank, something we call ill-conditioned,
      then we still may need arbitrary precision to get an accurate answer.
      But we can promise an accurate answer as long as you are not too close
      to a low-rank matrix, say at least a distance 1e-16 away, but we will
      need to do some 128-bit ("quadruple precision") arithmetic, so 
      it still won't be as cheap as possible."
  SFI: "My customers don't even give me the data to 16 digits, so I think
        that would be overkill."
  P: "In that case, you might want to use the standard algorithm, which is
      'backward stable', which means that it is actually equivalent (in the 
      usual 16-digit double precision arithmetic) to perturbing the input
      data in the 16-th digit, and then computing the exact answer:
        output of the algorithm =  f_alg(x) = f_exact(x + dx) 
      where x is your data, x+dx is the perturbed data, and f_exact() is
      the mathematically exact solution that the algorithm f_alg() approximates.
      This makes almost all users happy, getting exactly the right answer
      to almost the same problem they wanted to solve, since they usually
      only know the problem approximately anyway.
      And there are low cost ways to estimate an error bound, i.e. a bound
      on the error of the form
        error = f_alg(x) - f_exact(x) = f_exact(x+dx) - f_exact(x) ~ f_exact'(x)*dx
      using calculus to estimate f_exact'(x), which may be large if the input
      matrix is close to low-rank. The size of f_exact'(x), appropriately scaled,
      is called the condition number of the problem x.
  SFI: "Actually, I work on Wall Street, and probably getting an answer that
        probably has a not-too-large error is probably good enough."
  P: "In that case you should consider a randomized algorithm that has just
      those properties, and is much cheaper than the standard algorithm, 
      costing as little as O(nnz) where nnz = number of nonzeros in your
      matrix, vs O(m*n^2) where m = #rows and n = #columns."

There is one more kind of "accuracy" demand that a user might make:
to get the bit-wise identical answer every time they run the program
(or call the same subroutine with the same input). This seems like
a reasonable expectation, and is expected by many users for debugging
(to be able to reproduce errors) and for correctness. But because of
parallelism, this is no longer guaranteed, or even likely to occur.
We will return to this when we talk about floating point arithmetic.

This illustrates another theme of the semester, that there is a tradeoff between
accuracy and speed, that "backward stability", getting the right answer for
almost the right problem, is a property of many algorithms we will discuss,
and that we can estimate condition numbers to get error bounds if desired.
      
To summarize the syllabus of the course:
   Ax=b, least squares, eigenproblems, Singular Value Decomposition (SVD) and variations
   Direct methods (aka matrix factorizations: LU, Cholesky, QR, Schur form etc.)
   Iterative methods (eg Jacobi, Gauss-Seidel, Conjugate Gradients, Multigrid, etc.)
   Deterministic and randomized algorithms
   Shared themes: 
      exploiting mathematical structure (eg symmetry, sparsity, ...)
      numerical stability / condition numbers - how accurate is my answer?
      efficiency (minimizing flops and communication = data movement)
      finding good existing software