CS 4: Lecture 23
                            Monday, April 17, 2006

The Nelder-Mead Simplex Algorithm
---------------------------------
This algorithm (not to be confused with the even more famous simplex algorithm
for linear programming) is a local search algorithm like grid search, but
improves upon it in several ways.  First, it adapts its step sizes as it
searches, and can make them longer as well as shorter.  Second, it saves time
by performing fewer evaluations of the objective function f per iteration.

A _simplex_ is a simple geometric shape defined by connecting together d + 1
vertices in d-dimensional space.  For example, a 1D simplex is an edge; a 2D
simplex is a triangle; a 3D simplex is a tetrahedron.  You can have simplices
of any dimension.  If we're optimizing a function on d parameters, then we're
searching in a d-dimensional parameter space, and our simplex has d + 1
vertices.

The Nelder-Mead simplex algorithm always maintains a set of d + 1 vertices, and
the values of the objective function at those vertices.  Each iteration
repositions one (or occasionally most) of the vertices.  The simplex must never
become _degenerate_.  That means no three vertices can lie on a common line; no
four vertices can lie on a common plane; and so on.

We begin by choosing d + 1 vertices to start with.  For example, if each
parameter ranges from -5 to 5, we could try

    x_0 = (0, 0, ..., 0),
    x_1 = (1, 0, ..., 0),
    x_2 = (0, 1, ..., 0),
    ...
    x_d = (0, 0, ..., 1).

Evaluate the objective function f(x_i) at each vertex x_i, 0 <= i <= d.  Let
k be the index of the _worst_ vertex; that is, f(x_k) is as large or larger
than the objective values at the other vertices.

The simplex algorithm repeats the following loop.

1 Let x_k be the worst vertex of the simplex, and let y be the centroid
  (average) of all the vertices _except_ x_k.

    y = sum   x_i.
       i != k

2 Let z = 2y - x_k be the _reflection_point_ of x_k through the centroid.  For
  example, in the following figure, x_2 is the worst vertex, y is the centroid
  of x_1 and x_3, and z is the reflection of x_2 through y.

      z<-_    O x_3                              z O-------O x_3
          -_                                        \  new |\        
            -_                                       \\    | \\
              y-_                                      \   |   \
                 -_                                     \\ |    \\
                   -_                                     \| old  \
              O      -O                                    O-------O 
             x_1     x_2                                  x_1     x_2

3 If z is better than x_k--in other words, if f(z) < f(x_k)--then we replace
  x_k by z.  In this manner, the simplex "walks" toward better and better
  positions in the parameter space.  Notice that a reflection step replaces the
  simplex with a rotated version of the simplex, but doesn't change its shape.

  However, if z is the _best_ vertex--in other words, if f(z) < f(x_i) for
  every vertex x_i of the simplex--then we might have found an unusually good
  direction, and we should see if the objective function will get even better
  if we continue in that direction.  So we compute an _expansion_vertex_
  w = 2z - y, and check the objective function at w.

      w<-_                               w O--__
          -_                                \_  --___
            -_                                \_     --__
              z<-_    O x_3                     \_       --O x_3
                  -_                              \_  new  |\
                    -_                              \_     | \\
                      y-_                             \_   |   \
                         -_                             \_ |    \\
                           -_                             \| old  \
                      O      -O                            O-------O
                     x_1     x_2                          x_1     x_2

  If f(w) < f(z), we replace x_k with w (instead of z).  Otherwise, we replace
  x_2 with z as usual.  In the former case, the new simplex has a different
  shape than the old one--it's elongated in the successful search direction.

  If either w or z is better than x_k, this iteration is over, and we start the
  loop from the beginning.

4 If z is not better than x_k, the far size of y might just be bad territory
  for searching.  Instead, we compute a _contraction_vertex_ v = 0.5 (x_k + y),
  halfway between the worst vertex x_k and the centroid y.  If f(v) < f(x_k),
  we replace x_k with v, end the iteration, and start the loop over.

              O x_3                                        O x_3
                                                           |\        
                                                           | \\
              y                                            |   \
                  v                                        |  _O_\
                   -_                                      |_-new-\
              O      -O                                    O-------O 
             x_1     x_2                                  x_1     x_2

5 If we still haven't found a point better than x_k, we perform a _shrink_step_
  like in the grid search algorithm.  Let x_j be the _best_ vertex of the
  simplex--the one whose objective value f(x_j) is least.  We shrink the
  simplex by moving every vertex halfway toward x_j--that is, we replace each
  x_i with 0.5 (x_i + x_j).  The vertex x_j doesn't move, because it's the best
  vertex the search algorithm has ever found, and we don't want to forget it
  until we find a better one.

              O x_3                                        o
              |\                                           |
              | \\                                         v
              |   \                 ===>                   O x_3
              |    \\                                      |\_
              |      \                                     |  \
              O-------O                                    O---O<--o
             x_1     x_2                                  x_1 x_2

  The idea of the shrink step is that we're probably near a local minimum, so
  we refine the simplex so we're searching a smaller region around x_j.

  The simplex algorithm terminates when the shortest edge of the simplex falls
  below a certain size.

If the search space is bounded (not infinite in all directions), the easiest
way to prevent the simplex algorithm from searching outside the domain is to
have f(x) return a very bad (large) value for any point x outside the domain.

An interesting property of this algorithm is that the expansion and contraction
steps can make the simplex long and thin.  In this way, the algorithm can adapt
the simplex so its size along the different axes is tuned to the different
parameters.  This is particularly useful if the parameters have different units
of measurement--for example, engine temperature, air/gasoline mixture, and
engine RPMs.  With grid search, you have to guess in advance how to scale the
different axes to make the objective function "isotropic".  But the simplex
algorithm can often figure it out for you.

Combined Global & Local Search
------------------------------
An important observation is that local search algorithms like grid walk and
simplex often find different local minima if you start them from different
starting points.  Global search algorithms like random search and grid search
are good at finding starting points, but they're bad at homing in on local
minima.  So let's combine them.  Here are some ways to do so.

- Use grid search (with a very coarse grid) to find a good starting point.
  Then run grid walk from there, with an initial step size of half the original
  grid size.

- Use grid search to find 20 good starting points.  (20 is just a number; pick
  a number based on how long you're willing to wait.)  You could take the 20
  best points on the grid, but it's probably better to try to space the points
  out by choosing grid points that aren't near an even better grid point.  Then
  do simplex search once from each of those 20 points, with each starting
  simplex being half the size of the grid.  Choose the overall winner out of 20
  runs of the simplex algorithm.