CS 70 - Lecture 22 - Mar 11, 2011 - 10 Evans

Goal for today (Note 13): Hash functions

Recall data structures for storing n items of data:
  Simplest: unsorted list
    cost = constant to add something to the end of the list
    cost = proportional to n to look something up
  Better: sorted list
    cost = proportional to n*log n to sort given all the data,
           so log n per item; adding a new item trickier
    cost = proportional to log n to look something up
  Best: constant time to add something, or look up something: Hash table

A Hash table is a data structure for quickly storing
and looking up data, usually in constant time per item,
if designed properly:

Given keys in some set S, a hash function h:S ->[0,n-1]
maps each key into an integer from 0 to n-1, which is used to
look up the data in a list. So there is one List(i) (possibly empty)
for each i in the range [0,n-1].
     insert (key,data) into List(h(key))
   data = find(key)
     if (key,data) stored in List(h(key)), return data, else "empty"

For this to work well, all the List(i) have to be short, so adding to
or searching List(i) takes a constant amount of time. This depends
on the hash function h(x).

A really bad h(x) would be the function h(x) = 0 for all x, so
everything would be stored in List(0), and inserting and finding
data would be as slow as using a single list.

Ideally, if the number of lists is at least as large as the number
of data items, i.e. n >= |S|, then each list will have at most 
one data item. In other words h(x) would spread out the data as
evenly as possibly across List(0),...,List(n-1).

There are a lot of different kinds of hash functions that try to
achieve this. One example is h(x) = x mod n. This works well if
the rightmost digits of x are likely to be uniformly distributed.
If this is not the case then h(x) = a*x mod n, where a is a large 
number such that gcd(a,n) = 1, might be a better choice. 

Designing good hash functions is a topic for another class.
Here we assume we have done a good job, so that using h(x) is
like picking a random integer from [0,n-1].
In other words, inserting m data items is like throwing m balls
at random into n bins (Lists).
This lets us use probability theory to ask interesting questions, like
  How long is longest list likely to be?
  How big does n = #Lists have to be compared to m = #keys, so that
    the probability of having a long List is small?

We will start with the simpler question: how big does n have to be 
so that the chance of a List having more than 1 item is less than 1/2?
Clearly n has to be at least m, if h(x) did a perfect job of
distributing the keys uniformly across the lists. But since
we are throwing balls into bins, n will have to be larger.

So let's compute P(E) where
  E = {m balls thrown into n bins with no "collisions"}
Each possible outcome (where a balls lands) is equally likely,
so we just need to count 
  |E| =     n    ... number of ways to throw first ball
        x (n-1)  ... number of ways to throw second ball
        x (n-2)  ... number of ways to throw third ball
        x (n-m+1)  ... number of ways to throw m-th ball
and divide by
  |S| = |{all ways to throw m balls into n bins}|
      =   n ... number of ways to throw first ball
      = x n ... number of ways to throw second ball
      = x n ... number of ways to throw m-th ball
      = n^m
So P(E) = |E|/|S| = prod_{i=0 to m-1} (n-i)/n

Here is another way to compute the same results, using a result
on conditional probability from last time:
  P(A_1 inter A_2 inter ...  inter A_m)
    =  P(A_1)
     * P(A_2 | A_1)
     * P(A_3 | A_1 inter A_2 )
     * P(A_4 | A_1 inter A_2  inter A_3 )
     * P(A_m | A_1 inter A_2 inter ... inter A_{m-1} )
  Let A_i = {i-th ball does not collide with balls 1...i-1}
  Then A_1 inter ... inter A_m
    = {for i in [1,m], ball i does not collide with balls 1 through i-1}
    = {for i neq j, both in [1,m], ball i does not collide with ball j}
  and our goal is to compute P(A_1 inter ... inter A_m):
  P(A_1) = 1, and
  P(A_i | A_1 inter ... inter A_{i-1}) = (n-i+1)/n
    because A_1 inter ... inter A_{i-1} means that balls 1 to i-1
    occupy i-1 different bins, so there are n-i+1 empty bins for the
    i-th ball to land in to avoid collisions.
    P(A_1 inter ... inter A_m) = prod_{i=1 to m-1} (n-i)/n
  as above.

Given n, we seek the value of m that makes
 P(m balls thrown into n bins without collision)
    = prod_{i=1 to m-1} (n-i)/n
    = prod_{i=1 to m-1} (1- i/n)
    = p(m,n)
equal to .5 (or just smaller).

We compute instead ln p(m,n) = sum_{i=1 to m-1} ln(1 - i/n)
   ln(1 - x) = -x - x^2/2 - x^3/3 - ...  Taylor expansion
             ~ -x when x is small (i.e. when i/n is small, or n is large)
   ln p(m,n) = sum_{i=1 to m-1} ln(1 - i/n)
             ~ sum_{i=1 to m-1} -i/n
             = (-1/n) sum_{i=1 to m-1} i
             = (-1/n) m*(m-1)/2
             ~ (-1/n) m^2/2
Equating ln (1/2) = ln p(m,n) = -m^2/(2*n) yields
   m^2/(2*n) = ln 2  
   m = sqrt(2 * ln 2 * n) ~ 1.177 * sqrt(n).
In other words, we can probably only throw about sqrt(# bins) balls without 
a collision.

ASK&WAIT: What would change if we want the probability of collision to be 5%?

Ex: "Birthday Paradox"
How many different people do you need to have before the chance that two of them
have a common birthday is at least .5?

Ex: "Coupon collector's problem"
Suppose I like to buy cereal because each box contains a random baseball
card from a collection of n baseball cards. How many boxes m do I have to
buy so that the probability that I have at least one of each card is at
least 1/2? 

Pick a card, any card. The chance that I do not get this particular card 
in any one box is (n-1)/n, so the chance of not getting it in m boxes is
   ((n-1)/n)^m  = (1 - 1/n)^m
From calculus we know that 
   lim_{n -> infinity} (1 - 1/n)^n = 1/e = 1/2.71828...
so when n is large
   (1 - 1/n)^m = ((1 - 1/n)^n)^(m/n) ~ exp(-m/n)
Said another way, let E_i be the event that I do not get card i in m boxes, so
   P(E_i) ~ exp(-m/n)
What we want is the probability that we don't get some card, 
i.e. card 1 or card 2 or ... or card n; this is 
P(E_1 union E_2 union ... union E_n)

The sets E_i and E_j are not disjoint when i neq j, but it is still true that
Thm (Union Bound)  For any events E_i, disjoint or not
   P(E_1 union ... union E_n) <= sum_{i=1 to n} P(E_i)
Proof: Let E = E_1 union ... union E_n
 P(E) = sum_{x in E} P(x)
      <= sum_{i=1 to n} sum_{x in E_i} P(x)    
          ... over-counting will occur if E_i not all disjoint, 
          ... but we still get an upper bound
      = sum_{i=1 to n} P(E_i)

Therefore P(some card missing) = P(E) <= sum_{i=1 to n} P(E_i) ~ n*exp(-m/n)
So if we pick m large enough to make the upper bound n*exp(-m/n) < 1/2
then we will be sure that the probability of missing some card is < 1/2.
      1/2 = n*exp(-m/n)
for m we get m = n*ln(2*n). This is clearly a good marketing ploy for cereal.