```CS 70 - Lecture 33 - Apr 13, 2011 - 10 Evans

Goals for today:  Inference: How use to what we know about random variables
to infer "hidden" information from noisy measurements

Ex 1: Communication: bits are sent over noisy channel (wireless, DSL phone line, etc);
how do we infer true bits from what we receive?

Ex 2: Control: A plane on autopilot / spacecraft landing on moon / drone
gets noisy data from sensors about where it is and how fast it is going,
and wants to control its path to go in right direction / land safely /
do whatever the CIA wants it do.

Ex 3: Object recognition: From an image containing an object (partially
behind others, at an unknown angle, with unknown lighting), identify it

Ex 4: Speech recognition: Given an audio signal of one or more people
speaking in a noisy environment in some language, what are they saying?

Ex 5: Investing: Given past data about one or more stocks, try to predict
what its price will be in the future (1 second, 1 day, 1 month, ...)

In all these cases:
(1) There is a hidden quantity X that we'd like to know
(2) We know the prior distribution of X, i.e. the probabilities
P(X=a) before we collect any observations
(3) We measure random variables Y1, Y2,... called "observations";
say the measured value of Yi is bi.
(4) We know the conditional distributions of Yi given X: P(Yi=b|X=a)

What we want to compute is the condition distributions of X given
all the observed values of Yi, that is P(X=a | Y1=b1 and Y2=b2 and ... ).

Def: Choosing the value of a for which P(X=a | Y1=b1 and Y2=b2 and ... )
is largest, i.e. the most likely value of X, is called the maximum
a posteriori (MAP) rule.

We first consider the case of one observation, call it Y.
Recall Bayes Theorem from Note 12:
P(X=a | Y=b) = P(X=a and Y=b)/P(Y=b)
= P(Y=b | X=a) * P(X=a) / P(Y=b)

Now P(Y=b) = sum_i P(Y=b and X=ai)
= sum_i P(Y=b | X=ai) * P(X=ai)
(this is called the Total Probability Rule), so
P(X=a | Y=b) = P(Y=b | X=a) * P(X=a) / sum_i P(Y=b | X=ai) * P(X=ai)
where all the quantities on the right hand side are known.

Ex: Multi-armed Bandits (slot machines)
Suppose you walk into a room full of n slot machines, where the
i-th machine lets you win with probability p_i. But you don't know
which is which. So you pick a random machine and play it.
Q1: What is the probability of winning?
Q2: What is the probability that you chose the i-th machine, given you win?
Q3: What is the probability of winning the second time if you win the first time?
Q4: What is the probability that you chose the i-th machine, given you win twice?
Q5: What is the probability that you chose the i-th machine, given you win m times?
Which machine i maximizes this probability, i.e. what is the most likely machine?
(i.e., use the MAP Rule)

We let X = identity of machine you chose (an integer 1 <= i <= n) - unknown.
We let Yi = W if you win on the i-th play, and L if you lose - known.

Q1: What is the probability of winning?
We are asking for P(Y1=W). By the Total Probability Rule we get
P(Y1=W) = sum_{i=1 to n} P(Y1 = W | X = i) * P(X=i)
= sum_{i=1 to n} p_i * (1/n)

Q2: What is the probability that you chose the i-th machine, given you win?
We are asking for P(X=i | Y1=W), which by Bayes Rule is
P(X=i | Y1= W) =  P(Y1=W | X=i) * P(X=i) / P(Y1=W)
= [ p_i ] * [ 1/n / sum_j p_j/n ]
= p_i / sum_j p_j

Q3: What is the probability of winning the second time if you win the first time?
We are asking for P(Y2=W | Y1=W) = P(Y2=W and Y1=W) / P(Y1=W)
If we knew X=i, it would be easy to compute P(Y2=W and Y1=W),
because each play is independent and wins with probability p_i,
so we'd just multiply to get p_i^2. More carefully:

Def: Two events A and B conditionally independent given event C if
P(A and B | C) = P(A | C) * P(B | C)
Two random variables Y1 and Y2 are conditionally independent
given another random variable X if
P(Y1=b1 and Y2=b2 | X=a) = P(Y1=b1 | X=a) * P(Y2=b2 | X=a)

This lets us write
P(Y2 = W | Y1 = W) = P(Y2=W and Y1=W) / P(Y1=W)
= sum_{i=1 to n} P(Y2=W and Y1=W and X=i) / P(Y1=W)
... by the Total Probability Rule
= sum_{i=1 to n} P(Y2=W and Y1=W | X=i) * P(X=i) / P(Y1=W)
... by the definition of conditional probability
= sum_{i=1 to n} P(Y2=W | X=i) * P(Y1=W | X=i) * P(X=i) / P(Y1=W)
... by conditional independence of Y1 and Y2 given X
= sum_{i=1 to n} p_i * p_i * (1/n) / sum_{i=1 to n} p_i/n
= sum_{i=1 to n} p_i^2 / sum_{i=1 to n} p_i

We would expect that given one W, the second one is at least as likely, that is
P(Y2=W | Y1=W) >= P(Y1=W)
or
sum_{i=1 to n} p_i^2 / sum_{i=1 to n} p_i  >= sum_{i=1 to n} p_i/n
or
n * sum_{i=1 to n} p_i^2 >= ( sum_{i=1 to n} p_i )^2
This follows from a Ma54 fact, the Cauchy-Schwartz inequality,
applied to the two vectors [p_1,...,p_n] and [1,...,1]

Q4: What is the probability that you chose the i-th machine, given you win twice?
This is asking for P(X=i | Y1=W and Y2=W). Applying Bayes Rule again, we get
P(X=i | Y1=W and Y2=W) = P(Y1=W and Y2=W | X=i) * P(X=i) / P(Y1=W and Y2=W)
= P(Y1=W and Y2=W | X=i) * P(X=i)
/ sum_{j=1 to n} P(Y1=W and Y2=W | X=j ) * P(X=j)
= p_i^2 * (1/n) / sum_{j=1 to n} p_j^2 * (1/n)
... by conditional independence
= p_i^2 / sum_{j=1 to n} p_j^2

Q5: What is the probability that you chose the i-th machine, given you win m times?
This is asking for P(X=i | Y1=W and ... and Ym=W ).
Applying Bayes Rule yet again, we get
P(X=i | Y1=W and ... and  Ym=W)
= P(Y1=W and ... and Ym=W | X=i) * P(X=i) / P(Y1=W and ... and Ym=W)
= P(Y1=W and ... | X=i) * P(X=i)
/ sum_{j=1 to n} P(Y1=W and ... | X=j ) * P(X=j)
= p_i^m * (1/n) / sum_{j=1 to n} p_j^m * (1/n)
... by conditional independence
= p_i^m / sum_{j=1 to n} p_j^m

To understand which machine is most likely, i.e. which value of i
maximizes P(X=i | Y1=W and ... and  Ym=W), assume for simplicity that
we have numbered the machines so p_1 > p_2 > ... > p_n.
Then clearly the choice i=1 maximizes P(X=i | Y1=W and ... and Yn=W),
i.e. if you keep winning, the most likely identity of the machine you
are playing is the machine with the highest probability of winning.

To understand what you learn as m grows, we can write
p_i^m / sum_{j=1 to n} p_j^m =
= (p_i/p_1)^m / sum_{j=1 to n} (p_j/p_1)^m
= (p_i/p_1)^m / [ 1 + sum_{j=2 to n} (p_j/p_1)^m ]
where each ratio p_i/p_1 is less than 1, for i>1.
So it approaches 1 if i=1 and 0 otherwise.
In other words, if you keep winning, the probability approaches 1
that you have chosen the machine with the highest probability of
winning p_1, and approaches zero of having chosen a different machine.

Ex: Communication over a Noisy Channel. Now we consider the
question of sending bits over a noisy channel, where there
is a probability p < 1/2 that each bit gets flipped. How much
more reliable can you make the transmission by sending each
bit n times? Now
X = correct bit = 0 or 1, each with probability 1/2
Yi = i-th copy of bit X that is sent
= X (is correct) with probability 1-p > 1/2
= 1-X (is flipped) with probability p < 1/2
As before, let b_i be the observed value of Y_i.
We assume the Yi are conditionally independent given X,
i.e. each Yi is flipped or not independently of the others.
We need to compute and compare
P(X=0 | Y1=b1 and ... and Yn=bn)
and
P(X=1 | Y1=b1 and ... and Yn=bn)
and choose the value X=0 or X=1, depending on which is larger
(use the MAP rule). By Bayes rule, and conditional independence
p_a = P(X=a | Y1=b1 and ... Yn=bn )
= P(Y1=b1 and ... and Yn=bn | X=a) * P(X=a)
/ P(Y1=b1 and ... and Yn=bn)
= [ prod_{i=1 to n} P(Yi=bi | X=a) ] * P(X=a)
/ P(Y1=b1 and ... and Yn=bn)
We need to decide if p_0 or p_1 is larger. To save work,
we can just compute their ratio r = p_1/p_0, and ask
if it is larger or small than 1.
This ratio is called a Maximum Likelihood Ratio.
After cancelling common terms in r
(so we don't have to compute them!) we get
r = p_1/p_0 = prod_{i=1 to n} P(Yi=bi | X=1) / P(Yi=bi | X=0)
where
P(Yi = bi | X=a) = { 1-p if bi=a  (bit sent correctly)
{ p   if bi=1-a (bit flipped)
Thus
prod_{i=1 to n} P(Yi=bi | X=a)
= (1-p)^{#bi's that equal a} * p^{#bi's that do not equal a}
or, letting x = #bi=1 and n-x = #bi=0
r = [ (1-p)^x * p^(n-x) ] / [ p^x * (1-p)^(n-x) ]
= (p/(1-p))^(n-2x)
We need to decide if r > 1, i.e. p_1 > p_0, so we decide X = 1,
or if r < 1, i.e. p_1 < p_0, so we decide X = 0. Since
p < 1/2, then p/(1-p) < 1. So
r = (p/(1-p))^(n-2x) > 1 if n-2x < 0, i.e. x > n/2, i.e. more 1s than 0s received,
so we decide X=1
r = (p/(1-p))^(n-2x) < 1 if n-2x > 0, i.e. x < n/2, i.e. more 0s than 1s received,
so we decide X=0
In other words, we simply use a majority vote! And if x=n/2, a tie,
either X=0 or X=1 is equally likely (so we had better send an odd number
of bits, which we henceforth assume).

It is of interest to ask what is the probability that
the majority vote gets it wrong, i.e. that the number
of flipped bits is bigger than n/2. This is a binomial distribution:
P(# flipped bits = k) = C(n,k)*p^k*(1-p)^(n-k)
P(# flipped bits > n/2) = sum_{k=(n+1)/2 to n} C(n,k)*p^k*(1-p)^(n-k)
If we know how reliable our network is, i.e. the chance p of flipping a bit,
we can use this to pick n large enough to make the network as reliable
as we like. We can use Chebyshev's inequality to get a bound on how big
n has to be: Let random variable f = #flipped bits, so E(f) = n*p
and V(f) = n*p*(1-p), and
P(f > n/2) <= P( | f - n*p | > n*(1/2 - p) )
<= Var(f)/(n*(1/2 - p))^2
= p*(1-p)/[n* (1/2 - p)^2]
So when p= probability of flipping a bit is small, we don't have
to pick n very large to make the probability of choosing the wrong
bit as small as we like. But with p gets close to 1/2, we have to
pick n larger and larger.
```