CS 298-2
Theory Seminar
Tom Hayes
U.C. Berkeley
The "k-armed bandit" is a modified slot machine, which has k arms
you can pull, rather than just one. Each arm has a different payout,
which can change every time you play, but is always in the range [0,1].
If we get to play this machine T times, how well can we do, compared
to the best single arm in hindsight?
The right answer appears to be "within an additive O(\sqrt{T})."
Random payouts establish this as a lower bound.
An O(\sqrt{T}) upper bound is known if the problem is weakened in
any of the following ways:
(a) the adversary is non-adaptive
(b) the machine announces the payouts for all k arms,
rather than just the chosen one,
(c) instead of comparing against the best arm in hindsight, let the
adversary secretly choose one arm in advance, and compare only
against that one.
The best upper bound for the original problem has been O(T^{2/3}).
We'll show why this bound was tight for the previous champion, and
present a new algorithm which achieves O(\sqrt{T \log T}).
This is joint work with Varsha Dani.