CS 298-2
Theory Seminar

Kamalika Chaudhuri
UC Berkeley

Learning Mixtures of Distributions

Monday, September 17, 2007
4pm-5pm
The Wozniak Lounge, on the fourth floor of Soda Hall


Clustering, a method of finding structure in unlabelled data by grouping
the data points into few groups based on a similarity measure, has many
applications in AI, Physics and Biology.A simple theoretical model
that captures clustering is the problem of learning mixtures of
distributions. In this setting, one is given sample points generated
from a mixture of T distributions of a certain type, and
the goal is to recover these distributions and classify the points
correctly. In this talk, motivated by applications in biology, I will focus
on learning mixtures of product distributions.

The most common method in practice is uses principal component
analysis(PCA) as a preprocessing step to find the T-dimensional
subspace that contains the T centers. While this has been analysed
theoretically, it is known to be ineffective in certain situations --
namely, when the proportion of different distributions in the mixture
is too skewed, or when the variance in irrelevant directions is too
high. In the first part of the talk, we present a simple method which
simultaneously exploits the correlation between the signal coordinates
and independence between the noise coordinates to effectively separate
the centers of the distributions. Our method performs better than
PCA-based algorithms when learning mixtures of binary product
distributions and axis-aligned Gaussians.

In the second part of the talk, motivated again by our application in
biology, we address the sample complexity of learning mixtures of
distributions. We present a simple and efficient algorithm that learns
mixtures of two binary product distributions with near-optimal sample
complexity.