Research

...

Below is a summary of a selection of research projects that I have worked on. A full list of publications is here.

Submodularity

Submodular set functions are characterized by an intuitive diminishing marginal costs property. Their structure admits efficient (approximate) optimization algorithms in many settings.
Unfortunately, not all set functions are submodular, and I am collecting a growing list of functions that may appear submodular at first look but are actually not.



Submodularity on graphs

Many classical combinatorial problems ask to select a set of graph edges that form a particular structure - for example, a graph cut, while minimizing a sum of edge weights. We replace the sum of edge weights by a more general, non-additive submodular set function. The resulting problem is a Cooperative Graph Cut.

Cooperative Graph cuts in Computer Vision

Graph cuts have been used as a generic tool in combinatorial optimization. Replacing the common sum of edge weights by a submodular function enhances the representative capabilities of graph cuts. For example, graph cuts haven been popular for MAP inference in pairwise Markoc Random Fields, used e.g. for image segmentation. These have some well-known shortcomings: the optimal segmentations tend to short-cut fine object boundaries, in particular when the contrast is low. Cooperative cuts correspond to high-order potentials and admit to incorporate information about the object boundary, leading to much better segmentations.
Cooperative cuts indeed unify and generalize a number of higher-order energy functions that have been used in Computer Vision.

S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. CVPR 2011.

Cooperative Graph cuts - complexity and algorithms

In its most general form, the problem of finding a minimum cooperative cut is extremely hard, and polynomial lower bounds apply. However, we have explored several approximation algorithms that in practice work better than their theoretical worst-case approximation factor. Some of these are also efficient enough to be applied to large graphs arising from images.
Our theoretical results complement recent work in theoretical computer science on related problems.

S. Jegelka and J. Bilmes. Approximation bounds for inference using cooperative cut. ICML 2011.
Jegelka and J. Bilmes. Multi-label cooperative cuts. CVPR 2011 Workshop on Inference in Graphical Models with Structured Potentials.

Sequential combinatorial problems with submodular costs

Sequential decision problems ask to repeatedly solve an optimization problem with an unknown, changing cost function. A decision for the current cost must be made based on observing the costs at previous steps, but not the current one. Such problems become challenging when the optimization problem is combinatorial and therefore the decision space exponentially large. We address sequential combinatorial problems and derive the first algorithms that handle nonlinear, submodular cost functions.

S. Jegelka and J. Bilmes. Online submodular minimization for combinatorial structures. ICML 2011.

On faster approximate submodular minimization

Graph cuts have particularly been used to represent pseudo-boolean and set functions and thereby as a tool to minimize such functions more efficiently than e.g. with a generic algorithm for minimizing submodular functions. However, not all submodular functions can be (efficiently) represented as graph cuts. However, any submodular function can be represented as a cooperative cut, and this leads to efficient approximation algorithms that work well on several submodular functions.

S. Jegelka, H. Lin and J. Bilmes. On fast approximate submodular minimization. NIPS 2011.



Clustering and graph partitioning


LP Stability for graph partitioning problems

Data in machine learning often arises from noisy measurements. When such data is used in an optimization problem, it is beneficial to know the stability of the optimal solution to perturbations in the data. We show a method for analyzing this stability for LP relaxations of graph partitioning problems. The method can handle the exponential number of constraints and applies to problems such as correlation clustering, clustering aggregation or modularity clustering.

S. Nowozin and S. Jegelka. Solution stability in linear programming relaxations: graph partitioning and unsipervised learning. ICML 2009.

Separating distributions by higher-order moments

Many clustering criteria aim for clusers that are spatially separated. For example, the popular k-means crtiterion seeks clusers whose means are maximally far apart. If the data is assumed to be samples of a mixture of distributions and we want to recover the underlying distributions, then spatial separation may not be the ideal criterion, e.g. if we have two Gaussians with the same mean but different variance. Using a kernel criterion, however, we can separate distributions by higher-order moments. This observation also explains capabilities of the kernel k-means algorithm for example in separating distributions by moments other than the mean.

S. Jegelka, A. Gretton, B. Schoelkopf, B.K Sriperumbudur and U. von Luxburg. Generalized clustering via kernel embeddings. KI 2009.

Tensor Clustering

The tensor clustering problem generalizes co-custering (also called biclustering) from matrices to tensors. Informally, we aim to partition a given tensor into honogeneous "cubes". Formally, we want to find the closest best low-rank factorization of a particular form. We show the first approximation bounds for tensor clustering with metrics and Bregman divergences. This work also illustrates the limits of ignoring the "co" in co-clustering.

S. Jegelka, S. Sra and A. Banerjee. Approximation algorithms for tensor clustering. ALT 2009.

Statistically consistent clustering

Most clustering problems correspond to NP-hard optimization problems. Furthermore, even if we could find the optimzal solution, this procedure may fail to be statisticaly consistent. Therefore, we relax computationally hard clustering problems (such as k-means or normalized cut) to formulations that cen be solved exactly in polynomial time and that are statistically consistent and converge to the solution of the given objective as the number of sample points grows.

U. von Luxburg, S. Bubeck, S. Jegelka and M. Kaufmann. Consistent minimization of clustering objective functions. NIPS 2007



Kernel independent component analysis

In Independent Component Analysis (ICA), we observe a linear mixture of signals from independent source distributions and aum to recover the unknown sources. Kernel dependence measures have proved particularly useful for ICA. However, optimizing such a kernel criterion over the special orthogonal group is a difficult optimization problem, and can quickly become inefficient as the kernel matrices become large. We therefore derive an approximate Newton method that handles these problems more efficiently. Empirically, the method compares favorably to state-of-the-art ICA methods.
In eralier work, we explored the effectiveness of different factorizations to approximate large kernel matrices.

H. Shen, S. Jegelka and A. Gretton. Fast Kernel-based Independent Component Analysis. IEEE Transactions on Signal Processing 57(9), pp. 3498-3511, 2009.
H. Shen, S. Jegelka and A. Gretton. Fast Kernel ICA using an Approximate Newton Method. AISTATS, 2007.
S. Jegelka and A. Gretton. Brisk Kernel Independent Component Analysis. In L. Bottou, O. Chapelle, D. DeCoste, J. Weston, editors. Large Scale Kernel Machines, pp. 225-250. MIT Press, 2007.