...

Below is a summary of a selection of research projects that
I have worked on. A full list of publications
is here.

### Submodularity and submodular edge weights in graphs

#### On faster (approximate/parallel) submodular minimization

#### Cooperative Graph cuts in Computer Vision

#### Submodular edge weights in combinatorial problems - theory and algorithms

#### Sequential combinatorial problems with submodular
costs

### Concurrency control for machine learning

### Clustering and graph partitioning

#### LP Stability for graph partitioning problems

#### Separating distributions by higher-order moments

#### Tensor Clustering

#### Statistically consistent clustering

### Kernel independent component analysis

In Independent Component Analysis (ICA), we observe a linear
mixture of signals from independent source distributions
and aum to recover the unknown sources. Kernel dependence
measures have proved particularly useful for ICA. However,
optimizing such a kernel criterion over the special
orthogonal group is a difficult optimization problem, and
can quickly become inefficient as the kernel matrices
become large. We therefore derive an approximate Newton
method that handles these problems more
efficiently. Empirically, the method compares favorably to
state-of-the-art ICA methods.

In eralier work, we explored the effectiveness of different factorizations to approximate large kernel matrices.

Submodular set functions are characterized by an intuitive
diminishing marginal costs property. They have long been important
in combinatorics and many other areas, and it turns out that they can help model
interesting complex interactions in machine learning problems too.
Luckily, their structure
admits efficient (approximate) optimization algorithms in
many settings.

Unfortunately, however, not all set functions are submodular, and I
am collecting a
growing list
of functions that may appear
submodular at first look but are actually not.

I am interested in how we can efficiently use submodularity in machine learning, and what new interesting (combinatorial) models are possible. The first projects below address the efficient minimization of submodular functions, and the others the effect and complexity of having submodular functions on graph edges.

It is known that submodular set functions can be minimized in polynomial time, but many algorithms are not very practical for large real data. We address this problem with two very different approaches: (1) We write the minimization of a decomposable submodular function as a "Best Approximation" problem and apply operator splitting methods. The result is an easy, parallelizable and efficient algorithm for decomposable submodular functions that needs no parameter tuning. (2) Graph Cuts have been popular tools for representing set functions (and thereby offering an efficient tool for their optimization). Unfortunately, this is not possible for all submodular functions. However, every submodular function can be represented as a cooperative graph cut, and this insight leads to practical approximate algorithms.

S. Jegelka, F. Bach and S. Sra. Reflection methods
for user-friendly submodular optimization
. NIPS 2013.

S. Jegelka, H. Lin and J. Bilmes. On fast
approximate submodular minimization. NIPS 2011.

Graph cuts have been widely used as a generic tool in
combinatorial optimization. Replacing the common sum of
edge weights by a submodular function enhances the
representative capabilities of graph cuts. For example,
graph cuts haven been popular for MAP inference in
pairwise Markoc Random Fields, used for image
segmentation. These have some well-known shortcomings:
the optimal segmentations tend to short-cut fine object
boundaries, in particular when the contrast is
low. Cooperative cuts correspond enable us to introduce complex long-range dependencies between variables (high-order potentials),
such as incorporating global information about the object
boundary, and thereby lead to much better segmentations.

Cooperative cuts indeed unify and generalize a number
of higher-order energy functions that have been used in
Computer Vision.

P. Kohli, A. Osokin and S. Jegelka. A principled deep random field for image segmentation. CVPR 2013

S. Jegelka and J. Bilmes. Submodularity
beyond submodular energies: coupling edges in graph
cuts. CVPR 2011.

S. Jegelka and J. Bilmes. Multi-label Cooperative
Cuts. CVPR 2011 Workshop on Inference in Graphical
Models with Structured Potentials.

Starting with (a generalized) minimum cut, we study combinatorial
problems where instead of a sum of edge weights, we have a submodular
function on the edges. In their most general form, such problems are
usually very hard, with polynomial lower bounds on the approximation
factor (as several recent works show). But with some assumptions, efficient algorithms can give very
decent results.

Motivated by good empirical results, we continued to study
properties of functions that affect the complexity of
submodular problems: the curvature
of the function is a general complexity measure for
minimization, maximization, approximation and learning.

S. Jegelka and J. Bilmes. Graph Cuts with Interacting Edge Costs - Examples, Approximations, and Algorithms. Manuscript

R. Iyer, S. Jegelka and J. Bilmes. Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions. NIPS 2013.

R. Iyer, S. Jegelka and J. Bilmes. Fast Semidifferential-based Submodular Function Optimization. ICML 2013

S. Jegelka and J. Bilmes. Approximation
bounds for inference using cooperative cut. ICML 2011.

Sequential decision problems ask to repeatedly solve an optimization problem with an unknown, changing cost function. A decision for the current cost must be made based on observing the costs at previous steps, but not the current one. Such problems become challenging when the optimization problem is combinatorial and therefore the decision space exponentially large. We address sequential combinatorial problems and derive the first algorithms that handle nonlinear, submodular cost functions.

S. Jegelka and J. Bilmes. Online submodular minimization for combinatorial structures. ICML 2011.

Many machine learning algorithms iteratively transform some global state (e.g., model parameters or variable assignment) giving the illusion of serial dependencies between each operation. However, due to sparsity, exchangeability, and other symmetries, it is often the case that many, but not all, of the state-transforming operations can be computed concurrently while still preserving serializability: the equivalence to some serial execution where individual operations have been reordered.

This opportunity for serializable concurrency forms the foundation of distributed database systems. In this project, we implement updates in machine learning algorithms as concurrent transactions in a distributed database. As a result, we achieve high scalability while maintaining the semantics and theoretical properties of original serial algorithm.

X. Pan, J. Gonzalez, S. Jegelka, T. Broderick and M.I. Jordan. Optimistic Concurrency Control for Distributed Unsupervised Learning. NIPS 2013

Data in machine learning often arises from noisy measurements. When such data is used in an optimization problem, it is beneficial to know the stability of the optimal solution to perturbations in the data. We show a method for analyzing this stability for LP relaxations of graph partitioning problems. The method can handle the exponential number of constraints and applies to problems such as correlation clustering, clustering aggregation or modularity clustering.

S. Nowozin and S. Jegelka. Solution stability in linear programming relaxations: graph partitioning and unsipervised learning. ICML 2009.

Many clustering criteria aim for clusers that are spatially separated. For example, the popular k-means crtiterion seeks clusers whose means are maximally far apart. If the data is assumed to be samples of a mixture of distributions and we want to recover the underlying distributions, then spatial separation may not be the ideal criterion, e.g. if we have two Gaussians with the same mean but different variance. Using a kernel criterion, however, we can separate distributions by higher-order moments. This observation also explains capabilities of the kernel k-means algorithm for example in separating distributions by moments other than the mean.

S. Jegelka, A. Gretton, B. Schoelkopf, B.K Sriperumbudur and U. von Luxburg. Generalized clustering via kernel embeddings. KI 2009.

The tensor clustering problem generalizes co-custering (also called biclustering) from matrices to tensors. Informally, we aim to partition a given tensor into honogeneous "cubes". Formally, we want to find the closest best low-rank factorization of a particular form. We show the first approximation bounds for tensor clustering with metrics and Bregman divergences. This work also illustrates the limits of ignoring the "co" in co-clustering.

S. Jegelka, S. Sra and A. Banerjee. Approximation algorithms for tensor clustering. ALT 2009.

Most clustering problems correspond to NP-hard optimization problems. Furthermore, even if we could find the optimzal solution, this procedure may fail to be statisticaly consistent. Therefore, we relax computationally hard clustering problems (such as k-means or normalized cut) to formulations that cen be solved exactly in polynomial time and that are statistically consistent and converge to the solution of the given objective as the number of sample points grows.

U. von Luxburg, S. Bubeck, S. Jegelka and M. Kaufmann. Consistent minimization of clustering objective functions. NIPS 2007

In eralier work, we explored the effectiveness of different factorizations to approximate large kernel matrices.

H. Shen, S. Jegelka and A. Gretton. Fast Kernel-based Independent Component Analysis. IEEE
Transactions on Signal Processing 57(9),
pp. 3498-3511, 2009.

H. Shen, S. Jegelka and A. Gretton. Fast Kernel ICA
using an Approximate Newton Method. AISTATS, 2007.

S. Jegelka and A. Gretton. Brisk Kernel Independent Component Analysis. In L. Bottou, O. Chapelle,
D. DeCoste, J. Weston, editors. Large Scale Kernel
Machines, pp. 225-250. MIT Press, 2007.