A New Look at Regret (with J. Abernethy, A. Agarwal, and P. Bartlett). In preparation.
Matrix Regularization Techniques for Online Multitask Learning (with A. Agarwal and P. Bartlett). In preparation.
Game-Theoretic Timing Analysis (with S. Seshia). IEEE/ACM Conference on Computer-Aided Design (ICCAD), 2008, to appear. Abstract:
Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization (with J. Abernethy and E. Hazan). COLT 2008, to appear. Abstract:We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal $O^*(\sqrt{T})$ regret. The setting is a natural generalization of the non-stochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods.
High probability regret bounds for online optimization (with P. Bartlett, V. Dani, T. Hayes, S. Kakade, and A. Tewari). COLT 2008, to appear. Abstract: We present a modification of the algorithm of Dani et al. for the online linear optimization problem in the bandit setting, which with high probability has regret at most $O^*(\sqrt{T})$ against an adaptive adversary. This improves on the previous algorithm of Dani et al whose regret is bounded \emph{in expectation } against an \emph{oblivious} adversary. We obtain the same dependence on the dimension ($n^{3/2}$) as that exhibited by Dani et al. The results of this paper rest firmly on those of Dani et al and the remarkable technique of Auer et al. for obtaining high-probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.
Optimal Strategies and Minimax Lower Bounds for Online Convex Games (with Jacob Abernethy, Peter Bartlett, and Ambuj Tewari). COLT 2008, to appear. Abstract:
A number of learning problems can be cast as an Online Convex Game: on each round, a learner makes a prediction $x$ from a convex set, the environment plays a loss function $f$, and the learner's long-term goal is to minimize regret. Algorithms have been proposed by Zinkevich, when $f$ is assumed to be convex, and Hazan et al., when $f$ is assumed to be strongly convex, that have provably low regret. We consider these two settings and analyze such games from a minimax perspective, proving minimax strategies and lower bounds in each case. These results prove that the existing algorithms are essentially optimal.
Closing the Gap between Bandit and Full-Information Online Optimization: High-Probability Regret Bound (with Peter Bartlett and Ambuj Tewari), 2007 Abstract:We demonstrate a modification of the algorithm of Dani et al for the online linear optimization problem in the bandit setting, which allows us to achieve an $O(\sqrt{T\ln T})$ regret bound {\it in high probability}, as opposed to the {\it in expectation } result of Dani et al. Moreover, we obtain the same dependence on the dimension ($n^{3/2}$) as that exhibited by Dani et al. The results of this paper rest firmly on those of Dani et al and the remarkable technique of Auer et al for obtaining high-probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings. Note: A similar analysis has been carried out independently by Dani, Hayes & Kakade. A merged version of our results is being submitted.
Adaptive Online Gradient Descent (with Peter Bartlett and Elad Hazan), NIPS 2007. Technical report version available here. Abstract:We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between $\sqrt{T}$ and $\log T$. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.
Online Discovery of Similarity Mappings (with Jacob Abernethy and Peter Bartlett), ICML 2007. Abstract:We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the ``Online Prediction with Expert Advice'' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.
Multitask Learning with Expert Advice (with Jacob Abernethy and Peter Bartlett), COLT 2007. Technical report version available here. Abstract:We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the "ideal" algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient approximate randomized algorithm based on Markov chain Monte Carlo techniques.
Stability of K-means Clustering (with Andrea Caponnetto). NIPS, 2006. Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class $H_K$ and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of $H_K$ with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of $\Omega(n^{1/2})$ samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in $H_K$ implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K.
Stability Properties of Empirical Risk Minimization over Donsker Classes (with Andrea Caponnetto). Journal of Machine Learning Research. Vol. 7 (Dec), 2565--2583, 2006. (Older version as a technical report: AI Memo 2005-018. May 2005) Abstract:We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number $n$ of samples grows, the $L_2$-diameter of the set of almost-minimizers of empirical error with tolerance $\xi(n)=o(n^{-\frac{1}{2}})$ converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as $n$ increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than $n^{-1/2}$.
Risk Bounds for Mixture Density Estimation (with Dmitry Panchenko and Sayan Mukherjee). ESAIM Probability and Statistics. Vol. 9, 220-229, June 2005. (Older version as a technical report: AI Memo 2004-001. Jan 2004) Abstract:In this paper we focus on the problem of estimating a bounded density using a finite combination of densities from a given class. We consider the Maximum Likelihood Estimator (MLE) and the greedy procedure described by Li and Barron \cite{LiBarron99,Li99} under the additional assumption of boundedness of densities. We prove an $O(\frac{1}{\sqrt{n}})$ bound on the estimation error which does not depend on the number of densities in the estimated combination. Under the boundedness assumption, this improves the bound of Li and Barron by removing the $\log n$ factor and also generalizes it to the base classes with converging Dudley integral.
Stability Results in Learning Theory (with Sayan Mukherjee and Tomaso Poggio). Analysis and Applications, Special Issue on Learning Theory. Vol. 3, No. 4, 397-419. October 2005. Abstract:The problem of proving generalization bounds for the performance of learning algorithms can be formulated as a problem of bounding the bias and variance of estimators of the expected error. We show how various {\it stability assumptions} can be employed for this purpose. We provide a necessary and sufficient stability condition for bounding the bias and variance for the Empirical Risk Minimization algorithm, and various sufficient conditions for bounding bias and variance of estimators for general algorithms. We discuss settings in which it is possible to obtain exponential bounds, and we prove an extension of the bounded-difference inequality for "almost always" stable algorithms.
B (with Poggio, T., S. Mukherjee, R. Rifkin and A. Verri). In: Uncertainty in Geometric Computations, J. Winkler and M. Niranjan (eds.), Kluwer Academic Publishers, 131-141, 2002. Abstract:In this chapter we discuss the role of $b$, which is the constant in the standard form of the solutions provided by the Support Vector Machine technique $f(\bx x)=\Sum_i^l $\alpha_i$ K({\bf x}, {\bf x}_i) + b$, which is a special case of Regularization Machines. In the process, we describe properties of Reproducing Kernel Hilbert Spaces induced by different classes of kernels.
Bagging Regularizes (with T. Poggio, R. Rifkin, S. Mukherjee). AI Memo 2002-003, Massachusetts Institute of Technology, Cambridge, MA, February 2002. Abstract:Intuitively, we expect that averaging -- or bagging -- different regressors with low correlation should smooth their behavior and be somewhat similar to regularization. In this note we make this intuition precise. Using an almost classical definition of stability, we prove that a certain form of averaging provides generalization bounds with a rate of convergence of the same order as Tikhonov regularization -- similar to fashionable RKHS-based learning algorithms.
"Extra-label Information: Experiments with View-based Classification." (with G. Yeo and T. Poggio). Proceedings of the Sixth International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES'2002), Podere d'Ombriano, Crema, Italy, September 16-18, 2002. Abstract:Extra information is often readily available but not utilized in a classification paradigm. Here we explore using extra labels (profile faces and rotated faces) to aid in distinguishing faces versus non-faces. We propose a way to combine simple discriminant classifiers to build a more complex ones and justify the combination in a probabilistic setting.