|
|
BEN RUBINSTEINPROJECTS
PROJECT DESCRIPTIONThis work assesses the security of machine learning algorithms in computer systems operating in adversarial environments. I am interested in the experimental and theoretical analysis of existing systems, such as email spam filtering, network anomaly detection, automatic worm signature generation, and detection of phishing webpage. Work within these domains involves designing attacks on learners, devising effective counter-measures, and producing theoretical guarantees on a learner's robustness (or lack thereof) to adversarially compromised data. A second complementary goal is the theoretical understanding of inherent trade-offs between generalization and security in whole families of learners, and across varying levels of adversarial information & control. While SecML offers a practical venue for worst-case analysis, opportunities come with the challenge of incorporating realistic threat models of adversarial capabilities that do not feature in existing studies. See also the project descriptions in the RAD Lab's SecML project page and the 2007 EECS Research Summaries. COLLABORATORS Marco A. Barreno (now Google), Peter L. Bartlett (Berkeley CS and Stat), Jack Chi (Berkeley CS), Ling Huang (Intel Research Berkeley), Anthony D. Joseph (Berkeley CS and Intel Research Berkeley), Blaine Nelson (Berkeley CS), Udam Saini (Berkeley CS), Charles Sutton (Berkeley CS), Nina Taft (Intel Research Berkeley), Doug Tygar (Berkeley CS & I School), Kai Xia (Berkeley CS) PRESENTATIONS
PROJECT DESCRIPTIONThis work explores a number of related problems in statistical learning theory centered around general geometric representations of concept classes. Concept classes on a finite domain X correspond to subsets of the Boolean |X|-cube, and connecting points differing on only one component by edges forms what's called the one-inclusion graph. A deep relationship between Valiant's Probably Approximately Correct (PAC) learning model and the one-inclusion graph has emerged. A bound on the one-inclusion graph density by the underlying concept class's Vapnik-Chervonenkis (VC) dimension leads to a bound on the worst-case expected risk of the graph's prediction strategy. In our NIPS06 paper we tightened the graph density bound, resolving a conjecture of Kuzmin & Warmuth's, and by bounding one-inclusion hypergraph density were able to develop mistake bounds for multi-class classification. The full-length paper explored cubical complex characterizations of concept classes that are maximum (having the largest cardinality for their VC-dimension) and maximal (experiencing an increase to VC-dimension with the addition of any new concept). In particular we extended the result of Dudley that all VC-1 maximum classes have one-inclusion trees, to VC-d maximum classes being contractible d-dimensional cubical complexes (algebraic topology's higher dimensional analogue of the tree). Our motivation for studying maximum/maximal classes is Littlestone & Warmuth's long-standing Sample Compression Conjecture. Sample compression schemes of size k compress a labeled sample of n>k examples down to a sub-sequence of length at most k, from which the original n labels can be reconstructed. The conjecture states that concept classes of VC-dimension d admit compression schemes of size O(d), which to-date is only known to be true for maximum classes; compressing maximal classes is necessary and sufficient for compressing all classes. We have resolved several conjectures of Kuzmin & Warmuth relating to their proposed Min-Peeling scheme. In our COLT08 paper we represent maximum classes as arrangements of Euclidean, Hyperbolic or Piecewise-Linear hyperplanes and compress these maximum classes by sweeping their arrangements. These results bring new geometric understanding to familiar concept classes, the VC-dimension, and an elegant approach to sample compression. See also the project descriptions in the 2006, 2007 EECS Research Summaries. COLLABORATORS Peter L. Bartlett (Berkeley CS and Stats), J. Hyam Rubinstein (Melbourne University Maths & Stats) PRESENTATIONS
PROJECT DESCRIPTIONThe property of algorithmic stability intuitively quantifies how much the image of a learning map may change under small perturbations to the training sample (similar to continuity for topological spaces). Many definitions of stability have been proposed and shown to be necessary and/or sufficient for learnability; and in some cases the presence of stability provides the easiest avenue to deriving risk bounds for a given learning algorithm. In this project, we positively resolve a conjecture of Kutin and Niyogi on the effect of multiple risk minimizers on the rate of a certain notion of stability. The probability of instability with respect to sample size undergoes a phase transition when the number of risk minimizers increases past one. See also the project description in the 2007 EECS Research Summaries. COLLABORATORS Aleksandr Simma (Berkeley Computer Science). [ top ]
PROJECT DESCRIPTIONDNA microarrays are a powerful and widely used tool for massively parallel measurement of gene expression (levels of genome-wide mRNA transcripts in a cell) - an important characteristic of cell state. Gene discovery, disease explanation and diagnosis, and drug discovery are common applications of gene expression analysis. In order to gain biologically useful information such as gene expression level or differential expression from the raw data produced by microarray assays, it is necessary to transform this low-level data in some intelligent way. In this investigation the classification task of detecting presence or absence of gene expression for the Affymetrix GeneChip oligonucleotide microarray is explored. New detection algorithms based on statistical rank-sum tests and various robust averages are compared to algorithms based on the popular robust multi-array average (RMA) expression level estimator, the Affymetrix Microarray Suite 5.0 (MAS5) expression level estimator and the MAS5 detection algorithm, using several spike-in datasets. Receiver Operating Characteristic (ROC) analysis is used to compare these algorithms, resulting in the conclusion that the new algorithms have superior detection performance in a range of practical situations. In particular the ROC Convex Hull (ROCCH) method of Provost and Fawcett is used to derive optimality conditions based on assumed misclassification costs and class priors. In addition to allowing comparison of detection algorithms based on the class prior and misclassification cost ranges relevant to the particular problem at hand, the ROCCH method allows principled selection of threshold parameters for the chosen detection algorithm. COLLABORATORS Simon Cawley (Affymetrix), Rao Kotagiri (Melbourne University CS), Jon McAuliffe (now UPenn Wharton Statistics), Marimuthu Palaniswami (Melbourne University EE), Terence P. Speed (Berkeley Statistics and WEHI Bioinformatics). PRESENTATIONS
FUNDING
PROJECT DESCRIPTIONDue to the unintuitive nature of quantum algorithms, it has been difficult for researchers to discover new algorithms for tomorrow's quantum computers. Thus for some time it has been desirable to automate part of the design process. Evolutionary computing is an area of study that has proven useful when applied to automated design, especially where humans find manual design difficult. This project investigated the generation of quantum algorithms using genetic programming, a popular technique within the evolutionary computing community. The problem tackled is an instance of supervised learning where the GP is given pairs of inputs and outputs produced by some (hypothetical) quantum computer, and the task of the GP is to search for a quantum algorithm/circuit that can fit these pairs and thereby approximate the behavior of the computer. PRESENTATIONS
|