DaisyZheWang RecentChanges

BayesStore






Overview

As massive data acquisition (e.g. web documents, sensor networks, click-streams, software logs) and storage becomes increasingly affordable, a wide variety of enterprises are employing statistical and machine learning models in advanced probabilistic data analysis. For instance, information extraction systems apply statistical models over free text to extract structured data; pervasive computing applications must constantly reason about volumes of noisy sensory readings to accomplish tasks like motion prediction and modeling of human behavior; web applications (e.g. social networks, recommendation systems) need to analyze click-streams to model their customers; and big software companies need to analyze huge volumes of software logs to predict and debug errors at run time.

One approach to support such probabilistic data analyses over large volumes of data is by a probabilistic data management system (PDBMS). Early approaches in building PDBMS have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures. However, these approaches introduce a gap between the statistical models which are used for probabilistic analytics and the uncertainty model in the PDBMS. Our solution to this “model-mismatch” problem is to support statistical models, evidence data and inference algorithms as first-class in a PDBMS.

BayesStore is a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. BayesStore represents model and evidence data as relational tables; implements inference algorithms efficiently in SQL; adds probabilistic relational operators to the query engine; optimizes queries with both relational and inference operators. The design goals of BayesStore are: (1) to be able to support efficient query processing over different models compared to the off-the-shelf machine learning libraries; (2) to be able to support extensible API for plugging in new models and inference algorithms; and (3) to be able to scale up to very large data sets.

Papers

[6] Hybrid In-Database Inference for Declarative Information Extraction
To Appear, Proceedings of SIGMOD, 2011
Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, Joseph M. Hellerstein, and Michael L. Wick

[5] Querying Probabilistic Information Extraction pvldb10 pvldb10slides
Proceedings of VLDB, 2010, PVLDB Vol.3
Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein

[4] Probabilistic Declarative Information Extraction icde10 icde10slides TR-pdb-ie
Proceedings of ICDE, 2010, short paper
Daisy Zhe Wang, Eirinaios Michelakis, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein

[3] BayesStore: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models vldb08a slides
Proceedings of VLDB, 2008
Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, and Joseph M. Hellerstein

[2] Granularity Conscious Modeling for Probabilistic Databases dune07
Proceedings of ICDM DUNE, 2007: 501-506
Eirinaios Michelakis, Daisy Zhe Wang, Minos N. Garofalakis, Joseph M. Hellerstein

[1] Probabilistic Data Management for Pervasive Computing: The Data Furnace Project ieee06
IEEE Data Engineering Bulletin 29, No. 1: 57-63, 2006
Minos N. Garofalakis, Kurt P. Brown, Michael J. Franklin, Joseph M. Hellerstein, Daisy Zhe Wang, Eirinaios Michelakis, Liviu Tancau, Eugene Wu, Shawn R. Jeffery, Ryan Aipperspach

Data

YellowBookAddress: Address Strings [text segmentation]
273,436 address strings scraped from yellow pages for small businesses. In the raw data file, each line contains an address string. In the training data file, each tagged addresses are separated by an empty line. 5 labels are used in the tagged training data file: 1 street number, 3 street address, 5 city, 6 state, 8 country. The Java implementation of the CRF model from the CRF open source project was used to perform the learning and was compared with the SQL implementation of the Viterbi inference algorithm in [4].

Rich Thomason's BibTex Bibliography: Bibliography Strings [text segmentation]
18,312 bibliography strings extracted from the original fully tagged bibliography dataset with 27 labels.

Demo/Software

People