One approach to support such probabilistic data analyses over large volumes of data is by a probabilistic data management system (PDBMS). Early approaches in building PDBMS have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures. However, these approaches introduce a gap between the statistical models which are used for probabilistic analytics and the uncertainty model in the PDBMS. Our solution to this “model-mismatch” problem is to support statistical models, evidence data and inference algorithms as first-class in a PDBMS.
BayesStore is a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. BayesStore represents model and evidence data as relational tables; implements inference algorithms efficiently in SQL; adds probabilistic relational operators to the query engine; optimizes queries with both relational and inference operators. The design goals of BayesStore are: (1) to be able to support efficient query processing over different models compared to the off-the-shelf machine learning libraries; (2) to be able to support extensible API for plugging in new models and inference algorithms; and (3) to be able to scale up to very large data sets.
 Hybrid In-Database Inference for Declarative Information Extraction
To Appear, Proceedings of SIGMOD, 2011
, Michael J. Franklin, Minos Garofalakis, Joseph M. Hellerstein, and Michael L. Wick
 Probabilistic Declarative Information Extraction icde10 icde10slides TR-pdb-ie
Proceedings of ICDE, 2010, short paper
Daisy Zhe Wang, Eirinaios Michelakis, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein
 BayesStore: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models vldb08a slides
Proceedings of VLDB, 2008
Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, and Joseph M. Hellerstein
 Granularity Conscious Modeling for Probabilistic Databases dune07
Proceedings of ICDM DUNE, 2007: 501-506
Eirinaios Michelakis, Daisy Zhe Wang, Minos N. Garofalakis, Joseph M. Hellerstein
 Probabilistic Data Management for Pervasive Computing: The Data Furnace Project ieee06
IEEE Data Engineering Bulletin 29, No. 1: 57-63, 2006
Minos N. Garofalakis, Kurt P. Brown, Michael J. Franklin, Joseph M. Hellerstein,
Daisy Zhe Wang, Eirinaios Michelakis, Liviu Tancau, Eugene Wu, Shawn R. Jeffery, Ryan Aipperspach
YellowBookAddress: Address Strings [text segmentation]
273,436 address strings scraped from yellow pages for small businesses. In the raw data file, each line contains an address string. In the training data file, each tagged addresses are separated by an empty line. 5 labels are used in the tagged training data file: 1 street number, 3 street address, 5 city, 6 state, 8 country. The Java implementation of the CRF model from the CRF open source project was used to perform the learning and was compared with the SQL implementation of the Viterbi inference algorithm in .
- ViterbiCRF implementation in PostgreSQL8.4.1
- Viterbi and MCMC inference implementations in MAD Library