I am a final-year Ph.D. student in the department EECS at UC Berkeley. My advisors are: Michael J. Franklin, Joseph M. Hellerstein and Minos Garofalakis. My primary research interest is databases and systems for large-scale probabilistic data management and analysis, using statistical machine learning. In my thesis work, I proposed, built and evaluated BayesStore, a probabilistic database system that natively supports graphical models and their inference algorithms. I studied information extraction as the driving application.
- [NEW!] Feb/16/11 My Hybrid-Inference paper is accepted to SIGMOD’2011! See you all in Athens, Greece!
- [NEW!] Feb/6/11 The Viterbi and MCMC inference implementations are included in MAD Library! GO MAD!
- [NEW!] Nov/17/10 I gave a CSAIL talk at MIT on “Querying Probabilistic Information Extraction”.
- [NEW!] Oct/29/10 My paper with the Almaden folks – “Selectivity Estimation for Extraction Operators over Text Data” is accepted to ICDE2011. Woohoo!
- Aug/12/10 I will be presenting paper “Querying Probabilistic Information Extraction” pvldb10 in VLDB 2010 Singapore. Come to my talk!
- Mar/02/10 I am at ICDE 2010 Long Beach. I am giving a talk – Probabilistic Declarative Information Extraction – see you there!
- Jan/13/10 I am attending the Berkeley RAD lab retreat at Lake Tahoe. It was very exciting meeting with folks from many leading IT companies. I gave a poster that made it to the runner-up! radlab-2010sp-retreat-poster
- Jan/05/10 I am visiting University of Toronto, and giving a talk at the Database Seminar.
- [NEW!] ICDE11, April 2011, Selectivity Estimation for Extraction Operators over Text Data icde11slides
- [NEW!] MIT, CSAIL Seminar, November 2010, Querying Probabilistic Information Extraction mit10slides
- [NEW!] VLDB10, September 2010, Querying Probabilistic Information Extraction pvldb10slides
- ICDE10, March 2010, Probabilistic Declarative Information Extraction icde10slides
- University of Toronto, DBSeminar, Jan 2010, BayesStore: Querying Probabilistic Information Extraction UofT10slides
- WebDB, June 2009, Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems webdb09slides
- Berkeley Machine Learning Tea, 8th May 2009, BayesStore: Supporting Statistical Models in Probabilistic Databases
- Stanford Info Lunch, 1st May 2009, Declarative Information Extraction in a Probabilistic Database System stanford09slides
- VLDB08, August 2008, BayesStore: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models vldb08slides
- Berkeley Database Seminar, 2006, Probabilistic Complex Event Triggering (PCET)
Hybrid In-Database Inference for Declarative Information Extraction sigmod11
To Appear, Proceedings of SIGMOD, 2011
, Michael J. Franklin, Minos Garofalakis, Joseph M. Hellerstein, and Michael L. Wick
Selectivity Estimation for Extraction Operators over Text Data icde11 icde11slides
To Appear, Proceedings of ICDE, 2011
, Long Wei, Yunyao Li, Frederick Reiss, and Shivakumar Vaithyanathan
Probabilistic Declarative Information Extraction icde10 icde10slides TR-pdb-ie
Proceedings of ICDE, 2010, short paper
, Eirinaios Michelakis, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein
Functional Dependency Generation and Applications in Pay-as-you-go Data Integration Systems webdb09 webdb09slides TR-probFDgen
Proceedings of SIGMOD WebDB, 2009
, Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Halevy
BayesStore: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models vldb08a vldb08slides
Proceedings of VLDB, 2008
, Eirinaios Michelakis, Minos Garofalakis, and Joseph M. Hellerstein
WebTables: Exploring the Power of Tables on the Web vldb08b
Proceedings of VLDB, 2008
Michael Cafarella, Alon Halevy, Daisy Zhe Wang
, Eugene Wu, Yang Zhang
- BayesStore: The BayesStore system is designed and built to support data analysis using graphical models, enable ad-hoc queries over the uncertainties and probabilities inherent in the data and analysis results. The fundamental ideas underlying BayesStore include: (1) creating a novel data model that treats uncertain relational data and graphical models of uncertainty as first-class objects; (2) implementing inference as a native operator in a query execution engine; (3) developing algorithms for relational operators over probabilistic models and data; and (4) devising query execution strategies that optimize across inference and relational operators. I used information extraction (IE) as the driving application for BayesStore.
- WebTables: As a member of the WebTables project at Google, I worked on the complex problem of scalable extraction of the metadata of the HTML tables from the entire Web. I developed statistical classifiers and rule-based detectors, which recovered the schemas of millions of HTML tables. This vast number opened up a whole new data-driven way of thinking about schemas. In addition, I developed and evaluated algorithms based on Bayes’ Theorem, which statistically derive probabilistic functional dependencies from the extracted schemas.
- SystemT: I collaborated with researchers in IBM Almaden Research to work on building an optimizer for SystemT, a rule-based information extraction system using AQL, a declarative SQL-like language. I developed estimators for the cost and the output size of text extractors, such as dictionary and regular expression. I further developed different document synopses for more accurate estimation of various statistics over text corpora. Experimental results demonstrated the accuracy of the estimators and the benefits of the optimizer.
- Probabilistic Complex Event Triggering (PCET): In my work on PCET, I built an infrastructure that automatically infers and reasons about the probabilities of triggered events using a principled probabilistic model (i.e., Bayes Nets) along with the underlying noisy sensor data. I demonstrated that PCET simplifies the development process and, by using appropriate probabilistic models, boosts the accuracy of complex event-triggering systems, which deal with inherently uncertain and correlated data streams.
- Bonsai: In collaboration with David Purdy, a Ph.D. student from the statistics department, I explored the alternative of using interactive visualization as a means to cultivate the statistical modeling process over large datasets. I built Bonsai, an interactive visualization tool, and demonstrated how such a tool with different types of visualizations over the data can help in building better decision trees.
Long Wei (undergrad, fall 2009)
- Guided Long through the implementation and evaluation of various document synopses for text databases.
Michael Zhang (M.S. student, fall 2009)
- Supervised Michael through the implementation of a BayesStore demonstration.
Dwight Crow (undergrad, summer 2009)
- Guided Dwight through the feasibility study of clustering millions of HTML table schemas into domains.
Open Source Projects
PG-ML (with Milenko Petrovic): a PostgreSQL wrapper for statistical machine learning libraries
A Parable of Modern Research
Bob has lost his keys in a room which is dark except for one brightly lit corner.
“Why are you looking under the light, you lost them in the dark!”
“I can only see here.”