First-Order Probabilistic Models for Information Extraction

Bhaskara Marthi
Brian Milch
Stuart Russell

Abstract: Information extraction (IE) is the problem of constructing a knowledge base from a corpus of text documents. In this paper, we argue that first-order probabilistic models (FOPMs) are a promising framework for IE, for two main reasons. First, FOPMs allow us to reason explicitly about entites that are mentioned in multiple documents, and compute the probability that two strings refer to the same entity -- thus addressing the problem of coreference or record linkage in a principled way. Second, FOPMs allow us to resolve ambiguities in a text passage using information from the whole corpus, rather than disambiguating based on local cues alone and then trying to merge the results into a coherent knowledge base. This paper presents a comprehensive FOPM for a bibliographic database, and explains how the desired inference patterns emerge from the model.

Appeared in: IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, Acapulco, Mexico, August 2003.

Download: PS version, PDF version