Results 1 - 10
of
81
Blog: Probabilistic models with unknown objects
- In IJCAI
, 2005
"... This paper introduces and illustrates BLOG, a formal language for defining probability models over worlds with unknown objects and identity uncertainty. BLOG unifies and extends several existing approaches. Subject to certain acyclicity constraints, every BLOG model specifies a unique probability di ..."
Abstract
-
Cited by 179 (10 self)
- Add to MetaCart
(Show Context)
This paper introduces and illustrates BLOG, a formal language for defining probability models over worlds with unknown objects and identity uncertainty. BLOG unifies and extends several existing approaches. Subject to certain acyclicity constraints, every BLOG model specifies a unique probability distribution over first-order model structures that can contain varying and unbounded numbers of objects. Furthermore, complete inference algorithms exist for a large fragment of the language. We also introduce a probabilistic form of Skolemization for handling evidence. 1
Reference reconciliation in complex information spaces
- In SIGMOD
, 2005
"... Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). W ..."
Abstract
-
Cited by 157 (1 self)
- Add to MetaCart
(Show Context)
Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one’s desktop. Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark. 1.
A Mention-Synchronous Coreference Resolution Algorithm Based on the Bell Tree
- In Proc. of the ACL
, 2004
"... This paper proposes a new approach for coreference resolution which uses the Bell tree to represent the search space and casts the coreference resolution problem as finding the best path from the root of the Bell tree to the leaf nodes. A Maximum Entropy model is used to rank these paths. The corefe ..."
Abstract
-
Cited by 118 (9 self)
- Add to MetaCart
(Show Context)
This paper proposes a new approach for coreference resolution which uses the Bell tree to represent the search space and casts the coreference resolution problem as finding the best path from the root of the Bell tree to the leaf nodes. A Maximum Entropy model is used to rank these paths. The coreference performance on the 2002 and 2003 Automatic Content Extraction (ACE) data will be reported. We also train a coreference system using the MUC6 data and competitive results are obtained. 1
Supervised clustering with support vector machines
- in ICML
, 2005
"... Supervised clustering is the problem of training a clustering algorithm to produce desirable clusterings: given sets of items and complete clusterings over these sets, we learn how to cluster future sets of items. Example applications include noun-phrase coreference clustering, and clustering news a ..."
Abstract
-
Cited by 94 (5 self)
- Add to MetaCart
(Show Context)
Supervised clustering is the problem of training a clustering algorithm to produce desirable clusterings: given sets of items and complete clusterings over these sets, we learn how to cluster future sets of items. Example applications include noun-phrase coreference clustering, and clustering news articles by whether they refer to the same topic. In this paper we present an SVM algorithm that trains a clustering algorithm by adapting the item-pair similarity measure. The algorithm may optimize a variety of different clustering functions to a variety of clustering performance measures. We empirically evaluate the algorithm for noun-phrase and news article clustering. 1.
A Platform for Personal Information Management and Integration
"... The explosion of the amount of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focused on the WWW, individual computer users have developed their own vast collections of data on their desktop ..."
Abstract
-
Cited by 76 (6 self)
- Add to MetaCart
(Show Context)
The explosion of the amount of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focused on the WWW, individual computer users have developed their own vast collections of data on their desktops, and these collections are in critical need for good search tools. We describe the Semex System that offers users a flexible platform for personal information management. Semex has two main goals. The first goal is to enable browsing personal information by semantically meaningful associations. The challenge it to automatically create such associations between data items on one’s desktop, and to create enough of them so Semex becomes an indispensable tool. Our second goal is to leverage the personal information space we created to increase users ’ productivity. As our first target, Semex leverages the personal information to enable lightweight information integration tasks that are discouragingly difficult to perform with today’s tools.
Iterative Record Linkage for Cleaning and Integration
, 2004
"... Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples ’ attribute values; tuples with similarit ..."
Abstract
-
Cited by 76 (10 self)
- Add to MetaCart
Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples ’ attribute values; tuples with similarity scores above a certain threshold are declared to be matches. While this method can perform quite well in many domains, particularly domains where there is not a large amount of noise in the data, in some domains looking only at tuple values is not enough. By also examining the context of the tuple, i.e. the other tuples to which it is linked, we can come up with a more accurate linkage decision. But this additional accuracy comes at a price. In order to correctly find all duplicates, we may need to make multiple passes over the data; as linkages are discovered, they may in turn
Bayesian Learning in Undirected Graphical Models: Approximate MCMC algorithms
, 2004
"... Bayesian learning in undirected graphical models --- computing posterior distributions over parameters and predictive quantities --- is exceptionally difficult. We conjecture that for general undirected models, there are no tractable MCMC (Markov Chain Monte Carlo) schemes giving the correct equilib ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
Bayesian learning in undirected graphical models --- computing posterior distributions over parameters and predictive quantities --- is exceptionally difficult. We conjecture that for general undirected models, there are no tractable MCMC (Markov Chain Monte Carlo) schemes giving the correct equilibrium distribution over parameters. While this intractability, due to the partition function, is familiar to those performing parameter optimisation, Bayesian learning of posterior distributions over undirected model parameters has been unexplored and poses novel challenges. We propose several approximate MCMC schemes and test on fully observed binary models (Boltzmann machines) for a small coronary heart disease data set and larger artificial systems. While approximations must perform well on the model, their interaction with the sampling scheme is also important. Samplers based on variational mean-field approximations generally performed poorly, more advanced methods using loopy propagation, brief sampling and stochastic dynamics lead to acceptable parameter posteriors. Finally, we demonstrate these techniques on a Markov random field with hidden variables.
Identification and tracing of ambiguous names: Discriminative and generative approaches
- In Proc. of the National Conference on Artificial Intelligence (AAAI
, 2004
"... A given entity – representing a person, a location or an organization – may be mentioned in text in multiple, ambiguous ways. Understanding natural language requires identifying whether different mentions of a name, within and across documents, represent the same entity. We present two machine learn ..."
Abstract
-
Cited by 40 (10 self)
- Add to MetaCart
A given entity – representing a person, a location or an organization – may be mentioned in text in multiple, ambiguous ways. Understanding natural language requires identifying whether different mentions of a name, within and across documents, represent the same entity. We present two machine learning approaches to this problem, which we call the “Robust Reading ” problem. Our first approach is a discriminative approach, trained in a supervised way. Our second approach is a generative model, at the heart of which is a view on how documents are generated and how names (of different entity types) are “sprinkled ” into them. In its most general form, our model assumes: (1) a joint distribution over entities (e.g., a document that mentions “President Kennedy ” is more likely to mention “Oswald ” or “ White House ” than “Roger Clemens”), (2) an “author ” model, that assumes that at least one mention of an entity in a document is easily identifiable, and then generates other mentions via (3) an appearance model, governing how mentions are transformed from the “representative ” mention. We show that both approaches perform very accurately, in the range of 90 % − 95 % F1 measure for different entity types, much better than previous approaches to (some aspects of) this problem. Our extensive experiments exhibit the contribution of relational and structural features and, somewhat surprisingly, that the assumptions made within our generative model are strong enough to yield a very powerful approach, that performs better than a supervised approach with limited supervised information.
Efficient name disambiguation for large-scale databases
- PKDD
, 2006
"... Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retr ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
(Show Context)
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6 % pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated. 1
Blog: Relational modeling with unknown objects
- ICML 2004 Workshop on Statistical Relational Learning and Its Connections
, 2004
"... In many real-world probabilistic reasoning problems, one of the questions we want to answer is: how many objects are out there? Examples of such problems range from multitarget tracking to extracting information from text documents. However, most probabilistic modeling formalisms — even firstorder o ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
(Show Context)
In many real-world probabilistic reasoning problems, one of the questions we want to answer is: how many objects are out there? Examples of such problems range from multitarget tracking to extracting information from text documents. However, most probabilistic modeling formalisms — even firstorder ones — assume a fixed, known set of objects. We introduce a language called Blog for specifying probability distributions over relational structures that include varying sets of objects. In this paper we present Blog informally, by means of example models for multi-target tracking and citation matching. We discuss some attractive features of Blog models and some avenues of future work. 1.