Results 1 - 10
of
34
Reference reconciliation in complex information spaces
- In SIGMOD
, 2005
"... Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). W ..."
Abstract
-
Cited by 88 (1 self)
- Add to MetaCart
Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one’s desktop. Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark. 1.
A Mention-Synchronous Coreference Resolution Algorithm Based on the Bell Tree
- In Proc. of the ACL
, 2004
"... This paper proposes a new approach for coreference resolution which uses the Bell tree to represent the search space and casts the coreference resolution problem as finding the best path from the root of the Bell tree to the leaf nodes. A Maximum Entropy model is used to rank these paths. The corefe ..."
Abstract
-
Cited by 64 (4 self)
- Add to MetaCart
This paper proposes a new approach for coreference resolution which uses the Bell tree to represent the search space and casts the coreference resolution problem as finding the best path from the root of the Bell tree to the leaf nodes. A Maximum Entropy model is used to rank these paths. The coreference performance on the 2002 and 2003 Automatic Content Extraction (ACE) data will be reported. We also train a coreference system using the MUC6 data and competitive results are obtained. 1
A Platform for Personal Information Management and Integration
"... The explosion of the amount of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focused on the WWW, individual computer users have developed their own vast collections of data on their desktop ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
The explosion of the amount of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focused on the WWW, individual computer users have developed their own vast collections of data on their desktops, and these collections are in critical need for good search tools. We describe the Semex System that offers users a flexible platform for personal information management. Semex has two main goals. The first goal is to enable browsing personal information by semantically meaningful associations. The challenge it to automatically create such associations between data items on one’s desktop, and to create enough of them so Semex becomes an indispensable tool. Our second goal is to leverage the personal information space we created to increase users ’ productivity. As our first target, Semex leverages the personal information to enable lightweight information integration tasks that are discouragingly difficult to perform with today’s tools.
Supervised clustering with support vector machines
- in ICML
, 2005
"... Supervised clustering is the problem of training a clustering algorithm to produce desirable clusterings: given sets of items and complete clusterings over these sets, we learn how to cluster future sets of items. Example applications include noun-phrase coreference clustering, and clustering news a ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
Supervised clustering is the problem of training a clustering algorithm to produce desirable clusterings: given sets of items and complete clusterings over these sets, we learn how to cluster future sets of items. Example applications include noun-phrase coreference clustering, and clustering news articles by whether they refer to the same topic. In this paper we present an SVM algorithm that trains a clustering algorithm by adapting the item-pair similarity measure. The algorithm may optimize a variety of different clustering functions to a variety of clustering performance measures. We empirically evaluate the algorithm for noun-phrase and news article clustering. 1.
Bayesian Learning in Undirected Graphical Models: Approximate MCMC algorithms
, 2004
"... Bayesian learning in undirected graphical models --- computing posterior distributions over parameters and predictive quantities --- is exceptionally difficult. We conjecture that for general undirected models, there are no tractable MCMC (Markov Chain Monte Carlo) schemes giving the correct equilib ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
Bayesian learning in undirected graphical models --- computing posterior distributions over parameters and predictive quantities --- is exceptionally difficult. We conjecture that for general undirected models, there are no tractable MCMC (Markov Chain Monte Carlo) schemes giving the correct equilibrium distribution over parameters. While this intractability, due to the partition function, is familiar to those performing parameter optimisation, Bayesian learning of posterior distributions over undirected model parameters has been unexplored and poses novel challenges. We propose several approximate MCMC schemes and test on fully observed binary models (Boltzmann machines) for a small coronary heart disease data set and larger artificial systems. While approximations must perform well on the model, their interaction with the sampling scheme is also important. Samplers based on variational mean-field approximations generally performed poorly, more advanced methods using loopy propagation, brief sampling and stochastic dynamics lead to acceptable parameter posteriors. Finally, we demonstrate these techniques on a Markov random field with hidden variables.
Blog: Relational modeling with unknown objects
- ICML 2004 Workshop on Statistical Relational Learning and Its Connections
, 2004
"... In many real-world probabilistic reasoning problems, one of the questions we want to answer is: how many objects are out there? Examples of such problems range from multitarget tracking to extracting information from text documents. However, most probabilistic modeling formalisms — even firstorder o ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
In many real-world probabilistic reasoning problems, one of the questions we want to answer is: how many objects are out there? Examples of such problems range from multitarget tracking to extracting information from text documents. However, most probabilistic modeling formalisms — even firstorder ones — assume a fixed, known set of objects. We introduce a language called Blog for specifying probability distributions over relational structures that include varying sets of objects. In this paper we present Blog informally, by means of example models for multi-target tracking and citation matching. We discuss some attractive features of Blog models and some avenues of future work. 1.
A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models
- In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data
, 2003
"... Although information extraction and data mining appear together in many applications, their interface in most current systems would better be described as serial juxtaposition than as tight integration. ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Although information extraction and data mining appear together in many applications, their interface in most current systems would better be described as serial juxtaposition than as tight integration.
Efficient name disambiguation for large-scale databases
- PKDD
, 2006
"... Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retr ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6 % pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated. 1
Relational Markov Networks for Collective Information Extraction”, Relational Learning and Its Connection to Other Fields (SRL- 2004
, 2004
"... Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering influences between different potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random fi ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering influences between different potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random fields (CRFs), have been shown to be an effective approach to learning accurate IE systems. We present a new IE method that employs Relational Markov Networks, which can represent arbitrary dependencies between extractions. This allows for “collective information extraction ” that exploits the mutual influence between possible extractions. Experiments on learning to extract protein names from biomedical text demonstrate the advantages of this approach. 1.
A fast linkage detection scheme for multi-source information integration
- in ‘Web Information Retrieval and Integration’ (WIRI’05
, 2005
"... Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous W ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous Web resources. However, when targeting large-scale data, the cost of enumerating all the possible linkages often becomes impracticably high. Based on this background, this paper proposes a fast and efficient method for linkage detection. The features of the proposed approach are: first, it exploits a suffix array structure that enables linkage detection using variable length n-grams. Second, it dynamically generates blocks of possibly associated records using ‘blocking keys ’ extracted from already known reliable linkages. The results from our preliminary experiments where the proposed method was applied to the integration of four bibliographic databases, which scale up to more than 10 million records, are also reported in the paper. 1.

