Results 1 -
7 of
7
Reference reconciliation in complex information spaces
- In SIGMOD
, 2005
"... Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). W ..."
Abstract
-
Cited by 88 (1 self)
- Add to MetaCart
Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one’s desktop. Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark. 1.
Semantic integration research in the database community: A brief survey
- AI Magazine
, 2005
"... Semantic integration has been a long-standing challenge for the database community. It has received steady attention over the past two decades, and has now become a prominent area of database research. In this article, we first review database applications that require semantic integration, and disc ..."
Abstract
-
Cited by 75 (4 self)
- Add to MetaCart
Semantic integration has been a long-standing challenge for the database community. It has received steady attention over the past two decades, and has now become a prominent area of database research. In this article, we first review database applications that require semantic integration, and discuss the difficulties underlying the integration process. We then describe recent progress and identify open research issues. We will focus in particular on schema matching, a topic that has received much attention in the database community, but will also discuss data matching (e.g., tuple deduplication), and open issues beyond the match discovery context (e.g., reasoning with matches, match verification and repair, and reconciling inconsistent data values). For previous surveys of database research on semantic integration, see (Rahm & Bernstein 2001;
Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function
"... Author disambiguation is the problem of determining whether records in a publications database refer to the same person. A common supervised machine learning approach is to build a classifier to predict whether a pair of records is coreferent, followed by a clustering step to enforce transitivity. H ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Author disambiguation is the problem of determining whether records in a publications database refer to the same person. A common supervised machine learning approach is to build a classifier to predict whether a pair of records is coreferent, followed by a clustering step to enforce transitivity. However, this approach ignores powerful evidence obtainable by examining sets (rather than pairs) of records, such as the number of publications or co-authors an author has. In this paper we propose a representation that enables these first-order features over sets of records. We then propose a training algorithm well-suited to this representation that is (1) error-driven in that training examples are generated from incorrect predictions on the training data, and (2) rankbased in that the classifier induces a ranking over candidate predictions. We evaluate our algorithms on three author disambiguation datasets and demonstrate error reductions of up to 60% over the standard binary classification approach.
A unified approach for schema matching, coreference, and canonicalization
- in KDD, Las Vegas
, 2008
"... The automatic consolidation of database records from many heterogeneous sources into a single repository requires solving several information integration tasks. Although tasks such as coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The automatic consolidation of database records from many heterogeneous sources into a single repository requires solving several information integration tasks. Although tasks such as coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. Systems that do tackle multiple integration problems traditionally solve each independently, allowing errors to propagate from one task to another. In this paper, we describe a discriminatively-trained model that reasons about schema matching, coreference, and canonicalization jointly. We evaluate our model on a real-world data set of people and demonstrate that simultaneously solving these tasks reduces errors over a cascaded or isolated approach. Our experiments show that a joint model is able to improve substantially over systems that either solve each task in isolation or with the conventional cascade. We demonstrate nearly a 50 % error reduction for coreference and a 40 % error reduction for schema matching.
Efficient Strategies for Improving Partitioning-Based Author Coreference by Incorporating Web Pages as Graph Nodes
- INTERNATIONAL WORKSHOP ON INFORMATION INTEGRATION ON THE WEB
, 2007
"... Entity resolution in the domain of research paper authors is an important, but difficult problem. It suffers from insufficient contextual information, hence adding information from the web can significantly improve performance. We formulate the author coreference problem as one of graph partitioning ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Entity resolution in the domain of research paper authors is an important, but difficult problem. It suffers from insufficient contextual information, hence adding information from the web can significantly improve performance. We formulate the author coreference problem as one of graph partitioning with discriminativelytrained edge weights. Building on our previous work, this paper presents improved and more comprehensive results for the method in which we incorporate web documents as additional nodes in the graph. We also propose efficient strategies to select a subset of nodes to add to the graph and to select a subset of queries to gather additional nodes, without significant loss of performance gain. We extend the classic Set-cover problem to develop a node selection criteria, hence opening up interesting theoretical possibilities. Finally, we propose a hybrid approach, that achieves 74.3 % of the total improvement gain using only 18.3 % of all additional mentions.
unknown title
"... Abstract. In this paper we introduce a new semantic desktop system called IRIS, an application framework for enabling users to create a “personal map” across their office-related information objects. Built as part of the CALO Cognitive Assistant project, IRIS represents a step in our quest to constr ..."
Abstract
- Add to MetaCart
Abstract. In this paper we introduce a new semantic desktop system called IRIS, an application framework for enabling users to create a “personal map” across their office-related information objects. Built as part of the CALO Cognitive Assistant project, IRIS represents a step in our quest to construct the kinds of tools that will significantly augment the user’s ability to perform knowledge work. This paper explains our design decisions, progress, and shortcomings. The IRIS project has grown from the past work of others and offers opportunities to augment and otherwise collaborate with other current and future semantic desktop projects. This paper marks our entry into the ongoing conversation about semantic desktops, intelligent knowledge management, and systems for augmenting the performance of human teams. 1
Record Linkage Based on Entities’ Behavior
, 2008
"... Record linkage is the problem of identifying similar records across different data sources. Traditional record linkage techniques focus on using simple database attributes in a textual similarity comparison to decide on matched and non-matched records. Recently, record linkage techniques have consid ..."
Abstract
- Add to MetaCart
Record linkage is the problem of identifying similar records across different data sources. Traditional record linkage techniques focus on using simple database attributes in a textual similarity comparison to decide on matched and non-matched records. Recently, record linkage techniques have considered useful extracted knowledge and domain information to help enhancing the matching accuracy. In this paper, we present a new technique for record linkage that is based on entity’s behavior, which can be extracted from a transaction log. In the matching process, we measure the improvement of identifying a behavior when comparing two entities by merging their transaction log. To do so, we use two matching phases; first, a candidate generation phase, which is fast and provide almost no false negatives, while producing low precision. Second, an accurate matching phase, which enhances the precision of the matching at high run time cost. In the candidates phase generation, behavior is represented by points in the complex plan, where we perform approximate evaluations. In the accurate matching phase, we use a heuristic called compressibility, where identified behaviors are more compressible. Our experiments show that the proposed technique can be used to enhance the record linkage quality while being practical for large logs. We also perform extensive sensitivity analysis for the technique’s accuracy and performance.

