• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Matching and record linkage (1993)

by W E Winkler
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 61
Next 10 →

The State of Record Linkage and Current Research Problems

by William E. Winkler - Statistical Research Division, U.S. Census Bureau , 1999
"... This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful id ..."
Abstract - Cited by 172 (7 self) - Add to MetaCart
This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful ideas for estimating record linkage parameters and other ideas that still influence record linkage today. Record linkage research is characterized by its synergism of statistics, computer science, and operations research. Many difficult algorithms have been developed and put in software systems. Record linkage practice is still very limited. Some limits are due to existing software. Other limits are due to the difficulty in automatically estimating matching parameters and error rates, with current research highlighted by the work of Larsen and Rubin. Keywords: computer matching, modeling, iterative fitting, string comparison, optimization RsSUMs Cet article donne une vue d'ensemble sur les ...

Interactive Deduplication using Active Learning

by Sunita Sarawagi, Anuradha Bhamidipaty , 2002
"... Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to ov ..."
Abstract - Cited by 161 (3 self) - Add to MetaCart
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of the deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists. We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning signicantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration

by William W. Cohen, Jacob Richman , 2002
"... Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual ..."
Abstract - Cited by 96 (6 self) - Add to MetaCart
Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in di#erent databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.

Learning Object Identification Rules for Information Integration

by Sheila Tejada, Craig A. Knoblock, Steven Minton - Information Systems , 2001
"... When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it di#cult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects' ..."
Abstract - Cited by 77 (8 self) - Add to MetaCart
When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it di#cult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects' shared attributes in order to identify matching objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous methods of object identification have required manual construction of object identification rules or mapping rules for determining the mappings between objects. This manual process is time consuming and error-prone.

Advanced Methods For Record Linkage

by William E. Winkler , 1994
"... s Service. The study showed that the fewest errors typically occur at the beginning of a string and the error rates by character position increase monotonically as the position moves to the right. The enhancement basically consisted of adjusting the string comparator value upward by a fixed amount i ..."
Abstract - Cited by 59 (14 self) - Add to MetaCart
s Service. The study showed that the fewest errors typically occur at the beginning of a string and the error rates by character position increase monotonically as the position moves to the right. The enhancement basically consisted of adjusting the string comparator value upward by a fixed amount if the first four characters agreed; by lesser amounts if the first three, two, or one characters agreed. The string comparator examined by Budzkinsky (1991) consisted of the Jaro comparator with only the Winkler enhancement. The final enhancement due to Lynch and Winkler (1994) adjusts the string comparator value if the strings are longer than six characters and more than half the characters beyond the first four 4 agree. The final enhancement was based on detailed comparisons between versions of the comparator. The comparisons involved tens of thousands of pairs of last names, first names, and street names that did not agree on a character-by-character basis but were associated with truly...

Text Joins in an RDBMS for Web Data Integration

by Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, Divesh Srivastava , 2003
"... The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is a ..."
Abstract - Cited by 57 (8 self) - Add to MetaCart
The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity.

Overview of record linkage and current research directions

by William E Winkler - BUREAU OF THE CENSUS , 2006
"... This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research. ..."
Abstract - Cited by 55 (1 self) - Add to MetaCart
This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.

Schema Matching using Duplicates

by Alexander Bilke , Felix Naumann , 2005
"... Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data se ..."
Abstract - Cited by 37 (4 self) - Add to MetaCart
Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering

A Hierarchical Graphical Model for Record Linkage

by Pradeep Ravikumar , William W. Cohen - IN UAI , 2004
"... The task of matching co-referent records is known among other names as record linkage. For large ..."
Abstract - Cited by 29 (1 self) - Add to MetaCart
The task of matching co-referent records is known among other names as record linkage. For large

Record Linkage: Current Practice and Future Directions

by Lifang Gu, Rohan Baxter, Deanne Vickers, Chris Rainsford - CSIRO Mathematical and Information Sciences , 2003
"... Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabil ..."
Abstract - Cited by 27 (0 self) - Add to MetaCart
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University