Results 1 - 10
of
19
Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity
, 1998
"... Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. ..."
Abstract
-
Cited by 193 (13 self)
- Add to MetaCart
Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIR...
An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records
, 1997
"... Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, because of unstandardized abbreviations, or because of differences in the detailed sc ..."
Abstract
-
Cited by 154 (2 self)
- Add to MetaCart
Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, because of unstandardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among other reasons. In this paper, we present an efficient algorithm for recognizing clusters of approximately duplicate records. Three key ideas distinguish the algorithm presented. First, a version of the Smith-Waterman algorithm for computing minimum edit-distance is used as a domainindependent method to recognize pairs of approximately duplicate records. Second, the union/find algorithm is used to keep track of clusters of duplicate records incrementally, as pairwise duplicate relationships are discovered. Third, the algorithm uses a priority queue of cluster subsets to respond adaptively to the size and homogeneity of the clusters discovered as...
Learning Object Identification Rules for Information Integration
- Information Systems
, 2001
"... When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it di#cult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects' ..."
Abstract
-
Cited by 77 (8 self)
- Add to MetaCart
When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it di#cult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects' shared attributes in order to identify matching objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous methods of object identification have required manual construction of object identification rules or mapping rules for determining the mappings between objects. This manual process is time consuming and error-prone.
A Comparison of Fast Blocking Methods for Record Linkage
- KDD 2003 WORKSHOPS
, 2003
"... Record linkage of millions of individual health records for ethically-approved research purposes is a computationally expensive task. Blocking methods are used in record linkage systems to reduce the number of candidate record comparison pairs to a feasible number whilst still maintaining linkage ac ..."
Abstract
-
Cited by 72 (15 self)
- Add to MetaCart
Record linkage of millions of individual health records for ethically-approved research purposes is a computationally expensive task. Blocking methods are used in record linkage systems to reduce the number of candidate record comparison pairs to a feasible number whilst still maintaining linkage accuracy. New blocking methods have been implemented recently using high-dimensional indexing or clustering algorithms.
Overview of record linkage and current research directions
- BUREAU OF THE CENSUS
, 2006
"... This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research. ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.
Record Linkage: Current Practice and Future Directions
- CSIRO Mathematical and Information Sciences
, 2003
"... Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabil ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.
Methods for record linkage and bayesian networks
- Series RRS2002/05, U.S. Bureau of the Census
, 2002
"... Although terminology differs, there is considerable overlap between record linkage methods based on the Fellegi-Sunter model (JASA 1969) and Bayesian networks used in machine learning (Mitchell 1997). Both are based on formal probabilistic models that can be shown to be equivalent in many situations ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Although terminology differs, there is considerable overlap between record linkage methods based on the Fellegi-Sunter model (JASA 1969) and Bayesian networks used in machine learning (Mitchell 1997). Both are based on formal probabilistic models that can be shown to be equivalent in many situations (Winkler 2000). When no missing data are present in identifying fields and training data are available, then both can efficiently estimate parameters of interest. When missing data are present, the EM algorithm can be used for parameter estimation in Bayesian Networks when there are training data (Friedman 1997) and in record linkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (Belin and Rubin
Matching Algorithms within a Duplicate Detection System
- Bulletin of the Technical Committee on Data Engineering
, 2000
"... Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple databases – such as what happens in data warehousing where records from multiple data sources are integrated into a single source of information – among other reasons. In this paper we review a system to detect approximate duplicate records in a database and provide properties that a pair-wise record matching algorithm must have in order to have a successful duplicate detection system. 1
Quality of Very Large Databases
, 2001
"... Analyses and data mining of large computer files are affected by the quality of the information in the files. For large population registers and for files that are created by merging two or more files, duplicate entries must be identified. Duplicate identification can depend on record linkage softwa ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Analyses and data mining of large computer files are affected by the quality of the information in the files. For large population registers and for files that are created by merging two or more files, duplicate entries must be identified. Duplicate identification can depend on record linkage software that can deal with name, address, and date-of-birth data containing many typographical errors. Quantitative and qualitative data must be edited to assure that mutually contradictory or missing items are changed automatically and quickly. This paper describes computational methods and software that are suitable for groups of files where individual files contain between 1 million and 4 billion records. Keywords: record linkage, editing, imputation, data mining 1.
Regression Analysis with Linked Data
, 2004
"... Record linkage, or exact matching, can be used to join together two files that contain information on the same individuals, but lack unique personal identification codes. The possibility of errors in linkage causes problems for estimating the relationships between variables on the two files. The eff ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Record linkage, or exact matching, can be used to join together two files that contain information on the same individuals, but lack unique personal identification codes. The possibility of errors in linkage causes problems for estimating the relationships between variables on the two files. The effect is analogous to the impact of measurement error. A model of a linear regression relationship between variables in linked files is proposed. Assuming the probabilities that pairs of records are links are known, an unbiased estimator of the regression coefficients is derived. Methods for estimating the linkage probabilities by using mixture models are discussed. A consistent estimator of the covariance matrix of the proposed estimator is proposed. A bootstrap estimator is used to reflect the impact of the uncertainty in record linkage model parameters on the estimators of the regression parameters. A simulation study compares the performance of the proposed estimator and alternatives.

