Results 1 -
9 of
9
A Comparison of Fast Blocking Methods for Record Linkage
- KDD 2003 WORKSHOPS
, 2003
"... Record linkage of millions of individual health records for ethically-approved research purposes is a computationally expensive task. Blocking methods are used in record linkage systems to reduce the number of candidate record comparison pairs to a feasible number whilst still maintaining linkage ac ..."
Abstract
-
Cited by 72 (15 self)
- Add to MetaCart
Record linkage of millions of individual health records for ethically-approved research purposes is a computationally expensive task. Blocking methods are used in record linkage systems to reduce the number of candidate record comparison pairs to a feasible number whilst still maintaining linkage accuracy. New blocking methods have been implemented recently using high-dimensional indexing or clustering algorithms.
Record Linkage: Current Practice and Future Directions
- CSIRO Mathematical and Information Sciences
, 2003
"... Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabil ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.
Adaptive blocking: Learning to scale up record linkage
- In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006
, 2006
"... Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dat ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an indexbased similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods. 1
Adaptive filtering for efficient record linkage
- In ICDM
, 2004
"... The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important initial steps in many data mining applications. Record linkage of millions of records is a computationally expensive task. Various blocki ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important initial steps in many data mining applications. Record linkage of millions of records is a computationally expensive task. Various blocking methods have been used in record linkage systems to reduce the number of record pairs for comparison. A good blocking key is critical to the success of a blocking method and will ideally result in lot of small blocks. However, in practice, there are almost always large blocks no matter how good the blocking key is. For example, when blocking on surname for an Anglo-Celtic population, ‘Smith ’ and ‘Taylor ’ are populous and result in very large block sizes. The efficiency of a blocking method is hindered by these large blocks since the resulting number of record pairs is dominated by the sizes of these large blocks. In this paper, we present an adaptive filtering algorithm to post-process large blocks to enhance the blocking efficiency. Experimental results show that our filtering algorithm can reduce the number of record pairs produced by the standard blocking method by 88 % on a small real-world data set. The algorithm also reduces the number of record pairs generated by a 3-pass standard blocking method by 50 % on several synthetic test data sets, with minimal loss of accuracy.
A fast linkage detection scheme for multi-source information integration
- in ‘Web Information Retrieval and Integration’ (WIRI’05
, 2005
"... Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous W ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous Web resources. However, when targeting large-scale data, the cost of enumerating all the possible linkages often becomes impracticably high. Based on this background, this paper proposes a fast and efficient method for linkage detection. The features of the proposed approach are: first, it exploits a suffix array structure that enables linkage detection using variable length n-grams. Second, it dynamically generates blocks of possibly associated records using ‘blocking keys ’ extracted from already known reliable linkages. The results from our preliminary experiments where the proposed method was applied to the integration of four bibliographic databases, which scale up to more than 10 million records, are also reported in the paper. 1.
A Service Oriented Architecture for a Health Research Data Network
, 2004
"... Layer Model The Preparing layer is about collecting data by data custodians. The challenges and software services required for managing data collection are out of the scope of the present project. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Layer Model The Preparing layer is about collecting data by data custodians. The challenges and software services required for managing data collection are out of the scope of the present project.
Active Learning Genetic Programming for Record Deduplication
"... Abstract — The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is re ..."
Abstract
- Add to MetaCart
Abstract — The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to label training data, and methods following a semi-supervised approach might be more appropriate. This is because they significantly reduce the time required for data labeling while maintaining acceptable accuracy rates. This paper presents the Active Learning GP (AGP), a semi-supervised GP, and instantiates it for the data deduplication problem. AGP uses an active learning approach in which a committee of multi-attribute functions votes for classifying record pairs as duplicates or not. When the committee majority voting is not enough to predict the class of the data pairs, a user is called to solve the conflict. The method was applied to three datasets and compared to two other deduplication methods. Results show that AGP guarantees the quality of the deduplication while reducing the number of labeled examples needed. I.
A Comparative Study of Record Matching Algorithms
"... Record matching is an important process in data integration and data cleaning. It involves identifying cases where multiple database entities correspond to the same realworld entity. Often, duplicate records do not share a common key and contain erroneous data that make record matching a difficult t ..."
Abstract
- Add to MetaCart
Record matching is an important process in data integration and data cleaning. It involves identifying cases where multiple database entities correspond to the same realworld entity. Often, duplicate records do not share a common key and contain erroneous data that make record matching a difficult task. The quality of a record matching system highly depends on a good approach that is able to accurately detects duplicates in an efficient and effective way. Despite the many techniques that have been introduced over the decades, it is unclear which technique is the current state-of-the-art. Hence, the objectives of this project are: 1. Compare a few record matching techniques and evaluate their advantages and disadvantages. 2. Develop a technique that combines the best features from these techniques to produce an improved record matching technique. Currently, there are two main approaches for duplicate record detection, categorised into approaches that rely on training data, and approaches that rely on domain knowledge or distance metrics. This project focuses on comparisons between the Probabilisticbased
Real World Performance of Approximate String Comparators for use in Patient Matching
, 2004
"... Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching ..."
Abstract
- Add to MetaCart
Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate comparators included the modified Jaro-Winkler method, the longest common substring, and the Levenshtein edit distance. We also calculated the combined root-mean square of all three. We tested each name comparison method using a deterministic record linkage algorithm. Results were consistent across both hospitals. At a threshold comparator score of 0.8, the Jaro-Winkler comparator achieved the highest linkage sensitivities of 97.4 % and 97.7%. The combined root-mean square method achieved sensitivities higher than the Levenshtein edit distance or longest common substring while sustaining high linkage specificity. Approximate string comparators increase deterministic linkage sensitivity by up to 10 % compared to exact match comparisons and represent an accurate method of linking to vital statistics data. Keywords:

