Results 1 -
7 of
7
DogmatiX Tracks down Duplicates in XML
, 2005
"... Duplicate detection is the problem of detecting di#erent entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this p ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
Duplicate detection is the problem of detecting di#erent entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to e#ciently find those duplicates.
Moma - a mapping-based object matching system
- In CIDR
, 2007
"... Object matching or object consolidation is a crucial task for data integration and data cleaning. It addresses the problem of identifying object instances in data sources referring to the same real world entity. We propose a flexible framework called MOMA for mapping-based object matching. It allows ..."
Abstract
-
Cited by 19 (10 self)
- Add to MetaCart
Object matching or object consolidation is a crucial task for data integration and data cleaning. It addresses the problem of identifying object instances in data sources referring to the same real world entity. We propose a flexible framework called MOMA for mapping-based object matching. It allows the construction of match workflows combining the results of several matcher algorithms on both attribute values and contextual information. The output of a match task is an instance-level mapping that supports information fusion in P2P data integration systems and can be re-used for other match tasks. MOMA utilizes further semantic mappings of different cardinalities and provides merge and compose operators for mapping combination. We propose and evaluate several strategies for both object matching between different sources as well as for duplicate identification within a single data source. 1.
MAL4:6- Using Data Mining for Record Linkage
"... This paper presents a first attempt at using pedigree-based data to improve record linkage. It describes a composite metric for similarity and a mechanism to extract relevant generational features. Results on a large data set demonstrate promise. 1 ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents a first attempt at using pedigree-based data to improve record linkage. It describes a composite metric for similarity and a mechanism to extract relevant generational features. Results on a large data set demonstrate promise. 1
Utilizing Stacking for Feature Reduction in Graph-Based Genealogical Record Linkage
"... Abstract — Genealogy research is centered on collecting records about an individual from various sources and combining the information to gain a larger historical perspective about that individual, commonly in the form of a pedigree. Data extraction, the internet, and other technological advancement ..."
Abstract
- Add to MetaCart
Abstract — Genealogy research is centered on collecting records about an individual from various sources and combining the information to gain a larger historical perspective about that individual, commonly in the form of a pedigree. Data extraction, the internet, and other technological advancements have made large amounts of digital genealogical data more accessible. Discovering the relevancy of a digital record to a given pedigree involves determining if the individual described in the record is in actuality an individual within the pedigree. This process is called Genealogical Record Linkage (GRL). GRL can be automated through data mining and techniques by creating machine learned models from hand labeled comparisons. In this paper, we compare two such models-a tabular approach and a graph based stacking approach-and report the successful application of both on a large, post-blocking database. We also note the successful integration of these approaches in an open source distributed genealogy program that finds relevant machetes to a given pedigree from multiple online repositories. I.
Problem of Matching Far Infra-Red Astronomical Sources to Optical Counterparts
, 2005
"... Abstract: The problem of record linkage is often seen simply in terms of making links between data points that might be generated from the same source. However, in many cases the grounds for linking items is itself not certain. In fact it is often desirable to learn, in an unsupervised manner, what ..."
Abstract
- Add to MetaCart
Abstract: The problem of record linkage is often seen simply in terms of making links between data points that might be generated from the same source. However, in many cases the grounds for linking items is itself not certain. In fact it is often desirable to learn, in an unsupervised manner, what form linked objects take in different databases. One simple case of this is the “one to many ” linkage problem, where each object in one dataset is potentially linked to one of many objects in another dataset, and where the candidate matches are mutually exclusive. We show how the Expectation Maximisation algorithm can be used for this matching problem, both to calculate the probability of a match, and to learn something about the characteristics that matched objects have. The approach is derived for the specific astronomical problem of linking far infra-red observations to optical counterparts, but is generally applicable. This report outlines the theory of this record linkage procedure, but does not discuss its application or any
Beyond Probabilistic Record Linkage: Using Neural Networks and Complex Features to Improve Genealogical Record Linkage
"... Abstract — Probabilistic record linkage has been used for many years in a variety of industries, including medical, government, private sector and research groups. The formulas used for probabilistic record linkage have been recognized by some as being equivalent to the naïve Bayes classifier. While ..."
Abstract
- Add to MetaCart
Abstract — Probabilistic record linkage has been used for many years in a variety of industries, including medical, government, private sector and research groups. The formulas used for probabilistic record linkage have been recognized by some as being equivalent to the naïve Bayes classifier. While this method can produce useful results, it is not difficult to improve accuracy by using one of a host of other machine learning or neural network algorithms. Even a simple singlelayer perceptron tends to outperform the naïve Bayes classifier—and thus traditional probabilistic record linkage methods—by a substantial margin. Furthermore, many record linkage system use simple field comparisons rather than more complex features, partially due to the limits of the probabilistic formulas they use. This paper presents an overview of probabilistic record linkage, shows how to cast it in machine

