Abstract:
The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.
Citations
|
5044
|
Statistical Learning Theory
– Vapnik
- 1998
|
|
2372
|
A tutorial on hidden Markov Models and selected applications in speech recognition
– Rabiner
- 1989
|
|
1439
|
Modern Information Retrieval
– Baeza-Yates, Ribeiro
- 1999
|
|
1392
|
Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
– Witten, Frank
- 1999
|
|
805
|
Making large-scale SVM learning practical
– Joachims
- 1999
|
|
730
|
A general method applicable to the search for similarities in the amino acid sequence of two proteins
– Needleman, Wunsch
- 1970
|
|
630
|
Algorithms on Strings, Trees and Sequences
– Gusfield
- 1997
|
|
441
|
Biological sequence analysis—- Probabilistic models of proteins and nucleic acids. Combridge
– Durbin, Eddy, et al.
- 1998
|
|
365
|
Transductive inference for text classification using support vector machines
– Joachims
- 1999
|
|
317
|
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
– Platt
- 2000
|
|
208
|
Sunter A: A theory for record linkage
– Fellegi
- 1969
|
|
202
|
The merge/purge problem for large databases
– Hernandez, Stolfo
- 1995
|
|
145
|
Efficient clustering of high-dimensional data sets with application to reference matching
– McCallum, Nigam, et al.
- 2000
|
|
132
|
CP: An efficient domain-independent algorithm for detecting approximately duplicate database records
– AE, Elkan
- 1997
|
|
112
|
The state of record linkage and current research problems
– Winkler
- 1999
|
|
110
|
Learning stringedit distance
– RISTAD, YIANILOS
- 1998
|
|
108
|
Substructure discovery using minimum description length and background knowledge
– Cook, Holder
|
|
108
|
Interactive deduplication using active learning
– Sarawagi, Bhamidipaty
- 2002
|
|
98
|
The Field Matching Problem: Algorithms and Applications
– Monge, Elkan
- 1996
|
|
83
|
A: Automatic linkage of vital records
– Newcombe, Kennedy, et al.
- 1959
|
|
70
|
Learning to match and cluster large high-dimensional data sets for data integration
– Cohen, Richman
- 2002
|
|
70
|
Learning domain-independent string transformation weights for high accuracy object identification
– TEJADA, KNOBLOCK, et al.
- 2002
|
|
65
|
The alternating decision tree learning algorithm
– Freund, Mason
- 1999
|
|
54
|
Obtaining calibrated probability estimates from decision trees and naive bayesian classi ers
– Zadrozny, Elkan
- 2001
|
|
41
|
Hardening soft information sources
– COHEN, KAUTZ, et al.
- 2000
|
|
28
|
Using Information Extraction to Aid the Discovery of Prediction Rules from Text
– Nahm, Mooney
- 2000
|
|
8
|
Learning to combine trained distance metrics for duplicate detection in databases
– Mooney
- 2002
|