MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Adaptive Duplicate Detection Using Learnable String Similarity Measures (2003) [117 citations — 10 self]

by Mikhail Bilenko ,  Raymond J. Mooney
In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003
Add To MetaCart

Abstract:

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

Citations

5044 Statistical Learning Theory – Vapnik - 1998
2372 A tutorial on hidden Markov Models and selected applications in speech recognition – Rabiner - 1989
1439 Modern Information Retrieval – Baeza-Yates, Ribeiro - 1999
1392 Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations – Witten, Frank - 1999
805 Making large-scale SVM learning practical – Joachims - 1999
730 A general method applicable to the search for similarities in the amino acid sequence of two proteins – Needleman, Wunsch - 1970
630 Algorithms on Strings, Trees and Sequences – Gusfield - 1997
441 Biological sequence analysis—- Probabilistic models of proteins and nucleic acids. Combridge – Durbin, Eddy, et al. - 1998
365 Transductive inference for text classification using support vector machines – Joachims - 1999
317 Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods – Platt - 2000
208 Sunter A: A theory for record linkage – Fellegi - 1969
202 The merge/purge problem for large databases – Hernandez, Stolfo - 1995
145 Efficient clustering of high-dimensional data sets with application to reference matching – McCallum, Nigam, et al. - 2000
132 CP: An efficient domain-independent algorithm for detecting approximately duplicate database records – AE, Elkan - 1997
112 The state of record linkage and current research problems – Winkler - 1999
110 Learning stringedit distance – RISTAD, YIANILOS - 1998
108 Substructure discovery using minimum description length and background knowledge – Cook, Holder
108 Interactive deduplication using active learning – Sarawagi, Bhamidipaty - 2002
98 The Field Matching Problem: Algorithms and Applications – Monge, Elkan - 1996
83 A: Automatic linkage of vital records – Newcombe, Kennedy, et al. - 1959
70 Learning to match and cluster large high-dimensional data sets for data integration – Cohen, Richman - 2002
70 Learning domain-independent string transformation weights for high accuracy object identification – TEJADA, KNOBLOCK, et al. - 2002
65 The alternating decision tree learning algorithm – Freund, Mason - 1999
54 Obtaining calibrated probability estimates from decision trees and naive bayesian classi ers – Zadrozny, Elkan - 2001
41 Hardening soft information sources – COHEN, KAUTZ, et al. - 2000
28 Using Information Extraction to Aid the Discovery of Prediction Rules from Text – Nahm, Mooney - 2000
8 Learning to combine trained distance metrics for duplicate detection in databases – Mooney - 2002