## Adaptive Duplicate Detection Using Learnable String Similarity Measures (2003)

Venue: | In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003 |

Citations: | 237 - 11 self |

### BibTeX

@INPROCEEDINGS{Bilenko03adaptiveduplicate,

author = {Mikhail Bilenko and Raymond J. Mooney},

title = {Adaptive Duplicate Detection Using Learnable String Similarity Measures},

booktitle = {In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003},

year = {2003},

pages = {39--48}

}

### Years of Citing Articles

### OpenURL

### Abstract

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

### Citations

8983 | N.: The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ...imization (EM) algorithm for estimating the parameters of a generative model based on string edit distance with affine gaps. The other string similarity measure employs a Support Vector Machine (SVM) =-=[24]-=- to obtain a similarity estimate based on the vector-space model of text. The characterbased distance is best suited for shorter strings with minor variations, while the measure based on vector-space ... |

4275 | A tutorial on hidden Markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...generating the pair of prefixes (x T 1...t , y V 1...v ) and suffixes (x T t+1...T , y V v+1...V ) can be computed using dynamic programming in standard forward and backward algorithms in O(TV ) time =-=[20]-=-. Then, given a corpus of n matched strings corresponding to pairs of duplicates, C = {(x T 1 , y V 1 ), . . . , (x Tn , y Vn )}, this model can be trained using a variant of the Baum-Welch algorithm,... |

2966 |
Data mining : practical machine learning tools and techniques with Java implementations
- Witten, Frank
- 2000
(Show Context)
Citation Context ... the Restaurant dataset due to the limited number of duplicates in it). The SVM light implementation of a support vector machine with a radial basis function kernel was compared with the WEKA package =-=[26]-=- implementation of alternating decision trees [8], a state-of-the-art algorithm that combines boosting and decision tree learning. Unlearned vector-space normalized dot product was used as the field-l... |

2362 | Modern Information Retrieval
- Baeza-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ...ning soft databases [3], reference matching [13], and entityname clustering and matching [4]. Typically, standard string similarity metrics such as edit distance [9] or vector-space cosine similarity =-=[1]-=- are used to determine whether two values or records are alike enough to be duplicates. Some more recent work [4, 22, 23] has investigated the use of pairing functions that combine multiple standard m... |

1441 |
Making large-Scale SVM Learning Practical
- Joachims
- 1999
(Show Context)
Citation Context ...tion 2.2.3, implemented over TF-IDF representations after stemming and stopword removal. SVM implementation with the radial basis function kernel from the SVM light package was used as the classifier =-=[11]-=- with # = 10. Results for field-level duplicate detection experiments are summarized in Table 4. Each entry in the table contains the average of maximum F-measure values over the 40 evaluated folds. R... |

1410 |
A general method applicable to the search for similarities in the amino acid sequence of two proteins
- Needleman, Wunsch
- 1970
(Show Context)
Citation Context ...er-based string similarity metric is Levenshtein distance, defined as the minimum number of insertions, deletions or substitutions necessary to transform one string into another. Needleman and Wunsch =-=[17]-=- extended the model to allow contiguous sequences of mismatched characters, or gaps, in the alignment of two strings, and described a general dynamic programming method for computing edit distance. Mo... |

899 |
Algorithms on Strings, Trees, and Sequences
- Gusfield
- 1997
(Show Context)
Citation Context ...], duplicate detection [15, 22], hardening soft databases [3], reference matching [13], and entityname clustering and matching [4]. Typically, standard string similarity metrics such as edit distance =-=[9]-=- or vector-space cosine similarity [1] are used to determine whether two values or records are alike enough to be duplicates. Some more recent work [4, 22, 23] has investigated the use of pairing func... |

830 | Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
- Durbin, Eddy, et al.
- 1998
(Show Context)
Citation Context ... are likely to be aligned in a given domain, such as substitution #/, -# for phone numbers, or deletion #., ## for addresses. This generative model is similar to one given for amino-acid sequences in =-=[6]-=- with two important differences: (1) transition probabilities are distinct for states D and I , and (2) every transition has a probability parameter associated with it, instead of being expressed thro... |

701 | Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
- Platt
- 1999
(Show Context)
Citation Context ...in on the training data affects hypothesis choice), they are guaranteed not to overfit the solution to the training pairs. Methods for obtaining calibrated posterior probabilities from the SVM output =-=[19]-=- could be used to obtain a probabilistically meaningful similarity function at this point. Because in the deduping framework we are only concerned with obtaining a correct rankingsof pairs with respec... |

683 | Transductive inference for text classification using support vector machines
- Joachims
- 1999
(Show Context)
Citation Context ...ion of the quadratic optimization problem used by SVM light needs to be extended to be robust to differences between distributions of test data and training data. This task is similar to transduction =-=[24, 12]-=- because it would require using unlabeled test data in the learning process, but with the fundamental departure from transduction in using unlabeled test data from a different distribution. Alternativ... |

398 |
A Theory for Record Linkage
- Fellegi, Sunter
- 1969
(Show Context)
Citation Context ... records in databases was originally identified by Newcombe [18] as record linkage in the context of identifying medical records of the same individual from different time periods. Fellegi and Sunter =-=[7]-=- developed a formal theory for record linkage and offered statistical methods for estimating matching parameters and error rates. In more recent work in statistics, Winkler proposed using EM-based met... |

300 | The Merge/Purge Problem for Large Databases
- Hernandez, Stolfo
- 1995
(Show Context)
Citation Context ...stribute to lists, requires prior specific permission and/or a fee. SIGKDD '03, August 24-27, 2003, Washington, DC, USA Copyright 2003 ACM 1-58113-737-0/03/0008 ...$5.00. 25], the merge/purge problem =-=[10]-=-, duplicate detection [15, 22], hardening soft databases [3], reference matching [13], and entityname clustering and matching [4]. Typically, standard string similarity metrics such as edit distance [... |

258 | Ungar â€œEfficient Clustering of High Dimensional Data Sets with Application to Reference
- McCallum, Nigam, et al.
- 2000
(Show Context)
Citation Context ...t 24-27, 2003, Washington, DC, USA Copyright 2003 ACM 1-58113-737-0/03/0008 ...$5.00. 25], the merge/purge problem [10], duplicate detection [15, 22], hardening soft databases [3], reference matching =-=[13]-=-, and entityname clustering and matching [4]. Typically, standard string similarity metrics such as edit distance [9] or vector-space cosine similarity [1] are used to determine whether two values or ... |

218 | The state of record linkage and current research problems
- Winkler
- 1999
(Show Context)
Citation Context ...rding to their contribution to the true distance between records. While statistical aspects of combining similarity scores for individual fields have been addressed in previous work on record linkage =-=[25], availabi-=-lity of labeled duplicates allows a more direct approach that uses a binary classifier that computes a "pairing function " [4]. Given a database that contains records composed of k different... |

197 | Interactive Deduplication Using Active Learning
- Sarawagi, Bhamidipaty
- 2002
(Show Context)
Citation Context ...s prior specific permission and/or a fee. SIGKDD '03, August 24-27, 2003, Washington, DC, USA Copyright 2003 ACM 1-58113-737-0/03/0008 ...$5.00. 25], the merge/purge problem [10], duplicate detection =-=[15, 22]-=-, hardening soft databases [3], reference matching [13], and entityname clustering and matching [4]. Typically, standard string similarity metrics such as edit distance [9] or vector-space cosine simi... |

195 | Learning String-Edit Distance
- Ristad, Yianilos
- 1998
(Show Context)
Citation Context ...sed by a typo or an abbreviation. Therefore, adapting string edit distance to a particular domain requires assigning different weights to different edit operations. In prior work, Ristad and Yianilos =-=[21]-=- have developed a generative model for Levenshtein distance along with an ExpectationMaximization algorithm that learns model parameters using a training set consisting of matched strings. We propose ... |

174 | An efficient domain-independent algorithm for detecting approximately duplicate database records
- A, Elkan
- 1997
(Show Context)
Citation Context ...s prior specific permission and/or a fee. SIGKDD '03, August 24-27, 2003, Washington, DC, USA Copyright 2003 ACM 1-58113-737-0/03/0008 ...$5.00. 25], the merge/purge problem [10], duplicate detection =-=[15, 22]-=-, hardening soft databases [3], reference matching [13], and entityname clustering and matching [4]. Typically, standard string similarity metrics such as edit distance [9] or vector-space cosine simi... |

153 | The Field Matching Problem: Algorithms and Applications
- Monge, Elkan
- 1996
(Show Context)
Citation Context ...cientific citations. Monge and Elkan developed the iterative merging algorithm based on the union-find data structure [15] and showed the advantages of using a string distance metric that allows gaps =-=[14]-=-. Cohen et. al. [3] posed the duplicate detection task as an optimization problem, proved NP-hardness of solving the problem optimally, and proposed a nearly linear algorithm for finding a local optim... |

152 | Substructure discovery using minimum description length and background knowledge
- Cook, Holder
- 1994
(Show Context)
Citation Context ...this task, since it would allow discovering useful deletion sequences by developing a stochastic model based on the gaps created when computing minimum-cost alignments. Substructure discovery methods =-=[5]-=- could also be used to identify useful edit operation sequences that include different edit operations. 7. CONCLUSIONS Duplicate detection is an important problem in data cleaning, and an adaptive app... |

131 |
Automatic linkage of vital records
- Newcombe, Kennedy, et al.
- 1959
(Show Context)
Citation Context ...ative record similarity that leads to poor accuracy in the duplicate detection process. 5. RELATED WORK The problem of identifying duplicate records in databases was originally identified by Newcombe =-=[18]-=- as record linkage in the context of identifying medical records of the same individual from different time periods. Fellegi and Sunter [7] developed a formal theory for record linkage and offered sta... |

128 | Learning to match and cluster large high-dimensional data sets for data integration
- Cohen, Richman
- 2002
(Show Context)
Citation Context ...2003 ACM 1-58113-737-0/03/0008 ...$5.00. 25], the merge/purge problem [10], duplicate detection [15, 22], hardening soft databases [3], reference matching [13], and entityname clustering and matching =-=[4]-=-. Typically, standard string similarity metrics such as edit distance [9] or vector-space cosine similarity [1] are used to determine whether two values or records are alike enough to be duplicates. S... |

123 | The alternating decision tree learning algorithm
- Freund, Mason
- 1999
(Show Context)
Citation Context ...of duplicates in it). The SVM light implementation of a support vector machine with a radial basis function kernel was compared with the WEKA package [26] implementation of alternating decision trees =-=[8]-=-, a state-of-the-art algorithm that combines boosting and decision tree learning. Unlearned vector-space normalized dot product was used as the field-level similarity measure. Figs.11 and 12 illustrat... |

107 | Learning domain-independent string transformation weights for high accuracy object identification
- Tejada, Knoblock, et al.
- 2002
(Show Context)
Citation Context ... string similarity metrics such as edit distance [9] or vector-space cosine similarity [1] are used to determine whether two values or records are alike enough to be duplicates. Some more recent work =-=[4, 22, 23]-=- has investigated the use of pairing functions that combine multiple standard metrics. Because an estimate of similarity between strings can vary significantly depending on the domain and specific fie... |

97 | 2001b): Obtaining calibrated probability estimates from decision trees and naive bayes classi ers
- Zadrozny, Elkan
(Show Context)
Citation Context ...licate detection. While decision trees are reliable classifiers, obtaining calibrated confidence scores from them relies on probability estimates based on training data statistics over the tree nodes =-=[27]-=-. When little training data is available, such frequency-based estimates are very unreliable. As a result, the confidence of the decision tree classifier is an inaccurate measure of relative record si... |

54 | Hardening soft information sources
- Cohen, Kautz, et al.
- 2000
(Show Context)
Citation Context ...reviations, as well as integration of multiple data sources. Variations are particularly pronounced in data that is automatically extracted from unstructured or semi-structured documents or web pages =-=[16, 3]-=-. Such approximate duplicates can have many deleterious effects, including preventing data-mining algorithms from discovering important regularities. This problem is typically handled during a tedious... |

29 | Using information extraction to aid the discovery of prediction rules from text
- Nahm
- 2000
(Show Context)
Citation Context ...reviations, as well as integration of multiple data sources. Variations are particularly pronounced in data that is automatically extracted from unstructured or semi-structured documents or web pages =-=[16, 3]-=-. Such approximate duplicates can have many deleterious effects, including preventing data-mining algorithms from discovering important regularities. This problem is typically handled during a tedious... |

8 |
Learning to combine trained distance metrics for duplicate detection in databases
- Mooney
- 2002
(Show Context)
Citation Context ...transition and an accompanying character pair emission. In the MAXIMIZATION procedure all model parameters are updated using the collected expectations. Pseudo-code for the algorithms can be found in =-=[2]-=-. It can be proved that this training procedure is guaranteed to converge to a local maximum of likelihood of observing the training corpus C. The trained model can be used for estimating distance bet... |