MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Robust and efficient fuzzy match for online data cleaning (2003) [91 citations — 5 self]

by Surajit Chaudhuri ,  Kris Ganjam ,  Venkatesh Ganti ,  Rajeev Motwani
In SIGMOD
Add To MetaCart

Abstract:

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets. 1.

Citations

1458 Modern Information Retrieval – Baeza-Yates, Ribeiro-Neto
829 Identification of common molecular subsequences – Smith, Waterman - 1981
788 Applied Cryptography – Schneier - 1996
445 Multidimensional access methods – Gaede, Günther - 1998
367 Mtree: An efficient access method for similarity search in metric spaces – Ciaccia, Zezula, et al. - 1997
359 Approximate nearest neighbors: towards removing the curse of dimensionality – Indyk, Motwani - 1998
206 The merge/purge problem for large databases – Hernandez, Stolfo - 1995
186 On the resemblance and containment of documents – Broder - 1998
176 Integration of heterogeneous databases without common domains using queries based on textual similarity – Cohen - 1998
111 Interactive deduplication using active learning – Sarawagi, Bhamidipaty
91 Approximate string joins in a database (almost) for free – Gravano, Ipeirotis, et al. - 2001
75 Eliminating fuzzy duplicates in data warehouses – Ananthakrishna, Chaudhuri, et al. - 2002
65 Size-estimation framework with applications to transitive closure and reachability – Cohen - 1997
62 Two algorithms for approximate string matching in static texts – Jokinen, Ukkonen - 1991
52 Data integration using similarity joins and a word-based information representation language – Cohen - 2000
45 Searching in metric spaces by spatial approximation – Navarro
39 Indexing methods for approximate string matching – Navarro, Baeza-Yates, et al. - 2001
33 Indexing text with approximate q-grams – Navarro, Sutinen, et al. - 2000
25 Learning to match and cluster entity names – Cohen, Richman - 2001
23 Approximating matrix multiplication for pattern recognition tasks – Cohen, Lewis - 1997
17 Randomized Algorithms Cambridge – Motwani, Raghavan - 2005
13 A practical index for text retrieval allowing errors – Navarro, Baeza-Yates - 1998
1 http://www.trilliumsoft.com 0.8 0.6 – Software