Abstract:
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets. 1.
Citations
|
1458
|
Modern Information Retrieval
– Baeza-Yates, Ribeiro-Neto
|
|
829
|
Identification of common molecular subsequences
– Smith, Waterman
- 1981
|
|
788
|
Applied Cryptography
– Schneier
- 1996
|
|
445
|
Multidimensional access methods
– Gaede, Günther
- 1998
|
|
367
|
Mtree: An efficient access method for similarity search in metric spaces
– Ciaccia, Zezula, et al.
- 1997
|
|
359
|
Approximate nearest neighbors: towards removing the curse of dimensionality
– Indyk, Motwani
- 1998
|
|
206
|
The merge/purge problem for large databases
– Hernandez, Stolfo
- 1995
|
|
186
|
On the resemblance and containment of documents
– Broder
- 1998
|
|
176
|
Integration of heterogeneous databases without common domains using queries based on textual similarity
– Cohen
- 1998
|
|
111
|
Interactive deduplication using active learning
– Sarawagi, Bhamidipaty
|
|
91
|
Approximate string joins in a database (almost) for free
– Gravano, Ipeirotis, et al.
- 2001
|
|
75
|
Eliminating fuzzy duplicates in data warehouses
– Ananthakrishna, Chaudhuri, et al.
- 2002
|
|
65
|
Size-estimation framework with applications to transitive closure and reachability
– Cohen
- 1997
|
|
62
|
Two algorithms for approximate string matching in static texts
– Jokinen, Ukkonen
- 1991
|
|
52
|
Data integration using similarity joins and a word-based information representation language
– Cohen
- 2000
|
|
45
|
Searching in metric spaces by spatial approximation
– Navarro
|
|
39
|
Indexing methods for approximate string matching
– Navarro, Baeza-Yates, et al.
- 2001
|
|
33
|
Indexing text with approximate q-grams
– Navarro, Sutinen, et al.
- 2000
|
|
25
|
Learning to match and cluster entity names
– Cohen, Richman
- 2001
|
|
23
|
Approximating matrix multiplication for pattern recognition tasks
– Cohen, Lewis
- 1997
|
|
17
|
Randomized Algorithms Cambridge
– Motwani, Raghavan
- 2005
|
|
13
|
A practical index for text retrieval allowing errors
– Navarro, Baeza-Yates
- 1998
|
|
1
|
http://www.trilliumsoft.com 0.8 0.6
– Software
|