## Robust and efficient fuzzy match for online data cleaning (2003)

Venue: | In SIGMOD |

Citations: | 155 - 7 self |

### BibTeX

@INPROCEEDINGS{Chaudhuri03robustand,

author = {Surajit Chaudhuri and Kris Ganjam and Venkatesh Ganti and Rajeev Motwani},

title = {Robust and efficient fuzzy match for online data cleaning},

booktitle = {In SIGMOD},

year = {2003},

pages = {313--324}

}

### Years of Citing Articles

### OpenURL

### Abstract

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets. 1.

### Citations

2362 | Modern Information Retrieval
- Baeza-Yates, Ribeiro-Neto
- 1999
(Show Context)
Citation Context ...IR literature for quantifying the notion of token importance; informally, the importance of a token decreases with its frequency, which is the number of times a token occurs in the reference relation =-=[3]-=-. Even though the approach of weight association is common in the IR literature, the effective use of token weights in combination with data entry errors (e.g., spelling mistakes, missing values, inco... |

1461 |
Identification of common molecular subsequences
- Smith, Waterman
- 1981
(Show Context)
Citation Context ...nsforming u[i] into v[i]. tc ( u , v ) tc ( u[ i], v[ i]) = i The minimum transformation cost tc(u[i], v[i]) can be computed using the dynamic programming algorithm used for edit distance computation =-=[22]-=-. Consider the input tuple u[Beoing Corporation, Seattle, WA, 98004] in Table 2 and the reference tuple v[Boeing Company, Seattle, WA, 98004]. The minimum cost transformation of u[1] into v[1] require... |

1011 |
Applied Cryptography
- Schneier
- 1995
(Show Context)
Citation Context ...memory, we can adopt one of following approaches. Cache without Collisions: We can reduce the size of the tokenfrequency cache by mapping each token to an integer using a 1-1 hash function (e.g., MD5 =-=[21]-=-). We now only require 24 bytes of space (as opposed to a higher number earlier) for each token: the hash value (16 bytes), the column to which it belongs (4 bytes), and the frequency (4 bytes). Now, ... |

713 | Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing
- Indyk, Motwani
- 1998
(Show Context)
Citation Context ...rence tuples closest to an input tuple. It is well-known that efficiently identifying the exact K nearest neighbors even according to the Euclidean and Hamming norms in highdimensional spaces is hard =-=[14]-=-. Since the Hamming norm is a special case of the edit distance obtained by allowing only replacements, the identification of the exact closest K matches according to our fuzzy match similarity—which ... |

563 | Multi- dimensional Access Methods
- Gaede, Gunther
- 1998
(Show Context)
Citation Context ...imilarity threshold. This formulation is essentially that of the nearest neighbor problem, but there the domain is typically a Euclidean (or other normed) space with well-behaved similarity functions =-=[11]-=-. In our case, the data are not represented in “geometric” spaces, and it is hard to map them into one because the similarity function is relatively complex. Previous approaches addressing the fuzzy m... |

513 | M-tree: An efficient access method for similarity search in metric spaces - Ciaccia, Patella, et al. - 1997 |

341 | On the resemblance and containment of documents
- Broder
- 1997
(Show Context)
Citation Context ...here the i th coordinate mhi(S) is defined as ( S) = arg min h ( a) . Let I[X] mhi i a∈S denote an indicator variable over a boolean X, i.e., I[X] = 1 if X is true, and 0 otherwise. Then (as shown in =-=[4, 6]-=-), 1 E[ sim ( S1 , S 2 )] = H H i= 1 I[ mh ( S ) = mh ( S )] Computing the min-hash signature is like throwing darts at a board and stopping when we hit an element of S. Hence, the probability that we... |

300 | The Merge/Purge Problem for Large Databases
- Hernandez, Stolfo
- 1995
(Show Context)
Citation Context ...uted for further cleaning before considering it as referring to a new customer. A fuzzy match operation that is resilient to input errors can effectively prevent the proliferation of fuzzy duplicates =-=[13]-=- in a relation, i.e., multiple tuples describing the same real world entity. Figure 1: A Template for using Fuzzy Match Our goal in this paper is to develop a robust and efficient fuzzy match algorith... |

214 | Integration of heterogeneous databases without common domains using queries based on textual similarity - Cohen - 1998 |

197 | Interactive Deduplication Using Active Learning - Sarawagi, Bhamidipaty - 2002 |

161 | Approximate string joins in a database (almost) for free
- Gravano, Ipeirotis, et al.
- 2001
(Show Context)
Citation Context ...pirical study on real datasets, and conclude in Section 7. 2. RELATED WORK Several methods for approximate string matching over dictionaries or collections of text documents have been proposed (e.g., =-=[12]-=-, [17]). All of the above methods use edit distance as the similarity function, not considering the crucial aspect of differences in importance of tokens while measuring similarity. Approximate string... |

114 | Size-estimation framework with applications to transitive closure and reachability
- Cohen
- 1997
(Show Context)
Citation Context ...here the i th coordinate mhi(S) is defined as ( S) = arg min h ( a) . Let I[X] mhi i a∈S denote an indicator variable over a boolean X, i.e., I[X] = 1 if X is true, and 0 otherwise. Then (as shown in =-=[4, 6]-=-), 1 E[ sim ( S1 , S 2 )] = H H i= 1 I[ mh ( S ) = mh ( S )] Computing the min-hash signature is like throwing darts at a board and stopping when we hit an element of S. Hence, the probability that we... |

111 | Eliminating fuzzy duplicates in data warehouses
- Ananthakrishna, Chaudhuri, et al.
- 2002
(Show Context)
Citation Context ...it distance [e.g., 13], some on cosine similarity with IDF weights [e.g., 8], some on learning similarity functions from training datasets [e.g., 10, 20], and some on the use of dimension hierarchies =-=[1]-=-. However, all such techniques are designed for use in an offline setting and do not satisfy the efficiency requirements of the online fuzzy match operation where input tuples have to be quickly match... |

83 | Data Integration Using Similarity Joins and a Word-Based Information Representation Language - Cohen |

66 |
Two algorithms for approximate string matching in static texts
- Jokinen, Ukkonen
- 1991
(Show Context)
Citation Context ...on as fms apx and using Chernoff bounds [16], we have the following inequality, which yields Result (ii). E[ X < ( 1 − δ ) f ( u, v)] ≤ E[ X < ( 1 − δ ) E[ X ]) ≤ e ) 2 δ Hf 2 ( u , v ) − 2 Lemma 4.2 =-=[15]-=-: Let t 1, t 2 be two tokens, and m = max(|t 1|, |t 2|). Let d = (1-1/q) (1-1/m). Then, | QG(s 1 ) ∩ QG(s 2 ) | 1 − ed ( s1 , s 2 ) ≤ + d mq Because the probability P(fms apx (u, v) (1- δ )fms(u, v)) ... |

65 | Searching in metric spaces by spatial approximation - Navarro - 2002 |

54 | Indexing methods for approximate string matching
- Navarro, Baeza-Yates, et al.
(Show Context)
Citation Context ...dopt proprietary domain-specific functions (e.g., Trillium’s reference matching operation for the address domain [23]) or use the string edit distance function for measuring similarity between tuples =-=[17]-=-. A limitation of the edit distance is illustrated by the following example. The edit distance function would consider the input tuple I3 in Table 2 to be closest to R2 in Table 1, even though we know... |

38 | Indexing text with approximate q-grams - Navarro, Sutinen, et al. |

33 | Approximating matrix multiplication for pattern recognition tasks (special issue of selected papers from SODA’97
- Cohen, Lewis
- 1999
(Show Context)
Citation Context ...he limitation of ignoring erroneous input tokens. Further, Cohen et al. improve efficiency by choosing probabilistically a subset of tokens from each document under the correct input token assumption =-=[9]-=-. In this paper, we propose a similarity function that does not assume correctness of input tokens, and further improve efficiency by exploiting the variance in weights of input tokens. As discussed e... |

31 | Learning to match and cluster entity names - Cohen, Richman - 2001 |

25 |
Randomized Algorithms (Cambridge
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...H > 0, splitting fms apx (u, v) into the average of H independent functions f 1’, …, f H’ one for each min-hash coordinate such that f i’ has the same expectation as fms apx and using Chernoff bounds =-=[16]-=-, we have the following inequality, which yields Result (ii). E[ X < ( 1 − δ ) f ( u, v)] ≤ E[ X < ( 1 − δ ) E[ X ]) ≤ e ) 2 δ Hf 2 ( u , v ) − 2 Lemma 4.2 [15]: Let t 1, t 2 be two tokens, and m = ma... |

14 | A practical index for text retrieval allowing errors - Baeza-Yates, Navarro - 1997 |

1 |
http://www.trilliumsoft.com 0.8 0.6
- Software
(Show Context)
Citation Context ...s relatively complex. Previous approaches addressing the fuzzy match operation either adopt proprietary domain-specific functions (e.g., Trillium’s reference matching operation for the address domain =-=[23]-=-) or use the string edit distance function for measuring similarity between tuples [17]. A limitation of the edit distance is illustrated by the following example. The edit distance function would con... |