Results 1 - 10
of
34
Robust and efficient fuzzy match for online data cleaning
- In SIGMOD
, 2003
"... To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the p ..."
Abstract
-
Cited by 130 (6 self)
- Add to MetaCart
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets. 1.
Robust Identification of Fuzzy Duplicates
- In ICDE
, 2005
"... Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more a ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm. 1.
Practical Suffix Tree Construction
- In Proc. 13th International Conference on Very Large Data Bases
, 2004
"... Large string datasets are common in a number of emerging text and biological database applications. ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Large string datasets are common in a number of emerging text and biological database applications.
A comparison of personal name matching: Techniques and practical issues
- in ‘Workshop on Mining Complex Data’ (MCD’06), held at IEEE ICDM’06, Hong Kong
, 2006
"... or send email to: Technical-DOT-Reports-AT-cs-DOT-anu.edu.au A list of technical reports, including some abstracts and copies of some full reports may be found at: ..."
Abstract
-
Cited by 23 (6 self)
- Add to MetaCart
or send email to: Technical-DOT-Reports-AT-cs-DOT-anu.edu.au A list of technical reports, including some abstracts and copies of some full reports may be found at:
Average-Optimal Multiple Approximate String Matching
- In Proc. 14th Combinatorial Pattern Matching (CPM 2003), LNCS 2676
, 2003
"... We present a new algorithm for multiple approximate string matching, based on an extension of the optimal (on average) single-pattern approximate string matching algorithm of Chang and Marr. Our algorithm inherits the optimality and is also competitive in practice. ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
We present a new algorithm for multiple approximate string matching, based on an extension of the optimal (on average) single-pattern approximate string matching algorithm of Chang and Marr. Our algorithm inherits the optimality and is also competitive in practice.
Practical methods for constructing suffix trees
, 2005
"... Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluati ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not
Matchsimile: A Flexible Approximate Matching Tool for Personal Names Searching
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
, 2003
"... In this paper we present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for human and company names searches against a large textual database. Part of a larger information retrieval environment, this specific engine accepts an in ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
In this paper we present the architecture and algorithms behind Matchsimile, an approximate string matching lookup tool especially designed for human and company names searches against a large textual database. Part of a larger information retrieval environment, this specific engine accepts an input text file with a set of personal and company names and a set of restrictions for the search. After a batch processing, the engine outputs another text le containing the occurrences that match each record of the input names le, according to its search parameters. Beyond the similarity search capabilities applied on each word that forms a name, the tool considers a set of personal names formation rules for their words such as combination, abbreviation, character mapping, duplicity detections, ordering, word omission and insertion, among others. This engine is used in a succeeded commercial application (also named Matchsimile), which uses this tool to allow lawyers names searches against many official law journals publications.
The smoothed complexity of edit distance
- In Proc. of ICALP
, 2008
"... Abstract. We initiate the study of the smoothed complexity of sequence alignment, by proposing a semi-random model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every char ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Abstract. We initiate the study of the smoothed complexity of sequence alignment, by proposing a semi-random model of edit distance between two input strings, generated as follows. First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings. We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in near-linear time, namely d 1+ε for any fixed ε> 0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better then the bounds known for worst-case inputs, e.g. near-linear time algorithm achieving approximation roughly d 1/3, due to Batu, Ergün, and Sahinalp [SODA 2006]. Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma, Tromp and Li [Bioinformatics, 2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worst-case) edit distance, namely, edit distance on permutations (a.k.a. Ulam’s metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance. 1
A Practical Index for Genome Searching
- In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003), LNCS 2857
, 2003
"... Current search tools for computational biology trade e#- ciency for precision, losing many relevant matches. We push in the direction of obtaining maximum e#ciency from an indexing scheme that does not lose any relevant match. We show that it is feasible to search the human genome e#ciently on a ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Current search tools for computational biology trade e#- ciency for precision, losing many relevant matches. We push in the direction of obtaining maximum e#ciency from an indexing scheme that does not lose any relevant match. We show that it is feasible to search the human genome e#ciently on an average desktop computer.

