Results 1  10
of
15
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 418 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Approximate string matching
 ACM Computing Surveys
, 1980
"... Approximate matching of strings is reviewed with the aim of surveying techniques suitable for finding an item in a database when there may be a spelling mistake or other error in the keyword. The methods found are classified as either equivalence or similarity problems. Equivalence problems are seen ..."
Abstract

Cited by 135 (0 self)
 Add to MetaCart
Approximate matching of strings is reviewed with the aim of surveying techniques suitable for finding an item in a database when there may be a spelling mistake or other error in the keyword. The methods found are classified as either equivalence or similarity problems. Equivalence problems are seen to be readily solved using canonical forms. For sinuiarity problems difference measures are surveyed, with a full description of the wellestablmhed dynamic programming method relating this to the approach using probabilities and likelihoods. Searches for approximate matches in large sets using a difference function are seen to be an open problem still, though several promising ideas have been suggested. Approximate matching (error correction) during parsing is briefly reviewed.
Algorithms for the Satisfiability (SAT) Problem: A Survey
 DIMACS Series in Discrete Mathematics and Theoretical Computer Science
, 1996
"... . The satisfiability (SAT) problem is a core problem in mathematical logic and computing theory. In practice, SAT is fundamental in solving many problems in automated reasoning, computeraided design, computeraided manufacturing, machine vision, database, robotics, integrated circuit design, compute ..."
Abstract

Cited by 126 (3 self)
 Add to MetaCart
. The satisfiability (SAT) problem is a core problem in mathematical logic and computing theory. In practice, SAT is fundamental in solving many problems in automated reasoning, computeraided design, computeraided manufacturing, machine vision, database, robotics, integrated circuit design, computer architecture design, and computer network design. Traditional methods treat SAT as a discrete, constrained decision problem. In recent years, many optimization methods, parallel algorithms, and practical techniques have been developed for solving SAT. In this survey, we present a general framework (an algorithm space) that integrates existing SAT algorithms into a unified perspective. We describe sequential and parallel SAT algorithms including variable splitting, resolution, local search, global optimization, mathematical programming, and practical SAT algorithms. We give performance evaluation of some existing SAT algorithms. Finally, we provide a set of practical applications of the sat...
Approximate Text Searching
, 1998
"... This thesis focuses on the problem of text retrieval allowing errors, also called \approximate " string matching. The problem is to nd a pattern in a text, where the pattern and the text may have \errors". This problem has received a lot of attention in recent years because of its applicat ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
This thesis focuses on the problem of text retrieval allowing errors, also called \approximate " string matching. The problem is to nd a pattern in a text, where the pattern and the text may have \errors". This problem has received a lot of attention in recent years because of its applications in many areas, such as information retrieval, computational biology and signal processing, to name a few. The aim of this work is the development and analysis of novel algorithms to deal with the problem under various conditions, as well as a better understanding of the problem itself and its statistical behavior. Although our results are valid in many dierent areas, we focus our attention on typical text searching for information retrieval applications. This makes some ranges of values for the parameters of the problem more interesting than others. We have divided this presentation in two parts. The rst one deals with online approximate string matching, i.e. when there is no time or space to preprocess the text. These algorithms are the core of oline algorithms as well. Online searching is the area of the problem where better algorithms existed. We have obtained new bounds for the probability of an approximate match of a pattern in
An effective algorithm for string correction using generalized edit distances  I. Description of the . . .
, 1981
"... This paper deals with the problem of estimating a transmitted string X, from the corresponding received string Y, which is a noisy version of X,. We assume that Y contains*any number of substitution, insertion, and deletion errors, and that no two consecutive symbols of X, were deleted in transmissi ..."
Abstract

Cited by 19 (11 self)
 Add to MetaCart
This paper deals with the problem of estimating a transmitted string X, from the corresponding received string Y, which is a noisy version of X,. We assume that Y contains*any number of substitution, insertion, and deletion errors, and that no two consecutive symbols of X, were deleted in transmission. We have shown that for channels which cause independent errors, and whose error probabilities exceed those of noisy strings studied in the literature [ 121, at least 99.5 % of the erroneous strings will not contain two consecutive deletion errors. The best estimate X * of X, is defined as that element of H which minimizes the generalized Levenshtein distance D ( X/Y) between X and Y. Using dynamic programming principles, an algorithm is presented which yields X+ without computing individually the distances between every word of H and Y. Though this algorithm requires more memory, it can be shown that it is, in general, computationally less complex than all other existing algorithms which perform the same task.
Fast Approximate Search in Large Dictionaries
 COMPUTATIONAL LINGUISTICS
, 2004
"... The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a "universal Levenshtein automaton," we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixeddistance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.
Backwards Search in Context Bound Text Transformations
"... Abstract—The BurrowsWheeler Transform (BWT) is the basis for many of the most effective compression and selfindexing methods used today. A key to the versatility of the BWT is the ability to search for patterns directly in the transformed text. A backwards search for a pattern P can be performed on ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract—The BurrowsWheeler Transform (BWT) is the basis for many of the most effective compression and selfindexing methods used today. A key to the versatility of the BWT is the ability to search for patterns directly in the transformed text. A backwards search for a pattern P can be performed on a transformed text by iteratively determining the range of suffixes that match P. The search can be further enhanced by constructing a wavelet tree over the output of the BWT in order to emulate a suffix array. In this paper, we investigate new algorithms for search derived from a variation of the BWT whereby rotations are only sorted to a depth k, commonly referred to as a context bound transform. Interestingly, this BWT variant can be used to mimic a kgram index, which are used in a variety of applications that need to efficiently return occurrences in text position order. In this paper, we present the first backwards search algorithms on the kBWT, and show how to construct a selfindex containing many of the attractive properties of a kgram index. Keywordsbackwards search; text indexing; BWT; context bound transform, kgram index; I.
Text Searching: Theory and Practice
"... We present the state of the art of the main component of text retrieval systems: the search engine. We outline the main lines of research and issues involved. We survey the relevant techniques in use today for text searching and explore the gap between theoretical and practical algorithms. The main ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We present the state of the art of the main component of text retrieval systems: the search engine. We outline the main lines of research and issues involved. We survey the relevant techniques in use today for text searching and explore the gap between theoretical and practical algorithms. The main observation is that simpler ideas are better in practice.
IEEE International Conference on Data Engineering iVAFile: Efficiently Indexing Sparse Wide Tables in Community Systems
"... Abstract — In community web management systems (CWMS), storage structures inspired by universal tables are being used increasingly to manage sparse datasets. Such a sparse wide table (SWT) typically embodies thousands of attributes, with many of them being undefined in each tuple, and lowdimensiona ..."
Abstract
 Add to MetaCart
Abstract — In community web management systems (CWMS), storage structures inspired by universal tables are being used increasingly to manage sparse datasets. Such a sparse wide table (SWT) typically embodies thousands of attributes, with many of them being undefined in each tuple, and lowdimensional structured similarity search on a combination of numerical and text attributes is a common operation. However, many properties of such wide tables and their associated Web 2.0 services render most multidimensional indexing structures irrelevant. Recent studies in this area have mainly focused on improving the storage efficiency and efficient deployment of inverted indices; so far no new index has been proposed for indexing SWTs. The inverted index is fast for scanning but not efficient in reducing random accesses to the data file as it captures little information about the content of attribute values. In this paper, we propose the iVAfile that works on the basis of approximate contents and keeps scanning efficiency within a bounded range. We introduce the nGsignature to approximately represent data strings and improve the existing approximate vectors for numerical values. We also propose an efficient query processing strategy for the iVAfile, which is different from strategies used for existing scanbased indices. To enable the use of different metrics of distance between a query and a tuple that may vary from application to application, the iVAfile has been designed to be metricoblivious and to provide efficient filterandrefine search based on any rational metric. Extensive experiments on real datasets show that the iVAfile outperforms existing proposals in query efficiency significantly, at the same time, keeps a good update speed. I.