• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Approximate Text Search, (1998)

by G Navarro
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 25
Next 10 →

Information retrieval on the Web

by Mei Kobayashi, Koichi Takeda - ACM Computing Surveys , 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract - Cited by 95 (0 self) - Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited

Better Filtering with Gapped q-Grams

by Stefan Burkhardt, Juha Kärkkäinen , 2001
"... A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is men ..."
Abstract - Cited by 90 (2 self) - Add to MetaCart
A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.

Fast and Flexible String Matching by Combining Bit-parallelism and Suffix Automata

by Gonzalo Navarro, Mathieu Raffinot - ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA , 1998
"... ... In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inher ..."
Abstract - Cited by 74 (10 self) - Add to MetaCart
... In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inherits from Shift-Or the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%-40% faster than BDM and up to 7 times faster than Shift-Or. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that

A Hybrid Indexing Method for Approximate String Matching

by Gonzalo Navarro, Ricardo Baeza-Yates
"... We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction t ..."
Abstract - Cited by 63 (10 self) - Add to MetaCart
We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction tolerated « and the alphabet size �. Itisshownthat �� for approximately « �   � � Ô �,where � � � � ����. Thespace required is four times the text size, which is quite moderate for this problem. We experimentally show that this index can outperform by far all the existing alternatives for indexed approximate searching. These are also the first experiments that compare the different existing schemes.
(Show Context)

Citation Context

...= log n) with a complicated constant. For larger values the pattern partitioning method gives linear complexity and we need to resort to the traditional suffix tree traversal (j = 1). As shown in [7=-=, 25-=-], it is very unlikely that this limit of 1 c= p can be improved, since there are too many real approximate occurrences in the text. A simplified technique that gives a reasonable result in most case...

Block Addressing Indices for Approximate Text Retrieval

by Ricardo Baeza-yates, Gonzalo Navarro - Journal of the American Society for Information Science (JASIS , 1997
"... Although the issue of approximate text retrieval is gaining importance in the last years, it is currently addressed by only a few indexing schemes. To reduce space requirements, the indices may point to text blocks instead of exact word positions. This is called "block addressing". The mos ..."
Abstract - Cited by 49 (26 self) - Add to MetaCart
Although the issue of approximate text retrieval is gaining importance in the last years, it is currently addressed by only a few indexing schemes. To reduce space requirements, the indices may point to text blocks instead of exact word positions. This is called "block addressing". The most notorious index of this kind is Glimpse. However, block addressing has not been well studied yet, especially regarding approximate searching. Our main contribution is an analytical study of the spacetime trade-offs related to the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. We validate the analysis with extensive experiments, obtaining typical performance figures. These results are valid not only for approximate searching queries but also for classical ones. Finally, we propose a new strategy for approximate searching on block addressing indices, which we experimentally find 4-5 times f...
(Show Context)

Citation Context

...umber of allowed \errors"). This problem has a number of other applications in computational biology, signal processing, etc. There exist a number of solutions for the on-line version of this problem =-=[31]-=- (i.e. the pattern can be preprocessed but the text cannot). All these algorithms traverse the whole text sequentially. If the text database is large, even the fastest on-line algorithms are not pract...

Boosting Precision and Recall of Dictionary-Based Protein Name Recognition

by Yoshimasa Tsuruoka, Jun'ichi Tsujii - Proc. of the ACL-03 Workshop on Natural Language Processing in Biomedicine , 2003
"... Dictionary-based protein name recognition is the first step for practical information extraction from biomedical documents because it provides ID information of recognized terms unlike machine learning based approaches. However, dictionary based approaches have two serious problems: (1) a lar ..."
Abstract - Cited by 33 (1 self) - Add to MetaCart
Dictionary-based protein name recognition is the first step for practical information extraction from biomedical documents because it provides ID information of recognized terms unlike machine learning based approaches. However, dictionary based approaches have two serious problems: (1) a large number of false recognitions mainly caused by short names. (2) low recall due to spelling variation.

New and Faster Filters for Multiple Approximate String Matching

by Ricardo Baeza-Yates, Gonzalo Navarro - RANDOM STRUCTURES AND ALGORITHMS (RSA , 1998
"... We present three new algorithms for on-line multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved is in all cases linear in the text size for moderate error level, pattern length and number of patte ..."
Abstract - Cited by 18 (8 self) - Add to MetaCart
We present three new algorithms for on-line multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved is in all cases linear in the text size for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when each algorithm should be used, and show experimentally their performance. The only previous solution for this problem allows only one error. Our algorithms are the first to allow more errors, and are faster than previous work for a moderate number of patterns (e.g. less than 50-100 on English text, depending on the pattern length).

A New Indexing Method for Approximate String Matching

by Gonzalo Navarro, Ricardo Baeza-yates - Combinatorial Pattern Matching, 10th Annual Symposium , 1999
"... . We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(n ), for 0 ! ! 1, whenever ff ! 1 \Gamma e= p oe, where ff ..."
Abstract - Cited by 16 (5 self) - Add to MetaCart
. We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(n ), for 0 ! ! 1, whenever ff ! 1 \Gamma e= p oe, where ff is the error level tolerated and oe is the alphabet size. We experimentally show that this index outperforms by far all other algorithms for indexed approximate searching, also being the first experiments that compare the different existing schemes. We finally show how this index can be implemented using much less space. 1 Introduction Approximate string matching is a recurrent problem in many branches of computer science, with applications to text searching, computational biology, pattern recognition, signal processing, etc. The problem is: given a long text of length n, and a (comparatively short) pattern of length m, retrieve all the text segments (or "occurrences") whose edit distance t...
(Show Context)

Citation Context

...age retrieval time can be made O(n 2(ff+H oe (ff))=(1+ff) ), where H oe (ff) is the base-oe entropy function. This is sublinear for ff ! 1 \Gamma e= p oe. This limit on ff cannot probably be improved =-=[8, 25]-=-. We finally propose an alternative data structure to reduce the space requirements of the suffix tree, with little time penalty. 2 Combining Suffix Trees and Pattern Partitioning We present now our a...

One-gapped q-gram filters for Levenshtein Distance

by Stefan Burkhardt, Juha Kärkkäinen - In: Proceedings of the 13th Symposium on Combinatorial Pattern Matching (CPM’02). Volume 2373 , 2002
"... Abstract. We have recently shown that q-gram filters based on gapped q-grams instead of the usual contiguous q-grams can provide orders of magnitude faster and/or more efficient filtering for the Hamming distance. In this paper, we extend the results for the Levenshtein distance, which is more probl ..."
Abstract - Cited by 15 (1 self) - Add to MetaCart
Abstract. We have recently shown that q-gram filters based on gapped q-grams instead of the usual contiguous q-grams can provide orders of magnitude faster and/or more efficient filtering for the Hamming distance. In this paper, we extend the results for the Levenshtein distance, which is more problematic for gapped q-grams because an insertion or deletion in a gap affects a q-gram while a replacement does not. To keep this effect under control, we concentrate on gapped q-grams with just one gap. We demostrate with experiments that the resulting filters provide a significant improvement over the contiguous q-gram filters. We also develop new techniques for dealing with complex q-gram filters. 1

On-line approximate string searching algorithms: survey and experimental results

by P. D. Michailidis, K. G. Margaritis - International Journal of Computer Mathematics , 2002
"... The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider a ..."
Abstract - Cited by 10 (5 self) - Add to MetaCart
The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider algorithms based on different approaches including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism. We compare these algorithms in terms of running time against pattern length and for several values of k for four different kinds of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Finally, we compare the experimental results of the algorithms with their theoretical complexities.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University