Results 1  10
of
25
Information retrieval on the Web
 ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract

Cited by 95 (0 self)
 Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Better Filtering with Gapped qGrams
, 2001
"... A popular and wellstudied class of filters for approximate string matching compares substrings of length q, the qgrams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped qgrams instead of contiguous substrings is men ..."
Abstract

Cited by 90 (2 self)
 Add to MetaCart
A popular and wellstudied class of filters for approximate string matching compares substrings of length q, the qgrams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped qgrams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped qgrams. We show that gapped qgrams can provide orders of magnitude faster and/or more efficient filtering than contiguous qgrams. To achieve these results the arrangement of the gaps in the qgram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.
Fast and Flexible String Matching by Combining Bitparallelism and Suffix Automata
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA
, 1998
"... ... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inher ..."
Abstract

Cited by 74 (10 self)
 Add to MetaCart
... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inherits from ShiftOr the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%40% faster than BDM and up to 7 times faster than ShiftOr. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that
A Hybrid Indexing Method for Approximate String Matching
"... We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction t ..."
Abstract

Cited by 63 (10 self)
 Add to MetaCart
(Show Context)
We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction tolerated « and the alphabet size �. Itisshownthat �� for approximately « � � � Ô �,where � � � � ����. Thespace required is four times the text size, which is quite moderate for this problem. We experimentally show that this index can outperform by far all the existing alternatives for indexed approximate searching. These are also the first experiments that compare the different existing schemes.
Block Addressing Indices for Approximate Text Retrieval
 Journal of the American Society for Information Science (JASIS
, 1997
"... Although the issue of approximate text retrieval is gaining importance in the last years, it is currently addressed by only a few indexing schemes. To reduce space requirements, the indices may point to text blocks instead of exact word positions. This is called "block addressing". The mos ..."
Abstract

Cited by 49 (26 self)
 Add to MetaCart
(Show Context)
Although the issue of approximate text retrieval is gaining importance in the last years, it is currently addressed by only a few indexing schemes. To reduce space requirements, the indices may point to text blocks instead of exact word positions. This is called "block addressing". The most notorious index of this kind is Glimpse. However, block addressing has not been well studied yet, especially regarding approximate searching. Our main contribution is an analytical study of the spacetime tradeoffs related to the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. We validate the analysis with extensive experiments, obtaining typical performance figures. These results are valid not only for approximate searching queries but also for classical ones. Finally, we propose a new strategy for approximate searching on block addressing indices, which we experimentally find 45 times f...
Boosting Precision and Recall of DictionaryBased Protein Name Recognition
 Proc. of the ACL03 Workshop on Natural Language Processing in Biomedicine
, 2003
"... Dictionarybased protein name recognition is the first step for practical information extraction from biomedical documents because it provides ID information of recognized terms unlike machine learning based approaches. However, dictionary based approaches have two serious problems: (1) a lar ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
Dictionarybased protein name recognition is the first step for practical information extraction from biomedical documents because it provides ID information of recognized terms unlike machine learning based approaches. However, dictionary based approaches have two serious problems: (1) a large number of false recognitions mainly caused by short names. (2) low recall due to spelling variation.
New and Faster Filters for Multiple Approximate String Matching
 RANDOM STRUCTURES AND ALGORITHMS (RSA
, 1998
"... We present three new algorithms for online multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved is in all cases linear in the text size for moderate error level, pattern length and number of patte ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
We present three new algorithms for online multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved is in all cases linear in the text size for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when each algorithm should be used, and show experimentally their performance. The only previous solution for this problem allows only one error. Our algorithms are the first to allow more errors, and are faster than previous work for a moderate number of patterns (e.g. less than 50100 on English text, depending on the pattern length).
A New Indexing Method for Approximate String Matching
 Combinatorial Pattern Matching, 10th Annual Symposium
, 1999
"... . We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(n ), for 0 ! ! 1, whenever ff ! 1 \Gamma e= p oe, where ff ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
(Show Context)
. We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(n ), for 0 ! ! 1, whenever ff ! 1 \Gamma e= p oe, where ff is the error level tolerated and oe is the alphabet size. We experimentally show that this index outperforms by far all other algorithms for indexed approximate searching, also being the first experiments that compare the different existing schemes. We finally show how this index can be implemented using much less space. 1 Introduction Approximate string matching is a recurrent problem in many branches of computer science, with applications to text searching, computational biology, pattern recognition, signal processing, etc. The problem is: given a long text of length n, and a (comparatively short) pattern of length m, retrieve all the text segments (or "occurrences") whose edit distance t...
Onegapped qgram filters for Levenshtein Distance
 In: Proceedings of the 13th Symposium on Combinatorial Pattern Matching (CPM’02). Volume 2373
, 2002
"... Abstract. We have recently shown that qgram filters based on gapped qgrams instead of the usual contiguous qgrams can provide orders of magnitude faster and/or more efficient filtering for the Hamming distance. In this paper, we extend the results for the Levenshtein distance, which is more probl ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Abstract. We have recently shown that qgram filters based on gapped qgrams instead of the usual contiguous qgrams can provide orders of magnitude faster and/or more efficient filtering for the Hamming distance. In this paper, we extend the results for the Levenshtein distance, which is more problematic for gapped qgrams because an insertion or deletion in a gap affects a qgram while a replacement does not. To keep this effect under control, we concentrate on gapped qgrams with just one gap. We demostrate with experiments that the resulting filters provide a significant improvement over the contiguous qgram filters. We also develop new techniques for dealing with complex qgram filters. 1
Online approximate string searching algorithms: survey and experimental results
 International Journal of Computer Mathematics
, 2002
"... The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider a ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider algorithms based on different approaches including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism. We compare these algorithms in terms of running time against pattern length and for several values of k for four different kinds of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Finally, we compare the experimental results of the algorithms with their theoretical complexities.