Results 1  10
of
16
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 239 (19 self)
 Add to MetaCart
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Dictionary Matching and Indexing with Errors and Don't Cares
 In Proceedings of STOC
, 2004
"... ..."
(Show Context)
Indexing and Dictionary Matching with One Error (Extended Abstract)
, 1999
"... The indexing problem is the one where a text is preprocessed and subsequent queries of the form: "Find all occurrences of pattern P in the text" are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
The indexing problem is the one where a text is preprocessed and subsequent queries of the form: "Find all occurrences of pattern P in the text" are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns is preprocessed and subsequent queries of the form: "Find all occurrences of dictionary patterns in text T" are answered in time proportional to the length of the text and the number of occurrences. There exist efficient worstcase solutions for the indexing problem and the dictionary matching problem, but none that find approximate occurrences of the patterns, i.e. where the pattern is within a bound edit (or hamming...
MultiMethod Dispatching: A Geometric Approach with Applications to String Matching Problems
, 1999
"... Current object oriented programming languages (OOPLs) rely on monomethod dispatching. Recent research has identified multimethods as a new, powerful feature to be added to OOPLs, and several experimental OOPLs now have multimethods. Their ultimate success and impact in practice depends, among ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
Current object oriented programming languages (OOPLs) rely on monomethod dispatching. Recent research has identified multimethods as a new, powerful feature to be added to OOPLs, and several experimental OOPLs now have multimethods. Their ultimate success and impact in practice depends, among other things, on whether multimethod dispatching can be supported efficiently. We show that the multimethod dispatching problem can be transformed to a geometric problem on multidimensional integer grids, for which we then develop a data structure that uses nearlinear space and has O(log log n) query time. This gives a solution whose performance almost matches that of the best known algorithm for standard monomethod dispatching. Our geometric data structure has other applications as well, namely in two string matching problems: matching multiple rectangular patterns against a rectangular query text, and approximate dictionary matching with edit distance at most one. Our results f...
Improved Bounds for Dictionary Lookup with One Error
 Information Processing Letters
, 2000
"... Given a dictionary S of n binary strings each of length m, we consider the problem of designing a data structure for S that supports dqueries; given a binary query string q of length m, a dquery reports if there exists a string in S within Hamming distance d of q. We construct a data structure for ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Given a dictionary S of n binary strings each of length m, we consider the problem of designing a data structure for S that supports dqueries; given a binary query string q of length m, a dquery reports if there exists a string in S within Hamming distance d of q. We construct a data structure for the case d = 1, that requires space O(n log m) and has query time O(1) in a cell probe model with word size m. This generalizes and improves the previous bounds of Yao and Yao for the problem in the bit probe model. The data structure can be constructed in randomized expected time O(nm). Key words: Data Structures, Dictionaries, Hashing, Hamming Distance 1 Introduction Minsky and Papert in 1969 posed the following problem, that has remained a challenge in data structure design [9]. Let S be a set of n binary strings of length m each. We want to construct a data structure for S that supports fast dqueries; that is, given a binary 1 BRICS (Basic Research in Computer Science), a Center o...
Dictionary LookUp Within Small Edit Distance
 In Proc. 8th Annual Intl. Computing and Combinatorics Conference (COCOON’02
, 2002
"... Let W be a dictionary consisting of n binary strings of length m each, represented as a trie. The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. We present an algorithm to determine if there is a member in W within edit distance d of ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Let W be a dictionary consisting of n binary strings of length m each, represented as a trie. The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. We present an algorithm to determine if there is a member in W within edit distance d of a given query string q of length m. The method takes time O(dm ) in the RAM model, independent of n, and requires O(dm) additional space.
Orthogonal range searching for text indexing
 In SpaceEfficient Data Structures, Streams, and Algorithms
, 2013
"... Abstract. Text indexing, the problem in which one desires to preprocess a (usually large) text for future (shorter) queries, has been researched ever since the suffix tree was invented in the early 70’s. With textual data continuing to increase and with changes in the way it is accessed, new data s ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Text indexing, the problem in which one desires to preprocess a (usually large) text for future (shorter) queries, has been researched ever since the suffix tree was invented in the early 70’s. With textual data continuing to increase and with changes in the way it is accessed, new data structures and new algorithmic methods are continuously required. Therefore, text indexing is of utmost importance and is a very active research domain. Orthogonal range searching, classically associated with the computational geometry community, is one of the tools that has increasingly become important for various text indexing applications. Initially, in the mid 90’s there were a couple of results recognizing this connection. In the last few years we have seen an increase in use of this method and are reaching a deeper understanding of the range searching uses for text indexing. In this monograph we survey some of these results.
Large Scale Hamming Distance Query Processing
"... Abstract—Hamming distance has been widely used in many application domains, such as nearduplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Hamming distance has been widely used in many application domains, such as nearduplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, we have a static Hamming distance range query problem. If k is part of the input, we have a dynamic Hamming distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming distance range query algorithm called HEngine s, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming distance range query algorithm called HEngine d, which addresses the limitation in prior art using a divideandconquer strategy. We implemented our algorithms and conducted sidebyside comparisons on large realworld and synthetic datasets. In our experiments, HEngine s uses 4.65 times less space and processes queries 16 % faster than the prior art, and HEngine d processes queries 46 times faster than linear scan while using only 1.7 times more space. A. Background I.
Efficient approximate dictionary lookup over small alphabets
, 2005
"... Given a dictionary W consisting of n binary strings of length m each, a dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 [10] as a challenge to data structure design. Efficient solutions have bee ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Given a dictionary W consisting of n binary strings of length m each, a dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 [10] as a challenge to data structure design. Efficient solutions have been developed only for the special case when d = 1 (the 1query problem). We assume the standard RAM model of computation, and consider the case of the problem when alphabet size is arbitrary but finite, and d is small. We preprocess the dictionary, and construct an edgelabelled tree with bounded branching factor, and height. We present an algorithm to answer dictionary lookup within given distance d of a given query string q. The algorithm is efficient when the alphabet size is small, or the dictionary is sparse. In particular, for the dquery problem the algorithm takes time O(m(log 4/3 n − 1) d (log 2 n) d+1). This is an improvement over previously known algorithms for the dquery problem when d> 1. We also generalize the results for the case of the problem when edit distances are used. The algorithm can be modified such that it allows for words of different lengths as well as different lengths of query strings. 1
www.elsevier.com/locate/ipl Improved approximate common interval
, 2007
"... During the course of evolution, speciation results in the divergence of genomes that initially have the same gene order and content. If there is no selective pressure, successive rearrangements that are common in prokaryotic ..."
Abstract
 Add to MetaCart
(Show Context)
During the course of evolution, speciation results in the divergence of genomes that initially have the same gene order and content. If there is no selective pressure, successive rearrangements that are common in prokaryotic