Results 1  10
of
13
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 189 (17 self)
 Add to MetaCart
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Indexing and Dictionary Matching with One Error (Extended Abstract)
, 1999
"... The indexing problem is the one where a text is preprocessed and subsequent queries of the form: "Find all occurrences of pattern P in the text" are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
The indexing problem is the one where a text is preprocessed and subsequent queries of the form: "Find all occurrences of pattern P in the text" are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns is preprocessed and subsequent queries of the form: "Find all occurrences of dictionary patterns in text T" are answered in time proportional to the length of the text and the number of occurrences. There exist efficient worstcase solutions for the indexing problem and the dictionary matching problem, but none that find approximate occurrences of the patterns, i.e. where the pattern is within a bound edit (or hamming...
MultiMethod Dispatching: A Geometric Approach with Applications to String Matching Problems
, 1999
"... Current object oriented programming languages (OOPLs) rely on monomethod dispatching. Recent research has identified multimethods as a new, powerful feature to be added to OOPLs, and several experimental OOPLs now have multimethods. Their ultimate success and impact in practice depends, among ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
Current object oriented programming languages (OOPLs) rely on monomethod dispatching. Recent research has identified multimethods as a new, powerful feature to be added to OOPLs, and several experimental OOPLs now have multimethods. Their ultimate success and impact in practice depends, among other things, on whether multimethod dispatching can be supported efficiently. We show that the multimethod dispatching problem can be transformed to a geometric problem on multidimensional integer grids, for which we then develop a data structure that uses nearlinear space and has O(log log n) query time. This gives a solution whose performance almost matches that of the best known algorithm for standard monomethod dispatching. Our geometric data structure has other applications as well, namely in two string matching problems: matching multiple rectangular patterns against a rectangular query text, and approximate dictionary matching with edit distance at most one. Our results f...
Improved Bounds for Dictionary Lookup with One Error
 Information Processing Letters
, 2000
"... Given a dictionary S of n binary strings each of length m, we consider the problem of designing a data structure for S that supports dqueries; given a binary query string q of length m, a dquery reports if there exists a string in S within Hamming distance d of q. We construct a data structure for ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Given a dictionary S of n binary strings each of length m, we consider the problem of designing a data structure for S that supports dqueries; given a binary query string q of length m, a dquery reports if there exists a string in S within Hamming distance d of q. We construct a data structure for the case d = 1, that requires space O(n log m) and has query time O(1) in a cell probe model with word size m. This generalizes and improves the previous bounds of Yao and Yao for the problem in the bit probe model. The data structure can be constructed in randomized expected time O(nm). Key words: Data Structures, Dictionaries, Hashing, Hamming Distance 1 Introduction Minsky and Papert in 1969 posed the following problem, that has remained a challenge in data structure design [9]. Let S be a set of n binary strings of length m each. We want to construct a data structure for S that supports fast dqueries; that is, given a binary 1 BRICS (Basic Research in Computer Science), a Center o...
Dictionary LookUp Within Small Edit Distance
 In Proc. 8th Annual Intl. Computing and Combinatorics Conference (COCOON’02
, 2002
"... Let W be a dictionary consisting of n binary strings of length m each, represented as a trie. The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. We present an algorithm to determine if there is a member in W within edit distance d of ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Let W be a dictionary consisting of n binary strings of length m each, represented as a trie. The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. We present an algorithm to determine if there is a member in W within edit distance d of a given query string q of length m. The method takes time O(dm ) in the RAM model, independent of n, and requires O(dm) additional space.
Large Scale Hamming Distance Query Processing
"... Abstract—Hamming distance has been widely used in many application domains, such as nearduplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract—Hamming distance has been widely used in many application domains, such as nearduplicate detection and pattern recognition. We study Hamming distance range query problems, where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, we have a static Hamming distance range query problem. If k is part of the input, we have a dynamic Hamming distance range query problem. For the static problem, the prior art uses lots of memory due to its aggressive replication of the database. For the dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this paper, we first propose a static Hamming distance range query algorithm called HEngine s, which addresses the space issue in prior art by dynamically expanding the query on the fly. We then propose a dynamic Hamming distance range query algorithm called HEngine d, which addresses the limitation in prior art using a divideandconquer strategy. We implemented our algorithms and conducted sidebyside comparisons on large realworld and synthetic datasets. In our experiments, HEngine s uses 4.65 times less space and processes queries 16 % faster than the prior art, and HEngine d processes queries 46 times faster than linear scan while using only 1.7 times more space. A. Background I.
Efficient approximate dictionary lookup over small alphabets
, 2005
"... Given a dictionary W consisting of n binary strings of length m each, a dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 [10] as a challenge to data structure design. Efficient solutions have bee ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Given a dictionary W consisting of n binary strings of length m each, a dquery asks if there exists a string in W within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 [10] as a challenge to data structure design. Efficient solutions have been developed only for the special case when d = 1 (the 1query problem). We assume the standard RAM model of computation, and consider the case of the problem when alphabet size is arbitrary but finite, and d is small. We preprocess the dictionary, and construct an edgelabelled tree with bounded branching factor, and height. We present an algorithm to answer dictionary lookup within given distance d of a given query string q. The algorithm is efficient when the alphabet size is small, or the dictionary is sparse. In particular, for the dquery problem the algorithm takes time O(m(log 4/3 n − 1) d (log 2 n) d+1). This is an improvement over previously known algorithms for the dquery problem when d> 1. We also generalize the results for the case of the problem when edit distances are used. The algorithm can be modified such that it allows for words of different lengths as well as different lengths of query strings. 1
HmSearch: An Efficient Hamming Distance Query Processing Algorithm
"... Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k H ..."
Abstract
 Add to MetaCart
Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as only applicable to tiny error threshold values, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew. In this paper, we propose HmSearch, an efficient query processing method for Hamming distance queries that addresses the abovementioned limitations. Our method is based on improved enumerationbased signatures, enhanced filtering, and the hierarchical binary filteringandverification. We also design an effective dimension rearrangement method to deal with data skew. Extensive experimental results demonstrate that our methods outperform stateoftheart methods by up to two orders of magnitude. 1.
This document in subdirectory RS/99/50 / Improved Bounds for Dictionary Lookup with One Error
, 1999
"... Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy. See back inner page for a list of recent BRICS Report Series publications. Copies may be obtained by contacting: BRICS ..."
Abstract
 Add to MetaCart
Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy. See back inner page for a list of recent BRICS Report Series publications. Copies may be obtained by contacting: BRICS