Results 1  10
of
65
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 192 (17 self)
 Add to MetaCart
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 180 (81 self)
 Add to MetaCart
(Show Context)
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
I.: Type less, find more: fast autocompletion search with a succinct index
 In: SIGIR
, 2006
"... We consider the following fulltext search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter being typed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the best hits for any of th ..."
Abstract

Cited by 47 (5 self)
 Add to MetaCart
(Show Context)
We consider the following fulltext search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter being typed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the best hits for any of these completions should be displayed. Known indexing data structures that apply to this problem either incur large processing times for a substantial class of queries, or they use a lot of space. We present a new indexing data structure that uses no more space than a stateoftheart compressed inverted index, but with 10 times faster query processing times. Even on the large TREC Terabyte collection, which comprises over 25 million documents, we achieve, on a single machine and with the index on disk, average response times of one tenth of a second. We have built a fullfledged, interactive search engine that realizes the proposed autocompletion feature combined with support for proximity search, semistructured (XML) text, subword and phrase completion, and semantic tags.
Rank and select revisited and extended
 Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract

Cited by 32 (17 self)
 Add to MetaCart
The deep connection between the BurrowsWheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in positionrestricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s twodimensional data structures, and Grossi et al.’s wavelet trees.
Positionrestricted substring searching
 OF LECTURE NOTES IN COMPUTER SCIENCE
, 2006
"... A fulltext index is a data structure built over a text string T[1, n]. The most basic functionality provided is (a) counting how many times a pattern string P[1, m] appears in T and (b) locating all those occ positions. There exist several indexes that solve (a) in O(m) time and (b) in O(occ) tim ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
A fulltext index is a data structure built over a text string T[1, n]. The most basic functionality provided is (a) counting how many times a pattern string P[1, m] appears in T and (b) locating all those occ positions. There exist several indexes that solve (a) in O(m) time and (b) in O(occ) time. In this paper we propose two new queries, (c) counting how many times P[1, m] appears in T[l, r] and (d) locating all those occl,r positions. These can be solved using (a) and (b) but this requires O(occ) time. We present two solutions to (c) and (d) in this paper. The first is an index that requires O(n log n) bits of space and answers (c) in O(m + log n) time and (d) in O(log n) time per occurrence (that is, O(occl,r log n) time overall). A variant of the first solution answers (c) in O(m + log log n) time and (d) in constant time per occurrence, but requires O(nlog 1+ǫ n) bits of space for any constant ǫ> 0. The second solution requires O(nm log σ) bits of space, solving (c) in O(m⌈log σ/log log n⌉) time and (d) in O(m⌈log σ/log log n⌉) time per
Geometric burrowswheeler transform: Linking range searching and text indexing
 In DCC
"... We introduce a new variant of the popular BurrowsWheeler transform (BWT) called Geometric BurrowsWheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2dimensional geometry. Using this transform, we can answer to many open questions i ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
We introduce a new variant of the popular BurrowsWheeler transform (BWT) called Geometric BurrowsWheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) Can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching [16]? We also introduce a reverse transform, called Points2Text, which converts a set of points into text. This transform allows us to derive the first known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching. 1
Efficient computation of gapped substring kernels on large alphabets
 Journal of Machine Leaning Research
, 2005
"... We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty λ, and an integer p, computes the value of the gapweighted lengthp subsequences kernel. The algorithm works in time O(pMlogt), where M = {(i, j)si = t j} is the set of matches of characters in the ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty λ, and an integer p, computes the value of the gapweighted lengthp subsequences kernel. The algorithm works in time O(pMlogt), where M = {(i, j)si = t j} is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gappenalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating characterspecific match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a triebased algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On mediumsized alphabets the triebased approach is best if the maximum number of allowed gaps is strongly restricted.
Orthogonal Range Searching on the RAM, Revisited
, 2011
"... We present a number of new results on one of the most extensively studied topics in computational geometry, orthogonal range searching. All our results are in the standard word RAM model: 1. We present two data structures for 2d orthogonal range emptiness. The first achieves O(n lg lg n) space and ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
We present a number of new results on one of the most extensively studied topics in computational geometry, orthogonal range searching. All our results are in the standard word RAM model: 1. We present two data structures for 2d orthogonal range emptiness. The first achieves O(n lg lg n) space and O(lg lg n) query time, assuming that the n given points are in rank space. This improves the previous results by Alstrup, Brodal, and Rauhe (FOCS’00), with O(n lg ε n) space and O(lg lg n) query time, or with O(n lg lg n) space and O(lg 2 lg n) query time. Our second data structure uses O(n) space and answers queries in O(lg ε n) time. The best previous O(n)space data structure, due to Nekrich (WADS’07), answers queries in O(lg n / lg lg n) time. 2. We give a data structure for 3d orthogonal range reporting with O(n lg 1+ε n) space and O(lg lg n+ k) query time for points in rank space, for any constant ε> 0. This improves the previous results by Afshani (ESA’08), Karpinski and Nekrich (COCOON’09), and Chan (SODA’11), with O(n lg 3 n) space and O(lg lg n + k) query time, or with O(n lg 1+ε n) space and O(lg 2 lg n + k) query time. Consequently, we obtain improved upper bounds for orthogonal range reporting in all constant dimensions above 3.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing
"... We present a succinct representation of a set of n points on an n × n grid using n lg n + o(nlg n) bits 3 to support orthogonal range counting in O(lg n / lg lg n) time, and range reporting in O(k lg n/lg lg n) time, where k is the size of the output. This achieves an improvement on query time by ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
We present a succinct representation of a set of n points on an n × n grid using n lg n + o(nlg n) bits 3 to support orthogonal range counting in O(lg n / lg lg n) time, and range reporting in O(k lg n/lg lg n) time, where k is the size of the output. This achieves an improvement on query time by a factor of lg lg n upon the previous result of Mäkinen and Navarro [15], while using essentially the informationtheoretic minimum space. Our data structure not only can be used as a key component in solutions to the general orthogonal range search problem to save storage cost, but also has applications in text indexing. In particular, we apply it to improve two previous spaceefficient text indexes that support substring search [7] and positionrestricted substring search [15]. We also use it to extend previous results on succinct representations of sequences of small integers, and to design succinct data structures supporting certain types of orthogonal range query in the plane.
A Fast Algorithm for Optimal Alignment Between Similar Ordered Trees
 Fundamenta Informaticae
, 2001
"... . We present a fast algorithm for optimal alignment between two similar ordered trees with node labels. Let S and T be two such trees with jSj and jT j nodes, respectively. An optimal alignment between S and T which uses at most d blank symbols can be constructed in O(n log n (maxdeg) 4 d 2 ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
. We present a fast algorithm for optimal alignment between two similar ordered trees with node labels. Let S and T be two such trees with jSj and jT j nodes, respectively. An optimal alignment between S and T which uses at most d blank symbols can be constructed in O(n log n (maxdeg) 4 d 2 ) time, where n = maxfjSj; jT jg and maxdeg is the maximum degree of a node in S or T . In particular, if the input trees are of bounded degree, the running time is O(n log n d 2 ): 1 Introduction Let R be a rooted tree. R is called a labeled tree if each node of R is labeled by a symbol from a xed nite set . R is an ordered tree if the lefttoright order among siblings in R is given. The problem of determining the similarity between two labeled trees occurs in several dierent areas of computer science. For example, in computational biology, methods for measuring the similarity between ordered labeled trees of bounded degree can be used in the comparison of RNA secondary st...