Results 1  10
of
18
Geometric burrowswheeler transform: Linking range searching and text indexing
 In DCC
"... We introduce a new variant of the popular BurrowsWheeler transform (BWT) called Geometric BurrowsWheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2dimensional geometry. Using this transform, we can answer to many open questions i ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
We introduce a new variant of the popular BurrowsWheeler transform (BWT) called Geometric BurrowsWheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) Can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching [16]? We also introduce a reverse transform, called Points2Text, which converts a set of points into text. This transform allows us to derive the first known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching. 1
T.: Improved algorithms for the range next value problem and applications
 In: Proc. STACS
, 2008
"... Abstract. The Range Next Value problem (Problem RNV) is a recent interesting variant of the range search problems, where the query is for the immediate next (or equal) value of a given number within a given interval of an array. Problem RNV was introduced and studied very recently by Crochemore et. ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
Abstract. The Range Next Value problem (Problem RNV) is a recent interesting variant of the range search problems, where the query is for the immediate next (or equal) value of a given number within a given interval of an array. Problem RNV was introduced and studied very recently by Crochemore et. al [Finding Patterns In Given Intervals, MFCS 2007]. In this paper, we present improved algorithms for Problem RNV. We also show how this problem can be used to achieve optimal query time for a number of interesting variants of the classic pattern matching problems. 1.
Wavelet Trees for All
"... The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabli ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, fulltext indexes, XML indexes, and general numeric sequences.
Compression, indexing, and retrieval for massive string data
 COMBINATORIAL PATTERN MATCHING. LNCS
, 2010
"... The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as t ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the wellknown technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/Oefficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
Finding patterns in given intervals
 of Lecture Notes in Computer Science
, 2007
"... Abstract. In this paper, we study the pattern matching problem in given intervals. Depending on whether the intervals are given a priori for preprocessing, or during the query along with the pattern or, even in both cases, we develop solutions for different variants of this problem. In particular, ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Abstract. In this paper, we study the pattern matching problem in given intervals. Depending on whether the intervals are given a priori for preprocessing, or during the query along with the pattern or, even in both cases, we develop solutions for different variants of this problem. In particular, we present efficient indexing schemes for each of the above variants of the problem. 1
Faster Compact Topk Document Retrieval
"... An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is u ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5 % more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.
On EntropyCompressed Text Indexing in External Memory ⋆
"... Abstract. A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropycompressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FMindex [Ferragina and M ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropycompressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FMindex [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the BurrowsWheeler transform (BWT) [Burrows and Wheeler, 1994]. However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. [2008] gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2D range query structure. Given a text T of length n drawn from a σsized alphabet set, they achieved O(n log σ)bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in externalmemory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes O(n(Hk +1))+o(n log σ) bitsofspacewhereHk is the kthorder empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding. 1
The Wavelet Matrix
"... Abstract. The wavelet tree (Grossi et al., SODA 2003) is nowadays a popular succinct data structure for text indexes, discrete grids, and many other applications. When it has many nodes, a levelwise representation proposed by Mäkinen and Navarro (LATIN 2006) is preferable. We propose a different arr ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. The wavelet tree (Grossi et al., SODA 2003) is nowadays a popular succinct data structure for text indexes, discrete grids, and many other applications. When it has many nodes, a levelwise representation proposed by Mäkinen and Navarro (LATIN 2006) is preferable. We propose a different arrangement of the levelwise data, so that the bitmaps are shuffled in a different way. The result can no more be called a wavelet tree, and we dub it wavelet matrix. We demonstrate that the wavelet matrix is simpler to build, simpler to query, and faster in practice than the levelwise wavelet tree. This has a direct impact on many applications that use the levelwise wavelet tree for different purposes. 1
Compressed Text Indexing and Range Searching
, 2006
"... We introduce two transformations Text2Points and Points2Text that, respectively, convert text to points in space and viceversa. With these transformations, data structural problems in pattern matching and geometric range searching can be linked. We show strong connections between space versus query ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We introduce two transformations Text2Points and Points2Text that, respectively, convert text to points in space and viceversa. With these transformations, data structural problems in pattern matching and geometric range searching can be linked. We show strong connections between space versus query time tradeoffs in these fields. Thus, the results in range searching can be applied to compressed indexing and vice versa. In particular, we show that for a given equivalent space, pattern matching queries can be done using 2D range searching and viceversa with query times within a factor of O(1ogn) of each other. This twoway connection enables us not only to design new data structures for compressed text indexing, but also to derive new lower bounds. For compressed text indexing, we propose alternative data structures based on our Text2Points transform and Csided orthogonal query structures in 2D. Currently, all proposed compressed text indexes are based on the BurrowsWheeler transform (BWT) or its inverse [16,17,20,22,42]. We observe that our Text2Points transform is related to BWT on blocked text, and hence we also call it geometric BWT. With this variant, we solve some wellknown open problems in this area of compressed text indexing. In particular, we present the first external memory results for compressed text indexing. We give the first compressed data structures for positionrestricted pattern matching [27,34]. We also show lower bounds for these problems and for the problem of text indexing in general. These are the first known lower bounds (hardness results) in this area.