Results 1 - 10
of
55
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract
- in Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing
, 2000
"... Abstract. The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed al ..."
Abstract
-
Cited by 172 (15 self)
- Add to MetaCart
Abstract. The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ | bits by encoding each symbol with lg |Σ | bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg |Σ | n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg |Σ|) timeorinO(m +lgn) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg |Σ | n +lgɛ |Σ | n) search time in the worst case, for any constant
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
- In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA
"... We consider the indexable dictionary problem, which consists of storing a set S ⊆ {0,...,m − 1} for some integer m, while supporting the operations of rank(x), which returns the number of elements in S that are less than x if x ∈ S, and −1 otherwise; and select(i) which returns the i-th smallest ele ..."
Abstract
-
Cited by 149 (5 self)
- Add to MetaCart
We consider the indexable dictionary problem, which consists of storing a set S ⊆ {0,...,m − 1} for some integer m, while supporting the operations of rank(x), which returns the number of elements in S that are less than x if x ∈ S, and −1 otherwise; and select(i) which returns the i-th smallest element in S. We give a data structure that supports both operations in O(1) time on the RAM model and requires B(n,m)+ o(n)+O(lg lg m) bits to store a set of size n, where B(n,m) = ⌈ lg ( m) ⌉ n is the minimum number of bits required to store any n-element subset from a universe of size m. Previous dictionaries taking this space only supported (yes/no) membership queries in O(1) time. In the cell probe model we can remove the O(lg lg m) additive term in the space bound, answering a question raised by Fich and Miltersen, and Pagh. We present extensions and applications of our indexable dictionary data structure, including: • an information-theoretically optimal representation of a k-ary cardinal tree that supports standard operations in constant time, • a representation of a multiset of size n from {0,...,m − 1} in B(n,m+n) + o(n) bits that supports (appropriate generalizations of) rank and select operations in constant time, and • a representation of a sequence of n non-negative integers summing up to m in B(n,m + n) + o(n) bits that supports prefix sum queries in constant time. 1
Compressed full-text indexes
- ACM COMPUTING SURVEYS
, 2007
"... Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract
-
Cited by 142 (70 self)
- Add to MetaCart
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs
, 1999
"... We consider the implementation of abstract data types for the static objects: binary tree, rooted ordered tree and balanced parenthesis expression. Our representations use an amount of space within a lower order term of the information theoretic minimum and support, in constant time, a richer set ..."
Abstract
-
Cited by 116 (5 self)
- Add to MetaCart
We consider the implementation of abstract data types for the static objects: binary tree, rooted ordered tree and balanced parenthesis expression. Our representations use an amount of space within a lower order term of the information theoretic minimum and support, in constant time, a richer set of navigational operations than has previously been considered in similar work. In the case of binary trees, for instance, we can move from a node to its left or right child or to the parent in constant time while retaining knowledge of the size of the subtree at which we are positioned. The approach is applied to produce succinct representation of planar graphs in which one can test adjacency in constant time. Keywords: abstract data type, succinct representation, binary trees, balanced parenthesis, rooted ordered trees, planar graphs. AMS subject classifications: 68P05, 68Q65 1 Introduction The binary tree is among the most fundamental of data structures. While it is often the c...
The String B-Tree: A New Data Structure for String Search in External Memory and its Applications.
- Journal of the ACM
, 1998
"... We introduce a new text-indexing data structure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching data structures. In a short phrase, it is a combination of B-trees and Patricia tries for internal-node indices that is made more effective by a ..."
Abstract
-
Cited by 110 (11 self)
- Add to MetaCart
We introduce a new text-indexing data structure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching data structures. In a short phrase, it is a combination of B-trees and Patricia tries for internal-node indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String B-Tree overcomes the theoretical limitations of inverted files, B-trees, prefix B-trees, suffix arrays, compacted tries and suffix trees. String B-trees have the same worst-case performance as B-trees but they manage unbounded-length strings and perform much more powerful search operations such as the ones supported by suffix trees. String B-trees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.
Space Efficient Suffix Trees
, 1998
"... We first give a representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the size of the pattern. The structure is quite simple and answers a questi ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
We first give a representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the size of the pattern. The structure is quite simple and answers a question raised by Muthukrishnan in [17]. Previous compact representations of suffix trees had a higher lower order term in space and had some expectation assumption [3], or required more time for searching [5]. Then, surprisingly, we show that we can even do better, by developing a structure that uses a suffix array (and so ndlg ne bits) and an additional o(n) bits. String searching can be done in this structure also in O(m) time. Besides supporting string searching, we can also report the number of occurrences of the pattern in the same time using no additional space. In this case the space occupied...
Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays
, 2002
"... We introduce two succinct data structures to solve various string problems. One is for storing the information of lcp, the longest common prefix, between suffixes in the suffix array, and the other is an improvement in the compressed suffix array which supports linear time counting queries for any p ..."
Abstract
-
Cited by 46 (5 self)
- Add to MetaCart
We introduce two succinct data structures to solve various string problems. One is for storing the information of lcp, the longest common prefix, between suffixes in the suffix array, and the other is an improvement in the compressed suffix array which supports linear time counting queries for any pattern. The former occupies only 2n + o(n) bits for a text of length n for computing lcp between adjacent suffixes in lexicographic order in constant time, and 6n + o(n) bits between any two suffixes. No data structure in the literature attained linear size. The latter has size proportional to the text size and it is applicable to texts on any alphabet Σ such that |Σ| = log^O(1) n. These space-economical data structures are useful in processing huge amounts of text data.
Efficient External-Memory Data Structures and Applications
, 1996
"... In this thesis we study the Input/Output (I/O) complexity of large-scale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/O-efficient algorithms for them. A general theme in our work is to design I/O-effic ..."
Abstract
-
Cited by 38 (12 self)
- Add to MetaCart
In this thesis we study the Input/Output (I/O) complexity of large-scale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/O-efficient algorithms for them. A general theme in our work is to design I/O-efficient algorithms through the design of I/O-efficient data structures. One of our philosophies is to try to isolate all the I/O specific parts of an algorithm in the data structures, that is, to try to design I/O algorithms from internal memory algorithms by exchanging the data structures used in internal memory with their external memory counterparts. The results in the thesis include a technique for transforming an internal memory tree data structure into an external data structure which can be used in a batched dynamic setting, that is, a setting where we for example do not require that the result of a search operation is returned immediately. Using this technique we develop batched dynamic external versions of the (one-dimensional) range-tree and the segment-tree and we develop an external priority queue. Following our general philosophy we show how these structures can be used in standard internal memory sorting algorithms
A simple optimal representation for balanced parentheses
- In Proc. 15th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS v. 3109 (2004
, 2004
"... b Institute of Mathematical Sciences, Chennai 600 113, India. We consider succinct, or highly space-efficient, representations of a (static) string consisting of n pairs of balanced parentheses, that support natural operations such as finding the matching parenthesis for a given parenthesis, or find ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
b Institute of Mathematical Sciences, Chennai 600 113, India. We consider succinct, or highly space-efficient, representations of a (static) string consisting of n pairs of balanced parentheses, that support natural operations such as finding the matching parenthesis for a given parenthesis, or finding the pair of parentheses that most tightly enclose a given pair. This problem was considered by Jacobson, [Proc. 30th FOCS, 549–554, 1989] and Munro and Raman, [SIAM J. Comput. 31 (2001), 762–776], who gave O(n)-bit and 2n + o(n)-bit representations, respectively, that supported the above operations in O(1) time on the RAM model of computation. This data structure is a fundamental tool in succinct representations, and has applications in representing suffix trees, ordinal trees, planar graphs and permutations. We consider the practical performance of parenthesis representations. First, we give a new 2n + o(n)-bit representation that supports all the above operations in O(1) time. This representation is conceptually simpler, its space bound has a smaller o(n) term and it also has a simple and uniform o(n) time and space construction algorithm. We implement our data structure and a variant of Jacobson’s, and evaluate their practical performance (speed and memory usage), when used in a succinct representation of trees derived from XML documents. As a baseline, we compare our representations against a widely-used implementation of the standard DOM (Document Object Model) representation of XML documents. Both succinct representations use orders of magnitude less space than DOM and tree traversal operations are usually only slightly slower than in DOM. Key words: Succinct data structures, parentheses representation of trees, compressed dictionaries, XML DOM. Preprint submitted to Theoretical Computer Science 29 November 2006 1
Succinct ordinal trees with level-ancestor queries
- In SODA ’04: Proceedings of the Fifteenth annual ACM-SIAM Symposium on Discrete Algorithms
, 2004
"... We consider succinct or space-efficient representations of trees that efficiently support a variety of navigation operations. We focus on static ordinal trees, i.e., arbitrary static rooted trees where the children of each node are ordered. The set of operations is essentially the union of the sets ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
We consider succinct or space-efficient representations of trees that efficiently support a variety of navigation operations. We focus on static ordinal trees, i.e., arbitrary static rooted trees where the children of each node are ordered. The set of operations is essentially the union of the sets of operations supported by previous succinct

