Results 11  20
of
239
Dictionary matching and indexing with errors and don’t cares,”
 in Proceedings of the thirtysixth annual ACM symposium on Theory of computing. ACM,
, 2004
"... ..."
Engineering a lightweight suffix array construction algorithm (Extended Abstract)
"... In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matchi ..."
Abstract

Cited by 81 (3 self)
 Add to MetaCart
(Show Context)
In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the BurrowsWheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its fulltext index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in websearch engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and "lightweight" in the sense that it uses small space...
Succinct data structures for flexible text retrieval systems
 Journal of Discrete Algorithms
, 2007
"... University, Fukuoka, Japan. We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support querie ..."
Abstract

Cited by 77 (1 self)
 Add to MetaCart
(Show Context)
University, Fukuoka, Japan. We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idf scores there has been no efficient data structures for arbitrary patterns. Our new data structures support these queries using small space. The space is only 2/ɛ times the size of compressed documents plus 10n bits for a document collection of length n, for any 0 <ɛ ≤ 1. This is much smaller than the previous O(n log n) bit data structures. Query time is O(m+q log ɛ n) for listing and computing tf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.
A taxonomy of suffix array construction algorithms
 ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract

Cited by 77 (12 self)
 Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a spacesaving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple highlevel descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worstcase time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Indexing Text using the ZivLempel Trie
 Journal of Discrete Algorithms
, 2002
"... Let a text of u characters over an alphabet of size be compressible to n symbols by the LZ78 or LZW algorithm. We show that it is possible to build a data structure based on the ZivLempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in ..."
Abstract

Cited by 69 (44 self)
 Add to MetaCart
(Show Context)
Let a text of u characters over an alphabet of size be compressible to n symbols by the LZ78 or LZW algorithm. We show that it is possible to build a data structure based on the ZivLempel trie that takes 4n log 2 n(1+o(1)) bits of space and reports the R occurrences of a pattern of length m in worst case time O(m log(m)+(m+R)log n).
Lineartime construction of suffix arrays
 In Proc. 14th Symposium on Combinatorial Pattern Matching (CPM ’03
, 2003
"... Abstract. The time complexity of suffix tree construction has been shown to be equivalent to that of sorting: O(n) for a constantsize alphabet or an integer alphabet and O(n logn) for a general alphabet. However, previous algorithms for constructing suffix arrays have the time complexity of O(n l ..."
Abstract

Cited by 69 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The time complexity of suffix tree construction has been shown to be equivalent to that of sorting: O(n) for a constantsize alphabet or an integer alphabet and O(n logn) for a general alphabet. However, previous algorithms for constructing suffix arrays have the time complexity of O(n logn) even for a constantsize alphabet. In this paper we present a lineartime algorithm to construct suffix arrays for integer alphabets, which do not use suffix trees as intermediate data structures during its construction. Since the case of a constantsize alphabet can be subsumed in that of an integer alphabet, our result implies that the time complexity of directly constructing suffix arrays matches that of constructing suffix trees. 1
Space Efficient Suffix Trees
, 1998
"... We first give a representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the size of the pattern. The structure is quite simple and answers a questi ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
We first give a representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the size of the pattern. The structure is quite simple and answers a question raised by Muthukrishnan in [17]. Previous compact representations of suffix trees had a higher lower order term in space and had some expectation assumption [3], or required more time for searching [5]. Then, surprisingly, we show that we can even do better, by developing a structure that uses a suffix array (and so ndlg ne bits) and an additional o(n) bits. String searching can be done in this structure also in O(m) time. Besides supporting string searching, we can also report the number of occurrences of the pattern in the same time using no additional space. In this case the space occupied...
Succinct suffix arrays based on runlength encoding
 Nordic Journal of Computing
, 2005
"... A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex re ..."
Abstract

Cited by 60 (32 self)
 Add to MetaCart
A succinct fulltext selfindex is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the selfindex replaces the text. Several remarkable selfindexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new selfindex, called RLFM index for “runlength FMindex”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the BurrowsWheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different spacetime tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1
Breaking a TimeandSpace Barrier in Constructing FullText Indices
"... Suffix trees and suffix arrays are the most prominent fulltext indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)bit working space, where n denotes the length of the text. Int ..."
Abstract

Cited by 59 (4 self)
 Add to MetaCart
(Show Context)
Suffix trees and suffix arrays are the most prominent fulltext indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)bit working space, where n denotes the length of the text. Inthe literature, the fastest algorithm runs in O(n) time, whileit requires O(n log n)bit working space. On the other hand,the most spaceefficient algorithm requires O(n)bit working space while it runs in O(n log n) time. This paper breaks the longstanding timeandspace barrier under the unitcost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)bit working space, for texts with constantsize alphabets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm requires O(n logffl n) time and O(n)bit working space forany 0! ffl! 1. Apart from that, our algorithm can alsobe adopted to build other existing fulltext indices, such as