Results 1 - 10
of
38
Compressed full-text indexes
- ACM COMPUTING SURVEYS
, 2007
"... Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract
-
Cited by 142 (70 self)
- Add to MetaCart
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Hamsa: fast signature generation for zero-day polymorphic worms with provable attack resilience
- In SP ’06: Proceedings of the 2006 IEEE Symposium on Security and Privacy (S&P’06
, 2006
"... Zero-day polymorphic worms pose a serious threat to the security of Internet infrastructures. Given their rapid propagation, it is crucial to detect them at edge networks and automatically generate signatures in the early stages of infection. Most existing approaches for automatic signature generati ..."
Abstract
-
Cited by 53 (5 self)
- Add to MetaCart
Zero-day polymorphic worms pose a serious threat to the security of Internet infrastructures. Given their rapid propagation, it is crucial to detect them at edge networks and automatically generate signatures in the early stages of infection. Most existing approaches for automatic signature generation need host information and are thus not applicable for deployment on high-speed network links. In this paper, we propose Hamsa, a network-based automated signature generation system for polymorphic worms which is fast, noise-tolerant and attack-resilient. Essentially, we propose a realistic model to analyze the invariant content of polymorphic worms which allows us to make analytical attack-resilience guarantees for the signature generation algorithm. Evaluation based on a range of polymorphic worms and polymorphic engines demonstrates that Hamsa significantly outperforms Polygraph [16] in terms of efficiency, accuracy, and attack resilience. 1
A taxonomy of suffix array construction algorithms
- ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract
-
Cited by 30 (10 self)
- Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple high-level descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Two space saving tricks for linear time LCP computation
- In: Proc. SWAT. Volume 3111 of Lecture Notes in Computer Science
, 2004
"... In this paper we consider the linear time algorithm of Kasai et al. [10] for the computation of the LCP array given the text and the su#x array. We show that this algorithm can be implemented without any auxiliary array in addition to the ones required for the input (the text and the su#x array) ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
In this paper we consider the linear time algorithm of Kasai et al. [10] for the computation of the LCP array given the text and the su#x array. We show that this algorithm can be implemented without any auxiliary array in addition to the ones required for the input (the text and the su#x array) and the output (the LCP array). Thus, for a text of length n, we reduce the space occupancy of this algorithm from 13n bytes to 9n bytes.
Fast lightweight suffix array construction and checking
- 14th Annual Symposium on Combinatorial Pattern Matching
, 2003
"... We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p
Better external memory suffix array construction
- In: Workshop on Algorithm Engineering & Experiments
, 2005
"... Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. However, so far it has looked prohibitive to build suffix arrays for huge inputs that do not fit into main memory. This paper presents design, analysis, implementation, and experimental evaluation of several new and improved algorithms for suffix array construction. The algorithms are asymptotically optimal in the worst case or on the average. Our implementation can construct suffix arrays for inputs of up to 4GBytes in hours on a low cost machine. As a tool of possible independent interest we present a systematic way to design, analyze, and implement pipelined algorithms.
Linear-Time Computation of Similarity Measures for Sequential Data
, 2008
"... Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comp ..."
Abstract
-
Cited by 13 (10 self)
- Add to MetaCart
Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 10^6 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA.
Fast BWT in small space by blockwise suffix sorting
- In Proc. DIMACS Working Group on the Burrows-Wheeler Transform: Ten Years Later
"... The usual way to compute the Burrows–Wheeler transform (BWT) [3] of a text is by constructing the suffix array of the text. Even with space-efficient suffix array construction algorithms [12, 2], the space requirement of the suffix array itself is often the main factor limiting the size of the text ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
The usual way to compute the Burrows–Wheeler transform (BWT) [3] of a text is by constructing the suffix array of the text. Even with space-efficient suffix array construction algorithms [12, 2], the space requirement of the suffix array itself is often the main factor limiting the size of the text that can be handled in one piece, which is crucial for constructing compressed text indexes [4, 5]. Typically, the suffix array needs 4n bytes while the text and the BWT need only n bytes each and sometimes even less, for example 2n bits each for a DNA sequence. We reduce the space dramatically by constructing the suffix array in blocks of lexicographically consecutive suffixes. Given such a block, the corresponding block of the BWT is trivial to compute. Theorem 1 The BWT of a text of length n can be computed in O(n log n+n √ v +Dv) time (with high probability) and O(n / √ v + v) space (in addition to the text and the BWT), for any v ∈ [1, n]. Here Dv = ∑ i∈[0,n) min(di, v) = O(nv), where di is the length of the shortest unique substring starting at i. Proof (sketch). Assume first that the text has no repetitions longer than v, i.e., di ≤ v for all i. Choose a set of O(v) random suffixes that divide the suffix array into blocks. The sizes of the blocks
Faster Lightweight Suffix Array Construction
"... The suffix array is a data structure formed by sorting the suffixes of a string into lexicographical order. It is important for a variety of applications, perhaps most notably pattern matching, pattern discovery and block-sorting data compression. The last decade has seen intensive research toward e ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
The suffix array is a data structure formed by sorting the suffixes of a string into lexicographical order. It is important for a variety of applications, perhaps most notably pattern matching, pattern discovery and block-sorting data compression. The last decade has seen intensive research toward efficient construction of suffix arrays with algorithms striving not only to be fast, but also “lightweight” (in the sense that they use small working memory). In this paper we describe a new lightweight suffix array construction algorithm. By exploiting several interesting properties of suffixes in combination with cache concious programming we acheive excellent runtimes. Extensive experiments show our approach to be faster that all other known algorithms for the task.
Practical methods for constructing suffix trees
, 2005
"... Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluati ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Sequence datasets are ubiquitous in modern lifescience applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not

