Results 1 -
8 of
8
Reducing the Space Requirement of Suffix Trees
- Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract
-
Cited by 109 (10 self)
- Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Efficient Implementation of Lazy Suffix Trees
, 1999
"... We present an efficient implementation of a write-only top-down construction for suffix trees. Our implementation is based on a new, space-efficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a co ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
We present an efficient implementation of a write-only top-down construction for suffix trees. Our implementation is based on a new, space-efficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated only when it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy top-down construction is often faster and more space efficient than other methods. Copyright c ○ 2003 John Wiley & Sons, Ltd. KEY WORDS: string matching; suffix tree; space-efficient implementation; lazy evaluation
Efficient phrase-based document indexing for Web document clustering
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2004
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
Optimal Parallel Suffix Tree Construction
, 1997
"... An O(m)-work, O(m)-space, O(log m)-time CREW-PRAM algorithm for constructing the suffix tree of a string s of length m drawn from any fixed alphabet set is obtained. This is the first known work and space optimal parallel algorithm for this problem. It can be generalized to a string s drawn fr ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
An O(m)-work, O(m)-space, O(log m)-time CREW-PRAM algorithm for constructing the suffix tree of a string s of length m drawn from any fixed alphabet set is obtained. This is the first known work and space optimal parallel algorithm for this problem. It can be generalized to a string s drawn from any general alphabet set to perform in O(log m) time and O(m log j\Sigmaj) work and space, after the characters in s have been sorted alphabetically, where j\Sigmaj is the number of distinct characters in s. In this case too, the algorithm is work-optimal.
Comparative n-Gram Analysis of Whole-Genome Protein Sequences
- IN PROCEEDINGS OF THE HUMAN LANGUAGE TECHNOLOGIES CONFERENCE
, 2002
"... A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional three-dimensional shapes of the proteins are clearly differe ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional three-dimensional shapes of the proteins are clearly different for different organisms, since sequences obtained from human proteins often fail to form correct three-dimensional structures in bacterial organisms. In analogy to the question "What kind of things do people say?" we therefore need to ask the question "What kind of amino acid sequences occur in the proteins of an organism?" An understanding of the sequence space occupied by proteins in different organisms would have important applications for "translation" of proteins from the language of one organism into that of another and design of drugs that target sequences that might be unique or preferred by pathogenic organisms over those in human hosts. Here we describe
Linear-Time Construction of Two-Dimensional Suffix Trees (Extended Abstract)
- In Proceedings of the 26th International Colloquium on Automata, Languages and Programming (ICALP), volume 1644 of LNCS
, 1999
"... Dong Kyue Kim Kunsoo Park Department of Computer Engineering Seoul National University, Seoul 151-742, Korea fdkkim,kparkg@theory.snu.ac.kr Abstract. The suffix tree of a string S is a compacted trie that represents all suffixes of S. Linear-time algorithms for constructing the suffix tree hav ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Dong Kyue Kim Kunsoo Park Department of Computer Engineering Seoul National University, Seoul 151-742, Korea fdkkim,kparkg@theory.snu.ac.kr Abstract. The suffix tree of a string S is a compacted trie that represents all suffixes of S. Linear-time algorithms for constructing the suffix tree have been known for quite a while. In two dimensions, however, linear-time construction of two-dimensional suffix trees has been an open problem. We present the first linear-time algorithm for constructing twodimensional suffix trees.
Generalizations of Suffix Arrays to Multi-Dimensional Matrices
"... We propose multi-dimensional index data structures that generalize suffix arrays to square matrices and cubic matrices. Giancarlo proposed a two-dimensional index data structure, the Lsuffix tree, that generalizes suffix trees to square matrices. However, the construction algorithm for Lsuffix trees ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We propose multi-dimensional index data structures that generalize suffix arrays to square matrices and cubic matrices. Giancarlo proposed a two-dimensional index data structure, the Lsuffix tree, that generalizes suffix trees to square matrices. However, the construction algorithm for Lsuffix trees maintains complicated data structures and uses a large amount of space. We present simple and practical construction algorithms for multi-dimensional suffix arrays by applying a new partitioning technique to lexicographic sorting. Our contributions are the following: (1) We present the first algorithm for constructing two-dimensional suffix arrays directly. Our algorithm is ten times faster and five times space-efficient than Giancarlo's algorithm for Lsuffix trees. (2) We present an efficient algorithm for three-dimensional suffix arrays, which is the first algorithm for constructing three-dimensional index data structures.
Designing Pattern Matching Algorithms by Exploiting Structural Pattern Properties
, 1994
"... This thesis presents algorithms and, in some cases, lower bounds for some fundamental pattern matching problems. In all cases, the algorithms are obtained by understanding and strongly exploiting structural pattern properties. The following results are obtained. Exact Complexity of String Matching: ..."
Abstract
- Add to MetaCart
This thesis presents algorithms and, in some cases, lower bounds for some fundamental pattern matching problems. In all cases, the algorithms are obtained by understanding and strongly exploiting structural pattern properties. The following results are obtained. Exact Complexity of String Matching: We consider the question of how many character comparisons are needed to nd all occurrences of a pattern string of length m in a text string of length n. We show an almost tight upper bound of the form n+O(n=m) character comparisons, following preprocessing. Speci cally, we show an upper bound of n+ (n m) character comparisons. This bound is achieved by an on-line algorithm which performs O(n) work in total, requires O(m) space and O(m ) time for preprocessing. The following lower bounds are also shown: for on-line algorithms, a bound of n+ (n m) character comparisons for m = 35 + 36k, for any integer k 1, and for general algorithms, a bound of n + m+3 character comparisons, for m = 2k + 1, for any integer k 1.

