Results 1 
8 of
8
Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract

Cited by 118 (10 self)
 Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Efficient Implementation of Lazy Suffix Trees
, 1999
"... We present an efficient implementation of a writeonly topdown construction for suffix trees. Our implementation is based on a new, spaceefficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a co ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
We present an efficient implementation of a writeonly topdown construction for suffix trees. Our implementation is based on a new, spaceefficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated only when it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy topdown construction is often faster and more space efficient than other methods. Copyright c ○ 2003 John Wiley & Sons, Ltd. KEY WORDS: string matching; suffix tree; spaceefficient implementation; lazy evaluation
Efficient phrasebased document indexing for Web document clustering
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2004
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering ..."
Abstract

Cited by 37 (2 self)
 Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrasebased document index model, the Document Index Graph, which allows for incremental construction of a phrasebased index of the document set with an emphasis on efficiency, rather than relying on singleterm indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pairwise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
Optimal Parallel Suffix Tree Construction
, 1997
"... An O(m)work, O(m)space, O(log m)time CREWPRAM algorithm for constructing the suffix tree of a string s of length m drawn from any fixed alphabet set is obtained. This is the first known work and space optimal parallel algorithm for this problem. It can be generalized to a string s drawn fr ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
An O(m)work, O(m)space, O(log m)time CREWPRAM algorithm for constructing the suffix tree of a string s of length m drawn from any fixed alphabet set is obtained. This is the first known work and space optimal parallel algorithm for this problem. It can be generalized to a string s drawn from any general alphabet set to perform in O(log m) time and O(m log j\Sigmaj) work and space, after the characters in s have been sorted alphabetically, where j\Sigmaj is the number of distinct characters in s. In this case too, the algorithm is workoptimal.
Comparative nGram Analysis of WholeGenome Protein Sequences
 IN PROCEEDINGS OF THE HUMAN LANGUAGE TECHNOLOGIES CONFERENCE
, 2002
"... A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional threedimensional shapes of the proteins are clearly differe ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional threedimensional shapes of the proteins are clearly different for different organisms, since sequences obtained from human proteins often fail to form correct threedimensional structures in bacterial organisms. In analogy to the question "What kind of things do people say?" we therefore need to ask the question "What kind of amino acid sequences occur in the proteins of an organism?" An understanding of the sequence space occupied by proteins in different organisms would have important applications for "translation" of proteins from the language of one organism into that of another and design of drugs that target sequences that might be unique or preferred by pathogenic organisms over those in human hosts. Here we describe
LinearTime Construction of TwoDimensional Suffix Trees (Extended Abstract)
 In Proceedings of the 26th International Colloquium on Automata, Languages and Programming (ICALP), volume 1644 of LNCS
, 1999
"... Dong Kyue Kim Kunsoo Park Department of Computer Engineering Seoul National University, Seoul 151742, Korea fdkkim,kparkg@theory.snu.ac.kr Abstract. The suffix tree of a string S is a compacted trie that represents all suffixes of S. Lineartime algorithms for constructing the suffix tree hav ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Dong Kyue Kim Kunsoo Park Department of Computer Engineering Seoul National University, Seoul 151742, Korea fdkkim,kparkg@theory.snu.ac.kr Abstract. The suffix tree of a string S is a compacted trie that represents all suffixes of S. Lineartime algorithms for constructing the suffix tree have been known for quite a while. In two dimensions, however, lineartime construction of twodimensional suffix trees has been an open problem. We present the first lineartime algorithm for constructing twodimensional suffix trees.
Generalizations of Suffix Arrays to MultiDimensional Matrices
"... We propose multidimensional index data structures that generalize suffix arrays to square matrices and cubic matrices. Giancarlo proposed a twodimensional index data structure, the Lsuffix tree, that generalizes suffix trees to square matrices. However, the construction algorithm for Lsuffix trees ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We propose multidimensional index data structures that generalize suffix arrays to square matrices and cubic matrices. Giancarlo proposed a twodimensional index data structure, the Lsuffix tree, that generalizes suffix trees to square matrices. However, the construction algorithm for Lsuffix trees maintains complicated data structures and uses a large amount of space. We present simple and practical construction algorithms for multidimensional suffix arrays by applying a new partitioning technique to lexicographic sorting. Our contributions are the following: (1) We present the first algorithm for constructing twodimensional suffix arrays directly. Our algorithm is ten times faster and five times spaceefficient than Giancarlo's algorithm for Lsuffix trees. (2) We present an efficient algorithm for threedimensional suffix arrays, which is the first algorithm for constructing threedimensional index data structures.
Designing Pattern Matching Algorithms by Exploiting Structural Pattern Properties
, 1994
"... This thesis presents algorithms and, in some cases, lower bounds for some fundamental pattern matching problems. In all cases, the algorithms are obtained by understanding and strongly exploiting structural pattern properties. The following results are obtained. Exact Complexity of String Matching: ..."
Abstract
 Add to MetaCart
This thesis presents algorithms and, in some cases, lower bounds for some fundamental pattern matching problems. In all cases, the algorithms are obtained by understanding and strongly exploiting structural pattern properties. The following results are obtained. Exact Complexity of String Matching: We consider the question of how many character comparisons are needed to nd all occurrences of a pattern string of length m in a text string of length n. We show an almost tight upper bound of the form n+O(n=m) character comparisons, following preprocessing. Speci cally, we show an upper bound of n+ (n m) character comparisons. This bound is achieved by an online algorithm which performs O(n) work in total, requires O(m) space and O(m ) time for preprocessing. The following lower bounds are also shown: for online algorithms, a bound of n+ (n m) character comparisons for m = 35 + 36k, for any integer k 1, and for general algorithms, a bound of n + m+3 character comparisons, for m = 2k + 1, for any integer k 1.