Results 1  10
of
24
Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract

Cited by 118 (10 self)
 Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Spelling Approximate Repeated Or Common Motifs Using a Suffix Tree
, 1998
"... . We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet \Sigma . For instance, \Sigma may be equal to fA, C, G, Tg and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the s ..."
Abstract

Cited by 70 (7 self)
 Add to MetaCart
. We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet \Sigma . For instance, \Sigma may be equal to fA, C, G, Tg and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 q N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being "external" objects and denote them by the expression "valid models" if they verify the quorum constraint q. The approach we introduce here for finding all valid models corr...
Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases
 In ICDE
, 2000
"... We propose an indexing technique for fast retrieval of similar subsequences using time warping distances. A time warping distance is a more suitable similarity measure than the Euclidean distance in many applications, where sequences may be of different lengths or different sampling rates. Our index ..."
Abstract

Cited by 39 (4 self)
 Add to MetaCart
We propose an indexing technique for fast retrieval of similar subsequences using time warping distances. A time warping distance is a more suitable similarity measure than the Euclidean distance in many applications, where sequences may be of different lengths or different sampling rates. Our indexing technique uses a diskbased suffix tree as an index structure and employs' lowerbound distance functions to filter out dissimilar subsequences without false dismissals. To make the index structure compact and thus accelerate the query processing, we convert sequences of continuous values to sequences of discrete values via a categorization method and store only a subset of suffixes whose first values are different from their preceding values. The experimental results' reveal that our proposed technique can be a few orders' of magnitude faster than sequential scanning.
OASIS: An Online and Accurate Technique for Localalignment Searches on Biological Sequences
 In VLDB
, 2003
"... A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss target ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable.
Extracting structured motifs using a suffix tree  algorithms and application to promoter consensus identification
 In Proceedings of RECOMB 2000
, 2000
"... promoter consensus identification ..."
Fast retrieval of similar subsequences in long sequence databases
 In 3 rd IEEE Knowledge and Data Engineering Exchange Workshop
, 1999
"... shpark,dongwon,wwc¡ Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use highcost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance function ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
shpark,dongwon,wwc¡ Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use highcost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the ¢ average length of data sequences. In this paper, we propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to ¢ linear to. We also present an indexing technique to speedup the aligned subsequence matching using the similarity measure of the modified time warping distance. The experiments on the synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed the sequential scanning and achieved up to 6.5 times speedup. 1.
SegmentBased Approach for Subsequence Searches in Sequence Databases
, 2001
"... This paper investigates the subsequence searching problem under time warping in sequence databases. Time warping enables to find sequences with similar changing patterns even when they are of different lengths. Our work is motivated by the observation that subsequence searches slow down quadraticall ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
This paper investigates the subsequence searching problem under time warping in sequence databases. Time warping enables to find sequences with similar changing patterns even when they are of different lengths. Our work is motivated by the observation that subsequence searches slow down quadratically as the total length of data sequences increases. To resolve this problem, we propose the SegmentBased Approach for Subsequence Searches (SBASS), which modifies the similarity measure from time warping to piecewise time warping and limits the number of possible subsequences to be compared with a query sequence. For efficient
Accelerating Protein Classification Using Suffix Trees
, 2000
"... Positionspecific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Building on earlier work that allows evaluation of a sc ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
Positionspecific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix treebased method excludes many protein segments from consideration at once by pruning entire subtrees. Although suffix trees are usually expensive in space, the fact that scoring matrix evaluation requires an inorder traversal allows nodes to be stored more compactly without loss of speed, and our implementation requires only 17 bytes of primary memory per input symbol. Searches are accelerated by up to a factor of ten.
A First Approach to Finding Common Motifs With Gaps
 International Journal of Foundations of Computer Science
"... Abstract. We present three linear algorithms for as many formulations of the problem of finding motifs with gaps. The three versions of the problem are distinct in that they assume different constraints on the size of the gaps. The outline of the algorithm is always the same, although this is adapte ..."
Abstract

Cited by 13 (11 self)
 Add to MetaCart
Abstract. We present three linear algorithms for as many formulations of the problem of finding motifs with gaps. The three versions of the problem are distinct in that they assume different constraints on the size of the gaps. The outline of the algorithm is always the same, although this is adapted each time to the specific problem, while maintaining a linear time complexity with respect to the input size. The approach we suggest is based on a rewriting of the text that uses a new alphabet made of labels representing words of the original input text. The computational complexity of the algorithm allows to use it also to find long motifs. The algorithm is in fact general enough that it could be applied to several variants of the problem other those suggested in this paper.
The reconstruction of user sessions from a server log using improved timeoriented heuristics.” in CNSR
 In: Proceedings of the Second Annual Conference on Communication Networks and Services Research
, 2004
"... Web usage mining plays an important role in the personalization of Web services, adaptation of Web sites, and the improvement of Web server performance. It applies data mining techniques to discover Web access patterns from Web usage data. In order to discover access patterns, Web usage data should ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Web usage mining plays an important role in the personalization of Web services, adaptation of Web sites, and the improvement of Web server performance. It applies data mining techniques to discover Web access patterns from Web usage data. In order to discover access patterns, Web usage data should be reconstructed into sessions with or without user identification. However, not all Web server logs contain complete information for constructing user sessions. One approach for solving such a problem is to use timeoriented heuristics to reconstruct user sessions. This paper describes improved statisticalbased timeoriented heuristics for the reconstruction of user sessions from a server log. Comparative analysis are carried out using two similarity measures. The performance results of the proposed improved heuristics are promising and in some cases show reasonable improvements.