Results 1  10
of
86
Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract

Cited by 145 (12 self)
 Add to MetaCart
(Show Context)
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Fast Kernels for String and Tree Matching
, 2004
"... Introduction Many problems in machine learning require a data classification algorithm to work with a set of discrete objects. Common examples include biological sequence analysis where data is represented as strings (Durbin et al., 1998) and Natural Language Processing (NLP) where the data is give ..."
Abstract

Cited by 110 (7 self)
 Add to MetaCart
(Show Context)
Introduction Many problems in machine learning require a data classification algorithm to work with a set of discrete objects. Common examples include biological sequence analysis where data is represented as strings (Durbin et al., 1998) and Natural Language Processing (NLP) where the data is given in the form of a string combined with a parse tree (Collins and Du#y, 2001) or an annotated sequence (Altun et al., 2003). In order to apply kernel methods one defines a measure of similarity between discrete structures via a feature map # : X F. Here X is the set of discrete structures (eg. the set of all parse trees of a language) and F is a Hilbert space. Since #(x) F we can define a kernel by evaluating the scalar products k(x, x # ) = ##(x), #(x # )# (1.1) where x, x # X. The success of a kernel method employing k depends both on the faithful representation of discrete data and an e#cient means of computing k. Recent research e#ort has focussed on defining meaningful ker
Efficient implementation of lazy suffix trees
 MESSAGE SEQUENCE CHARTS AND PETRI NETS, CITESEER.NJ.NEC.COM/VANDERAALST99INTERORGANIZATIONAL.HTML
, 1999
"... We present an efficient implementation of a writeonly topdown construction for suffix trees. Our implementation is based on a new, spaceefficient representation of suffix trees which requires only 12 bytes per input character in the worst case, and 8:5 bytes per input character on average for a c ..."
Abstract

Cited by 52 (6 self)
 Add to MetaCart
(Show Context)
We present an efficient implementation of a writeonly topdown construction for suffix trees. Our implementation is based on a new, spaceefficient representation of suffix trees which requires only 12 bytes per input character in the worst case, and 8:5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated not before it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy topdown construction is often faster and more space efficient than other methods.
OASIS: An Online and Accurate Technique for Localalignment Searches on Biological Sequences
 In VLDB
, 2003
"... A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss target ..."
Abstract

Cited by 43 (4 self)
 Add to MetaCart
(Show Context)
A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable.
Practical Suffix Tree Construction
 In Proc. 13th International Conference on Very Large Data Bases
, 2004
"... Large string datasets are common in a number of emerging text and biological database applications. ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
Large string datasets are common in a number of emerging text and biological database applications.
A Comparison of Imperative and Purely Functional Suffix Tree Constructions
 Science of Computer Programming
, 1995
"... We explore the design space of implementing suffix tree algorithms in the functional paradigm. We review the linear time and space algorithms of McCreight and Ukkonen. Based on a new terminology of nested suffixes and nested prefixes, we give a simpler and more declarative explanation of these algor ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
(Show Context)
We explore the design space of implementing suffix tree algorithms in the functional paradigm. We review the linear time and space algorithms of McCreight and Ukkonen. Based on a new terminology of nested suffixes and nested prefixes, we give a simpler and more declarative explanation of these algorithms than was previously known. We design two "naive" versions of these algorithms which are not linear time, but use simpler data structures, and can be implemented in a purely functional style. Furthermore, we present a new, "lazy" suffix tree construction which is even simpler. We evaluate both imperative and functional implementations of these algorithms. Our results show that the naive algorithms perform very favourably, and in particular, the lazy construction compares very well to all the others. 1 Introduction Suffix trees are the method of choice when a large sequence of symbols, the "text", is to be searched frequently for occurrences of short sequences, the "patterns". Given tha...
A linear lower bound on index size for text retrieval
 J. ALGORITHMS
, 2003
"... Most informationretrieval systems preprocess the data to produce an auxiliary index structure. Empirically, it has been observed that there is a tradeoff between query response time and the size of the index. When indexing a large corpus, such as the web, the size of the index is an important consi ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
Most informationretrieval systems preprocess the data to produce an auxiliary index structure. Empirically, it has been observed that there is a tradeoff between query response time and the size of the index. When indexing a large corpus, such as the web, the size of the index is an important consideration. In this case it would be ideal to produce an index that is substantially smaller than the text. In this work we prove a linear worstcase lower bound on the size of any index that reports the location (if any) of a substring in the text in time proportional to the length of the pattern. In other words, an index supporting lineartime substring searches requires about as much space as the original text. Here “time ” is measured in the number of bit probes to the text; an arbitrary amount of computation may be done on an arbitrary amount of the index. Our lower bound applies to inverted word indices as well.
The at most kdeep factor tree
, 2003
"... Cet article présente un nouvelle structure d’indexation proche de l’arbre des suffixes. Cette structure indexe tous les facteurs de longueur au plus k d’une chaîne. La construction et la place mémoire sont linéaires en la longueur de la chaîne (comme l’arbre des suffixes). Cependant, pour des valeur ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
Cet article présente un nouvelle structure d’indexation proche de l’arbre des suffixes. Cette structure indexe tous les facteurs de longueur au plus k d’une chaîne. La construction et la place mémoire sont linéaires en la longueur de la chaîne (comme l’arbre des suffixes). Cependant, pour des valeurs de k petites, l’arbre des facteurs présente un fort gain mémoire visàvis de l’arbre des suffixes. Mots Clefs: arbre des suffixes, arbre des facteurs, structure d’indexation.
Suffix Trees and their Applications in String Algorithms
"... The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others. In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching.
Fast and space efficient string kernels using suffix arrays
 In Proceedings, 23rd ICMP
, 2006
"... String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree ba ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
(Show Context)
String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree based algorithm requires O(n) space for an n length string, in practice at least 40n bytes are required – 20n bytes for storing the suffix tree, and an additional 20n bytes for the annotation. This large memory requirement coupled with poor locality of memory access, inherent due to the use of suffix trees, means that the performance of the suffix tree based algorithm deteriorates on large strings. In this paper, we describe a new linear time yet space efficient and scalable algorithm for computing string kernels, based on suffix arrays. Our algorithm is a) faster and easier to implement, b) on the average requires only 19n bytes of storage, and c) exhibits strong locality of memory access. We show that our algorithm can be extended to perform linear time prediction on a test string, and present experiments to validate our claims. 1.