Results 1  10
of
88
Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract

Cited by 144 (12 self)
 Add to MetaCart
(Show Context)
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Fast and Intuitive Clustering of Web Documents
 In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining
, 1997
"... Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing retrieval results (Cutting et al. 1992). A person browsing the clusters can discover ..."
Abstract

Cited by 113 (2 self)
 Add to MetaCart
Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing retrieval results (Cutting et al. 1992). A person browsing the clusters can discover patterns that could be overlooked in the traditional presentation. This paper describes two novel clustering methods that intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. We report on experiments that evaluate these intersectionbased clustering methods on collections of snippets returned from Web search engines. First, we show that wordintersection clustering produces superior clusters and does so faster than standard techniques. Second, we show that our O(n log n) time phraseintersection clustering method produces comparable clusters and does so more than two orders of magnitude faster than all methods tested. I...
Space efficient linear time construction of suffix arrays
 Journal of Discrete Algorithms
, 2003
"... Abstract. We present a linear time algorithm to sort all the suffixes of a string over a large alphabet of integers. The sorted order of suffixes of a string is also called suffix array, a data structure introduced by Manber and Myers that has numerous applications in pattern matching, string proces ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
Abstract. We present a linear time algorithm to sort all the suffixes of a string over a large alphabet of integers. The sorted order of suffixes of a string is also called suffix array, a data structure introduced by Manber and Myers that has numerous applications in pattern matching, string processing, and computational biology. Though the suffix tree of a string can be constructed in linear time and the sorted order of suffixes derived from it, a direct algorithm for suffix sorting is of great interest due to the space requirements of suffix trees. Our result improves upon the best known direct algorithm for suffix sorting, which takes O(n log n) time. We also show how to construct suffix trees in linear time from our suffix sorting result. Apart from being simple and applicable for alphabets not necessarily of fixed size, this method of constructing suffix trees is more space efficient. 1
LempelZiv parsing and sublinearsize index structures for string matching (Extended Abstract)
 Proc. 3rd South American Workshop on String Processing (WSP'96
, 1996
"... String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinearsize index structure. The new structure is based on LempelZiv parsing ..."
Abstract

Cited by 61 (1 self)
 Add to MetaCart
(Show Context)
String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinearsize index structure. The new structure is based on LempelZiv parsing of the text and has size linear in N, the size of the LempelZiv parse. For a text of length n, N = O(n = log n) and can be still smaller if the text is compressible. With the new index structure, all occurrences of a pattern string of length m can be found in time O(m 2
Efficient implementation of lazy suffix trees
 MESSAGE SEQUENCE CHARTS AND PETRI NETS, CITESEER.NJ.NEC.COM/VANDERAALST99INTERORGANIZATIONAL.HTML
, 1999
"... We present an efficient implementation of a writeonly topdown construction for suffix trees. Our implementation is based on a new, spaceefficient representation of suffix trees which requires only 12 bytes per input character in the worst case, and 8:5 bytes per input character on average for a c ..."
Abstract

Cited by 52 (6 self)
 Add to MetaCart
(Show Context)
We present an efficient implementation of a writeonly topdown construction for suffix trees. Our implementation is based on a new, spaceefficient representation of suffix trees which requires only 12 bytes per input character in the worst case, and 8:5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated not before it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy topdown construction is often faster and more space efficient than other methods.
Sparse suffix trees
 In Proc. 2nd Annual International Conference on Computing and Combinatorics (COCOON), LNCS v. 1090
, 1996
"... ..."
(Show Context)
Database indexing for large DNA and protein sequence collections
, 2002
"... Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, whic ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200Mb of protein and 300Mbp of DNA, whose diskimage exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications.
Musical information retrieval using musical parameters
 In Proceedings of the International Computer Music Conference (ICMC98
, 1998
"... ..."
(Show Context)
Fast sequence clustering using a suffix array algorithm
 Bioinformatics
, 2003
"... Motivation: Efficient clustering is important for handling the large amount of available EST sequences. Most contemporary methods are based on some kind of allagainstall comparison, resulting in a quadratic time complexity. A different approach is needed to keep up with the rapid growth of EST da ..."
Abstract

Cited by 30 (6 self)
 Add to MetaCart
(Show Context)
Motivation: Efficient clustering is important for handling the large amount of available EST sequences. Most contemporary methods are based on some kind of allagainstall comparison, resulting in a quadratic time complexity. A different approach is needed to keep up with the rapid growth of EST data. Results: A new, fast EST clustering algorithm is presented. Subquadratic time complexity is achieved by using an algorithm based on suffix arrays. A prototype implementation has been developed and run on a benchmark data set. The produced clusterings are validated by comparing them to clusterings produced by other methods, and the results are quite promising. Availability: The source code for the prototype implementation is available under a GPL license from
Matching a Set of Strings with Variable Length Don’t Cares, Theoretical Computer Science 178
, 1997
"... Given an alphabet A, a pattern p is a sequence (vl,...,vm) of words from A * called keywords. We represent p as a single word ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
(Show Context)
Given an alphabet A, a pattern p is a sequence (vl,...,vm) of words from A * called keywords. We represent p as a single word