Results 1  10
of
11
Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract

Cited by 144 (12 self)
 Add to MetaCart
(Show Context)
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
qgram based database searching using a suffix array
 QUASAR). Proceedings of the third annual international conference on Computational molecular biology (Recomb 99
, 1999
"... With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to allversusall comparisons of large databases. Her ..."
Abstract

Cited by 82 (7 self)
 Add to MetaCart
With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to allversusall comparisons of large databases. Here we present a new database searching algorithm called QUASAR (Qgram Alignment based on Suffix ARrays) which was designed to quickly detect sequences with strong similarity to the query in a context where many searches are conducted on one database. Our algorithm applies a modification of qtuple filtering implemented on top of a suffix array. Two versions were developed, one for a RAM resident suffix array and one for access to the suffix array on disk. We compared our implementation with BLAST and found that our approach is an order of magnitude faster. It is, however, restricted to the search for strongly similar DNA sequences as is typically required, e.g., in the context of clustering expressed sequence tags (ESTs). 1
Database indexing for large DNA and protein sequence collections
, 2002
"... Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, whic ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200Mb of protein and 300Mbp of DNA, whose diskimage exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications.
Compressed index for a dynamic collection of texts
 In Proc. CPM’04, LNCS 3109
, 2004
"... Abstract. Let T be a string with n characters over an alphabet of bounded size. The recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [2, 4]. This paper extends the work on optimalspace i ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Let T be a string with n characters over an alphabet of bounded size. The recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [2, 4]. This paper extends the work on optimalspace indexing to a dynamic collection of texts. Precisely, we give a compressed index using O(n) bits where n is the total length of texts, such that searching for a pattern P takes O(P  log n + occ log 2 n) time where occ is the number of occurrences, and inserting or deleting a text T takes O(T  log n) time. 1
Suffix Tree Construction for Large Strings
, 2002
"... Our aim is the development of algorithms for the efficient construction of suffix trees of very large strings. We present an algorithm that improves upon results presented by Hunt, Atkinson and Irving (Proc. VLDB, 2001 ). Our algorithm is stable w.r.t. degeneration of the suffix tree, and its expect ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Our aim is the development of algorithms for the efficient construction of suffix trees of very large strings. We present an algorithm that improves upon results presented by Hunt, Atkinson and Irving (Proc. VLDB, 2001 ). Our algorithm is stable w.r.t. degeneration of the suffix tree, and its expected construction time is O(n log n) rather than (n ).
An experimental study of sbtrees
 In ACMSIAM symposium on Discrete Algorithms
, 1996
"... ..."
(Show Context)
On the Construction and Application of Compressed Text Indexes
, 2004
"... Text indexing involves the preprocessing of a text to build a data structure which enables efficient matching of this text against any pattern afterwards. Traditional indexes like suffix trees and suffix arrays offer superb matching performance in theory, and have found numerous applications in pra ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Text indexing involves the preprocessing of a text to build a data structure which enables efficient matching of this text against any pattern afterwards. Traditional indexes like suffix trees and suffix arrays offer superb matching performance in theory, and have found numerous applications in practice. However, they consume a lot of space. For a text of length n, they have a space requirement of O(n log n) bits, and this demanding space requirement has limited their use for indexing long texts such as DNA sequences. Recent developments, notably the Compressed Suffix Arrays (CSA) of Grossi and Vitter, and the FMindex of Ferragina and Manzini, have offered hope of reducing the space requirements of traditional indexes. These two indexes require space linear to the number of bits of the original text, and the matching performance is only slowed down by a polylog(n) factor. However, for such an index to be useful, it must have a spaceefficient construction. In this study, we improve an existing construction algorithm for the CSA, and show that the CSA and FMindex can be constructed in O(n log n) time using optimal space. We also
Approximate Multiple String Searching by Clustering
"... We are given a nite set S of text strings and a pattern P over some xed alphabet 6. The topic of this paper is the design of a data structure D(S) which supports approximate multiple string searching queries e ciently. Thereby, for a given upper bound k 2 Z + on the allowable distance, P = p 1 111pm ..."
Abstract
 Add to MetaCart
(Show Context)
We are given a nite set S of text strings and a pattern P over some xed alphabet 6. The topic of this paper is the design of a data structure D(S) which supports approximate multiple string searching queries e ciently. Thereby, for a given upper bound k 2 Z + on the allowable distance, P = p 1 111pm is said to appear approximately in a text T = t 1 111tn, m; n 2 Z +, if there exist positions u; v in T such that the edit distance between P and tu 111tv is at most k. Let N denote the sum of the lengths of all strings in S. Wepresent an algorithm that constructs the data structure D(S) in O(N) time and space. Afterwards, an approximate multiple string search query can be answered in O(N) expectedtime if the allowable distance k is bounded above by O( m). The method can be used tosearch large log m nucleotide and amino acid sequence databases for similar sequences. 1
BMC Bioinformatics BioMed Central Methodology article Multiple organism algorithm for finding ultraconserved elements
, 2008
"... This is an Open Access article distributed under the terms of the Creative Commons Attribution License ..."
Abstract
 Add to MetaCart
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MOTIPS: Automated Motif Analysis for Predicting Targets of Modular Protein Domains
, 2010
"... This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. ..."
Abstract
 Add to MetaCart
This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon.