Results 1 -
8 of
8
Reducing the Space Requirement of Suffix Trees
- Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract
-
Cited by 109 (10 self)
- Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
q-gram based database searching using a suffix array
- QUASAR). Proceedings of the third annual international conference on Computational molecular biology (Recomb 99
, 1999
"... With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to all-versus-all comparisons of large databases. Her ..."
Abstract
-
Cited by 59 (5 self)
- Add to MetaCart
With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to all-versus-all comparisons of large databases. Here we present a new database searching algorithm called QUASAR (Q-gram Alignment based on Suffix ARrays) which was designed to quickly detect sequences with strong similarity to the query in a context where many searches are conducted on one database. Our algorithm applies a modification of q-tuple filtering implemented on top of a suffix array. Two versions were developed, one for a RAM resident suffix array and one for access to the suffix array on disk. We compared our implementation with BLAST and found that our approach is an order of magnitude faster. It is, however, restricted to the search for strongly similar DNA sequences as is typically required, e.g., in the context of clustering expressed sequence tags (ESTs). 1
Database indexing for large DNA and protein sequence collections
, 2002
"... Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, whic ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200Mb of protein and 300Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications.
Compressed index for a dynamic collection of texts
- In Proc. CPM’04, LNCS 3109
, 2004
"... Abstract. Let T be a string with n characters over an alphabet of bounded size. The recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [2, 4]. This paper extends the work on optimal-space i ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Abstract. Let T be a string with n characters over an alphabet of bounded size. The recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [2, 4]. This paper extends the work on optimal-space indexing to a dynamic collection of texts. Precisely, we give a compressed index using O(n) bits where n is the total length of texts, such that searching for a pattern P takes O(|P | log n + occ log 2 n) time where occ is the number of occurrences, and inserting or deleting a text T takes O(|T | log n) time. 1
Suffix Tree Construction for Large Strings
, 2002
"... Our aim is the development of algorithms for the efficient construction of suffix trees of very large strings. We present an algorithm that improves upon results presented by Hunt, Atkinson and Irving (Proc. VLDB, 2001 ). Our algorithm is stable w.r.t. degeneration of the suffix tree, and its expect ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Our aim is the development of algorithms for the efficient construction of suffix trees of very large strings. We present an algorithm that improves upon results presented by Hunt, Atkinson and Irving (Proc. VLDB, 2001 ). Our algorithm is stable w.r.t. degeneration of the suffix tree, and its expected construction time is O(n log n) rather than (n ).
An experimental study of SB-trees
- In ACM-SIAM symposium on Discrete Algorithms
, 1996
"... In a previous work of ours [13], we proposed a text indexing data structure for external memory, which we called SB-tree, that combines the best B-tree and suffix array qualities to overcome the limitations of inverted files, suffix arrays, suffix trees, and prefix B-trees. In this paper, we study t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In a previous work of ours [13], we proposed a text indexing data structure for external memory, which we called SB-tree, that combines the best B-tree and suffix array qualities to overcome the limitations of inverted files, suffix arrays, suffix trees, and prefix B-trees. In this paper, we study the performance of SB-trees in a practical setting by running a large number of searching and updating experiments. We obtain fast practical performance by means of a new space-efficient and alphabet-independent organization of SB-tree nodes and a new batch insertion procedure that avoids thrashing. 1 Introduction Textual data in electronic form are more available than before and range from published documents (e.g., electronic dictionaries, libraries and archives, etc.) to private databases (e.g., marketing information, legal records, medical histories, etc.). Online providers of legal and newswire texts (such as Westlaw and Lexis-Nexis) already have hundreds of text gigabytes and will have...
Approximate Multiple String Searching by Clustering
"... We are given a nite set S of text strings and a pattern P over some xed alphabet 6. The topic of this paper is the design of a data structure D(S) which supports approximate multiple string searching queries e ciently. Thereby, for a given upper bound k 2 Z + on the allowable distance, P = p 1 111pm ..."
Abstract
- Add to MetaCart
We are given a nite set S of text strings and a pattern P over some xed alphabet 6. The topic of this paper is the design of a data structure D(S) which supports approximate multiple string searching queries e ciently. Thereby, for a given upper bound k 2 Z + on the allowable distance, P = p 1 111pm is said to appear approximately in a text T = t 1 111tn, m; n 2 Z +, if there exist positions u; v in T such that the edit distance between P and tu 111tv is at most k. Let N denote the sum of the lengths of all strings in S. Wepresent an algorithm that constructs the data structure D(S) in O(N) time and space. Afterwards, an approximate multiple string search query can be answered in O(N) expected-time if the allowable distance k is bounded above by O( m). The method can be used tosearch large log m nucleotide and amino acid sequence databases for similar sequences. 1
BMC Bioinformatics BioMed Central Methodology article Multiple organism algorithm for finding ultraconserved elements
, 2008
"... This is an Open Access article distributed under the terms of the Creative Commons Attribution License ..."
Abstract
- Add to MetaCart
This is an Open Access article distributed under the terms of the Creative Commons Attribution License

