Results 1  10
of
16
PSIST: Indexing protein structures using suffix trees
 In IEEE Computational Systems Bioinformatics Conference
, 2005
"... Approaches for indexing proteins, and for fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this paper, we developed a new method for extracting the loca ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
Approaches for indexing proteins, and for fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this paper, we developed a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between C atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results shows classification accuracy up to 97.8 % and 99.4 % at the superfamily and class level according to the SCOP classification, and shows that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results are competitive with the best previous methods. 1.
Searchoptimized suffixtree storage for biological applications
 In Proc. 12th IEEE International Conference on High Performance Computing
, 2005
"... Abstract. Suffixtrees are popular indexing structures for various sequence processing problems in biological data management. We investigate here the possibility of enhancing the search efficiency of diskresident suffixtrees through customized layouts of treenodes to diskpages. Specifically, we ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Abstract. Suffixtrees are popular indexing structures for various sequence processing problems in biological data management. We investigate here the possibility of enhancing the search efficiency of diskresident suffixtrees through customized layouts of treenodes to diskpages. Specifically, we propose a new layout strategy, called Stellar, that provides significantly improved search performance on a representative set of real genomic sequences. Further, Stellar supports both the standard roottoleaf lookup queries as well as sophisticated sequencesearch algorithms that exploit the suffixlinks of suffixtrees. Our results are encouraging with regard to the ultimate objective of seamlessly integrating sequence processing in database engines. 1
Obtaining provably good performance from suffix trees in secondary storage
 In Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 4009 (2006
, 2006
"... Abstract. Designing external memory data structures for string databases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string Btrees provide t ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Abstract. Designing external memory data structures for string databases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string Btrees provide the best known asymptotic performance in external memory for substring search and update operations. Work on external memory variants of suffix trees has largely focused on constructing suffix trees in external memory or layout schemes for suffix trees that preserve link locality. In this paper, we present a new suffix tree layout scheme for secondary storage and present construction, substring search, insertion and deletion algorithms that are competitive with the string Btree. For a set of strings of total length n, a pattern p and disk blocks of size B, we provide a substring search algorithm that uses O(p/B +log B n) disk accesses. We present algorithms for insertion and deletion of all suffixes of a string of length m that take O(m log B (n + m)) and O(m log B n) disk accesses, respectively. Our results demonstrate that suffix trees can be directly used as efficient secondary storage data structures for string and sequence data. 1
The Suffix Sequoia Index for Approximate String Matching
, 2003
"... We address the problem of approximate string matching over protein, DNA, and RNA strings, using an arbitrary cost matrix. We focus on fullsensitivity searching, equivalent to the SmithWaterman algorithm. The data structure we propose is compact. To index a text of length n, we use just over 4n byt ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We address the problem of approximate string matching over protein, DNA, and RNA strings, using an arbitrary cost matrix. We focus on fullsensitivity searching, equivalent to the SmithWaterman algorithm. The data structure we propose is compact. To index a text of length n, we use just over 4n bytes. This datastructure is amenable for future...
Indexed searching on proteins using a suffix sequoia
 Bulletin of the IEEE Computer Society Technical Committee on Data Engineering
, 2004
"... Approximate searching on protein sequence data under arbitrary cost models is not supported by database indexing technology. We present a new data structure, suffix sequoia, which reduces the time complexity of the dynamic programming (DP) matrix calculation required in approximate matching. The dat ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Approximate searching on protein sequence data under arbitrary cost models is not supported by database indexing technology. We present a new data structure, suffix sequoia, which reduces the time complexity of the dynamic programming (DP) matrix calculation required in approximate matching. The data structure is compact. It uses just over 4 Bytes per symbol indexed. We show that time complexity of the DP calculation is O(qg d) for a pattern of length q, alphabet size g, and indexing window size d. The DP calculation requires no disk access, and can be executed efficiently. The second phase of the algorithm is based on sequential disk access, and appears to be effective. Approximate matching experiments are promising and offer a lot of scope for algorithm refinement and data structure engineering. 1
High Throughput and Large Capacity Pipelined Dynamic Search Tree on FPGA ∗
"... We propose a pipelined Dynamic Search Tree (pDST) on FPGA which offers high throughput for lookup, insert and delete operations as well as the capability to perform inplace incremental updates. Based on the pipelined 23 tree data structure, our pDST supports one lookup per clock cycle and maintains ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We propose a pipelined Dynamic Search Tree (pDST) on FPGA which offers high throughput for lookup, insert and delete operations as well as the capability to perform inplace incremental updates. Based on the pipelined 23 tree data structure, our pDST supports one lookup per clock cycle and maintains tree balance under continual insert and delete operations. A novel buffered update scheme together with a bidirectional linear pipeline allows the pDST to perform one insert or delete operation per O (log N) cycles (N being the tree capacity) without stalling the lookup operations. Nodes at each pipeline stage are allocated and freed by a freenode chaining mechanism which greatly simplifies the memory management circuit. Our prototype implementation of a 15level, 32bit key dualport pDST requires 192 blocks of 36 Kb BRAMs (64%) and 12.8k LUTs (6.3%) on a Virtex 5 LX330 FPGA. The circuit has a maximum capacity of 96k 32bit keys and clock rate of 135 MHz, supporting 242 million lookups and concurrently 3.97 million inserts or deletes per second.
String Searching in Referentially Compressed Genomes
"... Genome compression, referential compression, string search Background:Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences be ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Genome compression, referential compression, string search Background:Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a tobecompressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. Summary:In our work, we address this problem by proposing an algorithm for exact string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Results:Our string search algorithm for referentially compressed genomes performs exact string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. We think that this is an important step towards space and runtime efficient management of large biological data sets. 1
Abstract Constructing Chromosome Scale Suffix Trees
"... Suffix trees have been the focus of significant research interest as they permit very efficient solutions to a range of string and sequence searching problems. Given a suffix tree that encodes a particular string, it is possible to solve problems such as searching for a specific pattern in time prop ..."
Abstract
 Add to MetaCart
Suffix trees have been the focus of significant research interest as they permit very efficient solutions to a range of string and sequence searching problems. Given a suffix tree that encodes a particular string, it is possible to solve problems such as searching for a specific pattern in time proportional to the length of the pattern rather than the length of the string. Suffix trees can also support inexact matching by dramatically improving the performance of dynamic programming. Therefore, suffix trees may enable a number of large scale bioinformatics problems to be solved more efficiently than previously thought. However, these benefits presume that a suffix tree of sufficient scale can be constructed. An inherent difficulty in suffix tree construction is that the tree construction requires a semi random walk over the tree as it is constructed. Therefore very large trees that will not fit in memory could take an unacceptably long time to construct due to excessive page faulting. In this paper we present a linear time construction algorithm that can construct suffix trees larger than memory using discrete subtrees. The subtrees can be constructed in parallel. The performance of the algorithm is evaluated using suffix trees constructed for chromosomes 1 and 12 of the human genome.
Constructing Chromosome Scale Suffix Trees
"... Suffix trees have been the focus of significant research interest as they permit very efficient solutions to a range of string and sequence searching problems. Given a suffix tree that encodes a particular string, it is possible to solve problems such as searching for a specific pattern in time prop ..."
Abstract
 Add to MetaCart
Suffix trees have been the focus of significant research interest as they permit very efficient solutions to a range of string and sequence searching problems. Given a suffix tree that encodes a particular string, it is possible to solve problems such as searching for a specific pattern in time proportional to the length of the pattern rather than the length of the string. Suffix trees can also support inexact matching by dramatically improving the performance of dynamic programming. Therefore, suffix trees may enable a number of large scale bioinformatics problems to be solved more efficiently than previously thought. However, these benefits presume that a suffix tree of sufficient scale can be constructed. An inherent difficulty in suffix tree construction is that the tree construction requires a semi random walk over the tree as it is constructed. Therefore very large trees that will not fit in memory could take an unacceptably long time to construct due to excessive page faulting. In this paper we present a linear time construction algorithm that can construct suffix trees larger than memory using discrete subtrees. The subtrees can be constructed in parallel. The performance of the algorithm is evaluated using suffix trees constructed for chromosomes 1 and 12 of the human genome.