Results 1 -
4 of
4
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
The ND-Tree: A Dynamic Indexing Technique for Multidimensional Non-ordered Discrete Data Spaces
- In Proc. of VLDB
, 2003
"... Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases. ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases.
Comparing Compressed Sequences for Faster Nucleotide BLAST Searches
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
"... Abstract. Molecular biologists, geneticists, and other life scientists use the blast homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of blast: blastp for searching protein collections and blastn ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Molecular biologists, geneticists, and other life scientists use the blast homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of blast: blastp for searching protein collections and blastn for nucleotide collections. Surprisingly, blastn has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 blast paper (Altschul, Madden, Schaffer, Zhang, Zhang, Miller & Lipman 1997) and no exact description has been published. It is important that blastn is state-of-the-art: nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and take many minutes to search on modern general-purpose workstations. This paper proposes significant improvements to the blastn algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of blastn with no effect on accuracy and have been integrated into our new version of blast that is freely available for download from
Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm
"... Abstract—With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories ..."
Abstract
- Add to MetaCart
Abstract—With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm’s results indicate that the uses of parallel programming with large dataset are promising which can be improved further. Keywords—Biological sequence, Database index, N-gram indexing, Parallel computing, Sequence retrieval.

