Results 1 -
5 of
5
Compressing Integers for Fast File Access
- The Computer Journal
, 1999
"... this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare di#erent approaches to compressing integers, including the Elias gamma and delta codes, Golom ..."
Abstract
-
Cited by 51 (13 self)
- Add to MetaCart
this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare di#erent approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
A deterministic finite automaton for faster protein hit detection in BLAST
- Journal of Computational Biology
, 2005
"... BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST — by improving its algorithms and optimisations — is essential to improve ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST — by improving its algorithms and optimisations — is essential to improve search times in the face of exponentially-increasing collection sizes. We present an optimisation to the first stage of the BLAST algorithm specifically designed for protein search. It produces the same results as NCBI-BLAST but in around 59 % of the time on Intel-based platforms; we also present results for other popular architectures. Overall, this is a saving of around 15 % of the total typical BLAST search time. Our approach uses a deterministic finite automaton (DFA), inspired by the original scheme used in the 1990 BLAST algorithm. The techniques are optimised for modern hardware, making careful use of cacheconscious approaches to improve speed. Our optimised DFA approach has been integrated into a new version of BLAST that is freely available for download at
Comparing Compressed Sequences for Faster Nucleotide BLAST Searches
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
"... Abstract. Molecular biologists, geneticists, and other life scientists use the blast homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of blast: blastp for searching protein collections and blastn ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Molecular biologists, geneticists, and other life scientists use the blast homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of blast: blastp for searching protein collections and blastn for nucleotide collections. Surprisingly, blastn has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 blast paper (Altschul, Madden, Schaffer, Zhang, Zhang, Miller & Lipman 1997) and no exact description has been published. It is important that blastn is state-of-the-art: nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and take many minutes to search on modern general-purpose workstations. This paper proposes significant improvements to the blastn algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of blastn with no effect on accuracy and have been integrated into our new version of blast that is freely available for download from
Compact In-Memory Models for Compression for Large Text Databases
, 1999
"... For compression of text databases, semi-static word-based models are a pragmatic choice. They provide good compression with a model of moderate size, and allow independent decompression of stored documents. Previous experiments have shown that, where there is not sufficient memory to store a full wo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
For compression of text databases, semi-static word-based models are a pragmatic choice. They provide good compression with a model of moderate size, and allow independent decompression of stored documents. Previous experiments have shown that, where there is not sufficient memory to store a full word-based model, encoding rare words as sequences of characters can still allow good compression, while a pure character-based model is poor. In addition, there are other kinds of semi-static model that can be used for text, such as word pairs. We propose a further kind of model that reduces main memory costs of a word-based model: approximate models, in which rare words are represented by similarly-spelt common words and a sequence of edits. We investigate the compression available with different memory efficient models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can significantly improve the compression available in limited memory and greatly reduce overall memory requirements.

