Results 1 - 10
of
30
Efficient Single-Pass Index Construction for Text Databases
- Jour. of the American Society for Information Science and Technology
, 2003
"... Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approa ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences
- In VLDB
, 2003
"... A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss target ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable.
Burst Tries: A Fast, Efficient Data Structure for String Keys
- ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure
- In VLDB
, 2005
"... The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer.
The ed-tree: an index for large dna sequence databases
- In In Proc. 15th Int. Conf. on Scientific and Statistical Database Management
, 2003
"... The growing interest in genomic research has caused an explosive growth in the size of DNA databases making it increasely challenging to perform searches on them. In this paper, we proposed an index structure called the ed-tree for supporting fast and effective homology searches on DNA databases. Th ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
The growing interest in genomic research has caused an explosive growth in the size of DNA databases making it increasely challenging to perform searches on them. In this paper, we proposed an index structure called the ed-tree for supporting fast and effective homology searches on DNA databases. The ed-tree is developed to enable probe-based homology search algorithms like Blastn which generate short probe strings from the query sequence and then match them against the sequence database in order to identify potential regions of high similarity to the query sequence. Unlike Blastn however, the homology search algorithm we developed for ed-tree supports more flexible probe model with longer probes and more relaxed matching. As a consequence, the ed-tree is not only more effective and efficient than the latest Blastn(NCBI Blast2) when supporting homology search but also takes up moderate storage compared to existing data structures like the suffix tree. To index a DNA database of 2 giga base pairs(Gbps), ed-tree only takes less than 3Gb of secondary storage which is easily handled by a desktop PC. Experiments will be shown in this paper to support our claim. 1
An efficient index-based protein structure database searching method
- Intl. Conf. on Database Systems for Advanced Applications (DASFAA
, 2003
"... In this paper, we present a novel indexing method called ProtDex to facilitate fast searching in 3-dimensional protein structure database. In ProtDex, we first build an index on the representative properties of all proteins in the database. When evaluating a query, with the help of the index, we fil ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
In this paper, we present a novel indexing method called ProtDex to facilitate fast searching in 3-dimensional protein structure database. In ProtDex, we first build an index on the representative properties of all proteins in the database. When evaluating a query, with the help of the index, we filter out a small candidate list of proteins. Then, we can either directly report them, with their respective rankings, to the user, or do the expensive actual alignments on them upon user’s request. Preliminary experimental results show that our solution is up to 16 times faster than the popular DALI method for database searching task(without actual alignments), while its overall accuracy is only slightly inferior to that of DALI. The software is available upon request by sending emails to the authors. 1.
Self-Adjusting Trees in Practice for Large Text Collections
- Software - Practice and Experience
, 2002
"... Splay and randomised search trees are self-balancing binary tree structures with little or no space overhead compared to a standard binary search tree. Both trees are intended for use in applications where node accesses are skewed, for example in gathering the distinct words in a large text collecti ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Splay and randomised search trees are self-balancing binary tree structures with little or no space overhead compared to a standard binary search tree. Both trees are intended for use in applications where node accesses are skewed, for example in gathering the distinct words in a large text collection for index construction. We investigate the efficiency of these trees for such vocabulary accumulation. Surprisingly, unmodified splaying and randomised search trees are on average around 25% slower than using a standard binary tree. We investigate heuristics to limit splay tree reorganisation costs and show their effectiveness in practice. In particular, a periodic rotation scheme improves the speed of splaying by 27%, while other proposed heuristics are less effective. We also report the performance of efficient bit-wise hashing and red-black trees for comparison.
Searching on the Secondary Structure of Protein Sequences
- In: Proceedings of the 28th VLDB Conference, Hong Kong
, 2002
"... In spite of the many decades of progress in database research, surprisingly scientists in the life sciences community still struggle with inefficient and awkward tools for querying biological data sets. This work highlights a specific problem involving searching large volumes of protein data s ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
In spite of the many decades of progress in database research, surprisingly scientists in the life sciences community still struggle with inefficient and awkward tools for querying biological data sets. This work highlights a specific problem involving searching large volumes of protein data sets based on their secondary structure. In this paper we define an intuitive query language that can be used to express queries on secondary structure and develop several algorithms for evaluating these queries. We implement these algorithms both in Periscope, a native system that we have built, and in a commercial ORDBMS. We show that the choice of algorithms can have a significant impact on query performance. As part of the Periscope implementation we have also developed a framework for optimizing these queries and for accurately estimating the costs of the various query evaluation plans. Our performance studies show that the proposed techniques are very efficient in the Periscope system and can provide scientists with interactive secondary structure querying options even on large protein data sets.
Indexing DNA sequences using q-grams
- In Proceedings of the 10th International Conference on Database Systems for Advanced Applications
, 2005
"... Abstract. We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database an ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract. We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity. 1
A Practical Index for Genome Searching
- In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003), LNCS 2857
, 2003
"... Current search tools for computational biology trade e#- ciency for precision, losing many relevant matches. We push in the direction of obtaining maximum e#ciency from an indexing scheme that does not lose any relevant match. We show that it is feasible to search the human genome e#ciently on a ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Current search tools for computational biology trade e#- ciency for precision, losing many relevant matches. We push in the direction of obtaining maximum e#ciency from an indexing scheme that does not lose any relevant match. We show that it is feasible to search the human genome e#ciently on an average desktop computer.

