Results 11  20
of
368
An efficient indexing technique for fulltext database systems
 In Proceedings of 18th International Conference on Very Large Databases
, 1992
"... Abstract: Fulltext database systems require an index to allow fast access to documents based on their content. We propose an inverted file indexing scheme based on compression. This scheme allows users to retrieve documents using words occurring in the documents, sequences of adjacent words, and ..."
Abstract

Cited by 79 (10 self)
 Add to MetaCart
Abstract: Fulltext database systems require an index to allow fast access to documents based on their content. We propose an inverted file indexing scheme based on compression. This scheme allows users to retrieve documents using words occurring in the documents, sequences of adjacent words, and statistical ranking techniques. The compression methods chosen ensure that the storage requirements are small and that dynamic update is straightforward. The only assumption that we make is that sufficient main memory is available to support an inmemory vocabulary; given this assumption, the method we describe requires at most one disc access per query term to identify answers to queries.
Compressed text databases with efficient query algorithms based on the compressed suffix array
 Proceedings of ISAAC'00, number 1969 in LNCS
, 2000
"... A compressed text database based on the compressed suffix array is proposed. The compressed su#x array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log bits for the alphabet #. On the other hand, our data structure does n ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
A compressed text database based on the compressed suffix array is proposed. The compressed su#x array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log bits for the alphabet #. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text
Compressing Integers for Fast File Access
 The Computer Journal
, 1999
"... this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare di#erent approaches to compressing integers, including the Elias gamma and delta codes, Golom ..."
Abstract

Cited by 60 (14 self)
 Add to MetaCart
(Show Context)
this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare di#erent approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variablebyte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed
A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring ..."
Abstract

Cited by 60 (4 self)
 Add to MetaCart
(Show Context)
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in subquadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speedup is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by LempelZiv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut GaspardMonge, Universite de MarnelaVallee, Cite Descartes, ChampssurMarne, 77454 MarnelaVallee Cedex 2, France, email: mac@univmlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (9724) 8240103, FAX: (9724) 8249331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 112013840; email: landau@poly.edu; partially supported by NSF grant CCR0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
Adding Compression to Block Addressing Inverted Indexes
, 2000
"... . Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, lowoverhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it dire ..."
Abstract

Cited by 50 (28 self)
 Add to MetaCart
. Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, lowoverhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of their original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches. Keywords: Text compression, inverted files, block addressing, text databases. 1.
Indexing and Retrieval for Genomic Databases
 IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Aminoacid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract

Cited by 50 (6 self)
 Add to MetaCart
(Show Context)
Genomic sequence databases are widely used by molecular biologists for homology searching. Aminoacid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an indexbased approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that indexbased searching is as accurate as existing exhaustive search schemes.
Compact Encodings of Planar Graphs via Canonical Orderings and Multiple Parentheses
, 1998
"... . We consider the problem of coding planar graphs by binary strings. Depending on whether O(1)time queries for adjacency and degree are supported, we present three sets of coding schemes which all take linear time for encoding and decoding. The encoding lengths are significantly shorter than th ..."
Abstract

Cited by 47 (11 self)
 Add to MetaCart
. We consider the problem of coding planar graphs by binary strings. Depending on whether O(1)time queries for adjacency and degree are supported, we present three sets of coding schemes which all take linear time for encoding and decoding. The encoding lengths are significantly shorter than the previously known results in each case. 1 Introduction This paper investigates the problem of encoding a graph G with n nodes and m edges into a binary string S. This problem has been extensively studied with three objectives: (1) minimizing the length of S, (2) minimizing the time needed to compute and decode S, and (3) supporting queries efficiently. A number of coding schemes with different tradeoffs have been proposed. The adjacencylist encoding of a graph is widely useful but requires 2mdlog ne bits. (All logarithms are of base 2.) A folklore scheme uses 2n bits to encode a rooted nnode tree into a string of n pairs of balanced parentheses. Since the total number of such trees is...