Results 1 - 10
of
12
Progressive and selective merge: computing top-k with ad-hoc ranking functions
- In SIGMOD Conference
, 2007
"... The family of threshold algorithm (i.e., TA) has been widely studied for efficiently computing top-k queries. TA uses a sort-merge framework that assumes data lists are pre-sorted, and the ranking functions are monotone. However, in many database applications, attribute values are indexed by treestr ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
The family of threshold algorithm (i.e., TA) has been widely studied for efficiently computing top-k queries. TA uses a sort-merge framework that assumes data lists are pre-sorted, and the ranking functions are monotone. However, in many database applications, attribute values are indexed by treestructured indices (e.g., B-tree, R-tree), and the ranking functions are not necessarily monotone. To answer top-k queries with ad-hoc ranking functions, this paper studies an index-merge paradigm that performs progressive search over the space of joint states composed by multiple index nodes. We address two challenges for efficient query processing. First, to minimize the search complexity, we present a doubleheap algorithm which supports not only progressive state search but also progressive state generation. Second, to avoid unnecessary disk access, we characterize a type of “empty-state ” that does not contribute to the final results, and propose a new materialization model, join-signature, to prune empty-states. Our performance study shows that the proposed method achieves one order of magnitude speed-up over baseline solutions.
Bounding the Depth of Search Trees
- The Computer Journal
, 1993
"... For an ordered sequence of n weights, Huffman's algorithm constructs in time and space O(n) a search tree with minimum average path length, or, which is equivalent, a minimum redundancy code. However, if an upper bound B is imposed on the length of the codewords, the best known algorithms for the co ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
For an ordered sequence of n weights, Huffman's algorithm constructs in time and space O(n) a search tree with minimum average path length, or, which is equivalent, a minimum redundancy code. However, if an upper bound B is imposed on the length of the codewords, the best known algorithms for the construction of an optimal code have time and space complexities O(Bn 2 ). A new algorithm is presented, which yields sub-optimal codes, but in time O(n log n) and space O(n). Under certain conditions, these codes are shown to be close to optimal, and extensive experiments suggest that in many practical applications, the deviation from the optimum is negligible. 1. Motivation and Introduction We consider the set B(n; b) of extended binary trees with n leaves, labelled 1 to n, and with depth b, henceforth called b-restricted trees. An extended binary tree is a binary tree in which every internal node has two sons (here, and in what follows, we use the terminology of Knuth [16, pp. 399--...
Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files
- Proc. International Conference on Very Large Databases
, 1993
"... There are several advantages to be gained by storing the lexicon of a full text database in main memory. In this paper we describe how to use a compressed inverted file index to search such a lexicon for entries that match a pattern or partially specified term. Our experiments show that this method ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
There are several advantages to be gained by storing the lexicon of a full text database in main memory. In this paper we describe how to use a compressed inverted file index to search such a lexicon for entries that match a pattern or partially specified term. Our experiments show that this method provides an effective compromise between speed and space, running orders of magnitude faster than brute force search, but requiring less memory than other pattern-matching data structures; indeed, in some cases requiring less memory than would be consumed by a single pointer to each string. The pattern search method is based on text indexing techniques and is a successful adaptation of inverted files to main memory databases.
Robust Universal Complete Codes for Transmission and Compression
- Discrete Applied Mathematics
, 1996
"... Several measures are defined and investigated, which allow the comparison of codes as to their robustness against errors. Then new universal and complete sequences of variable-length codewords are proposed, based on representing the integers in a binary Fibonacci numeration system. Each sequence is ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Several measures are defined and investigated, which allow the comparison of codes as to their robustness against errors. Then new universal and complete sequences of variable-length codewords are proposed, based on representing the integers in a binary Fibonacci numeration system. Each sequence is constant and need not be generated for every probability distribution. These codes can be used as alternatives to Huffman codes when the optimal compression of the latter is not required, and simplicity, faster processing and robustness are preferred. The codes are compared on several "real-life" examples. 1. Motivation and Introduction Let A = fA 1 ; A 2 ; \Delta \Delta \Delta ; An g be a finite set of elements, called cleartext elements, to be encoded by a static uniquely decipherable (UD) code. For notational ease, we use the term `code' as abbreviation for `set of codewords'; the corresponding encoding and decoding algorithms are always either given or clear from the context. A code i...
Supporting Temporal Text-Containment Queries in Temporal Document Databases
, 2003
"... In temporal document databases and temporal XML databases, temporal text-containment queries are a potential performance bottleneck. In this paper we describe how to manage documents and index structures in such databases in a way that makes temporal textcontainment querying feasible. We describe an ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In temporal document databases and temporal XML databases, temporal text-containment queries are a potential performance bottleneck. In this paper we describe how to manage documents and index structures in such databases in a way that makes temporal textcontainment querying feasible. We describe and discuss different index structures that can improve such queries. Three of the alternatives have been implemented in the V2 temporal document database system, and the performance of the index structures is studied using temporal web data. The results show that even a very simple time-indexing approach can reduce query cost by up to three orders of magnitude.
Compressing Term Positions in Web Indexes
"... Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inve ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inverted index compression. We focus on techniques for compressing term positions in web search engine indexes. Most previous work has focused on compressing docID and frequency data, or position information in other types of text collections. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency data. We perform a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also study how to efficiently access position data for ranking functions that take proximity features into account.
V2: A Database Approach to Temporal Document Management
, 2002
"... The advent of large amounts of data on the web has closed the gap between the document storage and database communities. In this paper, this work is continued by the description of the foundations for temporal document databases. We describe the V2 temporal document database, which supports storage, ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
The advent of large amounts of data on the web has closed the gap between the document storage and database communities. In this paper, this work is continued by the description of the foundations for temporal document databases. We describe the V2 temporal document database, which supports storage, retrieval, and querying of temporal documents. We describe functionality and operations /operators to be supported by such systems, and more specifically we describe the architecture for management of temporal documents used in the V2 prototype. We also give some performance results from a mini-benchmark run on the V2 prototype.
Design, Implementation, and Performance of the V2 Temporal Document Database System
, 2002
"... The advent of large amounts of data on the web has closed the gap between the document storage and the database communities. In this paper, this work is continued by the description of the foundations for temporal document databases. We describe functionality and operations/operators to be supported ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The advent of large amounts of data on the web has closed the gap between the document storage and the database communities. In this paper, this work is continued by the description of the foundations for temporal document databases. We describe functionality and operations/operators to be supported by such systems, and more specifically we describe the architecture for management of temporal documents used in the prototype of the V2 temporal document database system, which supports storage, retrieval, and querying of temporal documents. We also give some performance results from a mini-benchmark run on the V2 prototype.
Improving space-efficiency in temporal text-indexing
, 2004
"... Support for temporal text-containment queries, i.e., query for all versions of documents that contained one or more particular words at a particular time, is of interest in a number of contexts. In previous papers we have presented two approaches to temporal text-indexing, the V2X and ITTX indexes. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Support for temporal text-containment queries, i.e., query for all versions of documents that contained one or more particular words at a particular time, is of interest in a number of contexts. In previous papers we have presented two approaches to temporal text-indexing, the V2X and ITTX indexes. In this paper, we first present improvements to the previous techniques. We then perform a study of the space usage of the indexing approaches based on both analytical models and results from indexing temporal text collections created by a synthetic temporal document generator. These results show for what kind of document collections the different techniques should be employed. The results also show that regarding space usage, the new ITTX/VIDPI technique proposed in this paper is in most cases superior to V2X, except in the case of patterns of high number of new documents relative to number of updated documents.
Supporting Temporal Text-Containment Queries
, 2002
"... In temporal document databases and temporal XML databases, temporal text-containment queries are a potential performance bottleneck. In this paper we describe how to manage documents and index structures in such databases in way that makes temporal text-containment querying feasible. We describe and ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In temporal document databases and temporal XML databases, temporal text-containment queries are a potential performance bottleneck. In this paper we describe how to manage documents and index structures in such databases in way that makes temporal text-containment querying feasible. We describe and discuss different index structures that can improve such queries. Three of the alternatives have been implemented into the V2 temporal document database system, and the performance of the index structures is studied using temporal web data. The results show that even a very simple time-indexing approach can reduce query cost by up to three orders of magnitude.

