Results 1 - 10
of
16
RankSQL: Query algebra and optimization for relational top-k queries
- In SIGMOD
, 2005
"... This paper introduces RankSQL, a system that provides a systematic and principled framework to support efficient evaluations of ranking (top-k) queries in relational database systems (RDBMS), by extending relational algebra and query optimization. Previously, top-k query processing is studied in the ..."
Abstract
-
Cited by 71 (15 self)
- Add to MetaCart
This paper introduces RankSQL, a system that provides a systematic and principled framework to support efficient evaluations of ranking (top-k) queries in relational database systems (RDBMS), by extending relational algebra and query optimization. Previously, top-k query processing is studied in the middleware scenario or in RDBMS in a “piecemeal ” fashion, i.e., focusing on specific operator or sitting outside the core of query engines. In contrast, we aim to support ranking as a first-class database construct. As a key insight, the new ranking relationship can be viewed as another logical property of data, parallel to the “membership ” property of relational data model. While membership is essentially supported in RDBMS, the same support for ranking is clearly lacking. We address the fundamental integration of ranking in RDBMS in a way similar to how membership, i.e., Boolean filtering, is supported. We extend relational algebra by proposing a rank-relational model to capture the ranking property, and introducing new and extended operators to support ranking as a first-class construct. Enabled by the extended algebra, we present a pipelined and incremental execution model of ranking query plans (that cannot be expressed traditionally) based on a fundamental ranking principle. To optimize top-k queries, we propose a dimensional enumeration algorithm to explore the extended plan space by enumerating plans along two dual dimensions: ranking and membership. We also propose a sampling-based method to estimate the cardinality of rank-aware operators, for costing plans. Our experiments show the validity of our framework and the accuracy of the proposed estimation model. 1.
Continuous monitoring of top-k queries over sliding windows
- In SIGMOD
, 2006
"... Given a dataset P and a preference function f, atop-k query retrieves the k tuples in P with the highest scores according to f. Even though the problem is well-studied in conventional databases, the existing methods are inapplicable to highly dynamic environments involving numerous longrunning queri ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
Given a dataset P and a preference function f, atop-k query retrieves the k tuples in P with the highest scores according to f. Even though the problem is well-studied in conventional databases, the existing methods are inapplicable to highly dynamic environments involving numerous longrunning queries. This paper studies continuous monitoring of top-k queries over a fixed-size window W of the most recent data. The window size can be expressed either in terms of the number of active tuples or time units. We propose a general methodology for top-k monitoring that restricts processing to the sub-domains of the workspace that influence the result of some query. To cope with high stream rates and provide fast answers in an on-line fashion, the data in W reside in main memory. The valid records are indexed by a grid structure, which also maintains book-keeping information. We present two processing techniques: the first one computes the new answer of a query whenever some of the current top-k points expire; the second one partially precomputes the future changes in the result, achieving better running time at the expense of slightly higher space requirements. We analyze the performance of both algorithms and evaluate their efficiency through extensive experiments. Finally, we extend the proposed framework to other query types and a different data stream model. 1.
Deltasky: Optimal maintenance of skyline deletions without exclusive dominance region generation
- In UCSB Tech Report, 2006. http://www.cs.ucsb.edu/ ∼ pingwu/ deltasky.pdf
, 2007
"... This paper addresses the problem of efficient maintenance of a materialized skyline view in response to skyline removals. While there has been significant progress on skyline query computation, an equally important but largely unanswered issue is on the incremental maintenance for skyline deletions. ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
This paper addresses the problem of efficient maintenance of a materialized skyline view in response to skyline removals. While there has been significant progress on skyline query computation, an equally important but largely unanswered issue is on the incremental maintenance for skyline deletions. Previous work suggested the use of the so called exclusive dominance region (EDR) to achieve optimal I/O performance for deletion maintenance. However, the shape of an EDR becomes extremely complex in higher dimensions, and algorithms for its computation have not been developed. We derive a systematic way to decompose a d-dimensional EDR into a collection of hyper-rectangles. We show that the number of such hyper-rectangles is O(m d), where m is the current skyline result size. We then propose a novel algorithm DeltaSky which determines whether an intermediate R-tree MBR intersects with the EDR without explicitly calculating the EDR itself. This reduces the worse case complexity of the EDR intersection check from O(m d) to O(md). Thus DeltaSky helps the branch and bound skyline algorithm achieve I/O optimality for deletion maintenance by finding only the newly appeared skyline points after the deletion. We discuss implementation issues and show that DeltaSky can be efficiently implemented using one extra B-Tree. Moreover, we propose two optimization techniques which further reduce the average cost in practice. Extensive experiments demonstrate that DeltaSky achieves orders of magnitude performance gain over alternative solutions. 1
Ad-hoc Top-k Query Answering for Data Streams
, 2007
"... A top-k query retrieves the k highest scoring tuples from a data set with respect to a scoring function defined on the attributes of a tuple. The efficient evaluation of top-k queries has been an active research topic and many different instantiations of the problem, in a variety of settings, have b ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
A top-k query retrieves the k highest scoring tuples from a data set with respect to a scoring function defined on the attributes of a tuple. The efficient evaluation of top-k queries has been an active research topic and many different instantiations of the problem, in a variety of settings, have been studied. However, techniques developed for conventional, centralized or distributed databases are not directly applicable to highly dynamic environments and on-line applications, like data streams. Recently, techniques supporting top-k queries on data streams have been introduced. Such techniques are restrictive however, as they can only efficiently report top-k answers with respect to a pre-specified (as opposed to ad-hoc) set of queries. In this paper we introduce a novel geometric representation for the top-k query problem that allows us to raise this restriction. Utilizing notions of geometric arrangements, we design and analyze algorithms for incrementally maintaining a data set organized in an arrangement representation under streaming updates. We introduce query evaluation strategies that operate on top of an arrangement data structure that are able to guarantee efficient evaluation for ad-hoc queries. The performance of our core technique is augmented by incorporating tuple pruning strategies, minimizing the number of tuples that need to be stored and manipulated. This results in a main memory indexing technique supporting both efficient incremental updates and the evaluation of ad-hoc top-k queries. A thorough experimental study evaluates the efficiency of the proposed technique.
Efficient Skyline and Top-k Retrieval in Subspaces
"... Skyline and top-k queries are two popular operations for preference retrieval. In practice, applications that require these operations usually provide numerous candidate attributes, whereas, depending on their interests, users may issue queries regarding different subsets of the dimensions. The exis ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Skyline and top-k queries are two popular operations for preference retrieval. In practice, applications that require these operations usually provide numerous candidate attributes, whereas, depending on their interests, users may issue queries regarding different subsets of the dimensions. The existing algorithms are inadequate for subspace skyline/top-k search because they have at least one of the following defects: they (i) require scanning the entire database at least once; (ii) are optimized for one subspace but incur significant overhead for other subspaces; (iii) demand expensive maintenance cost or space consumption. In this paper, we propose a technique, SUBSKY, which settles both types of queries using purely relational technologies. The core of SUBSKY is a transformation that converts multidimensional data to 1D values. These values are indexed by a simple B-tree, which allows us to answer subspace queries by accessing a fraction of the database. SUBSKY entails low maintenance overhead, which equals the cost of updating a traditional B-tree. Extensive experiments with real data confirm that our technique outperforms alternative solutions significantly in both efficiency and scalability.
Reverse Top-k Queries
"... Rank-aware query processing has become essential for many applications that return to the user only the top-k objects based on the individual user’s preferences. Top-k queries have been mainly studied from the perspective of the user, focusing primarily on efficient query processing. In this work, ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Rank-aware query processing has become essential for many applications that return to the user only the top-k objects based on the individual user’s preferences. Top-k queries have been mainly studied from the perspective of the user, focusing primarily on efficient query processing. In this work, for the first time, we study top-k queries from the perspective of the product manufacturer. Given a potential product, which are the user preferences for which this product is in the topk query result set? We identify a novel query type, namely reverse top-k query, that is essential for manufacturers to assess the potential market and impact of their products based on the competition. We formally define reverse top-k queries and introduce two versions of the query, namely monochromatic and bichromatic. We first provide a geometric interpretation of the monochromatic reverse top-k query in the solution space that helps to understand the reverse top-k query conceptually. Then, we study in more details the case of bichromatic reverse topk query, which is more interesting for practical applications. Such a query, if computed in a straightforward manner, requires evaluating a top-k query for each user preference in the database, which is prohibitively expensive even for moderate datasets. In this paper, we present an efficient threshold-based algorithm that eliminates candidate user preferences, without processing the respective top-k queries. Furthermore, we introduce an indexing structure based on materialized reverse top-k views in order to speed up the computation of reverse top-k queries. Materialized reverse top-k views trade preprocessing cost for query speed up in a controllable manner. Our experimental evaluation demonstrates the efficiency of our techniques, which reduce the required number of top-k computations by 1 to 3 orders of magnitude.
Efficient processing of distributed top-k queries
- In DEXA 2005
, 2005
"... Abstract. Ranking-aware queries, or top-k queries, have received much attention recently in various contexts such as web, multimedia retrieval, relational databases, and distributed systems. Top-k queries play a critical role in many decision-making related activities such as, identifying interestin ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract. Ranking-aware queries, or top-k queries, have received much attention recently in various contexts such as web, multimedia retrieval, relational databases, and distributed systems. Top-k queries play a critical role in many decision-making related activities such as, identifying interesting objects, network monitoring, load balancing, etc. In this paper, we study the ranking aggregation problem in distributed systems. Prior research addressing this problem did not take data distributions into account, simply assuming the uniform data distribution among nodes, which is not realistic for real data sets and is, in general, inefficient. In this paper, we propose three efficient algorithms that consider data distributions in different ways. Our extensive experiments demonstrate the advantages of our approaches in terms of bandwidth consumption. 1
Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries
"... Abstract — Given a record set D and a query score function F,atop-kqueryreturns k records from D, whosevaluesof function F on their attributes are the highest. In this paper, we investigate the intrinsic connection between top-k queries and dominant relationship between records, and based on which, ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract — Given a record set D and a query score function F,atop-kqueryreturns k records from D, whosevaluesof function F on their attributes are the highest. In this paper, we investigate the intrinsic connection between top-k queries and dominant relationship between records, and based on which, we propose an efficient layer-based indexing structure, Dominant Graph (DG), to improve the query efficiency. Specifically, DG is built offline to express the dominant relationship between records and top-k query is implemented as a graph traversal problem, i.e. Traveler algorithm. We prove theoretically that the size of search space (that is the number of retrieved records from the record set to answer top-k query) in our basic algorithm is directly related to the cardinality of skyline points in the record set (see Theorem 3.2). Based on the cost analysis, we propose the optimization technique, pseudo record, to improve the search efficiency. In order to handle the top-k query in the high dimension record set, we also propose N-Way Traveler algorithm. Finally, extensive experiments demonstrate that our proposed methods have significant improvement over its counterparts, including both classical and state art of top-k algorithms. For example, the search space in our algorithm is less than 1 of that 5 in AppRI [1], one of state art of top-k algorithms. Furthermore, our method can support any aggregate monotone query function. I.
Sliding Window Query Processing over Data Streams
, 2006
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used suc ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, real-time Internet traffic analysis, and on-line financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes
Evaluating top-k queries over incomplete data streams
- In CIKM’09
"... We study the problem of continuous monitoring of top-k queries over multiple non-synchronized streams. Assuming a sliding window model, this general problem has been a well addressed research topic in recent years. Most approaches, however, assume synchronized streams where all attributes of an obje ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We study the problem of continuous monitoring of top-k queries over multiple non-synchronized streams. Assuming a sliding window model, this general problem has been a well addressed research topic in recent years. Most approaches, however, assume synchronized streams where all attributes of an object are known simultaneously to the query processing engine. In many streaming scenarios though, different attributes of an item are reported in separate nonsynchronized streams which do not allow for exact score calculations. We present how the traditional notion of object dominance changes in this case such that the k dominance set still includes all and only those objects which have a chance of being among the top-k results in their life time. Based on this, we propose an exact algorithm which builds on generating multiple instances of the same object in a way that enables efficient object pruning. We show that even with object pruning the necessary storage for exact evaluation of top-k queries is linear in the size of the sliding window. As data should reside in main memory to provide fast answers in an online fashion and cope with high stream rates, storing all this data may not be possible with limited resources. We present an approximate algorithm which leverages correlation statistics of pairs of streams to evict more objects while maintaining accuracy. We evaluate the efficiency of our proposed algorithms with extensive experiments.

