Results 1  10
of
167
Semantics of ranking queries for probabilistic data and expected ranks
 In Proc. of ICDE’09
, 2009
"... Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditi ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
(Show Context)
Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditional database settings. The importance of the topk is perhaps even greater in probabilistic databases, where a relation can encode exponentially many possible worlds. There have been several recent attempts to propose definitions and algorithms for ranking queries over probabilistic data. However, these all lack many of the intuitive properties of a topk over deterministic data. Specifically, we define a number of fundamental properties, including exactk, containment, uniquerank, valueinvariance, and stability, which are all satisfied by ranking queries on certain data. We argue that all these conditions should also be fulfilled by any reasonable definition for ranking uncertain data. Unfortunately, none of the existing definitions is able to achieve this. To remedy this shortcoming, this work proposes an intuitive new approach of expected rank. This uses the wellfounded notion of the expected rank of each tuple across all possible worlds as the basis of the ranking. We are able to prove that, in contrast to all existing approaches, the expected rank satisfies all the required properties for a ranking query. We provide efficient solutions to compute this ranking across the major models of uncertain data, such as attributelevel and tuplelevel uncertainty. For an uncertain relation of N tuples, the processing cost is O(N log N)—no worse than simply sorting the relation. In settings where there is a high cost for generating each tuple in turn, we provide pruning techniques based on probabilistic tail bounds that can terminate the search early and guarantee that the topk has been found. Finally, a comprehensive experimental study confirms the effectiveness of our approach. I.
A Unified Approach to Ranking in Probabilistic Databases
"... The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decisionmaking over such data. In this paper, we present a unified approach to ranking and topk query processing in ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
(Show Context)
The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decisionmaking over such data. In this paper, we present a unified approach to ranking and topk query processing in probabilistic databases by viewing it as a multicriteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two parameterized ranking functions, called P RF ω and P RF e, that generalize or can approximate many of the previously proposed ranking functions. We present novel generating functionsbased algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions, especially P RF e, at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking. 1.
Visual Modelling of
 Complex Business Processes with Trees, Overlays and DistortionBased Displays, Proc VLHCC’07, IEEE CS
"... evolution laws for thin crystalline films: ..."
(Show Context)
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 31 (18 self)
 Add to MetaCart
(Show Context)
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
A Survey on Representation, Composition and Application of Preferences in Database Systems
 ACM TODS
, 2011
"... Preferences have been traditionally studied in philosophy, psychology, and economics and applied to decision making problems. Recently, they have attracted the attention of researchers in other fields, such as databases where they capture soft criteria for queries. Databases bring a whole fresh pers ..."
Abstract

Cited by 30 (6 self)
 Add to MetaCart
Preferences have been traditionally studied in philosophy, psychology, and economics and applied to decision making problems. Recently, they have attracted the attention of researchers in other fields, such as databases where they capture soft criteria for queries. Databases bring a whole fresh perspective to the study of preferences, both computational and representational. From a representational perspective, the central question is how we can effectively represent preferences and incorporate them in database querying. From a computational perspective, we can look at how we can efficiently process preferences in the context of database queries. Several approaches have been proposed but a systematic study of these works is missing. The purpose of this survey is to provide a framework for placing existing works in perspective and highlight critical open challenges to serve as a springboard for researchers in database systems. We organize our study around three axes: preference representation, preference composition, and preference query processing.
Ranking with uncertain scores
 In ICDE
, 2009
"... Abstract — Large databases with uncertain information are becoming more common in many applications including data integration, location tracking, and Web search. In these applications, ranking records with uncertain attributes needs to handle new problems that are fundamentally different from conve ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
Abstract — Large databases with uncertain information are becoming more common in many applications including data integration, location tracking, and Web search. In these applications, ranking records with uncertain attributes needs to handle new problems that are fundamentally different from conventional ranking. Specifically, uncertainty in records ’ scores induces a partial order over records, as opposed to the total order that is assumed in the conventional ranking settings. In this paper, we present a new probabilistic model, based on partial orders, to encapsulate the space of possible rankings originating from score uncertainty. Under this model, we formulate several ranking query types with different semantics. We describe and analyze a set of efficient query evaluation algorithms. We show that our techniques can be used to solve the problem of rank aggregation in partial orders. In addition, we design novel sampling techniques to compute approximate query answers. Our experimental evaluation uses both real and synthetic data. The experimental study demonstrates the efficiency and effectiveness of our techniques in different settings.
Topk queries on uncertain data: On score distribution and typical answers
 In SIGMOD 2009
"... Uncertain data arises in a number of domains, including data integration and sensor networks. Topk queries that rank results according to some userdefined score are an important tool for exploring large uncertain data sets. As several recent papers have observed, the semantics of topk queries on ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
(Show Context)
Uncertain data arises in a number of domains, including data integration and sensor networks. Topk queries that rank results according to some userdefined score are an important tool for exploring large uncertain data sets. As several recent papers have observed, the semantics of topk queries on uncertain data can be ambiguous due to tradeoffs between reporting highscoring tuples and tuples with a high probability of being in the resulting data set. In this paper, we demonstrate the need to present the score distribution of topk vectors to allow the user to choose between results along this scoreprobability dimensions. One option would be to display the complete distribution of all potential topk tuple vectors, but this set is too large to compute. Instead, we propose to provide a number of typical vectors that effectively sample this distribution. We propose efficient algorithms to compute these vectors. We also extend the semantics and algorithms to the scenario of score ties, which is not dealt with in the previous work in the area. Our work includes a systematic empirical study on both real dataset and synthetic datasets.
Efficient processing of topk spatial keyword queries
 In SSTD,2011
"... Abstract. Givenaspatiallocationandasetofkeywords,atopk spatial keyword query returns the k best spatiotextual objects ranked according to their proximity to the query location and relevance to the query keywords. There are many applications handling huge amounts of geotagged data, such as Twitter ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Givenaspatiallocationandasetofkeywords,atopk spatial keyword query returns the k best spatiotextual objects ranked according to their proximity to the query location and relevance to the query keywords. There are many applications handling huge amounts of geotagged data, such as Twitter and Flickr, that can benefit from this query. Unfortunately,thestateoftheartapproachesrequirenonnegligibleprocessing cost that incurs in long response time. In this paper, we propose a novel index to improve the performance of topk spatial keyword queries named Spatial Inverted Index (S2I). Our index maps each distinct term to a set of objects containing the term. The objects are stored differently according to the document frequency of the term and can be retrieved efficiently in decreasing order of keyword relevance and spatial proximity. Moreover, we present algorithms that exploit S2I to process topk spatial keyword queries efficiently. Finally, we show through extensive experiments that our approach outperforms the stateoftheart approaches in terms of update and query cost.
Diversifying TopK Results
"... Topk query processing finds a list of k results that have largest scores w.r.t the user given query, with the assumption that all the k results are independent to each other. In practice, some of the topk results returned can be very similar to each other. As a result some of the topk results ret ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
Topk query processing finds a list of k results that have largest scores w.r.t the user given query, with the assumption that all the k results are independent to each other. In practice, some of the topk results returned can be very similar to each other. As a result some of the topk results returned are redundant. In the literature, diversified topk search has been studied to return k results that take both score and diversity into consideration. Most existing solutions on diversified topk search assume that scores of all the search results are given, and some works solve the diversity problem on a specific problem and can hardly be extended to general cases. In this paper, we study the diversified topk search problem. We define a general diversified topk search problem that only considers the similarity of the search results themselves. We propose a framework, such that most existing solutions for topk query processing can be extended easily to handle diversified topk search, by simply applying three new functions, a sufficient stop conditionsufficient(), a necessary stop conditionnecessary(), and an algorithm for diversified topk search on the current set of generated results, divsearchcurrent(). We propose three new algorithms, namely, divastar, divdp, and divcut to solve the divsearchcurrent() problem. divastar is an A ∗ based algorithm, divdp is an algorithm that decomposes the results into components which are searched using divastar independently and combined using dynamic programming. divcut further decomposes the current set of generated results using cut points and combines the results using sophisticated operations. We conducted extensive performance studies using two real datasets, enwiki and reuters. Our divcut algorithm finds the optimal solution for diversified topk search problem in seconds even for k as large as 2,000. 1.
Processing a Large Number of Continuous Preference Topk Queries
, 2012
"... Given a set of objects, each with multiple numeric attributes, a (preference) topk query retrieves the k objects with the highest scores according to a user preference, defined as a linear combination of attribute values. We consider the problem of processing a large number of continuous topk quer ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
Given a set of objects, each with multiple numeric attributes, a (preference) topk query retrieves the k objects with the highest scores according to a user preference, defined as a linear combination of attribute values. We consider the problem of processing a large number of continuous topk queries, each with its own preference. When objects or user preferences change, the query results must be updated. We present a dynamic index that supports the reverse topk query, which is of independent interest. Combining this index with another one for topk queries, we develop a scalable solution for processing many continuous topk queries that exploits the clusteredness in user preferences. We also define an approximate version of the problem and present a solution significantly more efficient than the exact one with little loss in accuracy.