Results 1  10
of
32
Probabilistic skylines on uncertain data
 In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), Viena
, 2007
"... Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this pap ..."
Abstract

Cited by 63 (16 self)
 Add to MetaCart
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a pskyline contains all the objects whose skyline probabilities are at least p. Computing probabilistic skylines on large uncertain data sets is challenging. We develop two efficient algorithms. The bottomup algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The topdown algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our two algorithms are efficient on large data sets, and complementary to each other in performance. 1.
Continuous monitoring of topk queries over sliding windows
 In SIGMOD
, 2006
"... Given a dataset P and a preference function f, atopk query retrieves the k tuples in P with the highest scores according to f. Even though the problem is wellstudied in conventional databases, the existing methods are inapplicable to highly dynamic environments involving numerous longrunning queri ..."
Abstract

Cited by 60 (7 self)
 Add to MetaCart
Given a dataset P and a preference function f, atopk query retrieves the k tuples in P with the highest scores according to f. Even though the problem is wellstudied in conventional databases, the existing methods are inapplicable to highly dynamic environments involving numerous longrunning queries. This paper studies continuous monitoring of topk queries over a fixedsize window W of the most recent data. The window size can be expressed either in terms of the number of active tuples or time units. We propose a general methodology for topk monitoring that restricts processing to the subdomains of the workspace that influence the result of some query. To cope with high stream rates and provide fast answers in an online fashion, the data in W reside in main memory. The valid records are indexed by a grid structure, which also maintains bookkeeping information. We present two processing techniques: the first one computes the new answer of a query whenever some of the current topk points expire; the second one partially precomputes the future changes in the result, achieving better running time at the expense of slightly higher space requirements. We analyze the performance of both algorithms and evaluate their efficiency through extensive experiments. Finally, we extend the proposed framework to other query types and a different data stream model. 1.
Selecting Stars: The k Most Representative Skyline Operator
 In Proc. of the Int. IEEE Conf. on Data Engineering (ICDE
, 2007
"... Skyline computation has many applications including multicriteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We first present an efficient dynamic progr ..."
Abstract

Cited by 57 (2 self)
 Add to MetaCart
Skyline computation has many applications including multicriteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We first present an efficient dynamic programming based exact algorithm in a 2dspace. Then, we show that the problem is NPhard when the dimensionality is 3 or more and it can be approximately solved by a polynomial time algorithm with the guaranteed approximation ratio 1 − 1 e. To speedup the computation, an efficient, scalable, indexbased randomized algorithm is developed by applying the FM probabilistic counting technique. A comprehensive performance evaluation demonstrates that our randomized technique is very efficient, highly accurate, and scalable. 1.
Efficient Skyline Computation over LowCardinality Domains
, 2007
"... Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution (i.e. whether the dataset ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution (i.e. whether the dataset attributes are correlated, independent, or anticorrelated). In this paper, we propose the Lattice Skyline Algorithm (LS) that is built around a new paradigm for skyline evaluation on datasets with attributes that are drawn from lowcardinality domains. LS continues to apply even if one attribute has high cardinality. Many skyline applications naturally have such data characteristics, and previous skyline methods have not exploited this property. We show that for typical dimensionalities, the complexity of LS is linear in the number of input tuples. Furthermore, we show that the performance of LS is independent of the input data distribution. Finally, we demonstrate through extensive experimentation on both real and synthetic datasets that LS can result in a significant performance advantage over existing techniques.
Categorical Skylines for Streaming Data
, 2008
"... The problem of skyline computation has attracted considerable research attention. In the categorical domain the problem becomes more complicated, primarily due to the partiallyordered nature of the attributes of tuples. In this paper, we initiate a study of streaming categorical skylines. We identi ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
The problem of skyline computation has attracted considerable research attention. In the categorical domain the problem becomes more complicated, primarily due to the partiallyordered nature of the attributes of tuples. In this paper, we initiate a study of streaming categorical skylines. We identify the limitations of existing work for offline categorical skyline computation and realize novel techniques for the problem of maintaining the skyline of categorical data in a streaming environment. In particular, we develop a lightweight data structure for indexing the tuples in the streaming buffer, that can gracefully adapt to tuples with many attributes and partially ordered domains of any size and complexity. Additionally, our study of the dominance relation in the dual space allows us to utilize geometric arrangements in order to index the categorical skyline and efficiently evaluate dominance queries. Lastly, a thorough experimental study evaluates the efficiency of the proposed techniques.
Probabilistic Skyline Operators over Sliding Windows Windows
 ICDE 2009 2009
"... Abstract — Skyline computation has many applications including multicriteria decision making. In this paper, we study the problem of efficient processing of continuous skyline queries over sliding windows on uncertain data elements regarding given probability thresholds. We first characterize what ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
Abstract — Skyline computation has many applications including multicriteria decision making. In this paper, we study the problem of efficient processing of continuous skyline queries over sliding windows on uncertain data elements regarding given probability thresholds. We first characterize what kind of elements we need to keep in our query computation. Then we show the size of dynamically maintained candidate set and the size of skyline. We develop novel, efficient techniques to process a continuous, probabilistic skyline query. Finally, we extend our techniques to the applications where multiple probability thresholds are given or we want to retrieve “topk ” skyline data objects. Our extensive experiments demonstrate that the proposed techniques are very efficient and handle a highspeed data stream in real time. I.
Continuous Nearest Neighbor Queries over Sliding Windows
 IEEE Transactions on Knowledge and Data Engineering (TKDE
, 2007
"... Abstract—This paper studies continuous monitoring of nearest neighbor (NN) queries over sliding window streams. According to this model, data points continuously stream in the system, and they are considered valid only while they belong to a sliding window that contains 1) the W most recent arrivals ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
Abstract—This paper studies continuous monitoring of nearest neighbor (NN) queries over sliding window streams. According to this model, data points continuously stream in the system, and they are considered valid only while they belong to a sliding window that contains 1) the W most recent arrivals (countbased) or 2) the arrivals within a fixed interval W covering the most recent time stamps (timebased). The task of the query processor is to constantly maintain the result of longrunning NN queries among the valid data. We present two processing techniques that apply to both countbased and timebased windows. The first one adapts conceptual partitioning, the best existing method for continuous NN monitoring over update streams, to the sliding window model. The second technique reduces the problem to skyline maintenance in the distancetime space and precomputes the future changes in the NN set. We analyze the performance of both algorithms and extend them to variations of NN search. Finally, we compare their efficiency through a comprehensive experimental evaluation. The skylinebased algorithm achieves lower CPU cost, at the expense of slightly larger space overhead. Index Terms—Locationdependent and sensitive, spatial databases, query processing, nearest neighbors, data streams, sliding windows. 1
Probabilistic path queries in road networks: traffic uncertainty aware path selection
 In EDBT
, 2010
"... Path queries such as “finding the shortest path in travel time from my hotel to the airport ” are heavily used in many applications of road networks. Currently, simple statistic aggregates such as the average travel time between two vertices are often used to answer path queries. However, such simpl ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
Path queries such as “finding the shortest path in travel time from my hotel to the airport ” are heavily used in many applications of road networks. Currently, simple statistic aggregates such as the average travel time between two vertices are often used to answer path queries. However, such simple aggregates often cannot capture the uncertainty inherent in traffic. In this paper, we study how to take traffic uncertainty into account in answering path queries in road networks. To capture the uncertainty in traffic such as the travel time between two vertices, the weight of an edge is modeled as a random variable and is approximated by a set of samples. We propose three novel types of probabilistic path queries using basic probability principles: (1) a probabilistic path query like “what are the paths from my hotel to the airport whose travel time is at most 30 minutes with a probability of at least 90%?”; (2) a weightthreshold topk path query like “what are the top3 paths from my hotel to the airport with the highest probabilities to take at most 30 minutes?”; and (3) a probabilitythreshold topk path query like “what are the top3 shortest paths from my hotel to the airport whose travel time is guaranteed by a probability of at least 90%? ” To evaluate probabilistic path queries efficiently, we develop three efficient probability calculation methods: an exact algorithm, a constant factor approximation method and a sampling based approach. Moreover, we devise the P * algorithm, a bestfirst search method based on a novel hierarchical partition tree index and three effective heuristic evaluation functions. An extensive empirical study using real road networks and synthetic data sets shows the effectiveness of the proposed path queries and the efficiency of the query evaluation methods.
Online interval skyline queries on time series
 In Proceedings of the 25th international conference on data engineering (ICDE’09
, 2009
"... Abstract — In many applications, we need to analyze a large number of time series. Segments of time series demonstrating dominating advantages over others are often of particular interest. In this paper, we advocate interval skyline queries, a novel type of time series analysis queries. For a set of ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Abstract — In many applications, we need to analyze a large number of time series. Segments of time series demonstrating dominating advantages over others are often of particular interest. In this paper, we advocate interval skyline queries, a novel type of time series analysis queries. For a set of time series and a given time interval [i: j], an interval skyline query returns the time series which are not dominated by any other time series in the interval. We illustrate the usefulness of interval skyline queries in applications. Moreover, we develop an onthefly method and a viewmaterialization method to online answer interval skyline queries on time series. The onthefly method keeps the minimum and the maximum values of the time series using radix priority search trees and sketches, and computes the skyline at the query time. The viewmaterialization method maintains the skylines over all intervals in a compact data structure. Through theoretical analysis and extensive experiments, we show that both methods only require linear space and are efficient in query answering as well as incremental maintenance. I.
Distributed Skyline Retrieval with Low Bandwidth Consumption
"... We consider skyline computation when the underlying dataset is horizontally partitioned onto geographically distant servers that are connected to the Internet. The existing solutions are not suitable for our problem, because they have at least one of the following drawbacks: (i) applicable only to d ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
We consider skyline computation when the underlying dataset is horizontally partitioned onto geographically distant servers that are connected to the Internet. The existing solutions are not suitable for our problem, because they have at least one of the following drawbacks: (i) applicable only to distributed systems adopting vertical partitioning or restricted horizontal partitioning, (ii) effective only when each server has limited computing and communication abilities, and (iii) optimized only for skyline search in subspaces but inefficient in the full space. This paper proposes an algorithm, called feedbackbased distributed skyline (FDS), to support arbitrary horizontal partitioning. FDS aims at minimizing the network bandwidth, measured in the number of tuples transmitted over the network. The core of FDS is a novel feedbackdriven mechanism, where the coordinator iteratively transmits certain feedback to each participant. Participants can leverage such information to prune a large amount of local data, which otherwise would need to be sent to the coordinator. Extensive experimentation confirms that FDS significantly outperforms alternative approaches in both effectiveness and progressiveness.