Results 1 - 10
of
53
A Survey of Top-k Query Processing Techniques in Relational Database Systems
"... Efficient processing of top-k queries is a crucial requirement in many interactive environments that involve massive amounts of data. In particular, efficient top-k processing in domains such as the Web, multimedia search and distributed systems has shown a great impact on performance. In this surve ..."
Abstract
-
Cited by 167 (6 self)
- Add to MetaCart
Efficient processing of top-k queries is a crucial requirement in many interactive environments that involve massive amounts of data. In particular, efficient top-k processing in domains such as the Web, multimedia search and distributed systems has shown a great impact on performance. In this survey, we describe and classify top-k processing techniques in relational databases. We discuss different design dimensions in the current techniques including query models, data access methods, implementation levels, data and query certainty, and supported scoring functions. We show the implications of each dimension on the design of the underlying techniques. We also discuss top-k queries in XML domain, and show their connections to relational approaches.
Top-k query processing in uncertain databases
- In ICDE
, 2007
"... Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on ..."
Abstract
-
Cited by 125 (9 self)
- Add to MetaCart
(Show Context)
Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on “marriage ” of traditional top-k semantics and possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a state space model and efficient query processing techniques to tackle the challenges of uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples and materialized search states. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naïve materialization of possible worlds. 1
Outlier Detection in Sensor Networks
- MobiHoc'07
, 2007
"... Outlier detection has many important applications in sensor networks, e.g., abnormal event detection, animal behavior change, etc. It is a difficult problem since global information about data distributions must be known to identify outliers. In this paper, we use a histogram-based method for outlie ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
Outlier detection has many important applications in sensor networks, e.g., abnormal event detection, animal behavior change, etc. It is a difficult problem since global information about data distributions must be known to identify outliers. In this paper, we use a histogram-based method for outlier detection to reduce communication cost. Rather than collecting all the data in one location for centralized processing, we propose collecting hints (in the form of a histogram) about the data distribution, and using the hints to filter out unnecessary data and identify potential outliers. We show that this method can be used for detecting outliers in terms of two different definitions. Our simulation results show that the histogram method can dramatically reduce the communication cost.
Top-k monitoring in wireless sensor networks
- IEEE TRANS. KNOWLEDGE AND DATA ENG
, 2007
"... Top-k monitoring is important to many wireless sensor applications. This paper exploits the semantics of top-k query and proposes an energy-efficient monitoring approach called FILA. The basic idea is to install a filter at each sensor node to suppress unnecessary sensor updates. Filter setting and ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Top-k monitoring is important to many wireless sensor applications. This paper exploits the semantics of top-k query and proposes an energy-efficient monitoring approach called FILA. The basic idea is to install a filter at each sensor node to suppress unnecessary sensor updates. Filter setting and query reevaluation upon updates are two fundamental issues to the correctness and efficiency of the FILA approach. We develop a query reevaluation algorithm that is capable of handling concurrent sensor updates. In particular, we present optimization techniques to reduce the probing cost. We design a skewed filter setting scheme, which aims to balance energy consumption and prolong network lifetime. Moreover, two filter update strategies, namely, eager and lazy, are proposed to favor different application scenarios. We also extend the algorithms to several variants of top-k query, that is, orderinsensitive, approximate, and value monitoring. The performance of the proposed FILA approach is extensively evaluated using real data traces. The results show that FILA substantially outperforms the existing TAG-based approach and range caching approach in terms of both network lifetime and energy consumption under various network configurations.
Distributed Image Search in Camera Sensor Networks
"... Recent advances in sensor networks permit the use of a large number of relatively inexpensive distributed computational nodes with camera sensors linked in a network and possibly linked to one or more central servers. We argue that the full potential of such a distributed system can be realized if i ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
(Show Context)
Recent advances in sensor networks permit the use of a large number of relatively inexpensive distributed computational nodes with camera sensors linked in a network and possibly linked to one or more central servers. We argue that the full potential of such a distributed system can be realized if it is designed as a distributed search engine where images from different sensors can be captured, stored, searched and queried. However, unlike traditional image search engines that are focused on resource-rich situations, the resource limitations of camera sensor networks in terms of energy, bandwidth, computational power, and memory capacity present significant challenges. In this paper, we describe the design and implementation of a distributed search system over a camera sensor network where each node is a search engine that senses, stores and searches information. Our work involves innovation at many levels including local storage, local search, and distributed search, all of which are designed to be efficient under the resource constraints of sensor networks. We present an implementation of the search engine on a network of iMote2 sensor nodes equipped with low-power cameras and extended flash storage. We evaluate our system for a dataset comprising book images, and demonstrate more than two orders of magnitude reduction in the amount of data communicated and up to 5x reduction in overall energy consumption over alternate techniques.
Query answering techniques on uncertain and probabilistic data
- In SIGMOD 2008
"... Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data collected and accumulated, analyzing large collections ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
(Show Context)
Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data collected and accumulated, analyzing large collections of uncertain data has become an important task and has attracted more and more interest from the database community. Recently, uncertain data management has become an emerging hot area in database research and development. In this tutorial, we systematically review some representative studies on answering various queries on uncertain and probabilistic data.
Cleaning Uncertain Data with Quality Guarantees
, 2008
"... Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.
Best Position Algorithms for Top-k Queries
- INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES (VLDB) (2007) 495-506
, 2007
"... The general problem of answering top-k queries can be modeled using lists of data items sorted by their local scores. The most efficient algorithm proposed so far for answering top-k queries over sorted lists is the Threshold Algorithm (TA). However, TA may still incur a lot of useless accesses to t ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
The general problem of answering top-k queries can be modeled using lists of data items sorted by their local scores. The most efficient algorithm proposed so far for answering top-k queries over sorted lists is the Threshold Algorithm (TA). However, TA may still incur a lot of useless accesses to the lists. In this paper, we propose two new algorithms which stop much sooner. First, we propose the best position algorithm (BPA) which executes topk queries more efficiently than TA. For any database instance (i.e. set of sorted lists), we prove that BPA stops as early as TA, and that its execution cost is never higher than TA. We show that the position at which BPA stops can be (m-1) times lower than that of TA, where m is the number of lists. We also show that the execution cost of our algorithm can be (m-1) times lower than that of TA. Second, we propose the BPA2 algorithm which is much more efficient than BPA. We show that the number of accesses to the lists done by BPA2 can be about (m-1) times lower than that of BPA. Our performance evaluation shows that over our test databases, BPA and BPA2 achieve significant performance gains in comparison with TA.
Processing top-k queries in distributed hash tables. Euro-Par Conf.,
, 2007
"... Abstract. Distributed Hash Tables (DHTs) provide a scalable solution for data sharing in large scale distributed systems, e.g. P2P systems. However, they only provide good support for exact-match queries, and it is hard to support complex queries such as top-k queries. In this paper, we propose a f ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
(Show Context)
Abstract. Distributed Hash Tables (DHTs) provide a scalable solution for data sharing in large scale distributed systems, e.g. P2P systems. However, they only provide good support for exact-match queries, and it is hard to support complex queries such as top-k queries. In this paper, we propose a family of algorithms which deal with efficient processing of top-k queries in DHTs. We evaluated the performance of our solution through implementation over a 64-node cluster and simulation. Our performance evaluation shows very good performance, in terms of communication cost and response time.
Ranking Distributed Probabilistic Data
, 2009
"... Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global top-k tuples from all distributed sites with minimum communication cost. Using the wellfounded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication- and computation-efficient algorithms for retrieving the top-k tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.