Results 1 - 10
of
12
Categorical Skylines for Streaming Data
, 2008
"... The problem of skyline computation has attracted considerable research attention. In the categorical domain the problem becomes more complicated, primarily due to the partially-ordered nature of the attributes of tuples. In this paper, we initiate a study of streaming categorical skylines. We identi ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The problem of skyline computation has attracted considerable research attention. In the categorical domain the problem becomes more complicated, primarily due to the partially-ordered nature of the attributes of tuples. In this paper, we initiate a study of streaming categorical skylines. We identify the limitations of existing work for offline categorical skyline computation and realize novel techniques for the problem of maintaining the skyline of categorical data in a streaming environment. In particular, we develop a lightweight data structure for indexing the tuples in the streaming buffer, that can gracefully adapt to tuples with many attributes and partially ordered domains of any size and complexity. Additionally, our study of the dominance relation in the dual space allows us to utilize geometric arrangements in order to index the categorical skyline and efficiently evaluate dominance queries. Lastly, a thorough experimental study evaluates the efficiency of the proposed techniques.
Thread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams
"... Many real-world data stream analysis applications such as network monitoring, click stream analysis, and others require combining multiple streams of data arriving from multiple sources. This is referred to as multi-stream analysis. To deal with high stream arrival rates, it is desirable that such s ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Many real-world data stream analysis applications such as network monitoring, click stream analysis, and others require combining multiple streams of data arriving from multiple sources. This is referred to as multi-stream analysis. To deal with high stream arrival rates, it is desirable that such systems be capable of supporting very high processing throughput. The advent of multicore processors and powerful servers driven by these processors calls for efficient parallel designs that can effectively utilize the parallelism of the multicores, since performance improvement is possible only through effective parallelism. In this paper, we address the problem of parallelizing multi-stream analysis in the context of multicore processors. Specifically, we concentrate on parallelizing frequent elements, top-k, and frequency counting over multiple streams. We discuss the challenges in designing an efficient parallel system for multi-stream processing. Our evaluation and analysis reveals that traditional “contention ” based locking results in excessive overhead and wait, which in turn leads to severe performance degradation in modern multicore architectures. Based on our analysis, we propose a “cooperation ” based locking paradigm for efficient parallelization of frequency counting. The proposed “cooperation ” based paradigm removes waits associated with synchronization, and allows replacing locks by much cheaper atomic synchronization primitives. Our implementation of the proposed paradigm to parallelize a well known frequency counting algorithm shows the benefits of the proposed “cooperation ” based locking paradigm when compared to the traditional “contention” based locking paradigm. In our experiments, the proposed “cooperation” based design outperforms the traditional “contention ” based design by a factor of 2 − 5.5X for synthetic zipfian data sets. 1.
CoTS: A Scalable Framework for Parallelizing Frequency Counting over Data Streams
- IN ICDE
, 2009
"... Applications involving analysis of data streams have gained significant popularity and importance. Frequency counting, frequent elements and top-k queries form a class of operators that are used for a wide range of stream analysis applications. In spite of the abundance of these algorithms, all know ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Applications involving analysis of data streams have gained significant popularity and importance. Frequency counting, frequent elements and top-k queries form a class of operators that are used for a wide range of stream analysis applications. In spite of the abundance of these algorithms, all known techniques for answering data stream queries are sequential in nature. The imminent ubiquity of Chip Multi-Processor (CMP) architectures requires algorithms that can exploit the parallelism of such architectures. In this paper, we first explore the challenges in parallelizing frequent elements and top-k queries in the context of the inherent parallelism available in multi-core processors, evaluate different naive techniques for intra-operator parallelism, and summarize the insights obtained from the different
The Gist of Everything New: Personalized Top-k Processing over Web 2.0 Streams ∗
"... Web 2.0 portals have made content generation easier than ever with millions of users contributing news stories in form of posts in weblogs or short textual snippets as in Twitter. Efficient and effective filtering solutions are key to allow users stay tuned to this ever-growing ocean of information, ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Web 2.0 portals have made content generation easier than ever with millions of users contributing news stories in form of posts in weblogs or short textual snippets as in Twitter. Efficient and effective filtering solutions are key to allow users stay tuned to this ever-growing ocean of information, releasing only relevant trickles of personal interest. In classical information filtering systems, user interests are formulated using standard IR techniques and data from all available information sources is filtered based on a predefined absolute quality-based threshold. In contrast to this restrictive approach which may still overwhelm the user with the returned stream of data, we envision a system which continuously keeps the user updated with only the top-k relevant new information. Freshness of data is guaranteed by considering it valid for a particular time interval, controlled by a sliding window. Considering relevance as relative to the existing pool of new information creates a highly dynamic setting. We present POL-filter which together with our maintenance module constitute an efficient solution to this kind of problem. We show by comprehensive performance evaluations using real world data, obtained from a weblog crawl, that our approach brings performance gains compared to state-of-the-art.
CAM Conscious Integrated Answering of Frequent Elements and Top-k Queries over Data Streams ∗ ABSTRACT
"... Frequent elements and top-k queries constitute an important class of queries for data stream analysis applications. Certain applications require answers for both frequent elements and top-k queries on the same stream. In addition, the ever increasing data rates call for providing fast answers to the ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Frequent elements and top-k queries constitute an important class of queries for data stream analysis applications. Certain applications require answers for both frequent elements and top-k queries on the same stream. In addition, the ever increasing data rates call for providing fast answers to the queries, and researchers have been looking towards exploiting specialized hardware for this purpose. Content Addressable Memory(CAM) provides an efficient way of looking up elements and hence are well suited for the class of algorithms that involve lookups. In this paper, we present a fast and efficient CAM conscious integrated solution for answering both frequent elements and top-k queries on the same stream. We call our scheme CAM conscious Space Saving with Stream Summary (CSSwSS), and it can efficiently answer continuous queries. We provide an implementation of the proposed scheme using commodity CAM chips, and the experimental evaluation demonstrates that not only does the proposed scheme outperforms existing CAM conscious techniques by an order of magnitude at query loads of about 10%, but the proposed scheme can also efficiently answer continuous queries.
Aggregate Computation over Data Streams
"... Abstract. Nowadays, we have witnessed the widely recognized phenomenon of high speed data streams. Various statistics computation over data streams is often required by many applications, including processing of relational type queries, data mining and high speed network management. In this paper, w ..."
Abstract
- Add to MetaCart
Abstract. Nowadays, we have witnessed the widely recognized phenomenon of high speed data streams. Various statistics computation over data streams is often required by many applications, including processing of relational type queries, data mining and high speed network management. In this paper, we provide survey for three important kinds of aggregate computations over data streams: frequency moment, frequency count and order statistic. 1
Efficient Distributed Top-k Query in Wireless Sensor Networks
"... In this paper, we focus on designing efficient query of topk data produced by sensor nodes in a wireless sensor network (WSN). Although efficient top-k query has been extensively studied in the database community, surprisingly, little was known about the efficient top-k query in WSNs. Assume that we ..."
Abstract
- Add to MetaCart
In this paper, we focus on designing efficient query of topk data produced by sensor nodes in a wireless sensor network (WSN). Although efficient top-k query has been extensively studied in the database community, surprisingly, little was known about the efficient top-k query in WSNs. Assume that we are given a connected WSN of diameter D, consisting of n nodes with maximum node degree ∆. Two different network models will be studied. In the first model, each node holds a numeric element, the goal is to determine the top-k smallest of these elements. In the second model, there are m objects L, each node vi holds a numeric value s j(vi) for each object L j ∈L, the goal is to find the k objects with the k smallest aggregated values f(s j(v1),sj(v2),·· ·,sj(vn)), where f is an aggregation function given in advance. We propose both delay efficient and message efficient methods for conducting top-k queries in both models. Then we study the minimum delay and messages required by any distributed method for top-k queries in both models. Our analysis shows that our methods are almost optimal. We conducted extensive experiments in testbed, and simulations to study the practical performances of our methods.
de Lausanne
"... Existing content-based publish/subscribe systems are designed assuming that all matching publications are equally relevant to a subscription. As we cannot know in advance the distribution of publication content, the following two unwanted situations are highly possible: a subscriber either receives ..."
Abstract
- Add to MetaCart
Existing content-based publish/subscribe systems are designed assuming that all matching publications are equally relevant to a subscription. As we cannot know in advance the distribution of publication content, the following two unwanted situations are highly possible: a subscriber either receives too many or only few publications. In this paper we present a new publish/subscribe model which is based on the sliding window computation model. Our model assumes that publications have different relevance to a subscription. In the model, a subscriber receives k most relevant publications published within a time window w, where k and w are parameters defined per each subscription. As a consequence, the arrival rate of incoming relevant publications per subscription is predefined, and does not depend on the publication
JTop Algorithms for Top-k Join Queries
, 2008
"... Top-k join queries have become very important in many important areas of computing. One of the most efficient algorithms for top-k join queries is the Rank-Join algorithm [17] [18]. However, there are many cases where Rank-Join does much unnecessary access to the input data sources. In this report, ..."
Abstract
- Add to MetaCart
Top-k join queries have become very important in many important areas of computing. One of the most efficient algorithms for top-k join queries is the Rank-Join algorithm [17] [18]. However, there are many cases where Rank-Join does much unnecessary access to the input data sources. In this report, we first show that there are many cases where Rank-Join’s stopping mechanism is not efficient, and it does much unnecessary accesses to the input data sources. Then, we propose JTop, a family of much more efficient algorithms for top-k queries. We prove that our algorithms always perform less work than Rank-Join, and thus are more efficient. We also show that the performance of our algorithms can be O(n) times better than that of Rank-Join where n is the number of data items in the database. We evaluated the performance of our algorithms through experimentation over databases with different distributions. The results show that over the tested databases our algorithms significantly outperform Rank-Join. JTop Algorithms for Top-k Join Queries 2 1.

