Results 1 - 10
of
33
Towards Expressive Publish/Subscribe Systems
- In Proc. EDBT
, 2006
"... Abstract. Traditional content based publish/subscribe (pub/sub) systems allow users to express stateless subscriptions evaluated on individual events. However, many applications such as monitoring RSS streams, stock tickers, or management of RFID data streams require the ability to handle stateful s ..."
Abstract
-
Cited by 43 (8 self)
- Add to MetaCart
Abstract. Traditional content based publish/subscribe (pub/sub) systems allow users to express stateless subscriptions evaluated on individual events. However, many applications such as monitoring RSS streams, stock tickers, or management of RFID data streams require the ability to handle stateful subscriptions. In this paper, we introduce Cayuga, a stateful pub/sub system based on nondeterministic finite state automata (NFA). Cayuga allows users to express subscriptions that span multiple events, and it supports powerful language features such as parameterization and aggregation, which significantly extend the expressive power of standard pub/sub systems. Based on a set of formally defined language operators, the subscription language of Cayuga provides non-ambiguous subscription semantics as well as unique opportunities for optimizations. We experimentally demonstrate that common optimization techniques used in NFA-based systems such as state merging have only limited effectiveness, and we propose novel efficient indexing methods to speed up subscription processing. In a thorough experimental evaluation we show the efficacy of our approach. 1
Maintaining Sliding Window Skylines on Data Streams
- IEEE Transactions on Knowledge and Data Engineering
, 2006
"... Abstract—The skyline of a multidimensional data set contains the “best ” tuples according to any preference function that is monotonic on each dimension. Although skyline computation has received considerable attention in conventional databases, the existing algorithms are inapplicable to stream app ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
Abstract—The skyline of a multidimensional data set contains the “best ” tuples according to any preference function that is monotonic on each dimension. Although skyline computation has received considerable attention in conventional databases, the existing algorithms are inapplicable to stream applications because 1) they assume static data that are stored in the disk (rather than continuously arriving/expiring), 2) they focus on “one-time ” execution that returns a single skyline (in contrast to constantly tracking skyline changes), and 3) they aim at reducing the I/O overhead (as opposed to minimizing the CPU-cost and main-memory consumption). This paper studies skyline computation in stream environments, where query processing takes into account only a “sliding window ” covering the most recent tuples. We propose algorithms that continuously monitor the incoming data and maintain the skyline incrementally. Our techniques utilize several interesting properties of stream skylines to improve space/time efficiency by expunging data from the system as early as possible (i.e., before their expiration). Furthermore, we analyze the asymptotical performance of the proposed solutions, and evaluate their efficiency with extensive experiments. Index Terms—Skyline, stream, database, algorithm. 1
On joining and caching stochastic streams
- In Proc. ACM SIGMOD Int. Conf. on Management of Data
, 2005
"... We consider the problem of joining data streams using limited cache memory, with the goal of producing as many result tuples as possible from the cache. Many cache replacement heuristics have been proposed in the past. Their performance often relies on implicit assumptions about the input streams, e ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
We consider the problem of joining data streams using limited cache memory, with the goal of producing as many result tuples as possible from the cache. Many cache replacement heuristics have been proposed in the past. Their performance often relies on implicit assumptions about the input streams, e.g., that the join attribute values follow a relatively stationary distribution. However, in general and in practice, streams often exhibit more complex behaviors, such as increasing trends and random walks, rendering these “hardwired ” heuristics inadequate. In this paper, we propose a framework that is able to exploit known or observed statistical properties of input streams to make cache replacement decisions aimed at maximizing the expected number of result tuples. To illustrate the complexity of the solution space, we show that even an algorithm that considers, at every time step, all possible sequences of future replacement decisions may not be optimal. We then identify a condition between two candidate tuples under which an optimal algorithm would always choose one tuple over the other to replace. We develop a heuristic that behaves consistently with an optimal algorithm whenever this condition is satisfied. We show through experiments that our heuristic outperforms previous ones. As another evidence of the generality of our framework, we show that the classic caching/paging problem for static objects can be reduced to a stream join problem and analyzed under our framework, yielding results that agree with or extend classic ones. 1
Adaptive load shedding for windowed stream joins
- In Proc. Int. Conf. on Information and Knowledge Management (CIKM
, 2005
"... We present an adaptive load shedding approach for windowed stream joins. In contrast to the conventional approach of dropping tuples from the input streams, we explore the concept of selective processing for load shedding. We allow stream tuples to be stored in the windows and shed excessive CPU loa ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
We present an adaptive load shedding approach for windowed stream joins. In contrast to the conventional approach of dropping tuples from the input streams, we explore the concept of selective processing for load shedding. We allow stream tuples to be stored in the windows and shed excessive CPU load by performing the join operations, not on the entire set of tuples within the windows, but on a dynamically changing subset of tuples that are learned to be highly beneficial. We support such dynamic selective processing through three forms of runtime adaptations: adaptation to input stream rates, adaptation to time correlation between the streams and adaptation to join directions. Our load shedding approach enables us to integrate utility-based load shedding with time correlation-based load shedding. Indexes are used to further speed up the execution of stream joins. Experiments are conducted to evaluate our adaptive load shedding in terms of output rate and utility. The results show that our selective processing approach to load shedding is very effective and significantly outperforms the approach that drops tuples from the input streams. 1
Declarative Network Monitoring with an Underprovisioned Query Processor
- In ICDE
, 2006
"... Many of the data sources used in stream query processing are known to exhibit bursty behavior. We focus here on passive network monitoring, an application in which the data rates typically exhibit a large peak-to-average ratio. Provisioning a stream query processor to handle peak rates in such a set ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Many of the data sources used in stream query processing are known to exhibit bursty behavior. We focus here on passive network monitoring, an application in which the data rates typically exhibit a large peak-to-average ratio. Provisioning a stream query processor to handle peak rates in such a setting can be prohibitively expensive. In this paper, we propose to solve this problem by provisioning the query processor for typical data rates instead of much higher peak data rates. To enable this strategy, we present mechanisms and policies for managing the tradeoffs between the latency and accuracy of query results when bursts exceed the steady-state capacity of the query processor. We describe the current status of our implementation and present experimental results on a testbed network monitoring application to demonstrate the utility of our approach. 1
Out-of-Order Processing: a New Architecture for High-Performance Stream Systems
- In Proc. Of the VLDB Endowment
, 2008
"... Many stream-processing systems enforce an order on data streams during query evaluation to help unblock blocking operators and purge state from stateful operators. Such in-order processing (IOP) systems not only must enforce order on input streams, but also require that query operators preserve orde ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Many stream-processing systems enforce an order on data streams during query evaluation to help unblock blocking operators and purge state from stateful operators. Such in-order processing (IOP) systems not only must enforce order on input streams, but also require that query operators preserve order. This orderpreserving requirement constrains the implementation of stream systems and incurs significant performance penalties, particularly for memory consumption. Especially for high-performance, potentially distributed stream systems, the cost of enforcing order can be prohibitive. We introduce a new architecture for stream systems, out-of-order processing (OOP), that avoids ordering constraints. The OOP architecture frees stream systems from the burden of order maintenance by using explicit stream progress indicators, such as punctuation or heartbeats, to unblock and purge operators. We describe the implementation of OOP stream systems and discuss the benefits of this architecture in depth. For example, the OOP approach has proven useful for smoothing workload bursts caused by expensive end-of-window operations, which can overwhelm internal communication paths in IOP approaches. We have implemented OOP in two stream systems, Gigascope and NiagaraST. Our experimental study shows that the OOP approach can significantly outperform IOP in a number of aspects, including memory, throughput and latency. 1.
Continuous Query Processing in Spatiotemporal Databases
- In Proceedings of the ICDE/EDBT PhD Workshop
, 2004
"... The tremendous increase of cellular phones, GPS-like devices, and RFIDs results in highly dynamic environments where objects as well as queries are continuously moving. In this paper, we present a continuous query processor designed specifically for highly dynamic environments (e.g., location-aware ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
The tremendous increase of cellular phones, GPS-like devices, and RFIDs results in highly dynamic environments where objects as well as queries are continuously moving. In this paper, we present a continuous query processor designed specifically for highly dynamic environments (e.g., location-aware environments). We implemented the proposed continuous query processor inside the PLACE server (Pervasive Location-Aware Computing Environments); a scalable location-aware database server currently developed at Purdue University. The PLACE server extends data streaming management systems to support location-aware environments. Such environments are characterized by the wide variety of continuous spatio-temporal queries and the unbounded spatio-temporal streams. The proposed continuous query processor mainly includes: (1) Developing new incremental spatio-temporal operators to support a wide variety of continuous spatio-temporal queries, (2) Extending the semantic of sliding window queries to deal with spatial sliding windows as well as temporal sliding windows, and (3) Providing a shared execution framework for scalable execution of a set of concurrent continuous spatio-temporal queries. Preliminary experimental evaluation shows the promising performance of the continuous query processor of the PLACE server.
State-slice: New paradigm of multiquery optimization of window-based stream queries
- In Proc. Int. Conf. on Very Large Data Bases (VLDB
, 2006
"... Modern stream applications such as sensor monitoring systems and publish/subscription services necessitate the handling of large numbers of continuous queries specified over high volume data streams. Efficient sharing of computations among multiple continuous queries, especially for the memory- and ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Modern stream applications such as sensor monitoring systems and publish/subscription services necessitate the handling of large numbers of continuous queries specified over high volume data streams. Efficient sharing of computations among multiple continuous queries, especially for the memory- and CPU-intensive window-based operations, is critical. A novel challenge in this scenario is to allow resource sharing among similar queries, even if they employ windows of different lengths. This paper first reviews the existing sharing methods in the literature, and then illustrates the significant performance shortcomings of these methods. This paper then presents a novel paradigm for the sharing of window join queries. Namely we slice window states of a join operator into fine-grained window slices and form a chain of sliced window joins. By using an elaborate pipelining methodology, the number of joins after state slicing is reduced from quadratic to linear. This novel sharing paradigm enables us to push selections down into the chain and flexibly select subsequences of such sliced window joins for computation sharing among queries with different window sizes. Based on the state-slice sharing paradigm, two algorithms are proposed for the chain buildup. One minimizes the memory consumption while the other minimizes the CPU usage. The algorithms are proven to find the optimal chain with respect to memory or CPU usage for a given query workload. We have implemented the slice-share paradigm within the data stream management system CAPE. The experimental results show that our strategy provides the best performance over a diverse range of workload settings among all alternate solutions in the literature.
Incremental Evaluation of Sliding-Window Queries over Data Streams
, 2007
"... Two research efforts have been conducted to realize sliding-window queries in data stream management systems, namely, query reevaluation and incremental evaluation. In the query reevaluation method, two consecutive windows are processed independently of each other. On the other hand, in the increme ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Two research efforts have been conducted to realize sliding-window queries in data stream management systems, namely, query reevaluation and incremental evaluation. In the query reevaluation method, two consecutive windows are processed independently of each other. On the other hand, in the incremental evaluation method, the query answer for a window is obtained incrementally from the answer of the preceding window. In this paper, we focus on the incremental evaluation method. Two approaches have been adopted for the incremental evaluation of sliding-window queries, namely, the input-triggered approach and the negative tuples approach. In the input-triggered approach, only the newly inserted tuples flow in the query pipeline and tuple expiration is based on the timestamps of the newly inserted tuples. On the other hand, in the negative tuples approach, tuple expiration is separated from tuple insertion where a tuple flows in the pipeline for every inserted or expired tuple. The negative tuples approach avoids the unpredictable output delays that result from the input-triggered approach. However, negative tuples double the number of tuples through the query pipeline, thus reducing the pipeline bandwidth. Based on a detailed study of the incremental evaluation pipeline, we classify the incremental query operators into two classes according to whether an operator can avoid the processing of negative tuples or not. Based on this classification, we present several optimization techniques over the negative tuples approach that aim to reduce the overhead of processing negative tuples while avoiding the output delay of the query answer. A detailed experimental study, based on a prototype system implementation, shows the performance gains over the input-triggered approach of the negative tuples approach when accompanied with the proposed optimizations.
A Survey on Algorithms for Mining Frequent Itemsets over Data Streams
"... The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the on ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the one-scan nature, the unbounded memory requirement and the high data arrival rate of data streams, the combinatorial explosion of itemsets exacerbates the mining task. The high complexity of the FI mining problem hinders the application of the stream mining techniques. We recognize that a critical review of existing techniques is needed in order to design and develop efficient mining algorithms and data structures that are able to match the processing rate of the mining with the high arrival rate of data streams. Within a unifying set of notations and terminologies, we describe in this paper the efforts and main techniques for mining data streams and present a compre-hensive survey of a number of the state-of-the-art algorithms on mining frequent itemsets over data streams. We classify the stream-mining techniques into two categories based on the window model that they adopt in order to provide in-sights into how and why the techniques are useful. Then, we further analyze the algorithms according to whether they are exact or approximate and, for approxi-mate approaches, whether they are false-positive or false-negative. We also discuss various interesting issues, including the merits and limitations in existing research and substantive areas for future research. 1 1

