Results 1 - 10
of
106
Models and issues in data stream systems
- In PODS
, 2002
"... In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work releva ..."
Abstract
-
Cited by 520 (18 self)
- Add to MetaCart
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues. 1
The Design of an Acquisitional Query Processor for Sensor Networks
- In ACM SIGMOD
, 2002
"... We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often data is physically acquired (sampled) and delivered to query processing operators. By focusing on the locations and costs of acq ..."
Abstract
-
Cited by 371 (22 self)
- Add to MetaCart
We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often data is physically acquired (sampled) and delivered to query processing operators. By focusing on the locations and costs of acquiring data, we are able to significantly reduce power consumption over traditional passive systems that assume the a priori existence of data. We discuss simple extensions to SQL for controlling data acquisition, and show how acquisitional issues influence query optimization, dissemination, and execution. We evaluate these issues in the context of TinyDB, a distributed query processor for smart sensor devices, and show how acquisitional techniques can provide significant reductions in power consumption on our sensor devices.
Tinydb: An acquisitional query processing system for sensor networks
- ACM Trans. Database Syst
, 2005
"... We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often data is physically acquired (sampled) and delivered to query processing operators. By focusing on the locations and costs of acq ..."
Abstract
-
Cited by 295 (7 self)
- Add to MetaCart
We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often data is physically acquired (sampled) and delivered to query processing operators. By focusing on the locations and costs of acquiring data, we are able to significantly reduce power consumption over traditional passive systems that assume the a priori existence of data. We discuss simple extensions to SQL for controlling data acquisition, and show how acquisitional issues influence query optimization, dissemination, and execution. We evaluate these issues in the context of TinyDB, a distributed query processor for smart sensor devices, and show how acquisitional techniques can provide significant reductions in power consumption on our sensor devices. Categories and Subject Descriptors: H.2.3 [Database Management]: Languages—Query languages; H.2.4 [Database Management]: Systems—Distributed databases; query processing
Continuous Queries over Data Streams
, 2004
"... In many recent applications, data may take the form of continuous data streams, rather than finite stored data sets. Several aspects of data management need to be reconsidered in the presence of data streams, offering a new research direction for the database community. In this paper we focus primar ..."
Abstract
-
Cited by 215 (8 self)
- Add to MetaCart
In many recent applications, data may take the form of continuous data streams, rather than finite stored data sets. Several aspects of data management need to be reconsidered in the presence of data streams, offering a new research direction for the database community. In this paper we focus primarily on the problem of query processing, specifically on how to define and evaluate continuous queries over data streams. We address semantic issues as well as efficiency concerns. Our main contributions are threefold. First, we specify a general and flexible architecture for query processing in the presence of data streams. Second, we use our basic architecture as a tool to clarify alternative semantics and processing techniques for continuous queries. The architecture also captures most previous work on continuous queries and data streams, as well as related concepts such as triggers and materialized views. Finally, we map out research topics in the area of query processing over data streams, showing where previous work is relevant and describing problems yet to be addressed.
Surfing wavelets on streams: One-pass summaries for approximate aggregate queries
- In VLDB
, 2001
"... Abstract We present techniques for computing small spacerepresentations of massive data streams. These are inspired by traditional wavelet-based approx-imations that consist of specific linear projections of the underlying data. We present general"sketch " based methods for capturing vario ..."
Abstract
-
Cited by 175 (16 self)
- Add to MetaCart
Abstract We present techniques for computing small spacerepresentations of massive data streams. These are inspired by traditional wavelet-based approx-imations that consist of specific linear projections of the underlying data. We present general"sketch " based methods for capturing various linear projections of the data and use them to pro-vide pointwise and rangesum estimation of data streams. These methods use small amounts ofspace and per-item time while streaming through the data, and provide accurate representation asour experiments with real data streams show.
Processing Complex Aggregate Queries over Data Streams
, 2002
"... Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requir ..."
Abstract
-
Cited by 144 (16 self)
- Add to MetaCart
Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed.
DIMENSIONS: Why do we need a new Data Handling architecture for Sensor Networks?
, 2002
"... An important class of networked systems is emerging that involve very large numbers of small, low-power, wireless devices. These systems offer the ability to sense the environment densely, offering unprecedented opportunities for many scientific disciplines to obtain detailed datasets for analysis. ..."
Abstract
-
Cited by 114 (13 self)
- Add to MetaCart
An important class of networked systems is emerging that involve very large numbers of small, low-power, wireless devices. These systems offer the ability to sense the environment densely, offering unprecedented opportunities for many scientific disciplines to obtain detailed datasets for analysis. In this paper, we argue that a data handling architecture for these devices should incorporate their extreme resource constraints - energy, storage and processing - and spatiotemporal interpretation of the physical world in the design, cost model, and metrics of evaluation. We describe DIMENSIONS, a system that provides a unified view of data handling in sensor networks, incorporating long-term storage, multiresolution data access and spatio-temporal pattern mining.
Query Processing, Approximation, and Resource Management In a Data Stream Management System
, 2002
"... This paper describes our ongoing work developing the Stanford Stream Data Manager (STREAM), a system for executing continuous queries over multiple continuous data streams. The STREAM system supports a declarative query language, and it copes with high data rates and query workloads by providing app ..."
Abstract
-
Cited by 90 (3 self)
- Add to MetaCart
This paper describes our ongoing work developing the Stanford Stream Data Manager (STREAM), a system for executing continuous queries over multiple continuous data streams. The STREAM system supports a declarative query language, and it copes with high data rates and query workloads by providing approximate answers when resources are limited. This paper describes specific contributions made so far and enumerates our next steps in developing a general-purpose Data Stream Management System.
Tracking join and self-join sizes in limited storage
, 2002
"... This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the lar ..."
Abstract
-
Cited by 89 (0 self)
- Add to MetaCart
This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, high-quality estimates of join sizes in order to select between various join plans, and estimates of self-join sizes are used to indicate the degree of skew in the data. For self-joins, we considertwo approaches proposed in [Alon, Matias, and Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS, vol. 58, 1999, p.137-147], which we denote tug-of-war and sample-count. Wepresent fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the rst experimental study of the two approaches, on a range of synthetic and real-world data sets. Our study shows that tug-of-war provides more accurate estimates for a given storage limit than sample-count, which in turn is far more accurate than a standard sampling-based approach. For example, tug-of-war needed only 4{256 memory words, depending on the data set, in order to estimate the self-join size
Distinct sampling for highly-accurate answers to distinct values queries and event reports
- In Proceedings of the 27th International Conference on Very Large Data Bases
"... Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that exa ..."
Abstract
-
Cited by 73 (5 self)
- Add to MetaCart
Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that examine only part of the input data). We present an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data. In contrast to the previous negative results, our small Distinct Samples are guaranteed to accurately estimate the number of distinct values. The samples can be incrementally maintained up-to-date in the presence of data insertions and deletions, with minimal time and memory overheads, so that the full scan may be performed only once. Moreover, a stored Distinct Sample can be used to accurately estimate the number of distinct values within any range specified by the query, or within any other subset of the data satisfying a query predicate. We present an extensive experimental study of distinct sampling. Using synthetic and real-world data sets, we show that distinct sampling gives distinct-values estimates to within 0%–10 % relative error, whereas previous methods typically incur 50%–250 % relative error. Next, we show how distinct sampling can provide fast, highlyaccurate approximate answers for “report ” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc. For a commercial call center environment, we show that a 1 % Distinct Sample

