Results 1 - 10
of
164
Models and issues in data stream systems
- In PODS
, 2002
"... In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work releva ..."
Abstract
-
Cited by 519 (18 self)
- Add to MetaCart
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues. 1
Continuous Queries over Data Streams
, 2004
"... In many recent applications, data may take the form of continuous data streams, rather than finite stored data sets. Several aspects of data management need to be reconsidered in the presence of data streams, offering a new research direction for the database community. In this paper we focus primar ..."
Abstract
-
Cited by 215 (8 self)
- Add to MetaCart
In many recent applications, data may take the form of continuous data streams, rather than finite stored data sets. Several aspects of data management need to be reconsidered in the presence of data streams, offering a new research direction for the database community. In this paper we focus primarily on the problem of query processing, specifically on how to define and evaluate continuous queries over data streams. We address semantic issues as well as efficiency concerns. Our main contributions are threefold. First, we specify a general and flexible architecture for query processing in the presence of data streams. Second, we use our basic architecture as a tool to clarify alternative semantics and processing techniques for continuous queries. The architecture also captures most previous work on continuous queries and data streams, as well as related concepts such as triggers and materialized views. Finally, we map out research topics in the area of query processing over data streams, showing where previous work is relevant and describing problems yet to be addressed.
Mining time-changing data streams
- IN PROC. OF THE 2001 ACM SIGKDD INTL. CONF. ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2001
"... Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a station-ary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying pro-cesses genera ..."
Abstract
-
Cited by 196 (4 self)
- Add to MetaCart
Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a station-ary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying pro-cesses generating them changed during this time, sometimes radically. Although a number of algorithms have been pro-posed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes ques-tionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.
A Framework for Clustering Evolving Data Streams
- In VLDB
, 2003
"... The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a... ..."
Abstract
-
Cited by 156 (15 self)
- Add to MetaCart
The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a...
Processing Complex Aggregate Queries over Data Streams
, 2002
"... Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requir ..."
Abstract
-
Cited by 144 (16 self)
- Add to MetaCart
Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed.
StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time
- In VLDB
, 2002
"... Consider the problem of monitoring tens of thousands of time series data streams in an online fashion and making decisions based on them. In addition to single stream statistics such as average and standard deviation, we also want to find high correlations among all pairs of streams. A stock market ..."
Abstract
-
Cited by 133 (8 self)
- Add to MetaCart
Consider the problem of monitoring tens of thousands of time series data streams in an online fashion and making decisions based on them. In addition to single stream statistics such as average and standard deviation, we also want to find high correlations among all pairs of streams. A stock market trader might use such a tool to spot arbitrage opportunities.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers
, 2003
"... Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two ch ..."
Abstract
-
Cited by 132 (23 self)
- Add to MetaCart
Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities
, 2002
"... Although frequent-pattern mining has been widely studied and used, it is challenging to extend it to data streams. Compared to mining from a static transaction data set, the streaming case has far more information to track and far greater complexity to manage. Infrequent items can become frequent la ..."
Abstract
-
Cited by 75 (5 self)
- Add to MetaCart
Although frequent-pattern mining has been widely studied and used, it is challenging to extend it to data streams. Compared to mining from a static transaction data set, the streaming case has far more information to track and far greater complexity to manage. Infrequent items can become frequent later on and hence cannot be ignored. The storage structure needs to be dynamically adjusted to reflect the evolution of itemset frequencies over time.
Maintaining Variance and k-Medians over Data Stream Windows
- In PODS
, 2003
"... The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding w ..."
Abstract
-
Cited by 60 (0 self)
- Add to MetaCart
The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model --- maintaining variance and maintaining a k-- median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most # using O( # 2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O( N) memory, where # < 1/2 is a parameter which trades o# the space bound with the approximation factor of O(2 ).

