Results 1 - 10
of
31
Clustering data streams: Theory and practice
- IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract
-
Cited by 75 (2 self)
- Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Maintaining Variance and k-Medians over Data Stream Windows
- In PODS
, 2003
"... The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding w ..."
Abstract
-
Cited by 60 (0 self)
- Add to MetaCart
The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model --- maintaining variance and maintaining a k-- median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most # using O( # 2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O( N) memory, where # < 1/2 is a parameter which trades o# the space bound with the approximation factor of O(2 ).
Clustering Binary Data Streams with K-means
- In 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
, 2003
"... Clustering data streams is an interesting Data Mining problem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants include On-line K-means, Scalable Kmeans, and Incremental K-means, a proposed variant introduced that finds higher quality solutio ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
Clustering data streams is an interesting Data Mining problem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants include On-line K-means, Scalable Kmeans, and Incremental K-means, a proposed variant introduced that finds higher quality solutions in less time. Higher quality of solutions are obtained with a mean-based initialization and incremental learning. The speedup is achieved through a simplified set of sufficient statistics and operations with sparse matrices. A summary table of clusters is maintained on-line. The K-means variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions.
GenIc: A Single Pass Generalized Incremental Algorithm for Clustering
- In SIAM Int. Conf. on Data Mining
, 2004
"... In this paper we introduce a new single pass clustering algorithm called GenIc designed with the objective of having low overall cost. We examine some of the properties of GenIc and compare it to windowed k-means. We also study its performance using experimental data sets obtained from network monit ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
In this paper we introduce a new single pass clustering algorithm called GenIc designed with the objective of having low overall cost. We examine some of the properties of GenIc and compare it to windowed k-means. We also study its performance using experimental data sets obtained from network monitoring.
Mining frequent itemsets from data streams with a time-sensitive sliding window
- In SDM
, 2005
"... Mining frequent itemsets has been widely studied over the last decade. Past research focuses on mining frequent itemsets from static databases. In many of the new applications, data flow through the Internet or sensor networks. It is challenging to extend the mining techniques to such a dynamic envi ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Mining frequent itemsets has been widely studied over the last decade. Past research focuses on mining frequent itemsets from static databases. In many of the new applications, data flow through the Internet or sensor networks. It is challenging to extend the mining techniques to such a dynamic environment. The main challenges include a quick response to the continuous request, a compact summary of the data stream, and a mechanism that adapts to the limited resources. In this paper, we develop a novel approach for mining frequent itemsets from data streams based on a time-sensitive sliding window model. Our approach consists of a storage structure that captures all possible frequent itemsets and a table providing approximate counts of the expired data items, whose size can be adjusted by the available storage space. Experiment results show that in our approach both the execution time and the storage space remain small under various parameter settings. In addition, our approach guarantees no false alarm or no false dismissal to the results yielded. 1
Adaptive Mining Techniques for Data Streams Using Algorithm Output Granularity Mohamed
- Workshop (AusDM 2003), Held in conjunction with the 2003 Congress on Evolutionary Computation (CEC 2003
, 2003
"... Mining data streams is an emerging area of research given the potentially large number of business and scientific applications. A significant challenge in analyzing /mining data streams is the high data rate of the stream. In this paper, we propose a novel approach to cope with the high data rate of ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Mining data streams is an emerging area of research given the potentially large number of business and scientific applications. A significant challenge in analyzing /mining data streams is the high data rate of the stream. In this paper, we propose a novel approach to cope with the high data rate of incoming data streams. We termed our approach "algorithm output granularity". It is a resource-aware approach that is adaptable to available memory, time constraints, and data stream rate. The approach is generic and applicable to clustering, classification and counting frequent items mining techniques. We have developed a data stream clustering algorithm based on the algorithm output granularity approach. We present this algorithm and discuss its implementation and empirical evaluation. The experiments show acceptable accuracy accompanied with run-time efficiency. They show that the proposed algorithm outperforms the K-means in terms of running time while preserving the accuracy that our algorithm can achieve.
Scuba: Scalable cluster-based algorithm for evaluating continuous spatio-temporal queries on moving objects
- In EDBT
, 2006
"... Abstract. In this paper, we propose, SCUBA, a Scalable Cluster Based Algorithm for evaluating a large set of continuous queries over spatiotemporal data streams. The key idea of SCUBA is to group moving objects and queries based on common spatio-temporal properties at runtime into moving clusters to ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract. In this paper, we propose, SCUBA, a Scalable Cluster Based Algorithm for evaluating a large set of continuous queries over spatiotemporal data streams. The key idea of SCUBA is to group moving objects and queries based on common spatio-temporal properties at runtime into moving clusters to optimize query execution and thus facilitate scalability. SCUBA exploits shared cluster-based execution by abstracting the evaluation of a set of spatio-temporal queries as a spatial join first between moving clusters. This cluster-based filtering prunes true negatives. Then the execution proceeds with a fine-grained withinmoving-cluster join process for all pairs of moving clusters identified as potentially joinable by a positive cluster-join match. A moving cluster can serve as an approximation of the location of its members. We show how moving clusters can serve as means for intelligent load shedding of spatio-temporal data to avoid performance degradation with minimal harm to result quality. Our experiments on real datasets demonstrate that SCUBA can achieve a substantial improvement when executing continuous queries on spatio-temporal data streams. 1
Online clustering of parallel data streams
- In press for Data & Knowledge Engineering
, 2005
"... In recent years, the management and processing of so-called data streams has become a topic of active research in several fields of computer science such as, e.g., distributed systems, database systems, and data mining. A data stream can roughly be thought of as a transient, continuously increasing ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
In recent years, the management and processing of so-called data streams has become a topic of active research in several fields of computer science such as, e.g., distributed systems, database systems, and data mining. A data stream can roughly be thought of as a transient, continuously increasing sequence of time-stamped data. In this paper, we consider the problem of clustering parallel streams of real-valued data, that is to say, continuously evolving time series. In other words, we are interested in grouping data streams the evolution over time of which is similar in a specific sense. In order to maintain an up-to-date clustering structure, it is necessary to analyze the incoming data in an online manner, tolerating not more than a constant time delay. For this purpose, we develop an efficient online version of the classical K-means clustering algorithm. Our method’s efficiency is mainly due to a scalable online transformation of the original data which allows for a fast computation of approximate distances between streams. Key words: data mining, clustering, data streams, fuzzy sets 1
Subsequence matching on structured time series data
- In SIGMOD’05
, 2005
"... Subsequence matching in time series databases is a useful technique, with applications in pattern matching, prediction, and rule discovery. Internal structure within the time series data can be used to improve these tasks, and provide important insight into the problem domain. This paper introduces ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Subsequence matching in time series databases is a useful technique, with applications in pattern matching, prediction, and rule discovery. Internal structure within the time series data can be used to improve these tasks, and provide important insight into the problem domain. This paper introduces our research effort in using the internal structure of a time series directly in the matching process. This idea is applied to the problem domain of respiratory motion data in cancer radiation treatment. We propose a comprehensive solution for analysis, clustering, and online prediction of respiratory motion using subsequence similarity matching. In this system, a motion signal is captured in real time as a data stream, and is analyzed immediately for treatment and also saved in a database for future study. A piecewise linear representation of the signal is generated from a finite state model, and is used as a query for subsequence matching. To ensure that the query subsequence is representative, we introduce the concept of subsequence stability, which can be used to dynamically adjust the query subsequence length. To satisfy the special needs of similarity matching over breathing patterns, a new subsequence similarity measure is introduced. This new measure uses a weighted ¢ ¡ distance function to capture the relative importance of each source stream, amplitude, frequency, and proximity in time. From the subsequence similarity measure, stream and patient similarity can be defined, which are then used for offline and online applications. The matching results are analyzed and applied for motion prediction and correlation discovery. While our system has been customized for use in radiation therapy, our approach to time series modeling is general enough for application domains with structured time series data. 1.
Density-Based Clustering for Real-Time Stream Data
- Proc. Of KDD' 07
, 2007
"... Existing data-stream clustering algorithms such as CluStream are based on k-means. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this paper ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Existing data-stream clustering algorithms such as CluStream are based on k-means. These clustering algorithms are incompetent to find clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this paper proposes D-Stream, a framework for clustering stream data using a density-based approach. The algorithm uses an online component which maps each input data record into a grid and an offline component which computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. Exploiting the intricate relationships between the decay factor, data density and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped to by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes high-speed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams. 1.

