#### DMCA

## Mining Data Streams: A Review. (2005)

Venue: | SIGMOD Record, |

Citations: | 113 - 6 self |

### Citations

786 | Models and issues in data stream systems,
- Babcock, Babu, et al.
- 2002
(Show Context)
Citation Context ...n proposed to address the communication overhead issue. Different variations of algorithms have been developed in order to increase the accuracy of the generated global model. More details about distributed data mining could be found in [47]. Recently, the data generation rates in some data sources become faster than ever before. This rapid generation of continuous streams of information has challenged our storage, computation and communication capabilities in computing systems. Systems, models and techniques have been proposed and developed over the past few years to address these challenges [5, 44]. In this paper, we review the theoretical foundations of data stream analysis. Mining data stream systems, techniques are critically reviewed. Finally, we outline and discuss research problems in streaming mining field of study. These research issues should be addressed in order to realize robust systems that are capable of fulfilling the needs of data stream mining applications. The paper is organized as follows. Section 2 presents the theoretical background of data stream analysis. Mining data stream techniques and systems are reviewed in sections 3 and 4 respectively. Open and addressed re... |

533 | Data streams: Algorithms and applications.
- Muthukrishnan
- 2005
(Show Context)
Citation Context ...n proposed to address the communication overhead issue. Different variations of algorithms have been developed in order to increase the accuracy of the generated global model. More details about distributed data mining could be found in [47]. Recently, the data generation rates in some data sources become faster than ever before. This rapid generation of continuous streams of information has challenged our storage, computation and communication capabilities in computing systems. Systems, models and techniques have been proposed and developed over the past few years to address these challenges [5, 44]. In this paper, we review the theoretical foundations of data stream analysis. Mining data stream systems, techniques are critically reviewed. Finally, we outline and discuss research problems in streaming mining field of study. These research issues should be addressed in order to realize robust systems that are capable of fulfilling the needs of data stream mining applications. The paper is organized as follows. Section 2 presents the theoretical background of data stream analysis. Mining data stream techniques and systems are reviewed in sections 3 and 4 respectively. Open and addressed re... |

516 |
Principles of data mining.
- Hand, Mannila, et al.
- 2001
(Show Context)
Citation Context ...e objective was to find computationally efficient solutions to data analysis problems. Along with the progress in machine learning research, new data analysis problems have been addressed. Due to the increase in database sizes, new algorithms have been proposed to deal with the scalability issue. Moreover machine learning and statistical analysis techniques have been adopted and modified in order to address the problem of very large databases. Data mining is that interdisciplinary field of study that can extract models and patterns from large amounts of information stored in data repositories [30, 31, 34]. Advances in networking and parallel computation have lead to the introduction of distributed and parallel data mining. The goal was how to extract knowledge from different subsets of a dataset and integrate these generated knowledge structures in order to gain a global model of the whole dataset. Client/server, mobile agent based and hybrid models have been proposed to address the communication overhead issue. Different variations of algorithms have been developed in order to increase the accuracy of the generated global model. More details about distributed data mining could be found in [47... |

418 | Approximate frequency counts over data streams. In
- Manku, Motwani
- 2002
(Show Context)
Citation Context ..., all the cases represented by this class is released from the memory. 3.3 Frequency Counting Giannella et al. [20] have developed a frequent itemsets mining algorithm over data stream. They have proposed the use of tilted windows to calculate the frequent patterns for the most recent transactions based on the fact that users are more interested in the most recent transactions. They use an incremental algorithm to maintain the FP-stream which is a tree data structure to represent the frequent itemsets. They conducted a number of experiments to prove the algorithm efficiency. Manku and Motwani [43] have proposed and implemented an approximate frequency counts in data streams. The implemented algorithm uses all the previous historical data to calculate the frequent patterns incrementally. Cormode and Muthukrishnan [13] have developed an algorithm for counting frequent items. The algorithm uses group testing to find the hottest k items. The algorithm is used with the turnstile data stream model which allows addition as well as deletion of data items. An approximation randomized algorithm has been used to approximately count the most frequent items. It is worth mentioning that this data st... |

402 | Mining high-speed data streams.
- Domingos, Hulten
- 2000
(Show Context)
Citation Context ...hm [27]. They use the same method described above, however they address the problem of merging clusters when the two sets of cluster centers to be merged are far apart by maintaining the EH data structure. They have studied their proposed algorithm analytically. Charikar et al [11] have proposed another kmedian algorithm that overcomes the problem of increasing approximation factors in the Guha et al [27] algorithm with the increase in the number of levels used to result in the final solution of the divide and conquer algorithm. The algorithm has also been studied analytically Domingos et al. [15, 16, 35] have proposed a general method for scaling up machine learning algorithms. They have termed this approach Very Fast Machine Learning VFML. This method depends on determining an upper bound for the learner’s loss as a function in number of data items to be examined in each step of the algorithm. They have applied this method to K-means clustering VFKM and decision tree classification VFDT techniques. These algorithms have been implemented and evaluated using synthetic data sets as well as real web data streams. VFKM uses the Hoeffding bound to determine the number of examples needed in each st... |

359 | A framework for clustering evolving data streams,” in
- Aggarwal, Han, et al.
- 2003
(Show Context)
Citation Context ... frequency moments [5] have been proposed as synopsis data structures. Since synopsis of data does not represent all the characteristics of the dataset, approximate answers are produced when using such data structures. 2.1.5 Aggregation Aggregation is the process of computing statistical measures such as means and variance that summarize the incoming stream. Using this aggregated data could be used by the mining algorithm. The problem with aggregation is that it does not perform well with highly fluctuating data distributions. Merging online aggregation with offline mining has been studies in [1, 2, 3]. 2.2 Task-based Techniques Task-based techniques are those methods that modify existing techniques or invent new ones in order to address the computational challenges of data stream processing. Approximation algorithms, sliding window and algorithm output granularity represent this category. In the following subsections, we examine each of these techniques and its application in the context of data stream analysis. 2.2.1 Approximation algorithms Approximation algorithms [44] have their roots in algorithm design. It is concerned with design algorithms for computationally hard problems. These a... |

338 | Mining Time-changing Data Streams.
- Hulten, Spencer, et al.
- 2001
(Show Context)
Citation Context ...hm [27]. They use the same method described above, however they address the problem of merging clusters when the two sets of cluster centers to be merged are far apart by maintaining the EH data structure. They have studied their proposed algorithm analytically. Charikar et al [11] have proposed another kmedian algorithm that overcomes the problem of increasing approximation factors in the Guha et al [27] algorithm with the increase in the number of levels used to result in the final solution of the divide and conquer algorithm. The algorithm has also been studied analytically Domingos et al. [15, 16, 35] have proposed a general method for scaling up machine learning algorithms. They have termed this approach Very Fast Machine Learning VFML. This method depends on determining an upper bound for the learner’s loss as a function in number of data items to be examined in each step of the algorithm. They have applied this method to K-means clustering VFKM and decision tree classification VFDT techniques. These algorithms have been implemented and evaluated using synthetic data sets as well as real web data streams. VFKM uses the Hoeffding bound to determine the number of examples needed in each st... |

315 | A symbolic representation of time series, with implications for streaming algorithms.
- Lin, Keogh, et al.
- 2003
(Show Context)
Citation Context ...time series streams. The first phase clusters sliding window patterns of each time series. Using the created clusters, an association rule discovery technique is used to create affinity analysis results among the created clusters of time series. Zhu and Shasha [54] have proposed techniques to compute some statistical measures over time series data streams. The proposed techniques use discrete Fourier transform. The system is called StatStream and is able to compute approximate error bounded correlations and inner products. The system works over an arbitrarily chosen sliding window. Lin et al. [42] have proposed the use of symbolic representation of time series data streams. This representation allows dimensionality/numerosity reduction. They have demonstrated the applicability of the proposed representation by applying it to clustering, classification, indexing and anomaly detection. The approach has two main stages. The first one is the transformation of time series data to Piecewise Aggregate Approximation followed by transforming the output to discrete string symbols in the second stage. Chen et al. [12] have proposed the application of what so called regression cubes for data strea... |

295 | Clustering data streams
- Guha, Mishra, et al.
- 2000
(Show Context)
Citation Context ...ication and frequency counting [21]. Having discussed the different theoretical approaches to data stream analysis problems, the following section is devoted to stream mining techniques that use the above theoretical approaches in different ways. 3- Mining Techniques Mining data streams has attracted the attention of data mining community for the last three years. A number of algorithms have been proposed for extracting knowledge from streaming information. In this section, we review clustering, classification, frequency counting and time series analysis techniques. 3.1 Clustering Guha et al. [27, 28] have studied analytically clustering data streams using K-median technique. The proposed algorithm makes a single pass over the data stream and uses small space. It requires O(nk) time and O(n) space where “k” is the number of centers, “n” is the number of points and <1. They have proved that any k-median algorithm that achieves a constant factor approximation can not achieve a better run time than O(nk). The algorithm starts by clustering a calculated size sample according to the available memory into 2k, and then at a second level, the algorithm clusters the above points for a number of s... |

280 | Mining concept-drifting data streams using ensemble classifiers, in:
- Wang, Fan, et al.
- 2003
(Show Context)
Citation Context ... to choose the subsequences that the algorithm can work on to produce meaningful results. Gaber et al. [21] have developed Lightweight Clustering LWC. It is an AOG-based algorithm. AOG has been discussed in section 2. The algorithm adjusts a threshold that represents the minimum distance measure between data items in different clusters. This adjustment is done regularly according to a pre-specified time frame. It is done according to the available resources by monitoring the input-output rate. This process is followed by merging clusters when the memory is full. 3.2 Classification Wang et al. [53] have proposed a general framework for mining concept drifting data streams. They have observed that data stream mining algorithms proposed so far have not addressed the concept of drifting in the evolving data. The proposed technique uses weighted classifier ensembles to mine data streams. The expiration of old data in their model depends on the data distribution. They use synthetic and real life data streams to test their algorithm and compare between the single classifier and classifier ensembles. The proposed algorithm combines multiple classifiers weighted by their expected prediction acc... |

221 | StatStream: Statistical monitoring of thousands of data streams in real time
- Zhu, Shasha
- 2002
(Show Context)
Citation Context ...mputing the sketches over an arbitrarily chosen time window and creating what so called sketch pool. Using this pool of sketches, relaxed periods and average trends are computed. The algorithms have shown experimentally efficiency in running time and accuracy. Perlman and Java [49] have proposed a two phase approach to mine astronomical time series streams. The first phase clusters sliding window patterns of each time series. Using the created clusters, an association rule discovery technique is used to create affinity analysis results among the created clusters of time series. Zhu and Shasha [54] have proposed techniques to compute some statistical measures over time series data streams. The proposed techniques use discrete Fourier transform. The system is called StatStream and is able to compute approximate error bounded correlations and inner products. The system works over an arbitrarily chosen sliding window. Lin et al. [42] have proposed the use of symbolic representation of time series data streams. This representation allows dimensionality/numerosity reduction. They have demonstrated the applicability of the proposed representation by applying it to clustering, classification, ... |

199 | What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically”.
- Cormode, Muthukrishnan
- 2003
(Show Context)
Citation Context ... algorithms are considered hard computational problems given its features of continuality and speed and the generating environment that is featured by being resource constrained. Approximation algorithms have attracted researchers as a direct solution to data stream mining problems. However, the problem of data rates with regard with the available resources could not be solved using approximation algorithms. Other tools should be used along with these algorithms in order to adapt to the SIGMOD Record, Vol. 34, No. 2, June 2005 19 available resources. Approximation algorithms have been used in [13] 2.2.2 Sliding Window The inspiration behind sliding window is that the user is more concerned with the analysis of most recent data streams. Thus the detailed analysis is done over the most recent data items and summarized versions of the old ones. This idea has been adopted in many techniques in the undergoing comprehensive data stream mining system MAIDS [17]. 2.2.3 Algorithm Output Granularity The algorithm output granularity (AOG) [21, 22, 23] introduces the first resource-aware data analysis approach that can cope with fluctuating very high data rates according to the available memory an... |

164 | Issues in Data Stream Management.
- Golab, Ozsu
- 2003
(Show Context)
Citation Context ...sented in snow, ice and clouds using kernel clustering methods. These techniques are used for data compression. The motivation of the project is to preserve the limited bandwidth needed to send image streams to the ground centers. The kernel methods have been chosen due to its low computational complexity in such resource-constrained environment. 5- Research Issues Data stream mining is a stimulating field of study that has raised challenges and research issues to be addressed by the database and data mining communities. The following is a discussion of both addressed and open research issues [17, 21, 26, 37]. The following is a brief discussion of previously addressed issues: Handling the continuous flow of data streams: this is a data management issue. Traditional database management systems are not capable of dealing with such continuous high data rate. Novel indexing, storage and querying techniques are required to handle this nonstopping fluctuated flow of information streams. Minimizing energy consumption of the mobile device: Large amounts of data streams are generated in resource-constrained environments. Senor networks represent a typical example. These devices have shortlife batteries. T... |

145 |
Event detection from time series data.
- Guralnik, Srivastava
- 1999
(Show Context)
Citation Context ...coming streams. This research has been extended to be adopted in an undergoing project Mining Alarming Incidents in Data Streams MAIDS. Himberg et al. [33] have presented and analyzed randomized variations of segmenting time series data streams generated onboard mobile phone sensors. One of the applications of clustering time series discussed: Changing the user interface of mobile phone screen according to the user context. It has been proven in this study that Global Iterative Replacement provides approximately an optimal solution with high efficiency in running time. Guralnik and Srivastava [29] have developed a generic event detection approach of time series streams. They have developed techniques for batch and online incremental processing of time series data. The techniques have proven efficiency with real and synthetic data sets. 4- Systems Several applications have stimulated the development of robust streaming analysis systems. The following represents a list of such applications. • Burl et al. [9] have developed Diamond Eye for NASA and JPL. The aim of this project to enable remote computing systems as well as observing 22 SIGMOD Record, Vol. 34, No. 2, June 2005 scientists to... |

144 | Multi-dimensional regression analysis of time-series data streams
- Chen, Dong, et al.
- 2002
(Show Context)
Citation Context ...roducts. The system works over an arbitrarily chosen sliding window. Lin et al. [42] have proposed the use of symbolic representation of time series data streams. This representation allows dimensionality/numerosity reduction. They have demonstrated the applicability of the proposed representation by applying it to clustering, classification, indexing and anomaly detection. The approach has two main stages. The first one is the transformation of time series data to Piecewise Aggregate Approximation followed by transforming the output to discrete string symbols in the second stage. Chen et al. [12] have proposed the application of what so called regression cubes for data streams. Due to the success of OLAP technology in the application of static stored data, it has been proposed to use multidimensional regression analysis to create a compact cube that could be used for answering aggregate queries over the incoming streams. This research has been extended to be adopted in an undergoing project Mining Alarming Incidents in Data Streams MAIDS. Himberg et al. [33] have presented and analyzed randomized variations of segmenting time series data streams generated onboard mobile phone sensors.... |

134 | Mining Frequent Patterns in Data Streams at Multiple Time Granularities.
- Giannella, Han, et al.
- 2003
(Show Context)
Citation Context ...uitable for streaming applications. Gaber et al. [21] have developed Lightweight Classification LWClass. It is a variation of LWC. It is also an AOG-based technique. The idea is to use Knearest neighbors with updating the frequency of class occurrence given the data stream features. In case of contradiction between the incoming stream and the SIGMOD Record, Vol. 34, No. 2, June 2005 21 stored summary of the cases, the frequency is reduced. In case of the frequency is equalized to zero, all the cases represented by this class is released from the memory. 3.3 Frequency Counting Giannella et al. [20] have developed a frequent itemsets mining algorithm over data stream. They have proposed the use of tilted windows to calculate the frequent patterns for the most recent transactions based on the fact that users are more interested in the most recent transactions. They use an incremental algorithm to maintain the FP-stream which is a tree data structure to represent the frequent itemsets. They conducted a number of experiments to prove the algorithm efficiency. Manku and Motwani [43] have proposed and implemented an approximate frequency counts in data streams. The implemented algorithm uses ... |

117 | Clustering of time series subsequences is meaningless: implications for previous and future research,
- Keogh, Lin
- 2005
(Show Context)
Citation Context ...hm. The proposed technique divides the clustering process into two components. The online component stores summarized statistics about the data streams and the offline one performs clustering on the summarized data according to a number of user preferences such as the time frame and the number of clusters. A number of experiments on real datasets have been conducted to prove the accuracy and efficiency of the proposed algorithm. They [2] have recently proposed HPStream; a projected clustering for high dimensional data streams. HPStream has outperformed CluStream in recent results. Keogh et al [39] have proved empirically that most highly cited clustering of time series data streams algorithms proposed so far in the literature come out with meaningless results in subsequence clustering. They have proposed a solution approach using k-motif to choose the subsequences that the algorithm can work on to produce meaningful results. Gaber et al. [21] have developed Lightweight Clustering LWC. It is an AOG-based algorithm. AOG has been discussed in section 2. The algorithm adjusts a threshold that represents the minimum distance measure between data items in different clusters. This adjustment ... |

97 | Streaming-data algorithms for highquality clustering.
- O'Callaghan, Mishra, et al.
- 2002
(Show Context)
Citation Context ...ithm outperforms the scalable k-means in the majority of cases. The proposed algorithm is a one pass algorithm in O(Tkn) complexity, where T is the average transaction size, n is number of transactions and k is number of centers. The use of binary data simplifies the manipulation of categorical data and eliminates the need for data normalization. The main idea behind the proposed algorithm is that it updates the cluster centers and weights after examining a batch of transactions which equalizes square root of the number of transactions rather than updating them one by one. O’Challaghan et al. [45] have proposed STREAM and LOCALSEARCH algorithms for high quality data stream clustering. The STREAM algorithm 20 SIGMOD Record, Vol. 34, No. 2, June 2005 starts by determining the size of the sample and then applies the LOCALSEARCH algorithm if the sample size is larger than a pre-specified equation result. This process is repeated for each data chunk. Finally, the LOCALSEARCH algorithm is applied to the cluster centers generated in the previous iterations. Aggarwal et al. [1] have proposed a framework for clustering data steams called CluStream algorithm. The proposed technique divides the c... |

94 | Maintaining variance and k-medians over data stream windows
- Babcock, Datar, et al.
- 2003
(Show Context)
Citation Context ...m and uses small space. It requires O(nk) time and O(n) space where “k” is the number of centers, “n” is the number of points and <1. They have proved that any k-median algorithm that achieves a constant factor approximation can not achieve a better run time than O(nk). The algorithm starts by clustering a calculated size sample according to the available memory into 2k, and then at a second level, the algorithm clusters the above points for a number of samples into 2k and this process is repeated to a number of levels, and finally it clusters the 2k clusters into k clusters. Babcock et al. [7] have used exponential histogram (EH) data structure to improve Guha et al. algorithm [27]. They use the same method described above, however they address the problem of merging clusters when the two sets of cluster centers to be merged are far apart by maintaining the EH data structure. They have studied their proposed algorithm analytically. Charikar et al [11] have proposed another kmedian algorithm that overcomes the problem of increasing approximation factors in the Guha et al [27] algorithm with the increase in the number of levels used to result in the final solution of the divide and c... |

94 | Better streaming algorithms for clustering problems.
- Charikar, O’Callaghan, et al.
- 2003
(Show Context)
Citation Context ...ry into 2k, and then at a second level, the algorithm clusters the above points for a number of samples into 2k and this process is repeated to a number of levels, and finally it clusters the 2k clusters into k clusters. Babcock et al. [7] have used exponential histogram (EH) data structure to improve Guha et al. algorithm [27]. They use the same method described above, however they address the problem of merging clusters when the two sets of cluster centers to be merged are far apart by maintaining the EH data structure. They have studied their proposed algorithm analytically. Charikar et al [11] have proposed another kmedian algorithm that overcomes the problem of increasing approximation factors in the Guha et al [27] algorithm with the increase in the number of levels used to result in the final solution of the divide and conquer algorithm. The algorithm has also been studied analytically Domingos et al. [15, 16, 35] have proposed a general method for scaling up machine learning algorithms. They have termed this approach Very Fast Machine Learning VFML. This method depends on determining an upper bound for the learner’s loss as a function in number of data items to be examined in e... |

83 | A framework for projected clustering of high dimensional data streams.
- Aggarwal, Han, et al.
- 2004
(Show Context)
Citation Context ... frequency moments [5] have been proposed as synopsis data structures. Since synopsis of data does not represent all the characteristics of the dataset, approximate answers are produced when using such data structures. 2.1.5 Aggregation Aggregation is the process of computing statistical measures such as means and variance that summarize the incoming stream. Using this aggregated data could be used by the mining algorithm. The problem with aggregation is that it does not perform well with highly fluctuating data distributions. Merging online aggregation with offline mining has been studies in [1, 2, 3]. 2.2 Task-based Techniques Task-based techniques are those methods that modify existing techniques or invent new ones in order to address the computational challenges of data stream processing. Approximation algorithms, sliding window and algorithm output granularity represent this category. In the following subsections, we examine each of these techniques and its application in the context of data stream analysis. 2.2.1 Approximation algorithms Approximation algorithms [44] have their roots in algorithm design. It is concerned with design algorithms for computationally hard problems. These a... |

79 |
Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. In
- Indyk, Koudas, et al.
- 2000
(Show Context)
Citation Context ... frequent items. It is worth mentioning that this data stream model is the hardest to analyze. Time series and cash register models are computationally easier. The former does not allow increments and decrements and the later one allows only increments. Gaber et al. [21] have developed one more AOG-based algorithm: Lightweight frequency counting LWF. It has the ability to find an approximate solution to the most frequent items in the incoming stream using adaptation and releasing the least frequent items regularly in order to count the more frequent ones. 3.4 Time Series Analysis Indyk et al. [36] have proposed approximate solutions with probabilistic error bounding to two problems in time series analysis: relaxed periods and average trends. The algorithms use dimensionality reduction sketching techniques. The process starts with computing the sketches over an arbitrarily chosen time window and creating what so called sketch pool. Using this pool of sketches, relaxed periods and average trends are computed. The algorithms have shown experimentally efficiency in running time and accuracy. Perlman and Java [49] have proposed a two phase approach to mine astronomical time series streams. ... |

75 | A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering,
- Domingos, Hulten
- 2001
(Show Context)
Citation Context ... be analyzed. Sampling, load shedding and sketching techniques represent the former one. Synopsis data structures and aggregation represent the later one. Here is an outline of the basics of these techniques with pointers to its applications in the context of data stream analysis. 2.1.1 Sampling Sampling refers to the process of probabilistic choice of a data item to be processed or not. Sampling is an old statistical technique that has been used for a long time. Boundaries of the error rate of the computation are given as a function of the sampling rate. Very Fast Machine Learning techniques [16] have used Hoeffding bound to measure the sample size according to some derived loss functions. The problem with using sampling in the context of data stream analysis is the unknown dataset size. Thus the treatment of data stream should follow a special analysis to find the error bounds. Another problem with sampling is that it would be important to check for anomalies for surveillance analysis as an application in mining data streams. Sampling may not be the right choice for such an application. Sampling also does not address the problem of fluctuating data rates. It would be worth investigat... |

70 | On Demand Classification of Data Streams,
- Aggarwal, Han, et al.
- 2004
(Show Context)
Citation Context ... frequency moments [5] have been proposed as synopsis data structures. Since synopsis of data does not represent all the characteristics of the dataset, approximate answers are produced when using such data structures. 2.1.5 Aggregation Aggregation is the process of computing statistical measures such as means and variance that summarize the incoming stream. Using this aggregated data could be used by the mining algorithm. The problem with aggregation is that it does not perform well with highly fluctuating data distributions. Merging online aggregation with offline mining has been studies in [1, 2, 3]. 2.2 Task-based Techniques Task-based techniques are those methods that modify existing techniques or invent new ones in order to address the computational challenges of data stream processing. Approximation algorithms, sliding window and algorithm output granularity represent this category. In the following subsections, we examine each of these techniques and its application in the context of data stream analysis. 2.2.1 Approximation algorithms Approximation algorithms [44] have their roots in algorithm design. It is concerned with design algorithms for computationally hard problems. These a... |

70 | Distributed Data Mining: Algorithms, Systems, and Applications. To be published in the Data Mining Handbook. Editor: Nong Ye.
- Park, Kargupta
- 2002
(Show Context)
Citation Context ...34]. Advances in networking and parallel computation have lead to the introduction of distributed and parallel data mining. The goal was how to extract knowledge from different subsets of a dataset and integrate these generated knowledge structures in order to gain a global model of the whole dataset. Client/server, mobile agent based and hybrid models have been proposed to address the communication overhead issue. Different variations of algorithms have been developed in order to increase the accuracy of the generated global model. More details about distributed data mining could be found in [47]. Recently, the data generation rates in some data sources become faster than ever before. This rapid generation of continuous streams of information has challenged our storage, computation and communication capabilities in computing systems. Systems, models and techniques have been proposed and developed over the past few years to address these challenges [5, 44]. In this paper, we review the theoretical foundations of data stream analysis. Mining data stream systems, techniques are critically reviewed. Finally, we outline and discuss research problems in streaming mining field of study. Thes... |

67 | Clustering Binary Data Streams with K-means
- Ordonez
- 2003
(Show Context)
Citation Context ... loss as a function in number of data items to be examined in each step of the algorithm. They have applied this method to K-means clustering VFKM and decision tree classification VFDT techniques. These algorithms have been implemented and evaluated using synthetic data sets as well as real web data streams. VFKM uses the Hoeffding bound to determine the number of examples needed in each step of K-means algorithm. The VFKM runs as a sequence of K-means executions with each run uses more examples than the previous one until a calculated statistical bound (Hoeffding bound) is satisfied. Ordonez [46] has proposed several improvements to k-means algorithm to cluster binary data streams. He has developed an incremental k-means algorithm. The experiments were conducted on real data sets as well as synthetic ones. He has demonstrated experimentally that the proposed algorithm outperforms the scalable k-means in the majority of cases. The proposed algorithm is a one pass algorithm in O(Tkn) complexity, where T is the average transaction size, n is number of transactions and k is number of centers. The use of binary data simplifies the manipulation of categorical data and eliminates the need fo... |

59 | Time series segmentation for context recognition in mobile devices.
- Himberg, Korpiaho, et al.
- 2001
(Show Context)
Citation Context ...iecewise Aggregate Approximation followed by transforming the output to discrete string symbols in the second stage. Chen et al. [12] have proposed the application of what so called regression cubes for data streams. Due to the success of OLAP technology in the application of static stored data, it has been proposed to use multidimensional regression analysis to create a compact cube that could be used for answering aggregate queries over the incoming streams. This research has been extended to be adopted in an undergoing project Mining Alarming Incidents in Data Streams MAIDS. Himberg et al. [33] have presented and analyzed randomized variations of segmenting time series data streams generated onboard mobile phone sensors. One of the applications of clustering time series discussed: Changing the user interface of mobile phone screen according to the user context. It has been proven in this study that Global Iterative Replacement provides approximately an optimal solution with high efficiency in running time. Guralnik and Srivastava [29] have developed a generic event detection approach of time series streams. They have developed techniques for batch and online incremental processing o... |

55 | One-pass wavelet decomposition of data streams”.
- Gilbert, Kotidis, et al.
- 2003
(Show Context)
Citation Context ...f the features. It is the process of vertically sample the incoming stream. Sketching has been applied in comparing different data streams and in aggregate queries. The major drawback of sketching is that of accuracy. It is hard to use it in the context of data stream mining. Principal Component Analysis (PCA) would be a better solution that has been applied in streaming applications [38]. 2.1.4 Synopsis Data Structures Creating synopsis of data refers to the process of applying summarization techniques that are capable of summarizing the incoming stream for further analysis. Wavelet analysis [25], histograms, quantiles and frequency moments [5] have been proposed as synopsis data structures. Since synopsis of data does not represent all the characteristics of the dataset, approximate answers are produced when using such data structures. 2.1.5 Aggregation Aggregation is the process of computing statistical measures such as means and variance that summarize the incoming stream. Using this aggregated data could be used by the mining algorithm. The problem with aggregation is that it does not perform well with highly fluctuating data distributions. Merging online aggregation with offline ... |

51 |
Raghu Ramakrishnan: Mining Data Streams under Block Evolution.
- Ganti, Gehrke
- 2002
(Show Context)
Citation Context ... drifting in the evolving data. The proposed technique uses weighted classifier ensembles to mine data streams. The expiration of old data in their model depends on the data distribution. They use synthetic and real life data streams to test their algorithm and compare between the single classifier and classifier ensembles. The proposed algorithm combines multiple classifiers weighted by their expected prediction accuracy. Also the selection of number of classifiers instead of using all is an option in the proposed framework without loosing accuracy in the classification process. Ganti et al. [18] have developed analytically an algorithm for model maintenance under insertion and deletion of blocks of data records. This algorithm can be applied to any incremental data mining model. They have also described a generic framework for change detection between two data sets in terms of the data mining results they induce. They formalize the above two techniques into two general algorithms: GEMM and FOCUS. The algorithms have been applied to decision tree models and the frequent itemset model. GEMM algorithm accepts a class of models and an incremental model maintenance algorithm for the unres... |

46 | Adaptive, Hands-Off Stream Mining,
- Papadimitriou, Faloutsos, et al.
- 2003
(Show Context)
Citation Context ...del maintenance algorithm for both window-independent and windowdependent block selection sequence. FOCUS framework uses the difference between data mining models as the deviation in data sets. Domingos et al. [15] have developed VFDT. It is a decision tree learning systems based on Hoeffding trees. It splits the tree using the current best attribute taking into consideration that the number of examined data items used satisfies a statistical measure which is Hoeffding bound. The algorithm also deactivates the least promising leaves and drops the non-potential attributes. Papadimitriou et al. [48] have proposed AWSOM (Arbitrary Window Stream mOdeling Method) for interesting pattern discovery from sensors. They developed a one-pass algorithm to incrementally update the patterns. Their method requires only O(log N) memory where N is the length of the sequence. They conducted experiments with real and synthetic data sets. They use wavelet coefficients as compact information representation and correlation structure detection, and then apply a linear regression model in the wavelet domain. Aggarwal et al. have adopted the idea of microclusters introduced in CluStream in On-Demand classifica... |

38 |
Online mining of changes from data streams: Research problems and preliminary results,
- Dong, Han, et al.
- 2003
(Show Context)
Citation Context ...d not be solved using approximation algorithms. Other tools should be used along with these algorithms in order to adapt to the SIGMOD Record, Vol. 34, No. 2, June 2005 19 available resources. Approximation algorithms have been used in [13] 2.2.2 Sliding Window The inspiration behind sliding window is that the user is more concerned with the analysis of most recent data streams. Thus the detailed analysis is done over the most recent data items and summarized versions of the old ones. This idea has been adopted in many techniques in the undergoing comprehensive data stream mining system MAIDS [17]. 2.2.3 Algorithm Output Granularity The algorithm output granularity (AOG) [21, 22, 23] introduces the first resource-aware data analysis approach that can cope with fluctuating very high data rates according to the available memory and the processing speed represented in time constraints. The AOG performs the local data analysis on a resource constrained device that generates or receive streams of information. AOG has three main stages. Mining followed by adaptation to resources and data stream rates represent the first two stages. Merging the generated knowledge structures when running out ... |

38 | MobiMine: Monitoring the Stock Market from a PDA.
- Kargupta, Park, et al.
- 2002
(Show Context)
Citation Context ...e development of robust streaming analysis systems. The following represents a list of such applications. • Burl et al. [9] have developed Diamond Eye for NASA and JPL. The aim of this project to enable remote computing systems as well as observing 22 SIGMOD Record, Vol. 34, No. 2, June 2005 scientists to extract patterns from spatial objects in real time image streams. The success of this project will enable “a new era of exploration using highly autonomous spacecraft, rovers, and sensors” [9]. This project represents an early development in streaming analysis applications. • Kargupta et al. [37] have developed the first ubiquitous data stream mining system: MobiMine. It is a client/server PDA-based distributed data stream mining application for stock market data. It should be pointed out that the mining component is located at the server side rather than the PDA. There are different interactions between the server and PDA till the results finally displayed on the PDA screen. The tendency to perform data mining at the server side has been changed with the increase of the computational power of small devices. • Kargupta et al. [38] have developed Vehicle Data Stream Mining System (VEDA... |

36 | Statistics and Data Mining: Intersecting Disciplines
- Hand
- 1999
(Show Context)
Citation Context ...e objective was to find computationally efficient solutions to data analysis problems. Along with the progress in machine learning research, new data analysis problems have been addressed. Due to the increase in database sizes, new algorithms have been proposed to deal with the scalability issue. Moreover machine learning and statistical analysis techniques have been adopted and modified in order to address the problem of very large databases. Data mining is that interdisciplinary field of study that can extract models and patterns from large amounts of information stored in data repositories [30, 31, 34]. Advances in networking and parallel computation have lead to the introduction of distributed and parallel data mining. The goal was how to extract knowledge from different subsets of a dataset and integrate these generated knowledge structures in order to gain a global model of the whole dataset. Client/server, mobile agent based and hybrid models have been proposed to address the communication overhead issue. Different variations of algorithms have been developed in order to increase the accuracy of the generated global model. More details about distributed data mining could be found in [47... |

36 |
Advances in Intelligent Data Analysis.
- Hoffmann, Hand, et al.
- 2001
(Show Context)
Citation Context ...e objective was to find computationally efficient solutions to data analysis problems. Along with the progress in machine learning research, new data analysis problems have been addressed. Due to the increase in database sizes, new algorithms have been proposed to deal with the scalability issue. Moreover machine learning and statistical analysis techniques have been adopted and modified in order to address the problem of very large databases. Data mining is that interdisciplinary field of study that can extract models and patterns from large amounts of information stored in data repositories [30, 31, 34]. Advances in networking and parallel computation have lead to the introduction of distributed and parallel data mining. The goal was how to extract knowledge from different subsets of a dataset and integrate these generated knowledge structures in order to gain a global model of the whole dataset. Client/server, mobile agent based and hybrid models have been proposed to address the communication overhead issue. Different variations of algorithms have been developed in order to increase the accuracy of the generated global model. More details about distributed data mining could be found in [47... |

32 | Load shedding techniques for data stream systems.
- Babcock, Datar, et al.
- 2003
(Show Context)
Citation Context ...e context of data stream analysis is the unknown dataset size. Thus the treatment of data stream should follow a special analysis to find the error bounds. Another problem with sampling is that it would be important to check for anomalies for surveillance analysis as an application in mining data streams. Sampling may not be the right choice for such an application. Sampling also does not address the problem of fluctuating data rates. It would be worth investigating the relationship among the three parameters: data rate, sampling rate and error bounds. 2.1.2 Load Shedding Load shedding refers [6, 52] to the process of dropping a sequence of data streams. Load shedding has been used successfully in querying data streams. It has the same problems of sampling. Load shedding is difficult to be used with mining algorithms because it drops chunks of data streams that could be used in the structuring of the generated models or it might represent a pattern of interest in time series analysis. 2.1.3 Sketching Sketching [5, 44] is the process of randomly project a subset of the features. It is the process of vertically sample the incoming stream. Sketching has been applied in comparing different da... |

30 | MAIDS: Mining Alarming Incidents from Data Streams.
- Cai, Clutter, et al.
- 2004
(Show Context)
Citation Context ...r is not interested in mining data stream results, but how these results are changed over time. If the number of clusters generated for example is changed, it might represent some changes in the dynamics of the arriving stream. Dynamics of data streams using changes in the knowledge structures generated would benefit many temporal-based analysis applications. Developing algorithms for mining results’ changes: this is related to the previous issue. Traditional data mining algorithms do not produce any results that show the change of the results over time. This issue has been addressed in MAIDS [10]. Visualization of data mining results on small screens of mobile devices: visualization of traditional data mining results on a desktop is still a research issue. Visualization in small screens of a PDA for example is a real challenge. Imagine a businessman and data are being streamed and analyzed on his PDA. Such results should be efficiently visualized in a way that enables SIGMOD Record, Vol. 34, No. 2, June 2005 23 him to take a quick decision. This issue has been addressed in [37] The above are the addressed research issues in mining data streams. Open Issues in the field are discussed i... |

26 |
Online Classification of Nonstationary Data Streams,
- Last
- 2002
(Show Context)
Citation Context ...mentally update the patterns. Their method requires only O(log N) memory where N is the length of the sequence. They conducted experiments with real and synthetic data sets. They use wavelet coefficients as compact information representation and correlation structure detection, and then apply a linear regression model in the wavelet domain. Aggarwal et al. have adopted the idea of microclusters introduced in CluStream in On-Demand classification [3] and it shows a high accuracy. The technique uses clustering results to classify data using statistics of class distribution in each cluster. Last [41] has proposed an online classification system that can adapt to concept drift. The system rebuilds the classification model with the most recent examples. Using the error rate as a guide to concept drift, the frequency of model building and the window size are adjusted. The system uses info-fuzzy techniques for model building and information theory to calculate the window size. Ding et al. [14] have developed a decision tree based on Peano count tree data structure. It has been shown experimentally that it is a fast building algorithm that is suitable for streaming applications. Gaber et al. [... |

22 | Telegraphcq: An architectural status report.
- Krishnamurthy, Chandrasekaran, et al.
- 2003
(Show Context)
Citation Context ...being streamed and analyzed on his PDA. Such results should be efficiently visualized in a way that enables SIGMOD Record, Vol. 34, No. 2, June 2005 23 him to take a quick decision. This issue has been addressed in [37] The above are the addressed research issues in mining data streams. Open Issues in the field are discussed in the following: Interactive mining environment to satisfy user requirements: mining data streams is a highly application oriented field. The user requirements are considered a vital research problem to be addressed. The integration between data stream management systems [4, 40] and the ubiquitous data stream mining approaches: it is an essential issue that should be addressed to realize a fully functioning ubiquitous mining. The integration among storage, querying, mining and reasoning over streaming information would realize robust streaming systems that could be used in different applications. Current database management systems have achieved this goal over static stored datasets. The needs of real world applications: the relationship between the proposed techniques and the needs of the real world applications is another important issue. Some of the proposed techn... |

18 | Resource-Aware Knowledge Discovery in Data Streams, - Gaber, Zaslavsky, et al. - 2004 |

17 | Decision Tree Classification of Spatial Data Streams Using Peano Count Trees,
- Ding, Ding, et al.
- 2002
(Show Context)
Citation Context ... introduced in CluStream in On-Demand classification [3] and it shows a high accuracy. The technique uses clustering results to classify data using statistics of class distribution in each cluster. Last [41] has proposed an online classification system that can adapt to concept drift. The system rebuilds the classification model with the most recent examples. Using the error rate as a guide to concept drift, the frequency of model building and the window size are adjusted. The system uses info-fuzzy techniques for model building and information theory to calculate the window size. Ding et al. [14] have developed a decision tree based on Peano count tree data structure. It has been shown experimentally that it is a fast building algorithm that is suitable for streaming applications. Gaber et al. [21] have developed Lightweight Classification LWClass. It is a variation of LWC. It is also an AOG-based technique. The idea is to use Knearest neighbors with updating the frequency of class occurrence given the data stream features. In case of contradiction between the incoming stream and the SIGMOD Record, Vol. 34, No. 2, June 2005 21 stored summary of the cases, the frequency is reduced. In ... |

17 |
Load Shedding on Data Streams,
- Tatbul, Cetintemel, et al.
- 2003
(Show Context)
Citation Context ...e context of data stream analysis is the unknown dataset size. Thus the treatment of data stream should follow a special analysis to find the error bounds. Another problem with sampling is that it would be important to check for anomalies for surveillance analysis as an application in mining data streams. Sampling may not be the right choice for such an application. Sampling also does not address the problem of fluctuating data rates. It would be worth investigating the relationship among the three parameters: data rate, sampling rate and error bounds. 2.1.2 Load Shedding Load shedding refers [6, 52] to the process of dropping a sequence of data streams. Load shedding has been used successfully in querying data streams. It has the same problems of sampling. Load shedding is difficult to be used with mining algorithms because it drops chunks of data streams that could be used in the structuring of the generated models or it might represent a pattern of interest in time series analysis. 2.1.3 Sketching Sketching [5, 44] is the process of randomly project a subset of the features. It is the process of vertically sample the incoming stream. Sketching has been applied in comparing different da... |

15 | Energy Consumption in Data Analysis for On-board and Distributed Applications,
- Bhargava, Kargupta, et al.
- 2003
(Show Context)
Citation Context ...able of dealing with such continuous high data rate. Novel indexing, storage and querying techniques are required to handle this nonstopping fluctuated flow of information streams. Minimizing energy consumption of the mobile device: Large amounts of data streams are generated in resource-constrained environments. Senor networks represent a typical example. These devices have shortlife batteries. The design of techniques that are energy efficient is a crucial issue given that sending all the generated stream to a central site is energy inefficient in addition to its lack of scalability problem [8]. Unbounded memory requirements due to the continuous flow of data streams: machine learning techniques represent the main source of data mining algorithms. Most of machine learning methods require data to be resident in memory while executing the analysis algorithm. Due to the huge amounts of the generated streams, it is absolutely a very important concern to deign space efficient techniques that can have only one look or less over the incoming stream. Required result accuracy: design a space and time efficient techniques should be accompanied with acceptable result accuracy. Approximation al... |

12 |
Towards an Adaptive Approach for Mining Data Streams in Resource Constrained Environments,
- Gaber, Zaslavsky, et al.
(Show Context)
Citation Context ... with these algorithms in order to adapt to the SIGMOD Record, Vol. 34, No. 2, June 2005 19 available resources. Approximation algorithms have been used in [13] 2.2.2 Sliding Window The inspiration behind sliding window is that the user is more concerned with the analysis of most recent data streams. Thus the detailed analysis is done over the most recent data items and summarized versions of the old ones. This idea has been adopted in many techniques in the undergoing comprehensive data stream mining system MAIDS [17]. 2.2.3 Algorithm Output Granularity The algorithm output granularity (AOG) [21, 22, 23] introduces the first resource-aware data analysis approach that can cope with fluctuating very high data rates according to the available memory and the processing speed represented in time constraints. The AOG performs the local data analysis on a resource constrained device that generates or receive streams of information. AOG has three main stages. Mining followed by adaptation to resources and data stream rates represent the first two stages. Merging the generated knowledge structures when running out of memory represents the last stage. AOG has been used in clustering, classification and... |

7 | Diamond Eye: A distributed architecture for image data mining,
- Fowlkes, Stechert, et al.
- 1999
(Show Context)
Citation Context ... context. It has been proven in this study that Global Iterative Replacement provides approximately an optimal solution with high efficiency in running time. Guralnik and Srivastava [29] have developed a generic event detection approach of time series streams. They have developed techniques for batch and online incremental processing of time series data. The techniques have proven efficiency with real and synthetic data sets. 4- Systems Several applications have stimulated the development of robust streaming analysis systems. The following represents a list of such applications. • Burl et al. [9] have developed Diamond Eye for NASA and JPL. The aim of this project to enable remote computing systems as well as observing 22 SIGMOD Record, Vol. 34, No. 2, June 2005 scientists to extract patterns from spatial objects in real time image streams. The success of this project will enable “a new era of exploration using highly autonomous spacecraft, rovers, and sensors” [9]. This project represents an early development in streaming analysis applications. • Kargupta et al. [37] have developed the first ubiquitous data stream mining system: MobiMine. It is a client/server PDA-based distributed d... |

7 |
A Cost-Efficient Model for Ubiquitous Data Stream Mining,
- Gaber, Zaslavsky, et al.
(Show Context)
Citation Context ... with these algorithms in order to adapt to the SIGMOD Record, Vol. 34, No. 2, June 2005 19 available resources. Approximation algorithms have been used in [13] 2.2.2 Sliding Window The inspiration behind sliding window is that the user is more concerned with the analysis of most recent data streams. Thus the detailed analysis is done over the most recent data items and summarized versions of the old ones. This idea has been adopted in many techniques in the undergoing comprehensive data stream mining system MAIDS [17]. 2.2.3 Algorithm Output Granularity The algorithm output granularity (AOG) [21, 22, 23] introduces the first resource-aware data analysis approach that can cope with fluctuating very high data rates according to the available memory and the processing speed represented in time constraints. The AOG performs the local data analysis on a resource constrained device that generates or receive streams of information. AOG has three main stages. Mining followed by adaptation to resources and data stream rates represent the first two stages. Merging the generated knowledge structures when running out of memory represents the last stage. AOG has been used in clustering, classification and... |

4 |
STREAM: The Stanford Stream Data Manager Demonstration description - short overview of system status and plans;
- Babu, Datar, et al.
- 2003
(Show Context)
Citation Context ...being streamed and analyzed on his PDA. Such results should be efficiently visualized in a way that enables SIGMOD Record, Vol. 34, No. 2, June 2005 23 him to take a quick decision. This issue has been addressed in [37] The above are the addressed research issues in mining data streams. Open Issues in the field are discussed in the following: Interactive mining environment to satisfy user requirements: mining data streams is a highly application oriented field. The user requirements are considered a vital research problem to be addressed. The integration between data stream management systems [4, 40] and the ubiquitous data stream mining approaches: it is an essential issue that should be addressed to realize a fully functioning ubiquitous mining. The integration among storage, querying, mining and reasoning over streaming information would realize robust streaming systems that could be used in different applications. Current database management systems have achieved this goal over static stored datasets. The needs of real world applications: the relationship between the proposed techniques and the needs of the real world applications is another important issue. Some of the proposed techn... |

2 | Rajeev Rastogi: Querying and mining data streams: you only get one look a tutorial. SIGMOD Conference - Garofalakis, Gehrke - 2002 |

2 | Predictive Mining of Time Series Data in Astronomy
- Perlman, Java
- 2003
(Show Context)
Citation Context ... order to count the more frequent ones. 3.4 Time Series Analysis Indyk et al. [36] have proposed approximate solutions with probabilistic error bounding to two problems in time series analysis: relaxed periods and average trends. The algorithms use dimensionality reduction sketching techniques. The process starts with computing the sketches over an arbitrarily chosen time window and creating what so called sketch pool. Using this pool of sketches, relaxed periods and average trends are computed. The algorithms have shown experimentally efficiency in running time and accuracy. Perlman and Java [49] have proposed a two phase approach to mine astronomical time series streams. The first phase clusters sliding window patterns of each time series. Using the created clusters, an association rule discovery technique is used to create affinity analysis results among the created clusters of time series. Zhu and Shasha [54] have proposed techniques to compute some statistical measures over time series data streams. The proposed techniques use discrete Fourier transform. The system is called StatStream and is able to compute approximate error bounded correlations and inner products. The system wor... |