Results 1  10
of
13
Mining highspeed data streams
, 2000
"... Categories and Subject ���������� � �¨�������������������������¦���¦����������¡¤�� ¡ � ¡����������������¦¡¤����§�£���� ..."
Abstract

Cited by 318 (10 self)
 Add to MetaCart
Categories and Subject ���������� � �¨�������������������������¦���¦����������¡¤�� ¡ � ¡����������������¦¡¤����§�£����
Iterative Incremental Clustering of Time Series
 EDBT
"... Abstract. We present a novel anytime version of partitional clustering algorithm, such as kMeans and EM, for time series. The algorithm works by leveraging off the multiresolution property of wavelets. The dilemma of choosing the initial centers is mitigated by initializing the centers at each app ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
Abstract. We present a novel anytime version of partitional clustering algorithm, such as kMeans and EM, for time series. The algorithm works by leveraging off the multiresolution property of wavelets. The dilemma of choosing the initial centers is mitigated by initializing the centers at each approximation level, using the final centers returned by the coarser representations. In addition to casting the clustering algorithms as anytime algorithms, this approach has two other very desirable properties. By working at lower dimensionalities we can efficiently avoid local minima. Therefore, the quality of the clustering is usually better than the batch algorithm. In addition, even if the algorithm is run to completion, our approach is much faster than its batch counterpart. We explain, and empirically demonstrate these surprising and desirable properties with comprehensive experiments on several publicly available real data sets. We further demonstrate that our approach can be generalized to a framework of much broader range of algorithms or data mining problems. 1
Anytime classification using the nearest neighbor algorithm with applications to stream mining
 IEEE International Conference on Data Mining (ICDM
, 2006
"... For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm ..."
Abstract

Cited by 19 (8 self)
 Add to MetaCart
For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm may be especially useful. In this work we show how we can convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy. We demonstrate the utility of our approach with a comprehensive set of experiments on data from diverse domains.
A WaveletBased Anytime Algorithm for KMeans Clustering of Time Series
 In Proc. Workshop on Clustering High Dimensionality Data and Its Applications
, 2003
"... The emergence of the field of data mining in the last decade has sparked an increasing interest in clustering of tiate series. Although there has been much research on clustering in general, most classic machine learning and data mining algorithms do not work well for time series due to their unique ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
The emergence of the field of data mining in the last decade has sparked an increasing interest in clustering of tiate series. Although there has been much research on clustering in general, most classic machine learning and data mining algorithms do not work well for time series due to their unique structure. In particular, the high dimensionaliF, very high feature correlation, and the (typically) large amount of noise that characterize time series data present a difficult challenge. In this work we address these challenges by introducing a novel anytiate version of kMeans clustering algorithm for time series. The algorithm works by leveraging off the multiresolution property of wavelets. In particular, an initial clustering is perforated with a very coarse resolution representation of the data. The results obtained from this "quick and dirty" clustering are used to initialize a clustering at a slightly finer level of approximation. This process is repeated until the clustering results stabilize or until the "approxiatation" is the raw data. In addition to casting kMeans as an anytime algorithm, our approach has two other very unintuitive properties. The quality of the clustering is often better than the batch algorithm, and even if the algorithm is run to coatpletion, the time taken is typically much less than the time taken by the original algorithm. We explain, and eatpirically demonstrate these surprising and desirable properties with coatprehensive experiatents on several publicly available real data sets.
Automating Exploratory Data Analysis for Efficient Data Mining
 Mining
, 2000
"... Having access to large data sets for the purpose of predictive data mining does not guarantee good models, even when the size of the training data is virtually unlimited. Instead, careful data preprocessing is required, including data cleansing, handling missing values, attribute representation a ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Having access to large data sets for the purpose of predictive data mining does not guarantee good models, even when the size of the training data is virtually unlimited. Instead, careful data preprocessing is required, including data cleansing, handling missing values, attribute representation and encoding, and generating derived attributes. In particular, the selection of the most appropriate subset of attributes to include is a critical step in building an accurate and efficient model. We describe an automated approach to the exploration, preprocessing, and selection of the optimal attribute subset whose goal is to simplify the KDD process and dramatically shorten the time to build a model. Our implementation finds inappropriate and suspicious attributes, performs target dependency analysis, determiningoptimal attribute encoding, generates new derived attributes, and provides a flexible approach to attribute selection. We present results generated by an industrial KDD environment called the Accrue Decision Series on several real world Web data sets.
Parallel Incremental 2DDiscretization on Dynamic Data Sets
 Proc. Int’l Parallel and Distributed Processing Symp
, 2002
"... Most current work in data mining assumes that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new database. Such approaches can waste a lot of computational and I/O resources, and result in relatively slow response times, to essent ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Most current work in data mining assumes that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new database. Such approaches can waste a lot of computational and I/O resources, and result in relatively slow response times, to essentially an interactive process. In this paper we address this issue in the context of 2dimensional discretization within a multiattribute database. Discretization, an important problem in data mining, is typically used to partition the range of continuous attribute(s) into intervals which highlight the behavior of a related discrete attribute. It can be used to build decision trees and to determine appropriate aggregations for OnLine Analytical Processing. We first propose a timeoptimal solution to the problem. We then parallelize and incrementalize the algorithm so that it can dynamically maintain the required information even in the presence of data updates without reexecuting the algorithm on the entire dataset. Experimental results confirm that our approach results in execution time improvements of up to several orders of magnitude on large datasets. 1
Autocannibalistic and Anyspace Indexing Algorithms with Applications to Sensor Data Mining
"... Efficient indexing is at the heart of many data mining algorithms. A simple and extremely effective algorithm for indexing under any metric space was introduced in 1991 by Orchard. Orchard’s algorithm has not received much attention in the data mining and database community because of a fatal flaw; ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Efficient indexing is at the heart of many data mining algorithms. A simple and extremely effective algorithm for indexing under any metric space was introduced in 1991 by Orchard. Orchard’s algorithm has not received much attention in the data mining and database community because of a fatal flaw; it requires quadratic space. In this work we show that we can produce a reduced version of Orchard’s algorithm that requires much less space, but produces nearly identical speedup. We achieve this by casting the algorithm in an anyspace framework, allowing deployed applications to take as much of an index as their main memory/sensor can afford. As we shall demonstrate, this ability to create an anyspace algorithm also allows us to create autocannibalistic algorithms. Autocannibalistic algorithms are algorithms which initially require a certain amount of space to index or classify data, but if unexpected circumstances require them to store additional information, they can dynamically delete parts of themselves to make room for the new data. We demonstrate the utility of autocannibalistic algorithms in a fielded project on insect monitoring with low power sensors, and a simple autonomous robot application.
Facilitating Interactive Distributed Data Stream Processing and Mining
, 2004
"... The past few years have seen the emergence of application domains that need to process data elements arriving as a continuous stream. Recently, several architectures to process database queries over these data streams have been proposed in the literature. Although these architectures may be suitable ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
The past few years have seen the emergence of application domains that need to process data elements arriving as a continuous stream. Recently, several architectures to process database queries over these data streams have been proposed in the literature. Although these architectures may be suitable for general purpose query processing in a centralizedsetting, they have serious limitations when it comes to supporting data mining queries in a distributedsetting. Data mining is an interactive process and it is crucial that we provide the user with interactive response times. In addition, many data mining applications, such as network intrusion detection, need to process data streams arriving at distributed endpoints. Centralized processing of data streams for network intrusion detection would be overwhelming. These are fundamental issues for data mining over data streams and have been addressed in this paper. Our schemes give controlled interactive response times when processing data streams in a distributedsetting.
MultiResolution KMeans Clustering of Time Series and Application to
 Images, Workshop on Multimedia Data Mining, the 4th SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington D.C
, 2003
"... Clustering is vital in the process of condensing and outlining information, since it can provide a synopsis of the stored data. However, the high dimensionality of multimedia data today presents an insurmountable challenge for clustering algorithms. Based on the well known fact that time series and ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Clustering is vital in the process of condensing and outlining information, since it can provide a synopsis of the stored data. However, the high dimensionality of multimedia data today presents an insurmountable challenge for clustering algorithms. Based on the well known fact that time series and image histograms can both be represented accurately in a lower resolution using orthonormal decompositions, we present an anytime version of the kMeans algorithm. The algorithm works by leveraging the multiresolution property of wavelets. The dilemma of choosing the initial centers for kMeans is mitigated by assigning the final centers at each approximation level as the initial centers for the subsequent, finer approximation. This process is repeated until the clustering results stabilize or until the finest “approximation ” level is reached. In addition to casting kMeans as an anytime algorithm, our approach has two other very desirable properties. We observe that even by working at coarser approximations, the achieved quality is better than the batch algorithm, and even if the algorithm is run to completion, the running time is significantly reduced. We show how this algorithm can be suitably extended to chromatic and textural features extracted from images. Finally, we demonstrate the applicability of this approach on the online image search engine scenario.
4. Multiresolution Clustering of Time Series and Application to Images
"... Summary. Clustering is vital in the process of condensing and outlining information, since it can provide a synopsis of the stored data. However, the high dimensionality of multimedia data today presents an insurmountable challenge for clustering algorithms. Based on the wellknown fact that time se ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Summary. Clustering is vital in the process of condensing and outlining information, since it can provide a synopsis of the stored data. However, the high dimensionality of multimedia data today presents an insurmountable challenge for clustering algorithms. Based on the wellknown fact that time series and image histograms can both be represented accurately in a lower resolution using orthonormal decompositions, we present an anytime version of the kmeans algorithm. The algorithm works by leveraging off the multiresolution property of wavelets. The dilemma of choosing the initial centers for kmeans is mitigated by assigning the final centers at each approximation level as the initial centers for the subsequent, finer approximation. In addition to casting kmeans as an anytime algorithm, our approach has two other very desirable properties. We observe that even by working at coarser approximations, the achieved quality is better than the batch algorithm, and that even if the algorithm is run to completion, the running time is significantly reduced. We show how this algorithm can be suitably extended to chromatic and textural features extracted from images. Finally, we demonstrate the applicability of this approach on the online image search engine scenario.