Results 1 - 10
of
18
Data Clustering: 50 Years Beyond K-Means
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Visualising the Cluster Structure of Data Streams
"... Abstract. The increasing availability of streaming data is a consequence of the continuing advancement of data acquisition technology. Such data provides new challenges to the various data analysis communities. Clustering has long been a fundamental procedure for acquiring knowledge from data, and n ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. The increasing availability of streaming data is a consequence of the continuing advancement of data acquisition technology. Such data provides new challenges to the various data analysis communities. Clustering has long been a fundamental procedure for acquiring knowledge from data, and new tools are emerging that allow the clustering of data streams. However the dynamic, temporal components of streaming data provide extra challenges to the development of stream clustering and associated visualisation techniques. In this work we combine a streaming clustering framework with an extension of a static cluster visualisation method, in order to construct a surface that graphically represents the clustering structure of the data stream. The proposed method, OpticsStream, provides intuitive representations of the clustering structure as well as the manner in which this structure changes through time. 1
DOI 10.1007/s10115-007-0070-x REGULAR PAPER
"... Tracking clusters in evolving data streams over sliding windows ..."
Robust Clustering of Data Streams using Incremental Optimization
"... Abstract. In many applications, it is useful to detect the evolving patterns in a data stream, and be able to capture them accurately (e.g. detecting the purchasing trends of customers over time on an e-commerce website). Data stream mining is challenging because of harsh constraints due to the cont ..."
Abstract
- Add to MetaCart
Abstract. In many applications, it is useful to detect the evolving patterns in a data stream, and be able to capture them accurately (e.g. detecting the purchasing trends of customers over time on an e-commerce website). Data stream mining is challenging because of harsh constraints due to the continuous arrival of huge amounts of data that prevent unlimited storage and processing in memory, and the lack of control over the data arrival pattern. In this paper, we present a new approach to discover the evolving dense clusters in a dynamic data stream by incrementally updating the cluster parameters using a method based on robust statistics. Our approach exhibits robustness toward an unknown number of outliers, with no assumptions about the number of clusters. Moreover, it can adapt to the evolution of the clusters in the input data stream. 1
An EM-Based Algorithm for Clustering Data Streams in Sliding Windows
"... Abstract. Cluster analysis has played a key role in data understanding. When such an important data mining task is extended to the context of data streams, it becomes more challenging since the data arrive at a mining system in one-pass manner. The problem is even more difficult when the clustering ..."
Abstract
- Add to MetaCart
Abstract. Cluster analysis has played a key role in data understanding. When such an important data mining task is extended to the context of data streams, it becomes more challenging since the data arrive at a mining system in one-pass manner. The problem is even more difficult when the clustering task is considered in a sliding window model which requiring the elimination of outdated data must be dealt with properly. We propose SWEM algorithm that exploits the Expectation Maximization technique to address these challenges. SWEM is not only able to process the stream in an incremental manner, but also capable to adapt to changes happened in the underlying stream distribution. 1
rEMM: Extensible Markov Model for Data Stream Clustering in R
"... Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of ..."
Abstract
- Add to MetaCart
Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of data points which is used by clustering, but also by a temporal component. The Extensible Markov Model (EMM) adds the temporal component to data stream clustering by superimposing a dynamically adapting Markov Chain. In this paper we introduce the implementation of the R˜extension package˜rEMM which implements EMM on top of a simple clustering algorithm for data streams and discuss some examples and applications. 1
CORE: Nonparametric Clustering of Large Numeric Databases
"... Current clustering techniques are able to identify arbitrarily shaped clusters in the presence of noise, but depend on carefully chosen model parameters. The choice of model parameters is difficult: it depends on the data and the clustering technique at hand, and finding good model parameters often ..."
Abstract
- Add to MetaCart
Current clustering techniques are able to identify arbitrarily shaped clusters in the presence of noise, but depend on carefully chosen model parameters. The choice of model parameters is difficult: it depends on the data and the clustering technique at hand, and finding good model parameters often requires time consuming human interaction. In this paper we propose CORE, a new nonparametric clustering technique that explicitly computes the local maxima of the density and represents them with cores. CORE proposes an adaptive grid and gradients to define and compute the cores of clusters. The incrementally constructed adaptive grid and the gradients make the identification of cores robust, scalable, and independent of small density fluctuations. Our experimental studies show that CORE without any carefully
2010 TABLE OF CONTENTS
, 2010
"... ABSTRACT....................................................................................................................... iv LIST OF TABLES...............................................................................................................v LIST OF FIGURES........................... ..."
Abstract
- Add to MetaCart
ABSTRACT....................................................................................................................... iv LIST OF TABLES...............................................................................................................v LIST OF FIGURES........................................................................................................... vi
Author manuscript, published in "15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2009)" Toward Autonomic Grids: Analyzing the Job Flow with Affinity Streaming
, 2009
"... The Affinity Propagation (AP) clustering algorithm proposed by Frey and Dueck (2007) provides an understandable, nearly optimal summary of a dataset, albeit with quadratic computational complexity. This paper, motivated by Autonomic Computing, extends AP to the data streaming framework. Firstly a hi ..."
Abstract
- Add to MetaCart
The Affinity Propagation (AP) clustering algorithm proposed by Frey and Dueck (2007) provides an understandable, nearly optimal summary of a dataset, albeit with quadratic computational complexity. This paper, motivated by Autonomic Computing, extends AP to the data streaming framework. Firstly a hierarchical strategy is used to reduce the complexity to O(N 1+ε); the distortion loss incurred is analyzed in relation with the dimension of the data items. Secondly, a coupling with a change detection test is used to cope with non-stationary data distribution, and rebuild the model as needed. The presented approach Strap is applied to the stream of jobs submitted to the EGEE Grid, providing an understandable description of the job flow and enabling the system administrator to spot online some sources of failures. Categories and Subject Descriptors
Online Fuzzy C Means
"... Abstract- Clustering streaming data presents the problem of not having all the data available at one time. Further, the total size of the data may be larger than will fit in the available memory of a typical computer. If the data is very large, it is a challenge to apply fuzzy clustering algorithms ..."
Abstract
- Add to MetaCart
Abstract- Clustering streaming data presents the problem of not having all the data available at one time. Further, the total size of the data may be larger than will fit in the available memory of a typical computer. If the data is very large, it is a challenge to apply fuzzy clustering algorithms to get a partition in a timely manner. In this paper, we present an online fuzzy clustering algorithm which can be used to cluster streaming data, as well as very large data sets which might be treated as streaming data. Results on several large volumes of magnetic resonance images show that the new algorithm produces partitions which are very close to what you could get if you clustered all the data at one time. So, the algorithm is an accurate approach for online clustering. I.

