Results 1  10
of
40
Data Clustering: A Review
 ACM COMPUTING SURVEYS
, 1999
"... Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exp ..."
Abstract

Cited by 1284 (13 self)
 Add to MetaCart
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify crosscutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
BIRCH: an efficient data clustering method for very large databases
 In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD
, 1996
"... Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of ..."
Abstract

Cited by 434 (2 self)
 Add to MetaCart
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multidir nensional clataset. Prior work does not adequately address the problem of large datasets and minimization of 1/0 costs. This paper presents a data clustering method named Bfll (;”H (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and clynamicall y clusters incoming multidimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints). BIRCH can typically find a goocl clustering with a single scan of the data, and improve the quality further with a few aclditioual scans. BIRCH is also the first clustering algorithm proposerl in the database area to handle “noise) ’ (data points that are not part of the underlying pattern) effectively. We evaluate BIRCH’S time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIR (;’H versus CLARA NS, a clustering method proposed recently for large datasets, and S11OW that BIRCH is consistently 1
Approximation Algorithms for Projective Clustering
 Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia
, 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w ..."
Abstract

Cited by 246 (21 self)
 Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w be the smallest value so that S can be covered by k hyperstrips (resp. hypercylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NPHard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...
Clustering Based On Association Rule Hypergraphs
"... Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. These discovered clusters are used to explain the characteristics of the data distribution. In this paper we propose a new metho ..."
Abstract

Cited by 88 (16 self)
 Add to MetaCart
Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. These discovered clusters are used to explain the characteristics of the data distribution. In this paper we propose a new methodology for clustering related items using association rules, and clustering related transactions using clusters of items. Our approach is linearly scalable with respect to the number of transactions. The frequent itemsets used to derive association rules are also used to group items into a hypergraph edge, and a hypergraph partitioning algorithm is used to find the clusters. Our experiments indicate that clustering using association rule hypergraphs holds great promise in several application domains. Our experiments with stockmarket data and congressional voting data show that this clustering scheme is able to successfully group items that belong to the same group. Clustering of items can ...
BIRCH: a new data clustering algorithm and its applications
 Data Min. Knowl. Disc
, 1997
"... Abstract. Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of ..."
Abstract

Cited by 65 (0 self)
 Add to MetaCart
Abstract. Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality. In this paper, an efficient and scalable data clustering method is proposed, based on a new inmemory data structure called CFtree, which serves as an inmemory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two reallife problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.
Distributed Clustering Using Collective Principal Component Analysis
 Knowledge and Information Systems
, 1999
"... This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to inte ..."
Abstract

Cited by 49 (9 self)
 Add to MetaCart
This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to integrate the Collective PCA with a given otheshelf clustering algorithm in order to develop a distributed clustering technique. It also presents experimental results using dierent test data sets including an application for web mining.
Clustering Large Datasets in Arbitrary Metric Spaces
 IN PROC. OF THE 15TH INT'L CONF. ON DATA ENG
, 1998
"... Clustering partitions a collection of objects into groups called clusters, such that similar objects fall into the same group. Similarity between objects is defined by a distance function satisfying the triangle inequality; this distance function along with the collection of objects describes a dist ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
Clustering partitions a collection of objects into groups called clusters, such that similar objects fall into the same group. Similarity between objects is defined by a distance function satisfying the triangle inequality; this distance function along with the collection of objects describes a distance space. In a distance space, the only operation possible on data objects is the computation of distance between them. All scalable algorithms in the literature assume a special type of distance space, namely a kdimensional vector space, which allows vector operations on objects. We present two scalable algorithms designed for clustering very large datasets in distance spaces. Our first algorithm BUBBLE is, to our knowledge, the first scalable clustering algorithm for data in a distance space. Our second algorithm BUBBLEFM improves upon BUBBLE by reducing the number of calls to the distance function, which may be computationally very expensive. Both algorithms make only a single scan o...
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data
, 1999
"... . This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) tim ..."
Abstract

Cited by 44 (8 self)
 Add to MetaCart
. This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) time, with a O(jSjn) space requirement and O(n) communication requirement, where n is the number of elements in the data set and jSj is the number of data sites. This approach shows significant improvement over naive methods with O(n 2 ) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is pres...
Clustering Binary Data Streams with Kmeans
 In 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
, 2003
"... Clustering data streams is an interesting Data Mining problem. This article presents three variants of the Kmeans algorithm to cluster binary data streams. The variants include Online Kmeans, Scalable Kmeans, and Incremental Kmeans, a proposed variant introduced that finds higher quality solutio ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
Clustering data streams is an interesting Data Mining problem. This article presents three variants of the Kmeans algorithm to cluster binary data streams. The variants include Online Kmeans, Scalable Kmeans, and Incremental Kmeans, a proposed variant introduced that finds higher quality solutions in less time. Higher quality of solutions are obtained with a meanbased initialization and incremental learning. The speedup is achieved through a simplified set of sufficient statistics and operations with sparse matrices. A summary table of clusters is maintained online. The Kmeans variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions.
Interactive learning and probabilistic retrieval in remote sensing image archives
 IEEE Trans. on Geoscience and Remote Sensing
, 2000
"... Abstract—We present a concept of interactive learning and probabilistic retrieval of userspecific cover types in a contentbased remote sensing image archive. A cover type is incrementally defined via userprovided positive and negative examples. From these examples, we infer probabilities of the B ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
Abstract—We present a concept of interactive learning and probabilistic retrieval of userspecific cover types in a contentbased remote sensing image archive. A cover type is incrementally defined via userprovided positive and negative examples. From these examples, we infer probabilities of the Bayesian network that link the user interests to a preextracted content index. Due to the stochastic nature of the cover type definitions, the database system not only retrieves images according to the estimated coverage but also according to the accuracy of that estimation given the current state of learning. For the latter, we introduce the concept of separability. We expand on the steps of Bayesian inference to compute the applicationfree content index using a family of data models, and on the description of the stochastic link using hyperparameters. In particular, we focus on the interactive nature of our approach, which provides instantaneous feedback to the user in the form of an immediate update of the posterior map, and a very fast, approximate search in the archive. A javabased demonstrator using the presented concept of contentbased access to a test archive of Landsat TM, XSAR, and aerial images are available over the Internet