Results 1  10
of
70
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 561 (12 self)
 Add to MetaCart
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 106 (2 self)
 Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Matrix approximation and projective clustering via volume sampling
 In SODA
, 2006
"... We present two new results for the problem of approximating a given real m × n matrix A by a rankk matrix D, where k < min{m, n}, so as to minimize A − D  2 F. It is known that by sampling O(k/ɛ) rows of the matrix, one can find a lowrank approximation with additive error ɛA  2 F. Our firs ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
We present two new results for the problem of approximating a given real m × n matrix A by a rankk matrix D, where k < min{m, n}, so as to minimize A − D  2 F. It is known that by sampling O(k/ɛ) rows of the matrix, one can find a lowrank approximation with additive error ɛA  2 F. Our first result shows that with adaptive sampling in t rounds and O(k/ɛ) samples in each round, the additive error drops exponentially as ɛt; the computation time is nearly linear in the number of nonzero entries. This demonstrates that multiple passes can be highly beneficial for a natural (and widely studied) algorithmic problem. Our second result is that there exists a subset of O(k2 /ɛ) rows such that their span contains a rankk approximation with multiplicative (1 + ɛ) error (i.e., the sum of squares distance has a small “coreset ” whose span determines a good approximation). This existence theorem leads to a PTAS for the following projective clustering problem: Given a set of points P in Rd, and integers k, j, find a set of j subspaces F1,..., Fj, each of dimension at most k, that minimize ∑ p∈P mini d(p, Fi) 2. 1
Clustering ensembles: Models of consensus and weak partitions
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2005
"... Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graphbased, combinatorial ..."
Abstract

Cited by 45 (3 self)
 Add to MetaCart
Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graphbased, combinatorial or statistical perspectives. This study extends previous research on clustering ensembles in several respects. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. Second, we propose a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum likelihood problem using the EM algorithm. Third, we define a new consensus function that is related to the classical intraclass variance criterion using the generalized mutual information definition. Finally, we demonstrate the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. Combination accuracy is analyzed as a function of several parameters that control the power and resolution of component partitions as well as the number of partitions. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed methods on several realworld datasets.
Computing Clusters of Correlation Connected Objects
, 2004
"... The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or mor ..."
Abstract

Cited by 34 (10 self)
 Add to MetaCart
The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or more features might be dependent from a combination of several other features. Wellknown methods like the principal components analysis (PCA) can perfectly find correlations which are global, linear, not hidden in a set of noise vectors, and uniform, i.e. the same type of correlation is exhibited in all feature vectors. In many applications such as medical diagnosis, molecular biology, time sequences, or electronic commerce, however, correlations are not global since the dependency between features can be different in different subgroups of the set. In this paper, we propose a method called 4C (Computing Correlation Connected Clusters) to identify local subgroups of the data objects sharing a uniform but arbitrarily complex correlation. Our algorithm is based on a combination of PCA and densitybased clustering (DBSCAN). Our method has a determinate result and is robust against noise. A broad comparative evaluation demonstrates the superior performance of 4C over competing methods such as DBSCAN, CLIQUE and ORCLUS.
HARP: A practical projected clustering algorithm
 IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—In highdimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering ac ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
Abstract—In highdimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering accuracy can be seriously degraded if incorrect values are used. Unfortunately, in real situations, it is rarely possible for users to supply the parameter values accurately, which causes practical difficulties in applying these algorithms to real data. In this paper, we analyze the major challenges of projected clustering and suggest why these algorithms need to depend heavily on user parameters. Based on the analysis, we propose a new algorithm that exploits the clustering status to adjust the internal thresholds dynamically without the assistance of user parameters. According to the results of extensive experiments on real and synthetic data, the new method has excellent accuracy and usability. It outperformed the other algorithms even when correct parameter values were artificially supplied to them. The encouraging results suggest that projected clustering can be a practical tool for various kinds of real applications. Index Terms—Data mining, mining methods and algorithms, clustering, bioinformatics. 1
triCluster: An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data
 In Proc. of the 2005 ACM SIGMOD international conference on Management of data
, 2005
"... In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di#erent parameter values, it can mine di#erent types of clusters, in ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di#erent parameter values, it can mine di#erent types of clusters, including those with constant or similar values along each dimension, as well as scaling and shifting expression patterns. triCluster relies on graphbased approach to mine all valid clusters. For each time slice, i.e., a genesample matrix, it constructs the range multigraph, a compact representation of all similar value ranges between any two sample columns. It then searches for constrained maximal cliques in this multigraph to yield the set of biclusters for this time slice. Then triCluster constructs another graph using the biclusters (as vertices) from each time slice; mining cliques from this graph yields the final set of triclusters. Optionally, triCluster merges/deletes some clusters having large overlaps. We present a useful set of metrics to evaluate the clustering quality, and we show that triCluster can find significant triclusters in the real microarray datasets.
Subspace clustering of high dimensional data
 SIAM International Conference on Data Mining
, 2004
"... Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This a ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves, using both synthetic and real data sets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in microarray data. 1
Density connected clustering with local subspace preferences
 In ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining
, 2004
"... Many clustering algorithms tend to break down in highdimensional feature spaces, because the clusters often exist only in specific subspaces (attribute subsets) of the original feature space. Therefore, the task of projected clustering (or subspace clustering) has been defined recently. As a novel ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
Many clustering algorithms tend to break down in highdimensional feature spaces, because the clusters often exist only in specific subspaces (attribute subsets) of the original feature space. Therefore, the task of projected clustering (or subspace clustering) has been defined recently. As a novel solution to tackle this problem, we propose the concept of local subspace preferences, which captures the main directions of high point density. Using this concept we adopt densitybased clustering to cope with highdimensional data. In particular, we achieve the following advantages over existing approaches: Our proposed method has a determinate result, does not depend on the order of processing, is robust against noise, performs only one single scan over the database, and is linear in the number of dimensions. A broad experimental evaluation shows that our approach yields results of significantly better quality than recent work on clustering highdimensional data. 1.
A generic framework for efficient subspace clustering of highdimensional data
 IN: PROC. ICDM
, 2005
"... Subspace clustering has been investigated extensively since traditional clustering algorithms often fail to detect meaningful clusters in highdimensional data spaces. Many recently proposed subspace clustering methods suffer from two severe problems: First, the algorithms typically scale exponentia ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
Subspace clustering has been investigated extensively since traditional clustering algorithms often fail to detect meaningful clusters in highdimensional data spaces. Many recently proposed subspace clustering methods suffer from two severe problems: First, the algorithms typically scale exponentially with the data dimensionality and/or the subspace dimensionality of the clusters. Second, for performance reasons, many algorithms use a global density threshold for clustering, which is quite questionable since clusters in subspaces of significantly different dimensionality will most likely exhibt significantly varying densities. In this paper, we propose a generic framework to overcome these limitations. Our framework is based on an efficient filterrefinement architecture that scales at most quadratic w.r.t. the data dimensionality and the dimensionality of the subspace clusters. It can be applied to any clustering notions including notions that are based on a local density threshold. A broad experimental evaluation on synthetic and realworld data empirically shows that our method achieves a significant gain of runtime and quality in comparison to stateoftheart subspace clustering algorithms.