## Automatic Subspace Clustering of High Dimensional Data (2005)

### Cached

### Download Links

Venue: | Data Mining and Knowledge Discovery |

Citations: | 560 - 12 self |

### BibTeX

@MISC{Agrawal05automaticsubspace,

author = {Rakesh Agrawal and Johannes Gehrke and Dimitrios Gunopulos and Prabhakar Raghavan},

title = {Automatic Subspace Clustering of High Dimensional Data},

year = {2005}

}

### Years of Citing Articles

### OpenURL

### Abstract

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.

### Citations

3916 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...ks to identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition =-=[11]-=- [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly class... |

2645 |
Statistical Pattern Recognition (2nd edition
- Fukunaga
- 1990
(Show Context)
Citation Context ... identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] =-=[19]-=-, and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed ... |

2432 |
The Design and Analysis of Computer Algorithms
- Aho, Holpcroft, et al.
- 1974
(Show Context)
Citation Context ...e in the same cluster. On the other hand, units corresponding to vertices in di erent components cannot be connected, and therefore cannot be in the same cluster. We use a depth- rst search algorithm =-=[2]-=- to nd the connected components of the graph. We start with some unit u in D, assign it the rst cluster number, and nd all the units it is connected to. Then, if there still are units in D that have n... |

2139 |
Dubes, Algorithms for Clustering Data
- Jain, C
- 1988
(Show Context)
Citation Context ... clusters in large high dimensional datasets. 1 Introduction Clustering is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes (dimensions) =-=[24]-=- [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33],... |

1321 |
Finding Groups in Data - An Introduction to Cluster Analysis
- Kaufman, Rousseeuw
- 1990
(Show Context)
Citation Context ...ters in large high dimensional datasets. 1 Introduction Clustering is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] =-=[25]-=-. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focu... |

1092 | A density-based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ... extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN =-=[13]-=-. Current clustering techniques can be broadly classi ed into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion [39], partitional clustering obt... |

623 | A threshold of ln n for approximating set cover
- Feige
(Show Context)
Citation Context ...r setting. For the general set cover problem, the best known algorithm for approximating the smallest set cover gives an approximation factor of ln n where n is the size of the universe being covered =-=[16]-=- [28]. This problem is similar to the problem of constructive solid geometry formulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in lo... |

592 | Efficient and effective clustering methods for spatial data rnining
- Ng, Hm
- 1994
(Show Context)
Citation Context ... [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS =-=[33]-=-, Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classified into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a c... |

565 | CURE: an efficient clustering algorithm for large databases - Guha, Rastogi, et al. - 1998 |

496 |
Stochastic Complexity
- Rissanen
- 1989
(Show Context)
Citation Context ...pply the MDL (Minimal Description Length) principle. The basic idea underlying the MDL principle is to encode the input data under a given model and select the encoding that minimizes the code length =-=[35]-=-. Assume we have the subspaces S1 ; S2 ; : : : ; Sn . Our pruning technique first groups together the dense units that lie in the same subspace. Then, for each subspace, it computes the fraction of th... |

483 | Dynamic itemset counting and implication rules for market basket data
- Brin, Motwani, et al.
- 1997
(Show Context)
Citation Context ...algorithm makes k passes over the database. It follows that the running time of our algorithm is O(c k + mk) for a constant c. The number of database passes can be reduced by adapting ideas from [41] =-=[8]-=-. 3.1.2 Making the bottom-up algorithm faster While the procedure just described dramatically reduces the number of units that are tested for being dense, we still may have a computationally infeasibl... |

479 |
Bayesian classification (AutoClass): Theory and results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...s of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning =-=[9]-=- [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classified into two categories [24] ... |

471 |
Fast discovery of association rules
- Agrawal, Mannila, et al.
- 1996
(Show Context)
Citation Context ...hm that exploits the monotonicity of the clustering criterion with respect to dimensionality to prune the search space. This algorithm is similar to the Apriori algorithm for mining Association rules =-=[l]-=-. A somewhat similar bottom-up scheme was also used in [lo) for determining modes in high dimensional histograms. Lemma 1 (Monotonicity): If a collection of points S is a cluster in a k-dimensional sp... |

436 | BIRCH: an efficient data clustering method for very large databases
- Zhang, Ramakrishnan, et al.
- 1996
(Show Context)
Citation Context ...have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH =-=[45]-=-, and DBSCAN [13]. Current clustering techniques can be broadly classified into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion [39], partitio... |

384 | Efficiently Mining Long Patterns from Databases
- Bayardo
- 1998
(Show Context)
Citation Context ... nding dense units. If the user is only interested in clusters in the subspaces of highest dimensionality, we can use techniques based on recently proposed algorithms for discovering maximal itemsets =-=[5]-=- [26]. These techniques will allow CLIQUE to nd dense units of high dimensionality without having to nd all of their projections. Acknowledgment The code for CLIQUE builds on several components that R... |

381 |
On the hardness of approximating minimization problems
- LUND, YANNAKAKIS
- 1994
(Show Context)
Citation Context ...ting. For the general set cover problem, the best known algorithm for approximating the smallest set cover gives an approximation factor of ln n where n is the size of the universe being covered [16] =-=[28]-=-. This problem is similar to the problem of constructive solid geometry formulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in logic m... |

374 | Sampling large databases for association rules
- Toivonen
- 1996
(Show Context)
Citation Context ... The algorithm makes k passes over the database. It follows that the running time of our algorithm is O(c k + mk) for a constant c. The number of database passes can be reduced by adapting ideas from =-=[41]-=- [8]. 3.1.2 Making the bottom-up algorithm faster While the procedure just described dramatically reduces the number of units that are tested for being dense, we still may have a computationally infea... |

349 | Mining quantitative association rules in large relational tables
- Srikant, Agrawal
- 1996
(Show Context)
Citation Context ...to get closer to an optimal solution. The subspace identi cation problem is related to the problem of nding quantitative association rules that also identify interesting regions of various attributes =-=[40]-=- [32]. However, the techniques proposed are quite di erent. One can also imagine adapting a tree-classi er designed for data mining (e.g. [30] [37]) for subspace clustering. In the treegrowth phase, t... |

286 |
Verkamo – “Fast discovery of association rules”, Advances in knowledge discovery and data mining
- Agrawal, Mannila, et al.
- 1996
(Show Context)
Citation Context ...hm that exploits the monotonicity of the clustering criterion with respect to dimensionality to prune the search space. This algorithm is similar to the Apriori algorithm for mining Association rules =-=[1]-=-. A somewhat similar bottom-up scheme was also used in [10] for determining modes in high dimensional histograms. Lemma 1 (Monotonicity): If a collection of points S is a cluster in a k-dimensional sp... |

270 |
Pattern Classi cation and Scene Analysis
- Duda, Hart, et al.
- 1973
(Show Context)
Citation Context ...ks to identify homogeneous groups of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition =-=[11]-=- [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly class... |

260 |
On the ratio of optimal integral and fractional covers
- Lovász
- 1975
(Show Context)
Citation Context ...he procedure until the whole cluster is covered. For general set cover, the addition heuristic is known to give acover within a factor ln n of the optimum where n is the number of units to be covered =-=[27]-=-. Thus it would appear that the addition heuristic, since its quality of approximation matches the negative results of [16] [28], would be the obvious choice. However, its implementation in our high d... |

250 | M.: SPRINT: A scalable parallel classifier for data mining
- Shafer, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...identify interesting regions of various attributes [40] [32]. However, the techniques proposed are quite different. One can also imagine adapting a tree-classifier designed for data mining (e.g. [30] =-=[37]-=-) for subspace clustering. In the treegrowth phase, the splitting criterion will have to be changed so that some clustering criterion (e.g. average cluster diameter) is optimized. In the tree-pruning ... |

246 |
Learning from observation: conceptual clustering
- Michalski, Stepp
- 1983
(Show Context)
Citation Context ... objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] =-=[31]-=-. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [25]: ... |

214 |
an e cient data clustering method for very large databases
- BIRCH
- 1996
(Show Context)
Citation Context ...have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH =-=[45]-=-, and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion [39], partition... |

190 | Sliq: A fast scalable classifier for data mining
- Mehta, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...also identify interesting regions of various attributes [40] [32]. However, the techniques proposed are quite different. One can also imagine adapting a tree-classifier designed for data mining (e.g. =-=[30]-=- [37]) for subspace clustering. In the treegrowth phase, the splitting criterion will have to be changed so that some clustering criterion (e.g. average cluster diameter) is optimized. In the tree-pru... |

183 | A cost model for nearest neighbor search in high-dimensional data space
- Berchtold, Bohm, et al.
- 1997
(Show Context)
Citation Context ...omain for each attribute can be large. It is not meaningful to look for clusters in such a high dimensional space as the average density of points anywhere in the data space is likely to be quite low =-=[6]-=-. Compounding this problem, many dimensions or combinations of dimensions can have noise or values that are uniformly distributed. Therefore, distance functions that use all the dimensions of the data... |

160 | Almost optimal set covers in finite vc-dimension - Bronnimann, Goodrich - 1994 |

139 | Finding generalized projected clusters in high dimensional spaces, in - Aggarwal, Yu |

101 | Pincer-Search: A New Algorithm for Discovering the Max Mum Frequent Set
- Lin, Kedem
- 1998
(Show Context)
Citation Context ...ng dense units. If the user is only interested in clusters in the subspaces of highest dimensionality, we can use techniques based on recently proposed algorithms for discovering maximal itemsets [5] =-=[26]-=-. These techniques will allow CLIQUE to nd dense units of high dimensionality without having to nd all of their projections. Acknowledgment The code for CLIQUE builds on several components that Ramakr... |

88 |
Stochastic Complexity in Statistical Inquiry. World Scienti c
- Rissanen
- 1989
(Show Context)
Citation Context ...pply the MDL (Minimal Description Length) principle. The basic idea underlying the MDL principle is to encode the input data under a given model and select the encoding that minimizes the code length =-=[35]-=-. Assume we havethe subspaces S1;S2;:::;Sn. Our pruning technique rst groups together the dense units that lie in the same subspace. Then, for each subspace, it computes the fraction of the database t... |

83 |
An algorithm for point clustering and grid generation
- Berger, Rigoutsos
- 1991
(Show Context)
Citation Context ...ulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in logic minimization (e.g. [22]). Some clustering algorithms in image analysis (e.g. =-=[7]-=- [36] [42]) also nd rectangular dense regions. In these domains, datasets are in low dimensional spaces and the techniques used are computationally too expensive for large datasets of high dimensional... |

83 | A monte carlo algorithm for fast projective clustering - Procopiuc, Jones, et al. - 2002 |

73 |
Association Rules over Interval Data
- Miller, Yang
- 1997
(Show Context)
Citation Context ...t closer to an optimal solution. The subspace identi cation problem is related to the problem of nding quantitative association rules that also identify interesting regions of various attributes [40] =-=[32]-=-. However, the techniques proposed are quite di erent. One can also imagine adapting a tree-classi er designed for data mining (e.g. [30] [37]) for subspace clustering. In the treegrowth phase, the sp... |

65 | Data mining, hypergraph transversals, and machine learning
- Gunopulos, Khardon, et al.
- 1997
(Show Context)
Citation Context ...subset of the k dimensions that is, O(2 k ) di erent combinations, are also dense. The running time of our algorithm is therefore exponential in the highest dimensionality ofany dense unit. As in [1] =-=[20]-=-, it can be shown that the candidate generation procedure produces the minimal number of candidates that can guarantee that all dense units will be found. Let k be the highest dimensionality ofany den... |

59 | Range queries in olap data cubes
- Ho, Agrawal, et al.
- 1997
(Show Context)
Citation Context ...ee structure is used to store sparse regions. Currently, users are required to specify dense and sparse dimensions [4]. Similarly, the precomputation techniques for range queries over OLAP data cubes =-=[21]-=- require identi cation of dense regions in sparse data cubes. CLIQUE can be used for this purpose. In future work, we plan to address the problem of evaluating the quality of clusterings in di erent s... |

50 |
A database interface for clustering in large spatial databases
- Ester, Kriegel, et al.
- 1995
(Show Context)
Citation Context ... techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS =-=[14]-=-, BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a clustering criterion [39... |

36 |
E cient and E ective Clustering Methods for Spatial Data
- Ng, Han
- 1994
(Show Context)
Citation Context ... [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning [9] [31]. Recent work in the database community includes CLARANS =-=[33]-=-, Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [25]: partitional and hierarchical. Given a set of objects and a cl... |

35 |
Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary re nement
- Schroeter, Bigun
- 1995
(Show Context)
Citation Context ... in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in logic minimization (e.g. [22]). Some clustering algorithms in image analysis (e.g. [7] =-=[36]-=- [42]) also nd rectangular dense regions. In these domains, datasets are in low dimensional spaces and the techniques used are computationally too expensive for large datasets of high dimensionality. ... |

32 |
A threshold of �� � for approximating set cover
- Feige
- 1998
(Show Context)
Citation Context ... setting. For the general set cover problem, the best known algorithm for approximating the smallest set cover gives an approximation factor of ln ra where n is the size of the universe being covered =-=[16]-=- [28]. This problem is similar to the problem of constructive solid geometry formulae in solid-modeling [44]. It is also related to the problem of covering marked boxes in a grid with rectangles in lo... |

25 |
Bayesian classi cation (AutoClass): Theory and results
- Cheeseman, Stutz
- 1995
(Show Context)
Citation Context ...s of objects based on the values of their attributes (dimensions) [24] [25]. Clustering techniques have been studied extensively in statistics [3], pattern recognition [11] [19], and machine learning =-=[9]-=- [31]. Recent work in the database community includes CLARANS [33], Focused CLARANS [14], BIRCH [45], and DBSCAN [13]. Current clustering techniques can be broadly classi ed into two categories [24] [... |

22 |
SLIQ: A fast scalable classi er for data mining
- Mehta, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...t also identify interesting regions of various attributes [40] [32]. However, the techniques proposed are quite di erent. One can also imagine adapting a tree-classi er designed for data mining (e.g. =-=[30]-=- [37]) for subspace clustering. In the treegrowth phase, the splitting criterion will have tobechanged so that some clustering criterion (e.g. average cluster diameter) is optimized. In the tree-pruni... |

21 | Subspace clustering of high dimensional data - Domeniconi, Papadopoulos, et al. - 2004 |

20 | SPRINT: A Scalable Parallel Classi er for data Mining
- Shafer, Agrawal, et al.
- 1996
(Show Context)
Citation Context ...o identify interesting regions of various attributes [40] [32]. However, the techniques proposed are quite di erent. One can also imagine adapting a tree-classi er designed for data mining (e.g. [30] =-=[37]-=-) for subspace clustering. In the treegrowth phase, the splitting criterion will have tobechanged so that some clustering criterion (e.g. average cluster diameter) is optimized. In the tree-pruning ph... |

16 |
Performing guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles
- Franzblau
- 1989
(Show Context)
Citation Context ...es. The best approximate algorithm known for the special case of nding a cover of a 2-dimensional rectilinear polygon with no holes produces a cover of size bounded by a factor of 2 times the optimal =-=[17]-=-. Since this algorithm only works forsthe 2-dimensional case, it cannot be used in our setting. For the general set cover problem, the best known algorithm for approximating the smallest set cover giv... |

16 |
Some NP-complete set covering problems
- Masek
- 1979
(Show Context)
Citation Context ...ver of C if every region R 2Ris contained in C, and each unit in C is contained in at least one of the regions in R. Computing the optimal cover is known to be NP-hard, even in the 2-dimensional case =-=[29]-=- [34]. The optimal cover is the cover with the minimal number of rectangles. The best approximate algorithm known for the special case of nding a cover of a 2-dimensional rectilinear polygon with no h... |

10 | An algorithm for constructing regions with rectangles: Independence and minimum generating sets for collections of intervals - Franzblau, Kleitman - 1984 |

10 | Minimum dissection of a rectilinear polygon with arbitrary holes into rectangles - Soltan, Gorpinevich - 1993 |

10 |
A comparative study of clustering methods. Future Generation Computer Systems
- Zait, Messatfa
- 1997
(Show Context)
Citation Context ...tion. The data resided in the AIX fle system and was stored on a 2GB SCSI drive with sequential throughput of about 2 MB/second. 4.1 Synthetic data generation We use the synthetic data generator from =-=[43]-=- to produce datasets with clusters of high density in specific subspaces. The data generator allows control over the structure and the size of datasets through parameters such as the number of records... |

9 |
Covering simple orthogonal polygon with a minimum number of orthogonally convex polygons
- Reckhow, Culberson
- 1987
(Show Context)
Citation Context ...f C if every region R 2Ris contained in C, and each unit in C is contained in at least one of the regions in R. Computing the optimal cover is known to be NP-hard, even in the 2-dimensional case [29] =-=[34]-=-. The optimal cover is the cover with the minimal number of rectangles. The best approximate algorithm known for the special case of nding a cover of a 2-dimensional rectilinear polygon with no holes ... |

9 |
A comparative study of clustering methods
- Zait, Messatfa
- 1997
(Show Context)
Citation Context ...ation. The data resided in the AIX le system and was stored on a 2GB SCSI drive with sequential throughput of about 2 MB/second. 4.1 Synthetic data generation We use the synthetic data generator from =-=[43]-=- to produce datasets with clusters of high density in speci c subspaces. The data generator allows control over the structure and the size of datasets through parameters such as the number of records,... |