Results 1 
7 of
7
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 561 (12 self)
 Add to MetaCart
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
Approximation Algorithms for Projective Clustering
 Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia
, 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w ..."
Abstract

Cited by 246 (21 self)
 Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w be the smallest value so that S can be covered by k hyperstrips (resp. hypercylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NPHard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...
Finding localized associations in market basket data
 Knowledge and Data Engineering
, 2002
"... In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering locali ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering localized associations, because they expose a customer pattern which is more speci c than the aggregate behavior. This information may bevery useful for target marketing. We present empirical results which show that the method is indeed able to nd a signi cantly larger number of associations than what can be discovered by analysis of the aggregate data.
A Framework for Finding Projected Clusters in High Dimensional Spaces
, 1999
"... Clustering problems are well known in the database literature for their use in numerous applications such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Clustering problems are well known in the database literature for their use in numerous applications such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In such high dimensional spaces not all dimensions may be relevant to a given cluster. One way of handling this is to pick the closely correlated dimensions and find clusters in the corresponding subspace. Traditional feature selection algorithms attempt to achieve this. The weakness of this approach is that in typical high dimensional data mining applications different sets of points may cluster better for different subsets of dimensions. The number of dimensions in each such clusterspecific subspace may also vary. Hence, it may be impossible to find a single small subset of dimensions for all the clusters. We therefore introduce a generalization of the clustering problem, referred to as the ...
Hypergraph Models and Algorithms for DataPattern Based Clustering
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2004
"... In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depend on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertextocluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.
ClusterTree: Integration of Cluster Representation and Nearest Neighbor Search for Large Datasets with High Dimensionality
 IEEE Internati onal Conference on Multimedia and Expo, 2000
, 2000
"... In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates t ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates the most related groups within the clusters. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Our cluster representation is highly adaptive to any kind of clusters. It is well accepted that most existing indexing techniques degrade rapidly when dimensionality goes higher. The ClusterTree can support the retrieval of the nearest neighbors effectively without having to linearly scan the highdimensional dataset. We also discuss an approach to dynamically reconstruct the ClusterTree when new data are added. We present the detailed analysis of this approach and justify it extensively by experiments. Keywords: indexing, cluster representation, nearest neighbor search, highdimensional datasets 1
A finite time stochastic clustering algorithm
"... Abstract. We present a finite time local search (1 + δ)approximation method finding the optimal solution with probability almost one with respect to a general measure of within groupdissimilarity. The algorithm is based on a finitetime Markov model of the simulated annealing. A dynamic cooling sc ..."
Abstract
 Add to MetaCart
Abstract. We present a finite time local search (1 + δ)approximation method finding the optimal solution with probability almost one with respect to a general measure of within groupdissimilarity. The algorithm is based on a finitetime Markov model of the simulated annealing. A dynamic cooling schedule, allows the control of the convergence. The algorithm uses as measure of within group dissimilarity a new generalized Ward index based on a set of wellscattered representative points, which deals with the major weaknesses of partitioning algorithms regarding the hyperspherical shaped clusters and the noise. We compare it with other clustering algorithms, such as CLIQUE and DBSCAN.