Results 1  10
of
12
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 600 (12 self)
 Add to MetaCart
(Show Context)
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
Approximation Algorithms for Projective Clustering
 Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia
, 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w ..."
Abstract

Cited by 256 (21 self)
 Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w be the smallest value so that S can be covered by k hyperstrips (resp. hypercylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NPHard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...
Finding localized associations in market basket data
 Knowledge and Data Engineering
, 2002
"... In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering locali ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often the aggregate behavior of a data set may be very di erent from localized segments. In such cases, it is desirable to design algorithms which are e ective in discovering localized associations, because they expose a customer pattern which is more speci c than the aggregate behavior. This information may bevery useful for target marketing. We present empirical results which show that the method is indeed able to nd a signi cantly larger number of associations than what can be discovered by analysis of the aggregate data.
ClusterTree: Integration of Cluster Representation and Nearest Neighbor Search for Large Datasets with High Dimensionality
 IEEE Internati onal Conference on Multimedia and Expo, 2000
, 2000
"... In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates t ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates the most related groups within the clusters. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Our cluster representation is highly adaptive to any kind of clusters. It is well accepted that most existing indexing techniques degrade rapidly when dimensionality goes higher. The ClusterTree can support the retrieval of the nearest neighbors effectively without having to linearly scan the highdimensional dataset. We also discuss an approach to dynamically reconstruct the ClusterTree when new data are added. We present the detailed analysis of this approach and justify it extensively by experiments. Keywords: indexing, cluster representation, nearest neighbor search, highdimensional datasets 1
A Framework for Finding Projected Clusters in High Dimensional Spaces
, 1999
"... Clustering problems are well known in the database literature for their use in numerous applications such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Clustering problems are well known in the database literature for their use in numerous applications such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In such high dimensional spaces not all dimensions may be relevant to a given cluster. One way of handling this is to pick the closely correlated dimensions and find clusters in the corresponding subspace. Traditional feature selection algorithms attempt to achieve this. The weakness of this approach is that in typical high dimensional data mining applications different sets of points may cluster better for different subsets of dimensions. The number of dimensions in each such clusterspecific subspace may also vary. Hence, it may be impossible to find a single small subset of dimensions for all the clusters. We therefore introduce a generalization of the clustering problem, referred to as the ...
Hypergraph Models and Algorithms for DataPattern Based Clustering
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2004
"... In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depend on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertextocluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.
A finite time stochastic clustering algorithm
"... Abstract. We present a finite time local search (1 + δ)approximation method finding the optimal solution with probability almost one with respect to a general measure of within groupdissimilarity. The algorithm is based on a finitetime Markov model of the simulated annealing. A dynamic cooling sc ..."
Abstract
 Add to MetaCart
Abstract. We present a finite time local search (1 + δ)approximation method finding the optimal solution with probability almost one with respect to a general measure of within groupdissimilarity. The algorithm is based on a finitetime Markov model of the simulated annealing. A dynamic cooling schedule, allows the control of the convergence. The algorithm uses as measure of within group dissimilarity a new generalized Ward index based on a set of wellscattered representative points, which deals with the major weaknesses of partitioning algorithms regarding the hyperspherical shaped clusters and the noise. We compare it with other clustering algorithms, such as CLIQUE and DBSCAN.
BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm322
"... Annotationbased distance measures for patient subgroup discovery in clinical microarray studies ..."
Abstract
 Add to MetaCart
(Show Context)
Annotationbased distance measures for patient subgroup discovery in clinical microarray studies
Mining Patterns from Case Base Analysis
, 2001
"... In this paper, we present our work on combining domain knowledge and data mining techniques for improve the service and realize cost reduction for a product company. We rst extract domain knowledge from the database of call records and then incorporate the domain knowledge into the process of nding ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper, we present our work on combining domain knowledge and data mining techniques for improve the service and realize cost reduction for a product company. We rst extract domain knowledge from the database of call records and then incorporate the domain knowledge into the process of nding similar products and clustering. By nding similar products, we can use successes from one product and apply them to the similar products. By clustering, we can group products into clusters and design improvement strategy for each group. It is projected that our work would be very useful and bene cial to the company. 1
Gene expression Penalized and Weighted Kmeans for Clustering with Scattered Objects and Prior Information in Highthroughput Biological Data
"... Motivation: Cluster analysis is one of the most important data mining tools for investigating highthroughput biological data. The existence of many scattered objects that should not be clustered has been found to hinder performance of most traditional clustering algorithms in such a highdimensiona ..."
Abstract
 Add to MetaCart
(Show Context)
Motivation: Cluster analysis is one of the most important data mining tools for investigating highthroughput biological data. The existence of many scattered objects that should not be clustered has been found to hinder performance of most traditional clustering algorithms in such a highdimensional complex situation. Very often, additional prior knowledge from databases or previous experiments is also available in the analysis. Excluding scattered objects and incorporating existing prior information are desirable to enhance the clustering performance. Results: In this paper, a class of loss functions is proposed for cluster analysis and applied in highthroughput genomic and proteomic data. Two major extensions from Kmeans are involved: penalization and weighting. The additive penalty term is used to allow a set of scattered objects without being clustered. Weights are introduced to account for prior information of preferred or prohibited cluster patterns to be identified. Their relationship with the classification likelihood of Gaussian mixture models is explored. Incorporation of good prior information is also shown to improve the global optimization issue in clustering. Applications of the proposed method on simulated data as well as highthroughput data sets from tandem mass spectrometry (MS/MS) and microarray experiments are presented. Our results demonstrate its superior performance over most existing methods and its computational simplicity and extensibility in the application of large complex biological data sets.