Results 1 - 10
of
63
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract
-
Cited by 177 (0 self)
- Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Finding Generalized Projected Clusters in High Dimensional Spaces
"... High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projec ..."
Abstract
-
Cited by 117 (7 self)
- Add to MetaCart
High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projected clustering which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to rede ne clustering for high dimensional applications by searching for hidden subspaces with clusters which are created by inter-attribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable, and are likely to tradeoff with better accuracy.
Feature selection for clustering
- in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000
"... Abstract. Clustering is an important data mining task. Data mining often concerns large and high-dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high-dimensionality or both. Di erent features a ect clusters di erently, some are imp ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Abstract. Clustering is an important data mining task. Data mining often concerns large and high-dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high-dimensionality or both. Di erent features a ect clusters di erently, some are important for clusters while others may hinder the clustering task. An e cient wayof handling it is by selecting a subset of important features. It helps in nding clusters e ciently, understanding the data better and reducing data size for e cient storage, collection and processing. The task of nding original important features for unsupervised data is largely untouched. Traditional feature selection algorithms work only for supervised data where class information is available. For unsupervised data, without class information, often principal components (PCs) are used, but PCs still require all features and they may be di cult to understand. Our approach: rst features are ranked according to their importance on clustering and then a subset of important features are selected. For large data we use a scalable method using sampling. Empirical evaluation shows the e ectiveness and scalability of our approach for benchmark and synthetic data sets. 1
OP-Cluster: Clustering by Tendency in High Dimensional Space
, 2003
"... Clustering is the process of grouping a set of objects into classes of similar objects. Because of unknownness of the hidden patterns in the data sets, the definition of similarity is very subtle. Until recently, similarity measures are typically based on distances, e.g Euclidean distance and cosine ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
Clustering is the process of grouping a set of objects into classes of similar objects. Because of unknownness of the hidden patterns in the data sets, the definition of similarity is very subtle. Until recently, similarity measures are typically based on distances, e.g Euclidean distance and cosine distance. In this paper, we propose a flexible yet powerful clustering model, namely OP-Cluster (Order Preserving Cluster). Under this new model, two objects are similar on a subset of dimensions if the values of these two objects induce the same relative order of those dimensions. Such a cluster might arise when the expression levels of (coregulated) genes can rise or fall synchronously in response to a sequence of environment stimuli. Hence, discovery of OP-Cluster is essential in revealing significant gene regulatory networks. A deterministic algorithm is designed and implemented to discover all the significant OP-Clusters. A set of extensive experiments has been done on several real biological data sets to demonstrate its effectiveness and efficiency in detecting co-regulated patterns.
Feature Selection for Clustering - A Filter Solution
- In Proceedings of the Second International Conference on Data Mining
, 2002
"... Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Processing applications with a large number of dimensions has been a challenge to the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature there are only a few methods proposed for feature selection for clustering. And, almost all of those methods are `wrapper' techniques that require a clustering algorithm to evaluate the candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as number of clusters, and due to lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose a `filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has very different point-to-point distance histogram than that of data without clusters. Using this we propose an entropy measure that is low if data has distinct clusters and high otherwise. The entropy measure is suitable for selecting the most important subset of features because it is invariant with number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.
Computing Clusters of Correlation Connected Objects
, 2004
"... The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or mor ..."
Abstract
-
Cited by 29 (10 self)
- Add to MetaCart
The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or more features might be dependent from a combination of several other features. Well-known methods like the principal components analysis (PCA) can perfectly find correlations which are global, linear, not hidden in a set of noise vectors, and uniform, i.e. the same type of correlation is exhibited in all feature vectors. In many applications such as medical diagnosis, molecular biology, time sequences, or electronic commerce, however, correlations are not global since the dependency between features can be different in different subgroups of the set. In this paper, we propose a method called 4C (Computing Correlation Connected Clusters) to identify local subgroups of the data objects sharing a uniform but arbitrarily complex correlation. Our algorithm is based on a combination of PCA and density-based clustering (DBSCAN). Our method has a determinate result and is robust against noise. A broad comparative evaluation demonstrates the superior performance of 4C over competing methods such as DBSCAN, CLIQUE and ORCLUS.
Clustering Through Decision Tree Construction
- In SIGMOD-00
, 2000
"... this paper, we propose a novel clustering technique, which is based on a supervised learning technique called decision tree construction. The new technique is able to overcome many of these shortcomings. The key idea is to use a decision tree to partition the data space into cluster and empty (spars ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
this paper, we propose a novel clustering technique, which is based on a supervised learning technique called decision tree construction. The new technique is able to overcome many of these shortcomings. The key idea is to use a decision tree to partition the data space into cluster and empty (sparse) regions at different levels of details. The technique is able to find "natural" clusters in large high dimensional spaces efficiently. It is suitable for clustering in the full dimensional space as well as in subspaces. It also provides comprehensible descriptions of clusters. Experiment results on both synthetic data and real-life data show that the technique is effective and also scales well for large high dimensional datasets.
Adaptive grids for clustering massive data sets
- In 1st SIAM International Conference Proceedings on Data Mining
, 2001
"... Clustering is a key data mining problem. Density and grid based technique is a popular way to mine clusters in a large multi-dimensional space wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters. Fine g ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Clustering is a key data mining problem. Density and grid based technique is a popular way to mine clusters in a large multi-dimensional space wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters. Fine grid sizes lead to a huge amount of computation while coarse grid sizes result in loss in quality of clusters found. Also,varied grid sizes result in discovering clusters with different cluster descriptions. The technique of Adaptive grids enables to use grids based on the data distribution and does not require the user to specify any parameters like the grid size or the density thresholds. Further,clusters could be embedded in a subspace of a high dimensional space. We propose a modified bottom-up subspace clustering algorithm to discover clusters in all possible subspaces. Our method scales linearly with the data dimensionality and the size of the data set. Experimental results on a wide variety of synthetic and real data sets demonstrate the effectiveness of Adaptive grids and the effect of the modified subspace clustering algorithm. Our algorithm explores at-least an order of magnitude more number of subspaces than the original algorithm and the use of adaptive grids yields on an average of two orders of magnitude speedup as compared to the method with user specified grid size and threshold.
Evaluating subspace clustering algorithms
- in Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining (SDM 2004
, 2004
"... Clustering techniques often define the similarity between instances using distance measures over the various dimensions of the data [12, 14]. Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Traditional clustering algo ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Clustering techniques often define the similarity between instances using distance measures over the various dimensions of the data [12, 14]. Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each instance described. In high dimensional data, however, many of the dimensions are often irrelevant. These irrelevant dimensions confuse clustering algorithms by hiding clusters in noisy data. In very high dimensions it is common for all of the instances in a dataset to be nearly equidistant from each other, completely masking the clusters. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. This paper presents a survey of the various subspace clustering algorithms. We then compare the two main approaches to subspace clustering using empirical scalability and accuracy tests. 1

