Results 1  10
of
19
DUSC: Dimensionality unbiased subspace clustering
 In ICDM
, 2007
"... To gain insight into today’s large data resources, data mining provides automatic aggregation techniques. Clustering aims at grouping data such that objects within groups are similar while objects in different groups are dissimilar. In scenarios with many attributes or with noise, clusters are ofte ..."
Abstract

Cited by 23 (9 self)
 Add to MetaCart
(Show Context)
To gain insight into today’s large data resources, data mining provides automatic aggregation techniques. Clustering aims at grouping data such that objects within groups are similar while objects in different groups are dissimilar. In scenarios with many attributes or with noise, clusters are often hidden in subspaces of the data and do not show up in the full dimensional space. For these applications, subspace clustering methods aim at detecting clusters in any subspace. Existing subspace clustering approaches fall prey to an effect we call dimensionality bias. As dimensionality of subspaces varies, approaches which do not take this effect into account fail to separate clusters from noise. We give a formal definition of dimensionality bias and analyze consequences for subspace clustering. A dimensionality unbiased subspace clustering (DUSC) definition based on statistical foundations is proposed. In thorough experiments on synthetic and real world data, we show that our approach outperforms existing subspace clustering algorithms. 1
VISA: Visual Subspace Clustering Analysis
"... To gain insight into today’s large data resources, data mining extracts interesting patterns. To generate knowledge from patterns and benefit from human cognitive abilities, meaningful visualization of patterns are crucial. Clustering is a data mining technique that aims at grouping data to patterns ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
To gain insight into today’s large data resources, data mining extracts interesting patterns. To generate knowledge from patterns and benefit from human cognitive abilities, meaningful visualization of patterns are crucial. Clustering is a data mining technique that aims at grouping data to patterns based on mutual (dis)similarity. For high dimensional data, subspace clustering searches patterns in any subspace of the attributes as patterns are typically obscured by many irrelevant attributes in the full space. For visual analysis of subspace clusters, their comparability has to be ensured. Existing subspace clustering approaches, however, lack interactive visualization and show bias with respect to the dimensionality of subspaces. In this work, dimensionality unbiased subspace clustering and a novel distance function for subspace clusters are proposed. We suggest two visualization techniques that allow users to browse the entire subspace clustering, to zoom into individual objects, and to analyze subspace cluster characteristics indepth. Bracketing of different parameter settings enable users to immediately see the effect of parameters on their data and hence to choose the best clustering result for further analysis. Usage of user analysis for feedback to the subspace clustering algorithm directly improves the subspace clustering. We demonstrate our visualization techniques on real world data and confirm results through additional accuracy measurements and comparison with existing subspace clustering algorithms. 1.
Fast Overlapping Clustering of Networks Using Sampled Spectral Distance Embedding and GMMs
"... Clustering social networks is vital to understanding online interactions and influence. This task becomes more difficult when communities overlap, and when the social networks become extremely large. We present an efficient algorithm for constructing overlapping clusters, roughly linear in the size ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
Clustering social networks is vital to understanding online interactions and influence. This task becomes more difficult when communities overlap, and when the social networks become extremely large. We present an efficient algorithm for constructing overlapping clusters, roughly linear in the size of the network. The algorithm first embeds the graph and then performs a metric clustering using a Gaussian Mixture Model (GMM). We evaluate the algorithm on the DBLP paperpaper network which consists of about 1 million nodes and over 30 million edges; we can cluster this network in under 20 minutes on a modest single CPU machine.
SSDECluster: Fast Overlapping Clustering of Networks Using Sampled Spectral Distance Embedding and GMMs
"... Abstract— Clustering social networks is vital to understanding online interactions and influence. This task becomes more difficult when communities overlap, and when the social networks become extremely large. We present an efficient algorithm for constructing overlapping clusters, (approximately li ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract— Clustering social networks is vital to understanding online interactions and influence. This task becomes more difficult when communities overlap, and when the social networks become extremely large. We present an efficient algorithm for constructing overlapping clusters, (approximately linear). The algorithm first embeds the graph and then performs a metric clustering using a Gaussian Mixture Model (GMM). We evaluate the algorithm on the DBLP paperpaper network which consists of about 1 million nodes and over 30 million edges; we can cluster this network in under 20 minutes on a modest single CPU machine. I.
Agglomerating Local Patterns Hierarchically with ALPHA
"... To increase the relevancy of local patterns discovered from noisy relations, it makes sense to formalize errortolerance. Our starting point is to address the limitations of stateoftheart methods for this purpose. Some extractors perform an exhaustive search w.r.t. a declarative specification of e ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
To increase the relevancy of local patterns discovered from noisy relations, it makes sense to formalize errortolerance. Our starting point is to address the limitations of stateoftheart methods for this purpose. Some extractors perform an exhaustive search w.r.t. a declarative specification of errortolerance. Nevertheless, their computational complexity prevents the discovery of large relevant patterns. Alpha is a 3step method that (1) computes complete collections of closed patterns, possibly errortolerant ones, from arbitrary nary relations, (2) enlarges them by hierarchical agglomeration, and (3) selects the relevant agglomerated patterns.
Clustering Algorithms for Categorical Data: A Monte Carlo
 Study, International Journal of Statistics and Applications
"... Abstract In this paper the clustering algorith ms: average linkage, ROCK, kmodes, fuzzy kmodes and kpopulations were co mpared by means of Monte Carlo simulat ion. Data were simu lated fro m Beta and Uniform distributions considering factors such as clusters overlapping, number of groups, variab ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract In this paper the clustering algorith ms: average linkage, ROCK, kmodes, fuzzy kmodes and kpopulations were co mpared by means of Monte Carlo simulat ion. Data were simu lated fro m Beta and Uniform distributions considering factors such as clusters overlapping, number of groups, variables and categories. A total of 64 population structures of clusters were simu lated considering smaller and higher degree of overlapping, nu mber o f clusters, variables and categories. The results showed that overlapping was the factor with major impact in the algorith m's accuracy which decreases as the number of clusters increases. In general, ROCK presented the best performance considering overlapping and nonoverlapping cases followed by kmodes and fuzzy kModes. The kpopulations algorithm showed better accuracy only in cases where there was a s mall degree of overlapping with performance similar to the average linkage. The superiority of kpopulations algorithm over kmodes and fu zzy kmodes presented in previous studies, which were based only in benchmark data, was not confirmed in this simulat ion study.
Data Reduction Method for Categorical Data Clustering
"... Abstract. Categorical data clustering constitutes an important part of data mining; its relevance has recently drawn attention from several researchers. As a step in data mining, however, clustering encounters the problem of large amount of data to be processed. This article offers a solution for c ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Categorical data clustering constitutes an important part of data mining; its relevance has recently drawn attention from several researchers. As a step in data mining, however, clustering encounters the problem of large amount of data to be processed. This article offers a solution for categorical clustering algorithms when working with high volumes of data by means of a method that summarizes the database. This is done using a structure called CMtree. In order to test our method, the KModes and Click clustering algorithms were used with several databases. Experiments demonstrate that the proposed summarization method improves execution time, without losing clustering quality.