Results 1  10
of
38
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 400 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets
, 1999
"... Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding en ..."
Abstract

Cited by 84 (0 self)
 Add to MetaCart
Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding endresults, and sensitivity to input order, have received attention in the recent past. Recent approaches attempt to find clusters embedded in subspaces of high dimensional data. In this paper we propose the use of adaptive grids for efficient and scalable computation of clusters in subspaces for large data sets and large number of dimensions. The bottomup algorithm for subspace clustering computes the dense units in all dimensions and combines these to generate the dense units in higher dimensions. Computation is heavily dependent on the choice of the partitioning parameter chosen to partition each dimension into intervals (bins) to be tested for density. The number of bins determine...
A p* primer: logit models for social networks
 SOCIAL NETWORKS
, 1999
"... A major criticism of the statistical models for analyzing social networks developed by Holland, Leinhardt, and others wHolland, P.W., Leinhardt, S., 1977. Notes on the statistical analysis of social network data; Holland, P.W., Leinhardt, S., 1981. An exponential family of probability distributions ..."
Abstract

Cited by 78 (1 self)
 Add to MetaCart
A major criticism of the statistical models for analyzing social networks developed by Holland, Leinhardt, and others wHolland, P.W., Leinhardt, S., 1977. Notes on the statistical analysis of social network data; Holland, P.W., Leinhardt, S., 1981. An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association. 76, pp. 33–65 Ž with discussion.; Fienberg, S.E., Wasserman,
Clustering Through Decision Tree Construction
 In SIGMOD00
, 2000
"... this paper, we propose a novel clustering technique, which is based on a supervised learning technique called decision tree construction. The new technique is able to overcome many of these shortcomings. The key idea is to use a decision tree to partition the data space into cluster and empty (spars ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
this paper, we propose a novel clustering technique, which is based on a supervised learning technique called decision tree construction. The new technique is able to overcome many of these shortcomings. The key idea is to use a decision tree to partition the data space into cluster and empty (sparse) regions at different levels of details. The technique is able to find "natural" clusters in large high dimensional spaces efficiently. It is suitable for clustering in the full dimensional space as well as in subspaces. It also provides comprehensible descriptions of clusters. Experiment results on both synthetic data and reallife data show that the technique is effective and also scales well for large high dimensional datasets.
Adaptive grids for clustering massive data sets
 In 1st SIAM International Conference Proceedings on Data Mining
, 2001
"... Clustering is a key data mining problem. Density and grid based technique is a popular way to mine clusters in a large multidimensional space wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters. Fine g ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
Clustering is a key data mining problem. Density and grid based technique is a popular way to mine clusters in a large multidimensional space wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters. Fine grid sizes lead to a huge amount of computation while coarse grid sizes result in loss in quality of clusters found. Also,varied grid sizes result in discovering clusters with different cluster descriptions. The technique of Adaptive grids enables to use grids based on the data distribution and does not require the user to specify any parameters like the grid size or the density thresholds. Further,clusters could be embedded in a subspace of a high dimensional space. We propose a modified bottomup subspace clustering algorithm to discover clusters in all possible subspaces. Our method scales linearly with the data dimensionality and the size of the data set. Experimental results on a wide variety of synthetic and real data sets demonstrate the effectiveness of Adaptive grids and the effect of the modified subspace clustering algorithm. Our algorithm explores atleast an order of magnitude more number of subspaces than the original algorithm and the use of adaptive grids yields on an average of two orders of magnitude speedup as compared to the method with user specified grid size and threshold.
Locally adaptive metrics for clustering high dimensional data
, 2006
"... Abstract. Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of featur ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
Abstract. Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves with respect to competitive methods, using both synthetic and real datasets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in gene expression data, and clustering of very high dimensional data such as text data.
Clustering in Massive Data Sets
 Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on casestudies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
Mining Clusters with Association Rules
 Advances in IntelligentData Analysis, Lecture Notes in Computer Science 1642
, 1999
"... In this paper we propose a method for extracting clusters in a population of customers, where the only information available is the list of products bought by the individual clients. We use association rules having high confidence to construct a hierarchical sequence of clusters. A specific metric i ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
In this paper we propose a method for extracting clusters in a population of customers, where the only information available is the list of products bought by the individual clients. We use association rules having high confidence to construct a hierarchical sequence of clusters. A specific metric is introduced for measuring the quality of the resulting clusterings. Practical consequences are discussed in view of some experiments on real life datasets.
Review of statistical network analysis: models, algorithms, and software
 STATISTICAL ANALYSIS AND DATA MINING
, 2012
"... ..."
(Show Context)
Probability matrix decomposition models
 Psychometrika
, 1996
"... In this paper, we consider a class of models for twoway matrices with binary entries of 0 and l. First, we consider Boolean matrix decomposition, conceptualize it as a latent response model (LRM) and, by making use of this conceptualization, generalize it to a larger class of matrix decomposition m ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
In this paper, we consider a class of models for twoway matrices with binary entries of 0 and l. First, we consider Boolean matrix decomposition, conceptualize it as a latent response model (LRM) and, by making use of this conceptualization, generalize it to a larger class of matrix decomposition models. Second, probability matrix decomposition (PMD) models are introduced as a probabilistic version of this larger class of deterministic matrix decomposition models. Third, an algorithm for the computation of the maximum likelihood (ML) and the maximum a posteriori (MAP) estimates of the parameters of PMD models is presented. This algorithm is an EMalgorithm, and is a special case of a more general algorithm that can be used for the whole class of LRMs. And fourth, as an example, a PMD model is applied to data on decision making in psychiatric diagnosis. Key words: Boolean matrix decomposition, latent response model, clustering, twoway data, incomplete data, EMalgorithm, psychiatric diagnosis. Within the domain of data analysis binary data have often taken a special place. This paper deals with a collection of models for binary data, which in the simplest case will be twoway twomode (i.e., a binary matrix). Throughout this paper, the first mode