Results 1  10
of
57
How many clusters? Which clustering method? Answers via modelbased cluster analysis
 THE COMPUTER JOURNAL
, 1998
"... ..."
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 247 (0 self)
 Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Pajek  analysis and visualization of large networks
 Graph Drawing Software
, 2003
"... Pajek is a program, for Windows, for analysis and visualization of large networks having some ten or houndred of thousands of vertices. In Slovenian language pajek means spider. ..."
Abstract

Cited by 123 (3 self)
 Add to MetaCart
Pajek is a program, for Windows, for analysis and visualization of large networks having some ten or houndred of thousands of vertices. In Slovenian language pajek means spider.
Parallel Algorithms for Hierarchical Clustering
 Parallel Computing
, 1995
"... Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms f ..."
Abstract

Cited by 80 (1 self)
 Add to MetaCart
Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms for hierarchical clustering. Parallel algorithms to perform hierarchical clustering using several distance metrics are then described. Optimal PRAM algorithms using n log n processors are given for the average link, complete link, centroid, median, and minimum variance metrics. Optimal butterfly and tree algorithms using n log n processors are given for the centroid, median, and minimum variance metrics. Optimal asymptotic speedups are achieved for the best practical algorithm to perform clustering using the single link metric on a n log n processor PRAM, butterfly, or tree. Keywords. Hierarchical clustering, pattern analysis, parallel algorithm, butterfly network, PRAM algorithm. 1 In...
New Techniques for BestMatch Retrieval
 ACM Transactions on Information Systems
, 1990
"... A scheme to answer bestmatch queries from a file containing a collection of objects is described. A bestmatch query is to find the objects in the file that are closest (according to some (dis)similarity measure) to a given target. Previous work [5, 331 suggests that one can reduce the number of co ..."
Abstract

Cited by 53 (5 self)
 Add to MetaCart
A scheme to answer bestmatch queries from a file containing a collection of objects is described. A bestmatch query is to find the objects in the file that are closest (according to some (dis)similarity measure) to a given target. Previous work [5, 331 suggests that one can reduce the number of comparisons required to achieve the desired results using the triangle inequality, starting with a data structure for the file that reflects some precomputed intrafile distances. We generalize the technique to allow the optimum use of any given set of precomputed intrafile distances. Some empirical results are presented which illustrate the effectiveness of our scheme, and its performance relative to previous algorithms.
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data
, 1999
"... . This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) tim ..."
Abstract

Cited by 44 (8 self)
 Add to MetaCart
. This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) time, with a O(jSjn) space requirement and O(n) communication requirement, where n is the number of elements in the data set and jSj is the number of data sites. This approach shows significant improvement over naive methods with O(n 2 ) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is pres...
RelationshipBased Clustering and Visualization for HighDimensional Data Mining
 INFORMS Journal on Computing
, 2002
"... In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary similarity spac ..."
Abstract

Cited by 40 (10 self)
 Add to MetaCart
In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graphpartitioningbased clustering techniques in this space. The output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While twodimensional visualization of a similarity matrix is by itself not novel, its combination with the ordersensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the highdimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user toward the right number of clusters
Pajek  program for analysis and visualization of large networks
 Timeshift  The World in TwentyFive Years: Ars Electronica
, 2004
"... List of commands with short explanation ..."
The Haar wavelet transform of a dendrogram
 Journal of Classification
, 2007
"... We consider the wavelet transform of a finite, rooted, noderanked, pway tree, focusing on the case of binary (p = 2) trees. We study a Haar wavelet transform on this tree. Wavelet transforms allow for multiresolution analysis through translation and dilation of a wavelet function. We explore how t ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
We consider the wavelet transform of a finite, rooted, noderanked, pway tree, focusing on the case of binary (p = 2) trees. We study a Haar wavelet transform on this tree. Wavelet transforms allow for multiresolution analysis through translation and dilation of a wavelet function. We explore how this works in our tree context.
Kmeans Clustering Algorithm For Categorical Attributes
"... Efficient partitioning of large data sets into homogeneous clusters is a fundamental problem in data mining. The hierarchical clustering methods are not adaptable because of their high computational complexity. The Kmeans based algorithms give promising results for their efficiency. However their u ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
Efficient partitioning of large data sets into homogeneous clusters is a fundamental problem in data mining. The hierarchical clustering methods are not adaptable because of their high computational complexity. The Kmeans based algorithms give promising results for their efficiency. However their use is often limited to numeric data. The quality of clusters produced depends on the initialization of clusters and the order in which data elements are processed in the iteration. We present a method which is based on the Kmeans philosophy but removes the numeric data limitation.