Results 1  10
of
85
How many clusters? Which clustering method? Answers via modelbased cluster analysis
 THE COMPUTER JOURNAL
, 1998
"... ..."
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 286 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Pajek  analysis and visualization of large networks
 GRAPH DRAWING SOFTWARE
, 2003
"... Pajek is a program, for Windows, for analysis and visualization of large networks having some ten or houndred of thousands of vertices. In Slovenian language pajek means spider. ..."
Abstract

Cited by 148 (4 self)
 Add to MetaCart
Pajek is a program, for Windows, for analysis and visualization of large networks having some ten or houndred of thousands of vertices. In Slovenian language pajek means spider.
Chemical similarity searching
 J. Chem. Inf. Comput. Sci
, 1998
"... This paper reviews the use of similarity searching in chemical databases. It begins by introducing the concept of similarity searching, differentiating it from the more common substructure searching, and then discusses the current generation of fragmentbased measures that are used for searching che ..."
Abstract

Cited by 110 (12 self)
 Add to MetaCart
This paper reviews the use of similarity searching in chemical databases. It begins by introducing the concept of similarity searching, differentiating it from the more common substructure searching, and then discusses the current generation of fragmentbased measures that are used for searching chemical structure databases. The next sections focus upon two of the principal characteristics of a similarity measure: the coefficient that is used to quantify the degree of structural resemblance between pairs of molecules and the structural representations that are used to characterize molecules that are being compared in a similarity calculation. New types of similarity measure are then compared with current approaches, and examples are given of several applications that are related to similarity searching. 1.
Parallel Algorithms for Hierarchical Clustering
 Parallel Computing
, 1995
"... Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms f ..."
Abstract

Cited by 84 (1 self)
 Add to MetaCart
(Show Context)
Hierarchical clustering is a common method used to determine clusters of similar data points in multidimensional spaces. O(n 2 ) algorithms are known for this problem [3, 4, 10, 18]. This paper reviews important results for sequential algorithms and describes previous work on parallel algorithms for hierarchical clustering. Parallel algorithms to perform hierarchical clustering using several distance metrics are then described. Optimal PRAM algorithms using n log n processors are given for the average link, complete link, centroid, median, and minimum variance metrics. Optimal butterfly and tree algorithms using n log n processors are given for the centroid, median, and minimum variance metrics. Optimal asymptotic speedups are achieved for the best practical algorithm to perform clustering using the single link metric on a n log n processor PRAM, butterfly, or tree. Keywords. Hierarchical clustering, pattern analysis, parallel algorithm, butterfly network, PRAM algorithm. 1 In...
New Techniques for BestMatch Retrieval
 ACM Transactions on Information Systems
, 1990
"... A scheme to answer bestmatch queries from a file containing a collection of objects is described. A bestmatch query is to find the objects in the file that are closest (according to some (dis)similarity measure) to a given target. Previous work [5, 331 suggests that one can reduce the number of co ..."
Abstract

Cited by 55 (5 self)
 Add to MetaCart
(Show Context)
A scheme to answer bestmatch queries from a file containing a collection of objects is described. A bestmatch query is to find the objects in the file that are closest (according to some (dis)similarity measure) to a given target. Previous work [5, 331 suggests that one can reduce the number of comparisons required to achieve the desired results using the triangle inequality, starting with a data structure for the file that reflects some precomputed intrafile distances. We generalize the technique to allow the optimum use of any given set of precomputed intrafile distances. Some empirical results are presented which illustrate the effectiveness of our scheme, and its performance relative to previous algorithms.
Ice floe identification in satellite images using mathematical morphology and clustering about principal curves
 JASA
, 1992
"... Identification of ice floes and their outlines in satellite images is important for understanding physical processes in the polar regions, for transportation in icecovered seas and for the design of offshore structures intended to survive in the presence of ice. At present this is done manually, ..."
Abstract

Cited by 50 (5 self)
 Add to MetaCart
(Show Context)
Identification of ice floes and their outlines in satellite images is important for understanding physical processes in the polar regions, for transportation in icecovered seas and for the design of offshore structures intended to survive in the presence of ice. At present this is done manually, a long and tedious process which precludes full use of the great volume of relevant images now available. We describe an automatic and accurate method for identifying ice floes and their outlines. Floe outlines are modeled as closed principal curves (Hastie and Stuetzle, 1989), a flexible class of smooth nonparametric curves. We propose a robust method of estimating closed principal curves which reduces both bias and variance. Initial estimates of floe outlines come from the erosionpropagation (EP) algorithm, which combines erosion from mathematical morphology with local propagation of information about floe edges. The edge pixels from the EP algorithm are grouped into floe outlines
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data
, 1999
"... . This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) tim ..."
Abstract

Cited by 46 (8 self)
 Add to MetaCart
. This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) time, with a O(jSjn) space requirement and O(n) communication requirement, where n is the number of elements in the data set and jSj is the number of data sites. This approach shows significant improvement over naive methods with O(n 2 ) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is pres...
RelationshipBased Clustering and Visualization for HighDimensional Data Mining
 INFORMS Journal on Computing
, 2002
"... In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary simil ..."
Abstract

Cited by 43 (10 self)
 Add to MetaCart
(Show Context)
In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graphpartitioningbased clustering techniques in this space. The output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While twodimensional visualization of a similarity matrix is by itself not novel, its combination with the ordersensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the highdimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user toward the right number of clusters
Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques
 of 16QAM Digital PLL Based Demodultors&quot;, Proc. Globecom94
, 1994
"... In this article, we report our implementation and comparison of two text clustering techniques. One is based on Ward's clustering and the other on Kohonen's Selforganizing Maps. We have evaluated how closely clusters produced by a computer resemble those created by human experts. We have ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
In this article, we report our implementation and comparison of two text clustering techniques. One is based on Ward's clustering and the other on Kohonen's Selforganizing Maps. We have evaluated how closely clusters produced by a computer resemble those created by human experts. We have also measured the time that it takes for an expert to "clean up" the automatically produced clusters. The technique based on Ward's clustering was found to be more precise. Both techniques have worked equally well in detecting associations between text documents. We used text messages obtained from group brainstorming meetings.