Results 1  10
of
12
The history of the cluster heat map
 The American Statistician
, 2009
"... The cluster heat map is an ingenious display that simultaneously reveals row and column hierarchical cluster structure in a data matrix. It consists of a rectangular tiling with each tile shaded on a color scale to represent the value of the corresponding element of the data matrix. The rows (column ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
The cluster heat map is an ingenious display that simultaneously reveals row and column hierarchical cluster structure in a data matrix. It consists of a rectangular tiling with each tile shaded on a color scale to represent the value of the corresponding element of the data matrix. The rows (columns) of the tiling are ordered such that similar rows (columns) are near each other. On the vertical and horizontal margins of the tiling there are hierarchical cluster trees. This cluster heat map is a synthesis of several different graphic displays developed by statisticians over more than a century. We locate the earliest sources of this display in late 19th century publications. And we trace a diverse 20th century statistical literature that provided a foundation for this most widely used of all bioinformatics displays. 1
Clustering in ordered dissimilarity data
 Int. J. Intelligent Systems
"... This paper presents a new technique for clustering either object or relational data. First, the data are represented as a matrix D of dissimilarity values. D is reordered to D ∗ using a visual assessment of cluster tendency algorithm. If the data contain clusters, they are suggested by visually appa ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
This paper presents a new technique for clustering either object or relational data. First, the data are represented as a matrix D of dissimilarity values. D is reordered to D ∗ using a visual assessment of cluster tendency algorithm. If the data contain clusters, they are suggested by visually apparent dark squares arrayed along the main diagonal of an image I (D∗) of D∗. The suggested clusters in the object set underlying the reordered relational data are found by defining an objective function that recognizes this blocky structure in the reordered data. The objective function is optimized when the boundaries in I (D∗) are matched by those in an aligned partition of the objects. The objective function combines measures of contrast and edginess and is optimized by particle swarm optimization. We prove that the set of aligned partitions is exponentially smaller than the set of partitions that needs to be searched if clusters are sought in D. Six numerical examples are given to illustrate various facets of the algorithm. C © 2009 Wiley Periodicals, Inc. 1.
Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering
, 2009
"... For hierarchical clustering, dendrograms provide convenient and powerful visualization. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
For hierarchical clustering, dendrograms provide convenient and powerful visualization. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this paper we extend (dissimilarity) matrix shading with several reordering steps based on seriation. Both methods, matrix shading and seriation, have been wellknown for a long time. However, only recent algorithmic improvements allow to use seriation for larger problems. Furthermore, seriation is used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is independent of the dimensionality of the data. A big advantage is that it presents the structure between clusters and the microstructure within clusters in one concise plot. This not only allows for judging cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples.
Seriation in the Presence of Errors: A Factor 16 Approximation Algorithm for l∞Fitting Robinson Structures to Distances
 ALGORITHMICA
, 2007
"... The classical seriation problem consists in finding a permutation of the rows and the columns of the distance (or, more generally, dissimilarity) matrix d on a finite set X so that small values should be concentrated around the main diagonal as close as possible, whereas large values should fall as ..."
Abstract
 Add to MetaCart
The classical seriation problem consists in finding a permutation of the rows and the columns of the distance (or, more generally, dissimilarity) matrix d on a finite set X so that small values should be concentrated around the main diagonal as close as possible, whereas large values should fall as far from it as possible. This goal is best achieved by considering the Robinson property: a distance dR on X is Robinsonian if its matrix can be symmetrically permuted so that its elements do not decrease when moving away from the main diagonal along any row or column. If the distance d fails to satisfy the Robinson property, then we are lead to the problem of finding a reordering of d which is as close as possible to a Robinsonian distance. In this paper, we present a factor 16 approximation algorithm for the following NPhard fitting problem: given a finite set X and a dissimilarity d on X, wewish to find a Robinsonian dissimilarity dR on X minimizing the lâerror âd â dRâ â = maxx,yâX{d(x,y) â dR(x, y)} between d and dR.
Finding the Number of Clusters in Unlabeled Datasets using Extended Dark Block Extraction
"... Clustering analysis is the problem of partitioning a set of objects O = {o1 … on} into c selfsimilar subsets based on available data. In general, clustering of unlabeled data poses three major problems: 1) assessing cluster tendency, i.e., how many clusters to seek? 2) Partitioning the data into c ..."
Abstract
 Add to MetaCart
(Show Context)
Clustering analysis is the problem of partitioning a set of objects O = {o1 … on} into c selfsimilar subsets based on available data. In general, clustering of unlabeled data poses three major problems: 1) assessing cluster tendency, i.e., how many clusters to seek? 2) Partitioning the data into c meaningful groups, and 3) validating the c clusters that are discovered. We address the first problem, i.e., determining the number of clusters c prior to clustering. Many clustering algorithms require number of clusters as an input parameter, so the quality of the clusters mainly depends on this value. Most methods are post clustering measures of cluster validity i.e., they attempt to choose the best partition from a set of alternative partitions. In contrast, tendency assessment attempts to estimate c before clustering occurs. Here, we represent the structure of the unlabeled data sets as a Reordered Dissimilarity Image (RDI), where pair wise dissimilarity information about a data set including ‗n ‘ objects is represented as nxn image. RDI is generated using VAT (Visual Assessment of Cluster tendency), RDI highlights potential clusters as a set of ―dark blocks ‖ along the diagonal of the image. So, number of clusters can be easily estimated using the number of dark blocks across the diagonal. We develop a new method called ―Extended Dark Block Extraction (EDBE) for counting the number of clusters formed along the diagonal of the RDI. EDBE method combines several image and signal processing techniques.
Semantic clustering: Identify D
"... up, cce e 4 Many of the existing approaches in Software Comprehension focus on program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software the main activities in softwar ..."
Abstract
 Add to MetaCart
(Show Context)
up, cce e 4 Many of the existing approaches in Software Comprehension focus on program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software the main activities in software reengineering, it is estimated ‘‘The informal linguistic information that the software Languages are a means of communication, and programming languages are no different. Source code contains two levels of communication: human–machine communication through program instructions, and human to human communications through names of identifiers and comments. Let us consider a small code example:
l l l l
, 2009
"... Clustering: assignment of objects to groups (clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. l l ..."
Abstract
 Add to MetaCart
(Show Context)
Clustering: assignment of objects to groups (clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. l l
Semantic Clustering: Identifying Topics in Source Code To appear in Journal on Information Systems and Technologies
"... Many of the existing approaches in Software Comprehension focus on program program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich soft ..."
Abstract
 Add to MetaCart
(Show Context)
Many of the existing approaches in Software Comprehension focus on program program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use of information retrieval to exploit linguistic information found in source code, such as identifier names and comments. We introduce Semantic Clustering, a technique based on Latent Semantic Indexing and clustering to group source artifacts that use similar vocabulary. We call these groups semantic clusters and we interpret them as linguistic topics that reveal the intention of the code. We compare the topics to each other, identify links between them, provide automatically retrieved labels, and use a visualization to illustrate how they are distributed over the system. Our approach is language independent as it works at the level of identifier names. To validate our approach we applied it on several case studies, two of which we present in this paper. Note: Some of the visualizations presented make heavy use of colors. Please obtain a color copy of the article for better understanding.
Classification Visualization with Shaded Similarity Matrix
"... Shaded similarity matrix has long been used in visual cluster analysis. This paper investigates how it can be used in classification visualization. We focus on two popular classification methods: nearest neighbor and decision tree. Ensemble classifier visualization is also presented for handling lar ..."
Abstract
 Add to MetaCart
(Show Context)
Shaded similarity matrix has long been used in visual cluster analysis. This paper investigates how it can be used in classification visualization. We focus on two popular classification methods: nearest neighbor and decision tree. Ensemble classifier visualization is also presented for handling large data sets.