Results 1  10
of
10
The history of the cluster heat map
 The American Statistician
, 2009
"... The cluster heat map is an ingenious display that simultaneously reveals row and column hierarchical cluster structure in a data matrix. It consists of a rectangular tiling with each tile shaded on a color scale to represent the value of the corresponding element of the data matrix. The rows (column ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
The cluster heat map is an ingenious display that simultaneously reveals row and column hierarchical cluster structure in a data matrix. It consists of a rectangular tiling with each tile shaded on a color scale to represent the value of the corresponding element of the data matrix. The rows (columns) of the tiling are ordered such that similar rows (columns) are near each other. On the vertical and horizontal margins of the tiling there are hierarchical cluster trees. This cluster heat map is a synthesis of several different graphic displays developed by statisticians over more than a century. We locate the earliest sources of this display in late 19th century publications. And we trace a diverse 20th century statistical literature that provided a foundation for this most widely used of all bioinformatics displays. 1
Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering
, 2009
"... For hierarchical clustering, dendrograms provide convenient and powerful visualization. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
For hierarchical clustering, dendrograms provide convenient and powerful visualization. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this paper we extend (dissimilarity) matrix shading with several reordering steps based on seriation. Both methods, matrix shading and seriation, have been wellknown for a long time. However, only recent algorithmic improvements allow to use seriation for larger problems. Furthermore, seriation is used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is independent of the dimensionality of the data. A big advantage is that it presents the structure between clusters and the microstructure within clusters in one concise plot. This not only allows for judging cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples.
Finding the Number of Clusters in Unlabeled Datasets using Extended Dark Block Extraction
"... Clustering analysis is the problem of partitioning a set of objects O = {o1 … on} into c selfsimilar subsets based on available data. In general, clustering of unlabeled data poses three major problems: 1) assessing cluster tendency, i.e., how many clusters to seek? 2) Partitioning the data into c ..."
Abstract
 Add to MetaCart
Clustering analysis is the problem of partitioning a set of objects O = {o1 … on} into c selfsimilar subsets based on available data. In general, clustering of unlabeled data poses three major problems: 1) assessing cluster tendency, i.e., how many clusters to seek? 2) Partitioning the data into c meaningful groups, and 3) validating the c clusters that are discovered. We address the first problem, i.e., determining the number of clusters c prior to clustering. Many clustering algorithms require number of clusters as an input parameter, so the quality of the clusters mainly depends on this value. Most methods are post clustering measures of cluster validity i.e., they attempt to choose the best partition from a set of alternative partitions. In contrast, tendency assessment attempts to estimate c before clustering occurs. Here, we represent the structure of the unlabeled data sets as a Reordered Dissimilarity Image (RDI), where pair wise dissimilarity information about a data set including ‗n ‘ objects is represented as nxn image. RDI is generated using VAT (Visual Assessment of Cluster tendency), RDI highlights potential clusters as a set of ―dark blocks ‖ along the diagonal of the image. So, number of clusters can be easily estimated using the number of dark blocks across the diagonal. We develop a new method called ―Extended Dark Block Extraction (EDBE) for counting the number of clusters formed along the diagonal of the RDI. EDBE method combines several image and signal processing techniques.
Seriation in the Presence of Errors: A Factor 16 Approximation Algorithm for l∞Fitting Robinson Structures to Distances
 ALGORITHMICA
, 2007
"... The classical seriation problem consists in finding a permutation of the rows and the columns of the distance (or, more generally, dissimilarity) matrix d on a finite set X so that small values should be concentrated around the main diagonal as close as possible, whereas large values should fall as ..."
Abstract
 Add to MetaCart
The classical seriation problem consists in finding a permutation of the rows and the columns of the distance (or, more generally, dissimilarity) matrix d on a finite set X so that small values should be concentrated around the main diagonal as close as possible, whereas large values should fall as far from it as possible. This goal is best achieved by considering the Robinson property: a distance dR on X is Robinsonian if its matrix can be symmetrically permuted so that its elements do not decrease when moving away from the main diagonal along any row or column. If the distance d fails to satisfy the Robinson property, then we are lead to the problem of finding a reordering of d which is as close as possible to a Robinsonian distance. In this paper, we present a factor 16 approximation algorithm for the following NPhard fitting problem: given a finite set X and a dissimilarity d on X, wewish to find a Robinsonian dissimilarity dR on X minimizing the lâerror âd â dRâ â = maxx,yâX{d(x,y) â dR(x, y)} between d and dR.
Concept Tree Based Clustering Visualization with Shaded Similarity Matrices
"... One of the problems with existing clustering methods is that the interpretation of clusters may be difficult. Two different approaches have been used to solve this problem: conceptual clustering in machine learning and clustering visualization in statistics and graphics. The purpose of this paper is ..."
Abstract
 Add to MetaCart
One of the problems with existing clustering methods is that the interpretation of clusters may be difficult. Two different approaches have been used to solve this problem: conceptual clustering in machine learning and clustering visualization in statistics and graphics. The purpose of this paper is to investigate the benefits of combining clustering visualization and conceptual clustering to obtain better cluster interpretations. In our research we have combined concept trees for conceptual clustering with shaded similarity matrices for visualization. Experimentation shows that the two interpretation approaches can complement each other to help us understand data better.
Classification Visualization with Shaded Similarity Matrix
"... Shaded similarity matrix has long been used in visual cluster analysis. This paper investigates how it can be used in classification visualization. We focus on two popular classification methods: nearest neighbor and decision tree. Ensemble classifier visualization is also presented for handling lar ..."
Abstract
 Add to MetaCart
Shaded similarity matrix has long been used in visual cluster analysis. This paper investigates how it can be used in classification visualization. We focus on two popular classification methods: nearest neighbor and decision tree. Ensemble classifier visualization is also presented for handling large data sets.
Semantic Clustering: Identifying Topics in Source Code To appear in Journal on Information Systems and Technologies
"... Many of the existing approaches in Software Comprehension focus on program program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich soft ..."
Abstract
 Add to MetaCart
Many of the existing approaches in Software Comprehension focus on program program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use of information retrieval to exploit linguistic information found in source code, such as identifier names and comments. We introduce Semantic Clustering, a technique based on Latent Semantic Indexing and clustering to group source artifacts that use similar vocabulary. We call these groups semantic clusters and we interpret them as linguistic topics that reveal the intention of the code. We compare the topics to each other, identify links between them, provide automatically retrieved labels, and use a visualization to illustrate how they are distributed over the system. Our approach is language independent as it works at the level of identifier names. To validate our approach we applied it on several case studies, two of which we present in this paper. Note: Some of the visualizations presented make heavy use of colors. Please obtain a color copy of the article for better understanding.
History Corner The History of the Cluster Heat Map
"... The cluster heat map is an ingenious display that simultaneously reveals row and column hierarchical cluster structure in a data matrix. It consists of a rectangular tiling, with each tile shaded on a color scale to represent the value of the corresponding element of the data matrix. The rows (colum ..."
Abstract
 Add to MetaCart
The cluster heat map is an ingenious display that simultaneously reveals row and column hierarchical cluster structure in a data matrix. It consists of a rectangular tiling, with each tile shaded on a color scale to represent the value of the corresponding element of the data matrix. The rows (columns) of the tiling are ordered such that similar rows (columns) are near each other. On the vertical and horizontal margins of the tiling are hierarchical cluster trees. This cluster heat map is a synthesis of several different graphic displays developed by statisticians over more than a century. We locate the earliest sources of this display in late 19th century publications, and trace a diverse 20th century statistical literature that provided a foundation for this most widely used of all bioinformatics displays. KEY WORDS: Cluster analysis; Heatmap; Microarray; Visualization. 1.