Results 1  10
of
48
Consensus clustering  A resamplingbased method for class discovery and visualization of gene expression microarray data
 MACHINE LEARNING, FUNCTIONAL GENOMICS SPECIAL ISSUE
, 2003
"... ..."
ModelBased Clustering and Data Transformations for Gene Expression Data
, 2001
"... Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particula ..."
Abstract

Cited by 124 (8 self)
 Add to MetaCart
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, modelbased clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
Orthogonal nonnegative matrix trifactorizations for clustering
 In SIGKDD
, 2006
"... Currently, most research on nonnegative matrix factorization (NMF) focus on 2factor X = FG T factorization. We provide a systematic analysis of 3factor X = FSG T NMF. While unconstrained 3factor NMF is equivalent to unconstrained 2factor NMF, constrained 3factor NMF brings new features to constr ..."
Abstract

Cited by 66 (18 self)
 Add to MetaCart
Currently, most research on nonnegative matrix factorization (NMF) focus on 2factor X = FG T factorization. We provide a systematic analysis of 3factor X = FSG T NMF. While unconstrained 3factor NMF is equivalent to unconstrained 2factor NMF, constrained 3factor NMF brings new features to constrained 2factor NMF. We study the orthogonality constraint because it leads to rigorous clustering interpretation. We provide new rules for updating F,S,G and prove the convergence of these algorithms. Experiments on 5 datasets and a real world case study are performed to show the capability of biorthogonal 3factor NMF on simultaneously clustering rows and columns of the input data matrix. We provide a new approach of evaluating the quality of clustering on words using class aggregate distribution and multipeak distribution. We also provide an overview of various NMF extensions and examine their relationships.
SCAN: an Structural Clustering Algorithm for Networks
 IN PROC. OF 13 TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2004
"... Network clustering (or graph partitioning) is an important task for the discovery of underlying structures in networks. Many algorithms find clusters by maximizing the number of intracluster edges. While such algorithms find useful and interesting structures, they tend to fail to identify and isola ..."
Abstract

Cited by 51 (3 self)
 Add to MetaCart
Network clustering (or graph partitioning) is an important task for the discovery of underlying structures in networks. Many algorithms find clusters by maximizing the number of intracluster edges. While such algorithms find useful and interesting structures, they tend to fail to identify and isolate two kinds of vertices that play special roles – vertices that bridge clusters (hubs) and vertices that are marginally connected to clusters (outliers). Identifying hubs is useful for applications such as viral marketing and epidemiology since hubs are responsible for spreading ideas or disease. In contrast, outliers have little or no influence, and may be isolated as noise in the data. In this paper, we proposed a novel algorithm called SCAN (Structural Clustering Algorithm for Networks), which detects clusters, hubs and outliers in networks. It clusters vertices based on a structural similarity measure. The algorithm is fast and efficient, visiting each vertex only once. An empirical evaluation of the method using both synthetic and real datasets demonstrates superior performance over other methods such as the modularitybased algorithms.
An empirical study of Principal Component Analysis for clustering gene expression data
, 2001
"... Motivation: There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expressio ..."
Abstract

Cited by 45 (4 self)
 Add to MetaCart
Motivation: There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PC's) in capturing cluster structure. In other words, we empirically compared the quality of clusters obtained from the original data set to the quality of clusters obtained from clustering the PC's using both real and synthetic gene expression data sets. Results: Our empirical study showed that clustering with the PC's instead of the original variables does not necessarily improve, and often degrade, cluster quality. In particular, the first few PC's (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PC's has different impact on different algorithms and different similarity metrics. Overall, we would not recommend PCA before clustering except in special circumstances. Availability: The software is under development. Contact: kayee cs.washington.edu Supplementary information: http://www.cs.washington.edu/homes/kayee/pca 1
Evaluation of fiber clustering methods for diffusion tensor imaging
 In IEEE Transactions on Visualization and Computer Graphics
, 2005
"... Figure 1: (a)Cluttered image showing the fibers in a healthy brain by seeding in the whole volume. The color coding shows main eigenvalue. (b)(c)(d) Clustering results. The color coding represents the clusters.(b) Hierarchical clustering with singlelink and mean distance between fibers. (c) The sam ..."
Abstract

Cited by 37 (1 self)
 Add to MetaCart
Figure 1: (a)Cluttered image showing the fibers in a healthy brain by seeding in the whole volume. The color coding shows main eigenvalue. (b)(c)(d) Clustering results. The color coding represents the clusters.(b) Hierarchical clustering with singlelink and mean distance between fibers. (c) The same as (b) but with closest point distance between fibers. (d) Shared nearest neighbor with mean distance between fibers. Fiber tracking is a standard approach for the visualization of the results of Diffusion Tensor Imaging (DTI). If fibers are reconstructed and visualized individually through the complete white matter, the display gets easily cluttered making it difficult to get insight in the data. Various clustering techniques have been proposed to automatically obtain bundles that should represent anatomical structures, but it is unclear which clustering methods and parameter settings give the best results. We propose a framework to validate clustering methods for whitematter fibers. Clusters are compared with a manual classification which is used as a ground truth. For the quantitative evaluation of the methods, we developed a new measure to assess the difference between the ground truth and the clusterings. The measure was validated and calibrated by presenting different clusterings to physicians and asking them for their judgement. We found that the values of our new measure for different clusterings match well with the opinions of physicians. Using this framework, we have evaluated different clustering algorithms, including shared nearest neighbor clustering, which has not been used before for this purpose. We found that the use of hierarchical clustering using singlelink and a fiber similarity measure based on the mean distance between fibers gave the best results.
Pagelevel template detection via isotonic smoothing
 In Proc. of In Conference on World Wide Web
, 2007
"... We develop a novel framework for the pagelevel template detection problem. Our framework is built on two main ideas. The first is the automatic generation of training data for a classifier that, given a page, assigns a templateness score to every DOM node of the page. The second is the global smoot ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
We develop a novel framework for the pagelevel template detection problem. Our framework is built on two main ideas. The first is the automatic generation of training data for a classifier that, given a page, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these pernode classifier scores by solving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on humanlabeled test data show that our approach detects templates effectively.
Weighted consensus clustering
, 2008
"... Consensus clustering has emerged as an important extension of the classical clustering problem. We propose weighted consensus clustering, where each input clustering is weighted and the weights are determined in such a way that the final consensus clustering provides a better quality solution, in wh ..."
Abstract

Cited by 18 (9 self)
 Add to MetaCart
Consensus clustering has emerged as an important extension of the classical clustering problem. We propose weighted consensus clustering, where each input clustering is weighted and the weights are determined in such a way that the final consensus clustering provides a better quality solution, in which clusters are better separated comparing to standard consensus clustering. Theoretically, we show that a reformulation of the wellknown L1 regularization LASSO problem is equivalent to the weight optimization of our weighted consensus clustering, and thus our approach provides sparse solutions which may resolve the difficult situation when the input clusterings diverge significantly. We also show that the weighted consensus clustering resolves the redundancy problem when many input clusterings correlate highly. Detailed algorithms are given. Experiments are carried out to demonstrate the effectiveness of the weighted consensus clustering.
Minimum Entropy Clustering and Applications to Gene Expression Analysis
 In Proceedings of IEEE Computational Systems Bioinformatics Conference
, 2004
"... Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an informationtheoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of c ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an informationtheoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano’s inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon’s entropy with HavrdaCharvat’s structural αentropy. Interestingly, the minimum entropy criterion based on structural αentropy is equal to the probability error of the nearest neighbor method when α =2. This is another evidence that the proposed criterion is good for clustering. With a nonparametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than kmeans/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously. 1.
Modelbased clustering for expression data via a Dirichlet process mixture model,” in Bayesian Inference for Gene Expression and Proteomics
, 2006
"... This chapter describes a clustering procedure for microarray expression data based on a welldefined statistical model, specifically, a conjugate Dirichlet process mixture model. The clustering algorithm groups genes whose latent variables governing expression are equal, that is, genes belonging to ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
This chapter describes a clustering procedure for microarray expression data based on a welldefined statistical model, specifically, a conjugate Dirichlet process mixture model. The clustering algorithm groups genes whose latent variables governing expression are equal, that is, genes belonging to the same mixture component. The model is fit with Markov chain Monte Carlo and the computational burden is eased by exploiting conjugacy. This chapter introduces a method to get a point estimate of the true clustering based on leastsquares distances from the posterior probability that two genes are clustered. Unlike ad hoc clustering methods, the model provides measures of uncertainty about the clustering. Further, the model automatically estimates the number of clusters and quantifies uncertainty about this important parameter. The method is compared to other clustering methods in a simulation study. Finally, the method is demonstrated with actual microarray data.