Results 1 - 10
of
33
Model-Based Clustering and Data Transformations for Gene Expression Data
, 2001
"... Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particula ..."
Abstract
-
Cited by 88 (8 self)
- Add to MetaCart
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
Constructing Internet Coordinate System Based on Delay Measurement
, 2003
"... In this paper, we consider the problem of how to represent the locations of Internet hosts in a Cartesian coordinate system to facilitate estimate of the network distance between two arbitrary Internet hosts. We envision an infrastructure that consists of beacon nodes and provides the service of est ..."
Abstract
-
Cited by 85 (3 self)
- Add to MetaCart
In this paper, we consider the problem of how to represent the locations of Internet hosts in a Cartesian coordinate system to facilitate estimate of the network distance between two arbitrary Internet hosts. We envision an infrastructure that consists of beacon nodes and provides the service of estimating network distance between two hosts without direct delay measurement. We show that the principal component analysis (PCA) technique can e#ectively extract topological information from delay measurements between beacon hosts. Based on PCA, we devise a transformation method that projects the distance data space into a new coordinate system of (much) smaller dimensions. The transformation retains as much topological information as possible and yet enables end hosts to easily determine their locations in the coordinate system. The resulting new coordinate system is termed as the Internet Coordinate System (ICS). As compared to existing work (e.g., IDMaps [1] and GNP [2]), ICS incurs smaller computation overhead in calculating the coordinates of hosts and smaller measurement overhead (required for end hosts to measure their distances to beacon hosts). Finally, we show via experimentation with real-life data sets that ICS is robust and accurate, regardless of the number of beacon nodes (as long as it exceeds certain threshold) and the complexity of network topology.
triCluster: An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data
- In Proc. of the 2005 ACM SIGMOD international conference on Management of data
, 2005
"... In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in three-dimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di#erent parameter values, it can mine di#erent types of clusters, in ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in three-dimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di#erent parameter values, it can mine di#erent types of clusters, including those with constant or similar values along each dimension, as well as scaling and shifting expression patterns. triCluster relies on graph-based approach to mine all valid clusters. For each time slice, i.e., a genesample matrix, it constructs the range multigraph, a compact representation of all similar value ranges between any two sample columns. It then searches for constrained maximal cliques in this multigraph to yield the set of biclusters for this time slice. Then triCluster constructs another graph using the biclusters (as vertices) from each time slice; mining cliques from this graph yields the final set of triclusters. Optionally, triCluster merges/deletes some clusters having large overlaps. We present a useful set of metrics to evaluate the clustering quality, and we show that triCluster can find significant triclusters in the real microarray datasets.
Using uncorrelated discriminant analysis for tissue classification with gene expression data
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
, 2004
"... Abstract—The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundred ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Abstract—The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear Discriminant Analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called Uncorrelated Linear Discriminant Analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the Generalized Singular Value Decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets. Index Terms—Microarray data analysis, discriminant analysis, generalized singular value decomposition, classification. 1
CARMAweb: comprehensive R- and Bioconductor-based web service for microarray data analysis
- Nucleic Acids Res
, 2006
"... web service for microarray data analysis ..."
The Lanczos-Ritz values appearing in an orthogonal similarity reduction of a matrix into semiseparable form
, 2003
"... ..."
Excavator: a computer program for efficiently mining gene expression data
- Nucleic Acids Research
, 2003
"... Massive gene-expression data are generated using microarrays, and clustering gene-expression data is useful for studying functional relationship among genes in a biological process. We have developed a computer package, EXCAVATOR ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Massive gene-expression data are generated using microarrays, and clustering gene-expression data is useful for studying functional relationship among genes in a biological process. We have developed a computer package, EXCAVATOR
in A Practical Approach to Microarray Data Analysis
- Kluwel. chapter
, 2003
"... 5. Singular value decomposition and principal component analysis 1 ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
5. Singular value decomposition and principal component analysis 1
A Memetic Co-Clustering Algorithm for Gene Expression Profiles and Biological Annotation
, 2004
"... With the invention of microarrays, researchers are capable of measuring thousands of gene expression levels in parallel at various time points of the biological process. To investigate general regulatory mechanisms, biologists cluster genes based on their expression patterns. In this paper, we propo ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
With the invention of microarrays, researchers are capable of measuring thousands of gene expression levels in parallel at various time points of the biological process. To investigate general regulatory mechanisms, biologists cluster genes based on their expression patterns. In this paper, we propose a new memetic co-clustering algorithm for expression profiles, which incorporates a priori knowledge in the form of Gene Ontology information. Ontologies offer a mechanism to capture knowledge in a shareable form that is also processable by computers. The use of this additional annotation information promises to improve biological data analysis and simplifies the identification of processes that are relevant under the measured conditions.

