Results 1 - 10
of
173
Missing value estimation methods for DNA microarrays
, 2001
"... Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clu ..."
Abstract
-
Cited by 184 (13 self)
- Add to MetaCart
Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.
Fast Monte Carlo Algorithms for Matrices II: Computing a Low-Rank Approximation to a Matrix
- SIAM Journal on Computing
, 2004
"... matrix A. It is often of interest to nd a low-rank approximation to A, i.e., an approximation D to the matrix A of rank not greater than a speci ed rank k, where k is much smaller than m and n. Methods such as the Singular Value Decomposition (SVD) may be used to nd an approximation to A which ..."
Abstract
-
Cited by 99 (17 self)
- Add to MetaCart
matrix A. It is often of interest to nd a low-rank approximation to A, i.e., an approximation D to the matrix A of rank not greater than a speci ed rank k, where k is much smaller than m and n. Methods such as the Singular Value Decomposition (SVD) may be used to nd an approximation to A which is the best in a well de ned sense. These methods require memory and time which are superlinear in m and n; for many applications in which the data sets are very large this is prohibitive. Two simple and intuitive algorithms are presented which, when given an m n matrix A, compute a description of a low-rank approximation D to A, and which are qualitatively faster than the SVD. Both algorithms have provable bounds for the error matrix A D . For any matrix X , let kXk and kXk 2 denote its Frobenius norm and its spectral norm, respectively. In the rst algorithm, c = O(1) columns of A are randomly chosen. If the m c matrix C consists of those c columns of A (after appropriate rescaling) then it is shown that from C C approximations to the top singular values and corresponding singular vectors may be computed. From the computed singular vectors a description D of the matrix A may be computed such that rank(D ) k and such that holds with high probability for both = 2; F . This algorithm may be implemented without storing the matrix A in Random Access Memory (RAM), provided it can make two passes over the matrix stored in external memory and use O(m + n) additional RAM memory. The second algorithm is similar except that it further approximates the matrix C by randomly sampling r = O(1) rows of C to form a r c matrix W . Thus, it has additional error, but it can be implemented in three passes over the matrix using only constant ...
Sparse Principal Component Analysis
- Journal of Computational and Graphical Statistics
, 2004
"... Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA su#ers from the fact that each principal component is a linear combination of all the original variables, thus it is often di#cult to interpret the results. We introduce a new method ca ..."
Abstract
-
Cited by 83 (3 self)
- Add to MetaCart
Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA su#ers from the fact that each principal component is a linear combination of all the original variables, thus it is often di#cult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We show that PCA can be formulated as a regression-type optimization problem, then sparse loadings are obtained by imposing the lasso (elastic net) constraint on the regression coe#cients. E#cient algorithms are proposed to realize SPCA for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data, and the results are encouraging.
Aligning Gene Expression Time Series With Time Warping Algorithms
, 2001
"... Motivation: Increasingly, biological processes are being studied through time series of RNA expression data collected for large numbers of genes. Because common processes may unfold at varying rates in different experiments or individuals, methods are needed that will allow corresponding expression ..."
Abstract
-
Cited by 76 (2 self)
- Add to MetaCart
Motivation: Increasingly, biological processes are being studied through time series of RNA expression data collected for large numbers of genes. Because common processes may unfold at varying rates in different experiments or individuals, methods are needed that will allow corresponding expression states in different time series to be mapped to one another. Results: We present implementations of time warping algorithms applicable to RNA and protein expression data and demonstrate their application to published yeast RNA expression time series. Programs executing two warping algorithms are described, a simple warping algorithm and an interpolative algorithm, along with programs that generate graphics that visually present alignment information. We show time warping to be superior to simple clustering at mapping corresponding time states. We document the impact of statistical measurement noise and sample size on the quality of time alignments, and present issues related to statistical assessment of alignment quality through alignment scores. We also discuss directions for algorithm improvement including development of multiple time series alignments and possible applications to causality searches and non-temporal processes (`concentration warping'). Availability: Academic implementations of alignment programs genewarp and genewarpi and the graphics generation programs grphwarp and grphwarpi are available as Win32 system DOS box executables on our web site along with documentation on their use. The publicly available data on which they were demonstrated may be found at http://genome-www.stanford.edu/cellcycle/. Postscript files generated by grphwarp and grphwarpi may be directly printed or viewed using GhostView software available at http://www.cs.wisc.edu/#ghost/. Con...
Spectral biclustering of microarray data: Coclustering genes and conditions
- Genome Research
, 2003
"... service ..."
Exploring the Conditional Coregulation of Yeast Gene Expression Through Fuzzy K-Means Clustering
, 2002
"... Background: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups o ..."
Abstract
-
Cited by 54 (0 self)
- Add to MetaCart
Background: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups of genes under different situations. This poses a challenge in analyzing wholegenome expression data, because many genes will be similarly expressed to multiple, distinct groups of genes. Because most commonly used analytical methods cannot appropriately represent these relationships, the connections between conditionally coregulated genes are often missed.
Defining transcription modules using large-scale gene expression data
- Bioinformatics
, 2004
"... Running title: Defining modules using large-scale expression data Motivation: Large-scale gene expression data comprising a variety of cellular conditions holds the promise of a global view on the transcription program. While conventional clustering algorithms have been successfully applied to small ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
Running title: Defining modules using large-scale expression data Motivation: Large-scale gene expression data comprising a variety of cellular conditions holds the promise of a global view on the transcription program. While conventional clustering algorithms have been successfully applied to smaller datasets, the utility of many algorithms for the analysis of large-scale data is limited by their inability to capture combinatorial and conditionspecific co-regulation. In addition, there is an increasing need to integrate the rapidly accumulating body of other high-throughput biological data with the expression analysis. In a previous work, we introduced the Signature Algorithm, which overcomes the problems of conventional clustering and allows for intuitive integration of additional biological data. However, the applicability of this approach to global analyses is constrained by the comprehensiveness of relevant external data and by its lacking capability of capturing hierarchical organization of the transcription network. Methods: We present a novel method for the analysis of large-scale expression data, which assigns genes into context-dependent and potentially overlapping regulatory units. We introduce
Cluster Analysis for Gene Expression Data: A Survey
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity f ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field. Index Terms—Microarray technology, gene expression data, clustering.
CLIFF: Clustering of High-Dimensional Microarray Data via Iterative Feature Filtering Using Normalized Cuts
, 2001
"... We present CLIFF, an algorithm for clustering biological samples using gene expression microarray data. This clustering problem is difficult for several reasons, in particular the sparsity of the data, the high dimensionality of the feature (gene) space, and the fact that many features are irrelevan ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
We present CLIFF, an algorithm for clustering biological samples using gene expression microarray data. This clustering problem is difficult for several reasons, in particular the sparsity of the data, the high dimensionality of the feature (gene) space, and the fact that many features are irrelevant or redundant. Our algorithm iterates between two computational processes, feature filtering and clustering. Given a reference partition that approximates the correct clustering of the samples, our feature filtering procedure ranks the features according to their intrinsic discriminability, relevance to the reference partition, and irredundancy to other relevant features, and uses this ranking to select the features to be used in the following round of clustering. Our clustering algorithm, which is based on the concept of a normalized cut, clusters the samples into a new reference partition on the basis of the selected features. On a well-studied problem involving 72 leukemia samples and 7130 genes, we demonstrate that CLIFF outperforms standard clustering approaches that do not consider the feature selection issue, and produces a result that is very close to the original expert labeling of the sample set.
Analysis Techniques for Microarray Time-Series Data (Extended Abstract)
- J. Comput. Biol
, 2000
"... Vladimir Filkov Steven Skiena Jizu Zhi Dept. of Computer Science and Center for Biotechnology State University of New York Stony Brook, NY 11794-4400 fvl lkov|skiena|zjizug@cs.sunysb.edu September 27, 2000 1 ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
Vladimir Filkov Steven Skiena Jizu Zhi Dept. of Computer Science and Center for Biotechnology State University of New York Stony Brook, NY 11794-4400 fvl lkov|skiena|zjizug@cs.sunysb.edu September 27, 2000 1

