Results 11 - 20
of
642
K-means Clustering via Principal Component Analysis
, 2004
"... Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we ..."
Abstract
-
Cited by 201 (5 self)
- Add to MetaCart
Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we
BagBoosting for tumor classification with gene expression data
- Bioinformatics
, 2004
"... Motivation: Microarray experiments are expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. They create a need for class prediction tools, which can deal with a large number of highly correlated input variables, perform feature selection ..."
Abstract
-
Cited by 194 (2 self)
- Add to MetaCart
Motivation: Microarray experiments are expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. They create a need for class prediction tools, which can deal with a large number of highly correlated input variables, perform feature selection and provide class probability estimates that serve as a quantification of the predictive uncertainty. A very promising solution is to combine the two ensemble schemes bagging and boosting to a novel algorithm called BagBoosting.
Results: When bagging is used as a module in boosting, the resulting classifier consistently improves the predictive performance and the probability estimates of both bagging and boosting on real and simulated gene expression data. This quasi-guaranteed improvement can be obtained by simply making a bigger computing effort. The advantageous predictive potential is also confirmed by comparing BagBoosting to several established class prediction tools for microarray data.
Discovering Local Structure in Gene Expression Data: The Order-Preserving Submatrix Problem
, 2002
"... This paper concerns the discovery of patterns in gene expression matrices, in which each element gives the expression level of a given gene in a given experiment. Most existing methods for pattern discovery in such matrices are based on clustering genes by comparing their expression levels in all ex ..."
Abstract
-
Cited by 187 (1 self)
- Add to MetaCart
(Show Context)
This paper concerns the discovery of patterns in gene expression matrices, in which each element gives the expression level of a given gene in a given experiment. Most existing methods for pattern discovery in such matrices are based on clustering genes by comparing their expression levels in all experiments, or clustering experiments by comparing their expression levels for all genes. Our work goes beyond such global approaches by looking for local patterns that manifest themselves when we focus simultaneously on a subset G of the genes and a subset T of the experiments. Speci � cally, we look for order-preserving submatrices (OPSMs), in which the expression levels of all genes induce the same linear ordering of the experiments (we show that the OPSM search problem is NP-hard in the worst case). Such a pattern might arise, for example, if the experiments in T represent distinct stages in the progress of a disease or in a cellular process and the expression levels of all genes in G vary across the stages in the same way. We de � ne a probabilistic model in which an OPSM is hidden within an otherwise random matrix. Guided by this model, we develop an ef � cient algorithm for � nding the hidden OPSM in the random matrix. In data generated according to the model, the algorithm recovers the hidden OPSM with a very high success rate. Application of the methods to breast cancer data seem to reveal signi � cant local patterns.
A molecular signature of metastasis in primary solid tumors.
- Nat. Genet.
, 2003
"... Metastasis is the principal event leading to death in individuals with cancer, yet its molecular basis is poorly understood 1 . To explore the molecular differences between human primary tumors and metastases, we compared the gene-expression profiles of adenocarcinoma metastases of multiple tumor t ..."
Abstract
-
Cited by 158 (1 self)
- Add to MetaCart
Metastasis is the principal event leading to death in individuals with cancer, yet its molecular basis is poorly understood 1 . To explore the molecular differences between human primary tumors and metastases, we compared the gene-expression profiles of adenocarcinoma metastases of multiple tumor types to unmatched primary adenocarcinomas. We found a geneexpression signature that distinguished primary from metastatic adenocarcinomas. More notably, we found that a subset of primary tumors resembled metastatic tumors with respect to this gene-expression signature. We confirmed this finding by applying the expression signature to data on 279 primary solid tumors of diverse types. We found that solid tumors carrying the gene-expression signature were most likely to be associated with metastasis and poor clinical outcome (P < 0.03). These results suggest that the metastatic potential of human tumors is encoded in the bulk of a primary tumor, thus challenging the notion that metastases arise from rare cells within a primary tumor that have the ability to metastasize 2 . The prevailing model of metastasis holds that most primary tumor cells have low metastatic potential, but rare cells (estimated at less than one in ten million) within large primary tumors acquire metastatic capacity through somatic mutation 2 . The metastatic phenotype includes the ability to migrate from the primary tumor, survive in blood or lymphatic circulation, invade distant tissues and establish distant metastatic nodules. This model is primarily supported by animal models in which poorly metastatic cell lines can spawn highly metastatic variants if the process is facilitated by the isolation of rare metastatic nodules, expansion of the cells in vitro and injection of these selected cells into secondary recipient mice To study the molecular nature of metastasis, we analyzed the gene-expression profiles of 12 metastatic adenocarcinoma nodules of diverse origin (lung, breast, prostate, colorectal, uterus, ovary) and compared them with the expression profiles of 64 primary adenocarcinomas representing the same spectrum of tumor types obtained from different individuals. This comparison identified an expression pattern of 128 genes that best distinguished primary and metastatic adenocarcinomas (Fig.
Aligning Gene Expression Time Series With Time Warping Algorithms
, 2001
"... Motivation: Increasingly, biological processes are being studied through time series of RNA expression data collected for large numbers of genes. Because common processes may unfold at varying rates in different experiments or individuals, methods are needed that will allow corresponding expression ..."
Abstract
-
Cited by 150 (3 self)
- Add to MetaCart
Motivation: Increasingly, biological processes are being studied through time series of RNA expression data collected for large numbers of genes. Because common processes may unfold at varying rates in different experiments or individuals, methods are needed that will allow corresponding expression states in different time series to be mapped to one another. Results: We present implementations of time warping algorithms applicable to RNA and protein expression data and demonstrate their application to published yeast RNA expression time series. Programs executing two warping algorithms are described, a simple warping algorithm and an interpolative algorithm, along with programs that generate graphics that visually present alignment information. We show time warping to be superior to simple clustering at mapping corresponding time states. We document the impact of statistical measurement noise and sample size on the quality of time alignments, and present issues related to statistical assessment of alignment quality through alignment scores. We also discuss directions for algorithm improvement including development of multiple time series alignments and possible applications to causality searches and non-temporal processes (`concentration warping'). Availability: Academic implementations of alignment programs genewarp and genewarpi and the graphics generation programs grphwarp and grphwarpi are available as Win32 system DOS box executables on our web site along with documentation on their use. The publicly available data on which they were demonstrated may be found at http://genome-www.stanford.edu/cellcycle/. Postscript files generated by grphwarp and grphwarpi may be directly printed or viewed using GhostView software available at http://www.cs.wisc.edu/#ghost/. Con...
Resampling-Based Multiple Testing for Microarray Data Analysis
, 2003
"... The burgeoning field of genomics has revived interest in multiple testing procedures by raising new methodological and computational challenges. For example, microarray experiments generate large multiplicity problems in which thousands of hypotheses are tested simultaneously. In their 1993 book, We ..."
Abstract
-
Cited by 145 (3 self)
- Add to MetaCart
The burgeoning field of genomics has revived interest in multiple testing procedures by raising new methodological and computational challenges. For example, microarray experiments generate large multiplicity problems in which thousands of hypotheses are tested simultaneously. In their 1993 book, Westfall & Young propose resampling-based p-value adjustment procedures which are highly relevant to microarray experiments. This article discusses different criteria for error control in resampling-based multiple testing, including (a) the family wise error rate of Westfall & Young (1993) and (b) the false discovery rate developed by Benjamini & Hochberg (1995), both from a frequentist viewpoint; and (c) the positive false discovery rate of Storey (2002), which has a Bayesian motivation. We also introduce our recently developed fast algorithm for implementing the minP adjustment to control familywise error rate. Adjusted p-values for different approaches are applied to gene expression data from two recently published microarray studies. The properties of these procedures for multiple testing are compared.
From patterns to pathways: gene expression data analysis comes of age.
- Nature Genetics
, 2002
"... ..."
A Bayesian missing value estimation method for gene expression profile data
- Bioinformatics
, 2003
"... Motivation: Gene expression profile analyses have been used in numerous studies covering a broad range of areas in biology. When unreliable measurements are excluded, missing values are introduced in gene expression profiles. Although existing multivariate analysis methods have difficulty with the t ..."
Abstract
-
Cited by 127 (2 self)
- Add to MetaCart
(Show Context)
Motivation: Gene expression profile analyses have been used in numerous studies covering a broad range of areas in biology. When unreliable measurements are excluded, missing values are introduced in gene expression profiles. Although existing multivariate analysis methods have difficulty with the treatment of missing values, this problem has received little attention. There are many options for dealing with missing values, each of which reaches drastically different results. Ignoring missing values is the simplest method and is frequently applied. This approach, however, has its flaws. In this article, we propose an estimation method for missing values, which is based on Bayesian principal component analysis (BPCA). Although the methodology that a probabilistic model and latent variables are estimated simultaneously within the framework of Bayes
et al. Gene expression patterns in human liver cancers
- Mol Biol Cell
"... Hepatocellular carcinoma (HCC) is a leading cause of death worldwide. Using cDNA microarrays to characterize patterns of gene expression in HCC, we found consistent differences between the expression patterns in HCC compared with those seen in nontumor liver tissues. The expression patterns in HCC w ..."
Abstract
-
Cited by 117 (4 self)
- Add to MetaCart
Hepatocellular carcinoma (HCC) is a leading cause of death worldwide. Using cDNA microarrays to characterize patterns of gene expression in HCC, we found consistent differences between the expression patterns in HCC compared with those seen in nontumor liver tissues. The expression patterns in HCC were also readily distinguished from those associated with tumors metastatic to liver. The global gene expression patterns intrinsic to each tumor were sufficiently distinctive that multiple tumor nodules from the same patient could usually be recognized and distinguished from all the others in the large sample set on the basis of their gene expression patterns alone. The distinctive gene expression patterns are characteristic of the tumors and not the patient; the expression programs seen in clonally independent tumor nodules in the same patient were no more similar than those in tumors from different patients. Moreover, clonally related tumor masses that showed distinct expression profiles were also distinguished by genotypic differences. Some features of the gene expression patterns were associated with specific phenotypic and genotypic characteristics of the tumors, including growth rate, vascular invasion, and p53 overexpression.
Spectral learning
- In IJCAI
, 2003
"... We present a simple, easily implemented spectral learning algorithm which applies equally whether we have no supervisory information, pairwise link constraints, or labeled examples. In the unsupervised case, it performs consistently with other spectral clustering algorithms. In the supervised case, ..."
Abstract
-
Cited by 106 (6 self)
- Add to MetaCart
We present a simple, easily implemented spectral learning algorithm which applies equally whether we have no supervisory information, pairwise link constraints, or labeled examples. In the unsupervised case, it performs consistently with other spectral clustering algorithms. In the supervised case, our approach achieves high accuracy on the categorization of thousands of documents given only a few dozen labeled training documents for the 20 Newsgroups data set. Furthermore, its classification accuracy increases with the addition of unlabeled documents, demonstrating effective use of unlabeled data. By using normalized affinity matrices which are both symmetric and stochastic, we also obtain both a probabilistic interpretation of our method and certain guarantees of performance. 1