Results 1 - 10
of
116
BagBoosting for tumor classification with gene expression data
- Bioinformatics
, 2004
"... Motivation: Microarray experiments are expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. They create a need for class prediction tools, which can deal with a large number of highly correlated input variables, perform feature selection ..."
Abstract
-
Cited by 79 (1 self)
- Add to MetaCart
Motivation: Microarray experiments are expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. They create a need for class prediction tools, which can deal with a large number of highly correlated input variables, perform feature selection and provide class probability estimates that serve as a quantification of the predictive uncertainty. A very promising solution is to combine the two ensemble schemes bagging and boosting to a novel algorithm called BagBoosting.
Results: When bagging is used as a module in boosting, the resulting classifier consistently improves the predictive performance and the probability estimates of both bagging and boosting on real and simulated gene expression data. This quasi-guaranteed improvement can be obtained by simply making a bigger computing effort. The advantageous predictive potential is also confirmed by comparing BagBoosting to several established class prediction tools for microarray data.
Correcting sample selection bias by unlabeled data
"... We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We prese ..."
Abstract
-
Cited by 69 (5 self)
- Add to MetaCart
We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
Gaussian processes for ordinal regression
- Journal of Machine Learning Research
, 2004
"... We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach.
Integrating structured biological data by kernel maximum mean discrepancy
- IN ISMB
, 2006
"... Motivation: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if the ..."
Abstract
-
Cited by 33 (13 self)
- Add to MetaCart
Motivation: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic. The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology. Results: We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms its best competitors. Conclusions: We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments.
Cluster Validation Techniques for Genome Expression Data
- Signal Processing
, 2002
"... Several clustering algorithms have been suggested to analyse genome expression data, but fewer solutions have been implemented to guide the design of clusteringbased experiments and assess the quality of their outcomes. A cluster validity framework provides insights into the problem of predicting th ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
Several clustering algorithms have been suggested to analyse genome expression data, but fewer solutions have been implemented to guide the design of clusteringbased experiments and assess the quality of their outcomes. A cluster validity framework provides insights into the problem of predicting the correct the number of clusters. This paper presents several validation techniques for gene expression data analysis. Normalisation and validity aggregation strategies are proposed to improve the prediction about the number of relevant clusters. The results obtained indicate that this systematic evaluation approach may significantly support genome expression analyses for knowledge discovery applications.
Microarrays, empirical Bayes and the two-groups model
- STATIST. SCI
, 2006
"... The classic frequentist theory of hypothesis testing developed by Neyman, Pearson, and Fisher has a claim to being the Twentieth Century’s most influential piece of applied mathematics. Something new is happening in the Twenty-First Century: high throughput devices, such as microarrays, routinely re ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
The classic frequentist theory of hypothesis testing developed by Neyman, Pearson, and Fisher has a claim to being the Twentieth Century’s most influential piece of applied mathematics. Something new is happening in the Twenty-First Century: high throughput devices, such as microarrays, routinely require simultaneous hypothesis tests for thousands of individual cases, not at all what the classical theory had in mind. In these situations empirical Bayes information begins to force itself upon frequentists and Bayesians alike. The two-groups model is a simple Bayesian construction that facilitates empirical Bayes analysis. This article concerns the interplay of Bayesian and frequentist ideas in the two-groups setting, with particular attention focussed on Benjamini and Hochberg’s False Discovery Rate method. Topics include the choice and meaning of the null hypothesis in large-scale testing situations, power considerations, the limitations of permutation methods, significance testing for groups of cases (such as pathways in microarray studies), correlation effects, multiple confidence intervals, and Bayesian competitors to the two-groups model.
Finding Predictive Gene Groups from Microarray Data
- Journal of Multivariate Analysis
, 2004
"... Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable of interest. To nd these groups, we suggest the use of supervised algorithms: these are procedures which use external information about the response variable for grouping the genes. We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, gene grouping and sample classi cation in a supervised, simultaneous way. With an empirical study on six dierent microarray datasets, we show that Pelora identi es gene groups whose expression centroids have very good predictive potential and yield results that can keep up with state-of-the-art classi cation methods based on single genes. Thus, our gene groups can be bene cial in medical diagnostics and prognostics, but they may also provide more biological insights into gene function and regulation.
Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional programs associated with metastatic disease
- Cancer Res
, 2002
"... The identification of genes that contribute to the biological basis for clinical heterogeneity and progression of prostate cancer is critical to accurate classification and appropriate therapy. We performed a comprehensive gene expression analysis of prostate cancer using oligonucleotide arrays with ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The identification of genes that contribute to the biological basis for clinical heterogeneity and progression of prostate cancer is critical to accurate classification and appropriate therapy. We performed a comprehensive gene expression analysis of prostate cancer using oligonucleotide arrays with 63,175 probe sets to identify genes and expressed sequences with strong and uniform differential expression between nonrecurrent primary prostate cancers and metastatic prostate cancers. The mean expression value for>3,000 tumor-intrinsic genes differed by at least 3-fold between the two groups. This includes many novel ESTs not previously implicated in prostate cancer progression. Many differentially expressed genes participate in biological processes that may contribute to the clinical phenotype. One example was a strong correlation between high proliferation rates in metastatic cancers and overexpression of genes that participate in cell cycle regulation, DNA replication, and DNA repair. Other functional categories of differentially expressed genes included transcriptional regulation, signaling, signal transduction, cell structure, and motility. These differentially expressed genes reflect critical cellular activities that contribute to clinical heterogeneity and provide diagnostic and therapeutic targets.
Partial least squares: A versatile tool for the analysis of high-dimensional genomic data
- Briefings in Bioinformatics
, 2007
"... Partial Least Squares (PLS) is a highly efficient statistical regression technique that is well suited for the analysis of high-dimensional genomic data. In this paper we review the theory and applications of PLS both under methodological and biological points of view. Focusing on microarray express ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Partial Least Squares (PLS) is a highly efficient statistical regression technique that is well suited for the analysis of high-dimensional genomic data. In this paper we review the theory and applications of PLS both under methodological and biological points of view. Focusing on microarray expression data we provide a systematic comparison of the PLS approaches currently employed, and discuss problems as different as tumor classification, identification of relevant genes, survival analysis and modeling of gene networks. 2 1
Size, power and false discovery rates
, 2007
"... Modern scientific technology has provided a new class of large-scale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
Modern scientific technology has provided a new class of large-scale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science surveys. This paper uses false discovery rate methods to carry out both size and power calculations on large-scale problems. A simple empirical Bayes approach allows the fdr analysis to proceed with a minimum of frequentist or Bayesian modeling assumptions. Closed-form accuracy formulas are derived for estimated false discovery rates, and used to compare different methodologies: local or tail-area fdr’s, theoretical, permutation, or empirical null hypothesis estimates. Two microarray data sets as well as simulations are used to evaluate the methodology the power diagnostics showing why non-null cases might easily fail to appear on a list of “significant ” discoveries. Short Title “Size, Power, and Fdr’s”

