An Empirical Bayes Approach to Inferring LargeScale Gene Association Networks
 BIOINFORMATICS
, 2004
Motivation: Genetic networks are often described statistically by graphical models (e.g. Bayesian networks). However, inferring the network structure offers a serious challenge in microarray analysis where the sample size is small compared to the number of considered genes. This renders many standard algorithms for graphical models inapplicable, and inferring genetic networks an “illposed” inverse problem. Methods: We introduce a novel framework for smallsample inference of graphical models from gene expression data. Specifically, we focus on socalled graphical Gaussian models (GGMs) that are now frequently used to describe gene association networks and to detect conditionally dependent genes. Our new approach is based on (i) improved (regularized) smallsample point estimates of partial correlation, (ii) an exact test of edge inclusion with adaptive estimation of the degree of freedom, and (iii) a heuristic network search based on false discovery rate multiple testing. Steps (ii) and (iii) correspond to an empirical Bayes estimate of the network topology. Results: Using computer simulations we investigate the sensitivity (power) and specificity (true negative rate) of the proposed framework to estimate GGMs from microarray data. This shows that it is possible to recover the true network topology with high accuracy even for smallsample data sets. Subsequently, we analyze gene expression data from a breast cancer tumor study and illustrate our approach by inferring a corresponding largescale gene association network for 3,883 genes. Availability: The authors have implemented the approach in the R package “GeneTS ” that is freely available from
The Entire Regularization Path for the Support Vector Machine
, 2004
The Support Vector Machine is a widely used tool for classification. Many efficient implementations exist for fitting a twoclass SVM model. The user has to supply values for the tuning parameters: the regularization cost parameter, and the kernel parameters. It seems a common practice is to use a default value for the cost parameter, often leading to the least restrictive model. In this paper we argue that the choice of the cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. We illustrate our algorithm on some examples, and use our representation to give further insight into the range of SVM solutions.
Prediction by supervised principal components
 Journal of the American Statistical Association
, 2006
In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer. KEY WORDS: Gene expression; Microarray; Regression; Survival analysis. 1.
Margin trees for highdimensional classification
 Journal of Machine Learning Research
We propose a method for the classification of more than two classes, from highdimensional features. Our approach is to build a binary decision tree in a topdown manner, using the optimal margin classifier at each split. We implement an exact greedy algorithm for this task, and compare its performance to less greedy procedures based on clustering of the matrix of pairwise margins,. We compare the performance of the “margin tree ” to the closely related “allpairs ” (one versus one) support vector machine, and nearest centroids on a number of cancer microarray datasets. We also develop a simple method for feature selection. We find that the margin tree has accuracy that is competitive with other methods and offers additional interpretability in its putative grouping of the classes.
Survival prediction using gene expression data: a review and comparison. submitted
, 2007
Background: Knowledge of the transcription of the humane genome might greatly enhance our understanding of cancer. In particular, gene expression may be used to predict the survival of cancer patients. A microarray measures the expression of thousands of genes simultaneously. The highdimensionality of the data poses the following problem: the number of covariates (∼10000) greatly exceeds the number of samples (∼200). Results: Here we give an inventory of methods that have been used to model survival using gene expression. These methods are critically reviewed and compared in a qualitative way. Finally, the methods are applied to artificial and reallife datasets for a quantitative comparison. Conclusions: The choice of the evaluation measure of predictive performance is crucial for the selection of the best method. Depending on the evaluation measure, either the L2penalized Cox regression or the random forest ensemble method yields the best survival time prediction using gene expression for the data sets used. Consensus, on which evaluation measure of predictive performance is best used, is much needed. 1 1
The Centrality of
, 1992
This Article is brought to you for free and open access by the Biochemistry, Department of at DigitalCommons@University of Nebraska Lincoln. It
Epigenetic profiling reveals etiologically distinct patterns of DNA methylation in head and neck squamous cell carcinoma
 Carcinogenesis
, 2009
D ow nloaded from Head and neck squamous cell carcinomas (HNSCC) represent clinically and etiologically heterogeneous tumors affecting over 40,000 patients per year in the United States. Previous research has identified individual epigenetic alterations, and, in some cases, the relationship of these alterations with carcinogen exposure or patient outcomes, suggesting that specific exposures give rise to specific types of molecular alterations in HNSCCs. Here we describe how different etiologic factors are reflected in the molecular character and clinical outcome of these tumors. In a case series of primary, incident HNSCC (n=68), we examined the DNA
An integrative pathwaybased clinical–genomic model for cancer . . .
 STATISTICS AND PROBABILITY LETTERS
, 2010
A study on three Linear Discriminant Analysis based methods in Small Sample Size problem Abstract
In this paper, we make a study on three Linear Discriminant Analysis (LDA) based