Annotationbased Distance Measures for Patient Subgroup Discovery in Clinical Microarray Studies
Motivation: Clustering algorithms are widely used in the analysis of microarray data. In clinical studies, they are often applied to find groups of coregulated genes. Clustering, however, can also stratify patients by similarity of their gene expression profiles, thereby defining novel disease entities based on molecular characteristics. Several distancebased cluster algorithms have been suggested, but little attention has been given to the distance measure between patients. Even with the Euclidean metric, including and excluding genes from the analysis leads to different distances between the same objects, and consequently different clustering results. Results: We describe a new clustering algorithm, in which gene selection is used to derive biologically meaningful clusterings of samples by combining expression profiles and functional annotation data. According to gene annotations, candidate gene sets with specific functional characterizations are generated. Each set defines a different distance measure between patients, leading to different clusterings. These clusterings are filtered using a resampling based significance measure. Significant clusterings are reported together with the underlying gene sets and their functional definition. Conclusions: Our method reports clusterings defined by biologically focused sets of genes. In annotation driven clusterings, we have recovered clinically relevant patient subgroups through biologically plausible sets of genes, as well as new subgroupings. We conjecture that our method has the potential to reveal so far unknown, clinically relevant classes of patients in an unsupervised manner. Availability: We provide the R package adSplit as part of Bioconductor release 1.9 and on
Prediction of the InterObserver Visual Congruency (IOVC) and application to image ranking
This paper proposes an automatic method for predicting the interobserver visual congruency (IOVC). The IOVC reflects the congruence or the variability among different subjects looking at the same image. Predicting this congruence is of interest for image processing applications where the visual perception of a picture matters such as website design, advertisement, etc. This paper makes several new contributions. First, a computational model of the IOVC is proposed. This new model is a mixture of lowlevel visual features extracted from the input picture where model’s parameters are learned by using a large eyetracking database. Once the parameters have been learned, it can be used for any new picture. Second, regarding lowlevel visual feature extraction, we propose a new scheme to compute the depth of field of a picture. Finally, once the training and the feature extraction have been carried out, a score ranging from 0 (minimal congruency) to 1 (maximal congruency) is computed. A value of 1 indicates that observers would focus on the same locations and suggests that the picture presents strong locations of interest. A second database of eye movements is used to assess the performance of the proposed model. Results show that our IOVC criterion outperforms the Feature Congestion measure [33]. To illustrate the interest of the proposed model, we have used it to automatically rank personalized photograph.
Dimensionality reduction by Canonical Contextual Correlation projections
 the 8th European Conference on Computer Vision, pp.562
, 2004
Abstract. A linear, discriminative, supervised technique for reducing feature vectors extracted from image data to a lowerdimensional representation is proposed. It is derived from classical Fisher linear discriminant analysis (LDA) and useful, for example, in supervised segmentation tasks in which highdimensional feature vector describes the local structure of the image. In general, the main idea of the technique is applicable in discriminative and statistical modelling that involves contextual data. LDA is a basic, wellknown and useful technique in many applications. Our contribution is that we extend the use of LDA to cases where there is dependency between the output variables, i.e., the class labels, and not only between the input variables. The latter can be dealt with in standard LDA. The principal idea is that where standard LDA merely takes into account a single class label for every feature vector, the new technique incorporates class labels of its neighborhood in its analysis as well. In this way, the spatial class label configuration in the vicinity of every feature vector is accounted for, resulting
Estimation of fitness landscape contours in EAs
 in Proceedings of GECCO2007, 2007
Evolutionary algorithms applied in real domain should profit from information about the local fitness function curvature. This paper presents an initial study of an evolutionary strategy with a novel approach for learning the covariance matrix of a Gaussian distribution. The learning method is based on estimation of the fitness landscape contour line between the selected and discarded individuals. The distribution learned this way is then used to generate new population members. The algorithm presented here is the first attempt to construct the Gaussian distribution this way and should be considered only a proof of concept; nevertheless, the empirical comparison on lowdimensional quadratic functions shows that our approach is viable and with respect to the number of evaluations needed to find a solution of certain quality, it is comparable to the stateoftheart CMAES in case of sphere function and outperforms the CMAES in case of elliptical function.
Tropical Implicitization
, 2010
In recent years, tropical geometry has developed as a theory on its own. Its two main aims are to answer open questions in algebraic geometry and to give new proofs of celebrated classical results. The main subject of this thesis is concerned with the former: the solution of implicitization problems via tropical geometry. We develop new and explicit techniques that completely solve these challenges in four concrete examples. We start by studying a family of challenging examples inspired by algebraic statistics and machine learning: the restricted Boltzmann machines F(n, k). These machines are highly structured projective varieties in tensor spaces. They correspond to a statistical model encoded by the complete bipartite graph Kk,n, by marginalizing k of the n+ k binary random variables. In Chapter 2, we investigate this problem in the most general setting. We conjecture a formula for the expected dimension of the model, verifying it in all relevant cases. We also study inference functions and their interplay with tropicalization of polynomial maps. In Chapter 3, we focus on the particular case F(4, 2), answering a question by Drton, Sturmfels and Sullivant regarding the degree (and Newton polytope) of the homogeneous equation in 16 variables defining this model. We show that its degree is 110 and compute
A gprior extension for p> n
, 801
For the normal linear model regression setup, Zellner’s gprior is extended for the case where the number of predictors p exceeds the number of observations n. Exact analytical calculation of the marginal density under this prior is seen to lead to a new closed form variable selection criterion. This results are also applicable to the multivariate regression setup.
Risk Bounds for Embedded Variable Selection in Classification Trees
, 2012
The problems of model and variable selections for classification trees are jointly considered. A penalized criterion is proposed which explicitly takes into account the number of variables, and a risk bound inequality is provided for the tree classifier minimizing this criterion. This penalized criterion is compared to the one used during the pruning step of the CART algorithm. It is shown that the two criteria are similar under some specific margin assumptions. In practice, the tuning parameter of the CART penalty has to be calibrated by holdout. Simulation studies are performed which confirm that the holdout procedure mimics the form of the proposed penalized criterion. Keywords: Theory
Title: High dimensional multiclass classification with applications to cancer diagnosis
Probabilistic classifiers are introduced and it is shown that the only regular linear probabilistic classifier with convex risk is multinomial regression. Penalized empirical risk minimization is introduced and used to construct supervised learning methods for probabilistic classifiers. A sparse group lasso penalized approach to high dimensional multinomial classification is presented. On different real data examples it is found that this approach clearly outperforms multinomial lasso in terms of error rate and features included in the model. An efficient coordinate descent algorithm is developed and the convergence is established. This algorithm is implemented in the msgl R package. Examples of high dimensional multiclass problems are studied, in particular examples of multiclass classification based on gene expression measurements. One such example is the – clinically important – problem of identifying the primary tumor site of lever metastases, this particular problem is studied in detail. In order to adjust for the lever contamination found in biopsies of metastases a computational contamination model is develop. The contamination model is presented in a domain adaption framework and a simulation based domain adaption strategy is presented. It is shown that the presented computational contamination approach
Probabilistic Models for Melodic Prediction
, 2008
submitted for publication Abstract. Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs).2 IDIAP–RR 0850 1