Results 1  10
of
34
A quantitative comparison of the similarity between genes and geography in worldwide human populations
, 2012
"... Multivariate statistical techniques such as principal components analysis (PCA) and multidimensional scaling (MDS) have been widely used to summarize the structure of human genetic variation, often in easily visualized twodimensional maps. Many recent studies have reported similarity between geogra ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Multivariate statistical techniques such as principal components analysis (PCA) and multidimensional scaling (MDS) have been widely used to summarize the structure of human genetic variation, often in easily visualized twodimensional maps. Many recent studies have reported similarity between geographic maps of population locations and MDS or PCA maps of genetic variation inferred from singlenucleotide polymorphisms (SNPs). However, this similarity has been evident primarily in a qualitative sense; and, because different multivariate techniques and marker sets have been used in different studies, it has not been possible to formally compare genetic variation datasets in terms of their levels of similarity with geography. In this study, using genomewide SNP data from 128 populations worldwide, we perform a systematic analysis to quantitatively evaluate the similarity of genes and geography in different geographic regions. For each of a series of regions, we apply a Procrustes analysis approach to find an optimal transformation that maximizes the similarity between PCA maps of genetic variation and geographic maps of population locations. We consider examples in Europe, SubSaharan Africa, Asia, East Asia, and Central/South Asia, as well as in a worldwide sample, finding that significant similarity between genes and geography exists in general at different geographic levels. The similarity is highest in our examples for Asia and, once highly distinctive populations have been removed, SubSaharan Africa. Our results provide a quantitative assessment of the geographic structure of human genetic variation worldwide, supporting the view that geography plays a strong role in
Nonconvex statistical optimization: Minimaxoptimal sparse pca in polynomial time. Available at arXiv:1408.5352
, 2014
"... Sparse principal component analysis (PCA) involves nonconvex optimization for which the global solution is hard to obtain. To address this issue, one popular approach is convex relaxation. However, such an approach may produce suboptimal estimators due to the relaxation effect. To optimally estimate ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Sparse principal component analysis (PCA) involves nonconvex optimization for which the global solution is hard to obtain. To address this issue, one popular approach is convex relaxation. However, such an approach may produce suboptimal estimators due to the relaxation effect. To optimally estimate sparse principal subspaces, we propose a twostage computational framework named “tighten after relax”: Within the “relax ” stage, we approximately solve a convex relaxation of sparse PCA with early stopping to obtain a desired initial estimator; For the “tighten ” stage, we propose a novel algorithm called sparse orthogonal iteration pursuit (SOAP), which iteratively refines the initial estimator by directly solving the underlying nonconvex problem. A key concept of this twostage framework is the basin of attraction. It represents a local region within which the “tighten ” stage has desired computational and statistical guarantees. We prove that, the initial estimator obtained from the “relax ” stage falls into such a region, and hence SOAP geometrically converges to a principal subspace estimator which is minimaxoptimal within a certain model class. Unlike most existing sparse PCA estimators, our approach applies to the nonspiked covariance models, and adapts to nonGaussianity as well as dependent data settings. Moreover, through analyzing the computational complexity of the two stages, we illustrate an interesting phenomenon: Larger sample size can reduce the total iteration complexity. Our framework motivates a general paradigm for solving many complex statistical problems which involve nonconvex optimization with provable guarantees. 1
A rticle Testing for Associations between Loci and Environmental Gradients Using Latent Factor Mixed Models
"... Adaptation to local environments often occurs through natural selection acting on a large number of loci, each having a weak phenotypic effect. One way to detect these loci is to identify genetic polymorphisms that exhibit high correlation with environmental variables used as proxies for ecological ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Adaptation to local environments often occurs through natural selection acting on a large number of loci, each having a weak phenotypic effect. One way to detect these loci is to identify genetic polymorphisms that exhibit high correlation with environmental variables used as proxies for ecological pressures. Here, we propose new algorithms based on population genetics, ecological modeling, and statistical learning techniques to screen genomes for signatures of local adaptation. Implemented in the computer program “latent factor mixed model ” (LFMM), these algorithms employ an approach in which population structure is introduced using unobserved variables. These fast and computationally efficient algorithms detect correlations between environmental and genetic variation while simultaneously inferring background levels of population structure. Comparing these new algorithms with related methods provides evidence that LFMM can efficiently estimate random effects due to population history and isolationbydistance patterns when computing geneenvironment correlations, and decrease the number of falsepositive associations in genome scans. We then apply these models to plant and human genetic data, identifying several genes with functions related to development that exhibit strong correlations with climatic gradients. Key words: local adaptation, environmental correlations, genome scans, latent factor models, population structure.
A rticle Genome Scans for Detecting Footprints of Local Adaptation Using a Bayesian Factor Model
"... There is a considerable impetus in population genomics to pinpoint loci involved in local adaptation. A powerful approach to find genomic regions subject to local adaptation is to genotype numerous molecular markers and look for outlier loci. One of the most common approaches for selection scans is ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
There is a considerable impetus in population genomics to pinpoint loci involved in local adaptation. A powerful approach to find genomic regions subject to local adaptation is to genotype numerous molecular markers and look for outlier loci. One of the most common approaches for selection scans is based on statistics that measure population differentiation such as FST. However, there are important caveats with approaches related to FST because they require grouping individuals into populations and they additionally assume a particular model of population structure. Here, we implement a more flexible individualbased approach based on Bayesian factor models. Factor models capture population structure with latent variables called factors, which can describe clustering of individuals into populations or isolationbydistance patterns. Using hierarchical Bayesian modeling, we both infer population structure and identify outlier loci that are candidates for local adaptation. In order to identify outlier loci, the hierarchical factor model searches for loci that are atypically related to population structure as measured by the latent factors. In a model of population divergence, we show that it can achieve a 2fold or more reduction of false discovery rate compared with the software BayeScan or with an FST approach. We show that our software can handle large data sets by analyzing the single nucleotide polymorphisms of the Human Genome Diversity Project. The Bayesian factor model is implemented in the opensource PCAdapt software. Key words: FST, population structure, landscape genetics, population genomics, selection scans.
Bayesian sparse factor analysis of genetic covariance matrices. Genetics, 0:in press
, 2013
"... Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, c ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, classical analytical techniques are poorly suited to quantitative genetic studies of gene expression where the number of traits assayed per individual can reach many thousand. Here, we derive a Bayesian genetic sparse factor model for estimating the genetic covariance matrix (Gmatrix) of highdimensional traits, such as gene expression, in a mixed effects model. The key idea of our model is that we need only consider Gmatrices that are biologically plausible. An organism’s entire phenotype is the result of processes that are modular and have limited complexity. This implies that the Gmatrix will be highly structured. In particular, we assume that a limited number of intermediate traits (or factors, e.g., variations in development or physiology) control the variation in the highdimensional phenotype, and that each of these intermediate traits is sparse – affecting only a few observed traits. The advantages of this approach are twofold. First, sparse factors are interpretable and provide biological insight into mechanisms underlying the genetic architecture. Second, enforcing sparsity helps prevent sampling errors from swamping out the true signal in highdimensional data. We demonstrate the advantages of our model on simulated data and in an analysis of a published Drosophila melanogaster gene expression data set.
Landscape genomic tests for associations between loci and environmental gradients
, 2012
"... Abstract Adaptation to local environments often occurs through natural selection acting on a large number of alleles, each having a weak phenotypic effect. One way to detect these alleles is to identify genetic polymorphisms that exhibit high correlation with environmental variables used as proxies ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract Adaptation to local environments often occurs through natural selection acting on a large number of alleles, each having a weak phenotypic effect. One way to detect these alleles is to identify genetic polymorphisms that exhibit high correlation with environmental variables used as proxies for ecological pressures. Here we propose an integrated framework based on population genetics, ecological modeling and statistical learning techniques to screen genomes for signatures of local adaptation. These new algorithms introduce latent factor mixed models to population genetics, employing an approach based on probabilistic principal component analysis in which population structure is introduced via unobserved variables. These fast, computationally efficient algorithms detect correlations between environmental and genetic variation while simultaneously inferring background levels of population structure. Comparing these new algorithms with related methods provides evidence that latent factor models can efficiently estimate random effects due to population history and isolationbydistance patterns when computing geneenvironment correlations, and decrease the number of falsepositive associations in genome scans. We then apply these models to plant and human genetic data, identifying several genes with functions related to development that exhibit unusual correlations with climatic gradients.
D: Normalizing RNAsequencing data by modeling hidden covariates with prior knowledge
 PLoS ONE
"... Transcriptomic assays that measure expression levels are widely used to study the manifestation of environmental or genetic variations in cellular processes. RNAsequencing in particular has the potential to considerably improve such understanding because of its capacity to assay the entire transcri ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Transcriptomic assays that measure expression levels are widely used to study the manifestation of environmental or genetic variations in cellular processes. RNAsequencing in particular has the potential to considerably improve such understanding because of its capacity to assay the entire transcriptome, including novel transcriptional events. However, as with earlier expression assays, analysis of RNAsequencing data requires carefully accounting for factors that may introduce systematic, confounding variability in the expression measurements, resulting in spurious correlations. Here, we consider the problem of modeling and removing the effects of known and hidden confounding factors from RNAsequencing data. We describe a unified residual framework that encapsulates existing approaches, and using this framework, present a novel method, HCP (Hidden Covariates with Prior). HCP uses a more informed assumption about the confounding factors, and performs as well or better than existing approaches while having a much lower computational cost. Our experiments demonstrate that accounting for known and hidden factors with appropriate models improves the quality of RNAsequencing data in two very different tasks: detecting genetic variations that are associated with nearby expression variations (ciseQTLs), and constructing accurate coexpression networks.
Tighten after Relax: MinimaxOptimal Sparse PCA
 in Polynomial Time. Advances in Neural Information Processing Systems (NIPS
, 2014
"... We provide statistical and computational analysis of sparse Principal Component Analysis (PCA) in high dimensions. The sparse PCA problem is highly nonconvex in nature. Consequently, though its global solution attains the optimal statistical rate of convergence, such solution is computationally intr ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We provide statistical and computational analysis of sparse Principal Component Analysis (PCA) in high dimensions. The sparse PCA problem is highly nonconvex in nature. Consequently, though its global solution attains the optimal statistical rate of convergence, such solution is computationally intractable to obtain. Meanwhile, although its convex relaxations are tractable to compute, they yield estimators with suboptimal statistical rates of convergence. On the other hand, existing nonconvex optimization procedures, such as greedy methods, lack statistical guarantees. In this paper, we propose a twostage sparse PCA procedure that attains the optimal principal subspace estimator in polynomial time. The main stage employs a novel algorithm named sparse orthogonal iteration pursuit, which iteratively solves the underlying nonconvex problem. However, our analysis shows that this algorithm only has desired computational and statistical guarantees within a restricted region, namely the basin of attraction. To obtain the desired initial estimator that falls into this region, we solve a convex formulation of sparse PCA with early stopping. Under an integrated analytic framework, we simultaneously characterize the computational and statistical performance of this twostage procedure. Computationally, our procedure converges at the rate of 1/ t within the initialization stage, and at a geometric rate within the main stage. Statistically, the final principal subspace estimator achieves the minimaxoptimal statistical rate of convergence with respect to the sparsity level s∗, dimension d and sample size n. Our procedure motivates a general paradigm of tackling nonconvex statistical learning problems with provable statistical guarantees. 1
Statistical Methods for Studying Genetic Variation in Populations
, 2012
"... This Dissertation is brought to you for free and open access by the Theses and Dissertations at Research Showcase @ CMU. It has been accepted for inclusion in Dissertations by an authorized administrator of Research Showcase @ CMU. For more information, please contact research ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This Dissertation is brought to you for free and open access by the Theses and Dissertations at Research Showcase @ CMU. It has been accepted for inclusion in Dissertations by an authorized administrator of Research Showcase @ CMU. For more information, please contact research