Results 1 
7 of
7
IDENTIFIABILITY OF A MARKOVIAN MODEL OF MOLECULAR EVOLUTION WITH GAMMADISTRIBUTED RATES
, 2008
"... Inference of evolutionary trees and rates from biological sequences is commonly performed using continuoustime Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Inference of evolutionary trees and rates from biological sequences is commonly performed using continuoustime Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its own rate. Very little has been rigorously established concerning the identifiability of the models currently in common use in data analysis, although nonidentifiability was proven for a semiparametric model and an incorrect proof of identifiability was published for a general parametric model (GTR+Γ+I). Here we prove that one of the most widely used models (GTR+Γ) is identifiable for generic parameters, and for all parameter choices in the case of 4state (DNA) models. This is the first proof of identifiability of a phylogenetic model with a continuous distribution of rates.
1 Identifiability of 2tree mixtures for groupbased models
, 909
"... Abstract — Phylogenetic data arising on two possibly different tree topologies might be mixed through several biological mechanisms, including incomplete lineage sorting or horizontal gene transfer in the case of different topologies, or simply different substitution processes on characters in the c ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract — Phylogenetic data arising on two possibly different tree topologies might be mixed through several biological mechanisms, including incomplete lineage sorting or horizontal gene transfer in the case of different topologies, or simply different substitution processes on characters in the case of the same topology. Recent work on a 2state symmetric model of character change showed such a mixture model has nonidentifiable parameters, and thus it is theoretically impossible to determine the two tree topologies from any amount of data under such circumstances. Here the question of identifiability is investigated for 2tree mixtures of the 4state groupbased models, which are more relevant to DNA sequence data. Using algebraic techniques, we show that the tree parameters are identifiable for the JC and K2P models. We also prove that generic substitution parameters for the JC mixture models are identifiable, and for the K2P and K3P models obtain generic identifiability results for mixtures on the same tree. This indicates that the full phylogenetic signal remains in such mixtures, and that the 2state symmetric result is thus a misleading guide to the behavior of other models. I.
IDENTIFIABILITY OF THE GTR+Γ MODEL OF MOLECULAR EVOLUTION
, 709
"... Abstract. Inference of evolutionary trees and rates from biological sequences is commonly performed using models of character change that incorporate rate variation across sites. Though an incorrect proof of the identifiability of the GTR+Γ+I model has been published, very little has been rigorously ..."
Abstract
 Add to MetaCart
Abstract. Inference of evolutionary trees and rates from biological sequences is commonly performed using models of character change that incorporate rate variation across sites. Though an incorrect proof of the identifiability of the GTR+Γ+I model has been published, very little has been rigorously established concerning the identifiability of the models currently in common use in data analysis. Here we prove that the GTR+Γ model is identifiable for generic parameters, and for all parameter choices in the case of 4state (DNA) models. This is the first proof of identifiability of a phylogenetic model with a continuous distribution of rate classes. 1.
Population Recovery and Partial Identification
"... We study several problems in which an unknown distribution over an unknown population of vectors needs to be recovered from partial or noisy samples, each of which nearly completely erases or obliterates the original vector. Such problems naturally arise in a variety of contexts in learning, cluster ..."
Abstract
 Add to MetaCart
We study several problems in which an unknown distribution over an unknown population of vectors needs to be recovered from partial or noisy samples, each of which nearly completely erases or obliterates the original vector. Such problems naturally arise in a variety of contexts in learning, clustering, statistics, computational biology, data mining and database privacy, where loss and error may be introduced by nature, inaccurate measurements, or on purpose. We give fairly efficient algorithms to recover the data under fairly general assumptions. Underlying our algorithms is a new structure we call a partial identification (PID) graph for an arbitrary finite set of vectors over any alphabet. This graph captures the extent to which certain subsets of coordinates in each vector distinguish it from other vectors. PID graphs yield strategies for dimension reductions and reassembly of statistical information. The quality of our algorithms (sequential and parallel runtime, as well as numerical stability) critically depends on three parameters of PID graphs: width, depth and cost. The combinatorial heart of this work is showing that every set of vectors posses a PID graph in which all three parameters are small (we prove some limitations on their tradeoffs as well). We further give an efficient algorithm to find such nearoptimal PID graphs for any set of vectors. Our efficient PID graphs imply general algorithms for these recovery problems, even when loss or noise are just below the informationtheoretic limit! In the learning/clustering context this gives a new algorithm for learning mixtures of binomial distributions (with known marginals) whose running time depends only quasipolynomially on the number of clusters. We discuss implications to privacy and coding as well.
TROPICAL MIXTURES OF STAR TREE METRICS
, 907
"... Abstract. We study tree metrics that can be realized as a mixture of two star tree metrics. We prove that the only trees admitting such a decomposition are the ones having only one internal edge and, moreover, certain relations among the weights assigned to all edges must hold. We also describe the ..."
Abstract
 Add to MetaCart
Abstract. We study tree metrics that can be realized as a mixture of two star tree metrics. We prove that the only trees admitting such a decomposition are the ones having only one internal edge and, moreover, certain relations among the weights assigned to all edges must hold. We also describe the fibers of the corresponding mixture map. In addition, we discuss the general framework of tropical secant varieties and we interpret our results within this setting. Finally, after discussing recent results on upper bounds on star tree ranks of metrics on n taxa, we show that analogous bounds for star tree metric ranks cannot exist. 1.
Can We Avoid “SIN ” in the House of “No Common Mechanism”?
, 2010
"... In “no common mechanism ” (NCM) models of character evolution, each character can evolve on a phylogenetic tree under a partially or totally separate process (e.g., with its own branch lengths). In such cases, the usual conditions that suffice to establish the statistical consistency of tree reconst ..."
Abstract
 Add to MetaCart
In “no common mechanism ” (NCM) models of character evolution, each character can evolve on a phylogenetic tree under a partially or totally separate process (e.g., with its own branch lengths). In such cases, the usual conditions that suffice to establish the statistical consistency of tree reconstruction by methods such as maximum likelihood (ML) break down, suggesting that such methods may be prone to statistical inconsistency (SIN). In this paper I ask whether we can avoid SIN for tree topology reconstruction when adopting such models either by using ML or by any other method that could be devised. I show that it is possible to avoid SIN for certain NCM models, but not for others, and the results depend delicately on the tree reconstruction method employed. I also describe the biological relevance of some recent mathematical results for the more usual “common mechanism (CM) ” setting. The results are not intended to justify NCM rather to set in place a framework within which such questions can be formally addressed. SIN in phylogenetics is the tendency of certain tree reconstruction methods to fail to converge on the correct
1 Enclose it or Lose it! Computeraided Proofs in Statistics Principal Investigator:
, 2010
"... Enclosure methods are a class of computeraided proofs used in analysis. They are used increasingly to solve open problems in mathematics. The proposed project will use enclosure methods to address two open statistical decision problems: 1. rigorous parameter estimation in a chaotic statistical expe ..."
Abstract
 Add to MetaCart
Enclosure methods are a class of computeraided proofs used in analysis. They are used increasingly to solve open problems in mathematics. The proposed project will use enclosure methods to address two open statistical decision problems: 1. rigorous parameter estimation in a chaotic statistical experiment, and 2. rigorous point estimation and exact posterior sampling in phylogenetics. To address these problems, we will adapt and extend recent developments in contractor programming, interval constraint propagation, algebraic statistical constraints and employ a novel mapped subpaving arithmetic. A C++ class library that can harness UC’s super computing power for such computeraided proofs will be made publicly available along with a database of solutions. 5A. ABSTRACT OF RESEARCH PROPOSAL Enclosure methods that rely on machine interval arithmetic — validated computer arithmetic that encloses or bounds all numerical errors — have become an important tool in computeraided proofs in analysis. Some examples where these methods have been applied include proofs of the Feigenbaum conjectures1, the double bubble conjecture 2, the existence of the Lorenz