Results 1  10
of
33
RHODES,J.A.(2009). Identifiability of parameters in latent structure models with many observed variables
 Ann. Statist
"... While hidden class models of various types arise in many statistical applications, it is often difficult to establish the identifiability of their parameters. Focusing on models in which there is some structure of independence of some of the observed variables conditioned on hidden ones, we demonstr ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
While hidden class models of various types arise in many statistical applications, it is often difficult to establish the identifiability of their parameters. Focusing on models in which there is some structure of independence of some of the observed variables conditioned on hidden ones, we demonstrate a general approach for establishing identifiability utilizing algebraic arguments. A theorem of J. Kruskal for a simple latentclass model with finite state space lies at the core of our results, though we apply it to a diverse set of models. These include mixtures of both finite and nonparametric product distributions, hidden Markov models and random graph mixture models, and lead to a number of new results and improvements to old ones. In the parametric setting, this approach indicates that for such models, the classical definition of identifiability is typically too strong. Instead generic identifiability holds, which implies that the set of nonidentifiable parameters has measure zero, so that parameter inference is still meaningful. In particular, this sheds light on the properties of finite mixtures of Bernoulli products, which have been used for decades despite being known to have nonidentifiable parameters. In the nonparametric setting, we again obtain identifiability only when certain restrictions are placed on the distributions that are mixed, but we explicitly describe the conditions. 1. Introduction. Statistical
Performance of a New Invariants Method on Homogeneous and Nonhomogeneous Quartet Trees
, 2006
"... ..."
Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites
, 2007
"... ..."
The Identifiability of Covarion Models in phylogenetics
, 2008
"... Covarion models of character evolution describe inhomogeneities in substitution processes through time. In phylogenetics, such models are used to describe changing functional constraints or selection regimes during the evolution of biological sequences. In this work the identifiability of such mode ..."
Abstract

Cited by 10 (7 self)
 Add to MetaCart
Covarion models of character evolution describe inhomogeneities in substitution processes through time. In phylogenetics, such models are used to describe changing functional constraints or selection regimes during the evolution of biological sequences. In this work the identifiability of such models for generic parameters on a known phylogenetic tree is established, provided the number of covarion classes does not exceed the size of the observable state space. Combined with earlier results, this implies both the tree and generic numerical parameters are identifiable if the number of classes is strictly smaller than the number of observable states.
Mixedup Trees: the Structure of Phylogenetic Mixtures
, 2008
"... In this paper, we apply new geometric and combinatorial methods to the study of phylogenetic mixtures. The focus of the geometric approach is to describe the geometry of phylogenetic mixture distributions for the two state random cluster model, which is a generalization of the two state symmetric ( ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we apply new geometric and combinatorial methods to the study of phylogenetic mixtures. The focus of the geometric approach is to describe the geometry of phylogenetic mixture distributions for the two state random cluster model, which is a generalization of the two state symmetric (CFN) model. In particular, we show that the set of mixture distributions forms a convex polytope and we calculate its dimension; corollaries include a simple criterion for when a mixture of branch lengths on the star tree can mimic the site pattern frequency vector of a resolved quartet tree. Furthermore, by computing volumes of polytopes we can clarify how “common ” nonidentifiable mixtures are under the CFN model. We also present a new combinatorial result which extends any identifiability result for a specific pair of trees of size six to arbitrary pairs of trees. Next we present a positive result showing identifiability of ratesacrosssites models. Finally, we answer a question raised in a previous paper concerning “mixed branch repulsion” on trees larger than quartet trees under the CFN model.
Pitfalls of heterogeneous processes for phylogenetic reconstruction
 Systematic Biology
, 2006
"... Different genes often have different phylogenetic histories. Even within regions having the same phylogenetic history, the mutation rates often vary. We investigate the prospects of phylogenetic reconstruction when all the characters are generated from the same tree topology, but the branch lengths ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Different genes often have different phylogenetic histories. Even within regions having the same phylogenetic history, the mutation rates often vary. We investigate the prospects of phylogenetic reconstruction when all the characters are generated from the same tree topology, but the branch lengths vary (with possibly different tree shapes). Furthering work of Kolaczkowski and Thornton (2004) and Chang (1996), we show examples where maximum likelihood (under a homogeneous model) is an inconsistent estimator of the tree. We then explore the prospects of phylogenetic inference under a heterogeneous model. In some models, there are examples where phylogenetic inference under any method is impossible – despite the fact that there is a common tree topology. In particular, there are nonidentifiable mixture distributions, i.e., multiple topologies generate identical mixture distributions. We address which evolutionary models have nonidentifiable mixture distributions and prove that the following duality theorem holds for most DNA substitution models. The model has either: (i) Nonidentifiability – two different tree topologies can produce identical mixture distributions, and hence distinguishing between the two topologies is impossible; or (ii) Linear tests – there exist linear tests which identify the common tree topology for character data generated by a mixture distribution. The theorem holds for models whose transition matrices can be parameterized by open sets, which includes most of the popular models, such as TamuraNei and Kimura’s 2parameter model. The duality theorem relies on our notion of linear tests, which are related to Lake’s linear invariants. 1
Phylogeny of mixture models: Robustness of maximum likelihood and nonidentifiable distributions
"... We address phylogenetic reconstruction when the data is generated from a mixture distribution. Such topics have gained considerable attention in the biological community with the clear evidence of heterogeneity of mutation rates. In our work we consider data coming from a mixture of trees which shar ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
We address phylogenetic reconstruction when the data is generated from a mixture distribution. Such topics have gained considerable attention in the biological community with the clear evidence of heterogeneity of mutation rates. In our work we consider data coming from a mixture of trees which share a common topology, but differ in their edge weights (i.e., branch lengths). We first show the pitfalls of popular methods, including maximum likelihood and Markov chain Monte Carlo algorithms. We then determine in which evolutionary models, reconstructing the tree topology, under a mixture distribution, is (im)possible. We prove that every model whose transition matrices can be parameterized by an open set of multilinear polynomials, either has nonidentifiable mixture distributions, in which case reconstruction is impossible in general, or there exist linear tests which identify the topology. This duality theorem, relies on our notion of linear tests and uses ideas from convex programming duality. Linear tests are closely related to linear invariants, which were first introduced by Lake, and are natural from an algebraic geometry perspective.
Using invariants for phylogenetic tree construction
, 2008
"... Phylogenetic invariants are certain polynomials in the joint probability distribution of a Markov model on a phylogenetic tree. Such polynomials are of theoretical interest in the field of algebraic statistics and they are also of practical interest—they can be used to construct phylogenetic trees ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Phylogenetic invariants are certain polynomials in the joint probability distribution of a Markov model on a phylogenetic tree. Such polynomials are of theoretical interest in the field of algebraic statistics and they are also of practical interest—they can be used to construct phylogenetic trees. This paper is a selfcontained introduction to the algebraic, statistical, and computational challenges involved in the practical use of phylogenetic invariants. We survey the relevant literature and provide some partial answers and many open problems.
Identifiability of latent class models with many observed variables
"... While latent class models of various types arise in many statistical applications, it is often difficult to establish their identifiability. Focusing on models in which there is some structure of independence of some of the observed variables conditioned on hidden ones, we demonstrate a general ap ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
While latent class models of various types arise in many statistical applications, it is often difficult to establish their identifiability. Focusing on models in which there is some structure of independence of some of the observed variables conditioned on hidden ones, we demonstrate a general approach for establishing identifiability, utilizing algebraic arguments. A theorem of J. Kruskal for a simple latent class model with finite state space lies at the core of our results, though we apply it to a diverse set of models. These include mixtures of both finite and nonparametric product distributions, hidden Markov models, and random graph models, and lead to a number of new results and improvements to old ones. In the parametric setting we argue that the classical definition of identifiability is too strong, and should be replaced by the concept of generic identifiability. Generic identifiability implies that the set of nonidentifiable parameters has zero measure, so that the model remains useful for inference. In particular, this sheds light on the properties of finite mixtures of Bernoulli products, which have been used for decades despite being known to be nonidentifiable models. In the nonparametric setting, we again obtain identifiability only when certain restrictions are placed on the distributions that are mixed, but we explicitly describe the conditions.
E: Conditioned genome reconstruction: how to avoid choosing the conditioning genome
 Syst Biol 2007
"... Abstract.—Genome phylogenies can be inferred from data on the presence and absence of genes across taxa. Logdet distances may be a good method, because they allow expected genome size to vary across the tree. Recently, Lake and Rivera proposed conditioned genome reconstruction (calculation of logdet ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Abstract.—Genome phylogenies can be inferred from data on the presence and absence of genes across taxa. Logdet distances may be a good method, because they allow expected genome size to vary across the tree. Recently, Lake and Rivera proposed conditioned genome reconstruction (calculation of logdet distances using only those genes present in a conditioning genome) to deal with unobservable genes that are absent from every taxon of interest. We prove that their method can consistently estimate the topology for almost any choice of conditioning genome. Nevertheless, the choice of conditioning genome is important for small samples. For real bacterial genome data, different choices of conditioning genome can result in strong bootstrap support for different tree topologies. To overcome this problem, we developed supertree methods that combine information from all choices of conditioning genome. One of these methods, based on the BIONJ algorithm, performs well on simulated data and may have applications to other supertree problems. However, an analysis of 40 bacterial genomes using this method supports an incorrect clade of parasites. This is a common feature of modelbased gene content methods and is due to parallel gene loss. [BIONJ; conditioned genome reconstruction; consistency; gene content; logdet; supertrees.] Variation in gene content makes it difficult to estimate organismal phylogeny from nucleotide or amino acid sequences. Within a lineage, genes are often gained (for example, by lateral transfer) and lost (by deletion). Both