Results 1  10
of
42
Identifiability of parameters in latent structure models with many observed variables
 ANN. STATIST
, 2009
"... While hidden class models of various types arise in many statistical applications, it is often difficult to establish the identifiability of their parameters. Focusing on models in which there is some structure of independence of some of the observed variables conditioned on hidden ones, we demonstr ..."
Abstract

Cited by 80 (8 self)
 Add to MetaCart
While hidden class models of various types arise in many statistical applications, it is often difficult to establish the identifiability of their parameters. Focusing on models in which there is some structure of independence of some of the observed variables conditioned on hidden ones, we demonstrate a general approach for establishing identifiability utilizing algebraic arguments. A theorem of J. Kruskal for a simple latentclass model with finite state space lies at the core of our results, though we apply it to a diverse set of models. These include mixtures of both finite and nonparametric product distributions, hidden Markov models and random graph mixture models, and lead to a number of new results and improvements to old ones. In the parametric setting, this approach indicates that for such models, the classical definition of identifiability is typically too strong. Instead generic identifiability holds, which implies that the set of nonidentifiable parameters has measure zero, so that parameter inference is still meaningful. In particular, this sheds light on the properties of finite mixtures of Bernoulli products, which have been used for decades despite being known to have nonidentifiable parameters. In the nonparametric setting, we again obtain identifiability only when certain restrictions are placed on the distributions that are mixed, but we explicitly describe the conditions.
pplacer: linear time maximumlikelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree
 BMC bioinfo
"... Background: Likelihoodbased phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihoodbased phylogenetic methods cannot be applied to large volumes of short reads from nextgeneration sequencing due to computa ..."
Abstract

Cited by 33 (3 self)
 Add to MetaCart
(Show Context)
Background: Likelihoodbased phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihoodbased phylogenetic methods cannot be applied to large volumes of short reads from nextgeneration sequencing due to computational complexity issues and lack of phylogenetic signal. “Phylogenetic placement, ” where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihoodbased approaches to large data sets. Results: This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edgebyedge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a wellsampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence. 1 ar
Performance of a New Invariants Method on Homogeneous and Nonhomogeneous Quartet Trees
, 2006
"... ..."
Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites
, 2007
"... ..."
Mixedup Trees: the Structure of Phylogenetic Mixtures
 BULLETIN OF MATHEMATICAL BIOLOGY (2008)
, 2008
"... In this paper, we apply new geometric and combinatorial methods to the study of phylogenetic mixtures. The focus of the geometric approach is to describe the geometry of phylogenetic mixture distributions for the two state random cluster model, which is a generalization of the two state symmetric ( ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
In this paper, we apply new geometric and combinatorial methods to the study of phylogenetic mixtures. The focus of the geometric approach is to describe the geometry of phylogenetic mixture distributions for the two state random cluster model, which is a generalization of the two state symmetric (CFN) model. In particular, we show that the set of mixture distributions forms a convex polytope and we calculate its dimension; corollaries include a simple criterion for when a mixture of branch lengths on the star tree can mimic the site pattern frequency vector of a resolved quartet tree. Furthermore, by computing volumes of polytopes we can clarify how “common” nonidentifiable mixtures are under the CFN model. We also present a new combinatorial result which extends any identifiability result for a specific pair of trees of size six to arbitrary pairs of trees. Next we present a positive result showing identifiability of ratesacrosssites models. Finally, we answer a question raised in a previous paper concerning “mixed branch repulsion” on trees larger than quartet trees under the CFN model.
IDENTIFIABILITY OF A MARKOVIAN MODEL OF MOLECULAR EVOLUTION WITH GAMMADISTRIBUTED RATES
, 2008
"... Inference of evolutionary trees and rates from biological sequences is commonly performed using continuoustime Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Inference of evolutionary trees and rates from biological sequences is commonly performed using continuoustime Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its own rate. Very little has been rigorously established concerning the identifiability of the models currently in common use in data analysis, although nonidentifiability was proven for a semiparametric model and an incorrect proof of identifiability was published for a general parametric model (GTR+Γ+I). Here we prove that one of the most widely used models (GTR+Γ) is identifiable for generic parameters, and for all parameter choices in the case of 4state (DNA) models. This is the first proof of identifiability of a phylogenetic model with a continuous distribution of rates.
Pitfalls of heterogeneous processes for phylogenetic reconstruction
 Systematic Biology
, 2006
"... Different genes often have different phylogenetic histories. Even within regions having the same phylogenetic history, the mutation rates often vary. We investigate the prospects of phylogenetic reconstruction when all the characters are generated from the same tree topology, but the branch lengths ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
Different genes often have different phylogenetic histories. Even within regions having the same phylogenetic history, the mutation rates often vary. We investigate the prospects of phylogenetic reconstruction when all the characters are generated from the same tree topology, but the branch lengths vary (with possibly different tree shapes). Furthering work of Kolaczkowski and Thornton (2004) and Chang (1996), we show examples where maximum likelihood (under a homogeneous model) is an inconsistent estimator of the tree. We then explore the prospects of phylogenetic inference under a heterogeneous model. In some models, there are examples where phylogenetic inference under any method is impossible – despite the fact that there is a common tree topology. In particular, there are nonidentifiable mixture distributions, i.e., multiple topologies generate identical mixture distributions. We address which evolutionary models have nonidentifiable mixture distributions and prove that the following duality theorem holds for most DNA substitution models. The model has either: (i) Nonidentifiability – two different tree topologies can produce identical mixture distributions, and hence distinguishing between the two topologies is impossible; or (ii) Linear tests – there exist linear tests which identify the common tree topology for character data generated by a mixture distribution. The theorem holds for models whose transition matrices can be parameterized by open sets, which includes most of the popular models, such as TamuraNei and Kimura’s 2parameter model. The duality theorem relies on our notion of linear tests, which are related to Lake’s linear invariants. 1
Phylogeny of mixture models: Robustness of maximum likelihood and nonidentifiable distributions
"... We address phylogenetic reconstruction when the data is generated from a mixture distribution. Such topics have gained considerable attention in the biological community with the clear evidence of heterogeneity of mutation rates. In our work we consider data coming from a mixture of trees which shar ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
We address phylogenetic reconstruction when the data is generated from a mixture distribution. Such topics have gained considerable attention in the biological community with the clear evidence of heterogeneity of mutation rates. In our work we consider data coming from a mixture of trees which share a common topology, but differ in their edge weights (i.e., branch lengths). We first show the pitfalls of popular methods, including maximum likelihood and Markov chain Monte Carlo algorithms. We then determine in which evolutionary models, reconstructing the tree topology, under a mixture distribution, is (im)possible. We prove that every model whose transition matrices can be parameterized by an open set of multilinear polynomials, either has nonidentifiable mixture distributions, in which case reconstruction is impossible in general, or there exist linear tests which identify the topology. This duality theorem, relies on our notion of linear tests and uses ideas from convex programming duality. Linear tests are closely related to linear invariants, which were first introduced by Lake, and are natural from an algebraic geometry perspective.