Results 1 - 10
of
27
Hierarchical Latent Class Models for Cluster Analysis
- Journal of Machine Learning Research
, 2002
"... Latent class models are used for cluster analysis of categorical data. Underlying such a model is the assumption that the observed variables are mutually independent given the class variable. A serious problem with the use of latent class models, known as local dependence, is that this assumption is ..."
Abstract
-
Cited by 34 (9 self)
- Add to MetaCart
Latent class models are used for cluster analysis of categorical data. Underlying such a model is the assumption that the observed variables are mutually independent given the class variable. A serious problem with the use of latent class models, known as local dependence, is that this assumption is often untrue. In this paper we propose hierarchical latent class models as a framework where the local dependence problem can be addressed in a principled manner. We develop a search-based algorithm for learning hierarchical latent class models from data. The algorithm is evaluated using both synthetic and real-world data.
Data Perturbation for Escaping Local Maxima in Learning
- IN AAAI
, 2002
"... Almost all machine learning algorithms---be they for regression, classification or density estimation---seek hypotheses that optimize a score on training data. In most interesting cases, however, full global optimization is not feasible and local search techniques are used to discover reasonable ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
Almost all machine learning algorithms---be they for regression, classification or density estimation---seek hypotheses that optimize a score on training data. In most interesting cases, however, full global optimization is not feasible and local search techniques are used to discover reasonable solutions. Unfortunately,
Last Level Cache (LLC) performance of data-mining workloads on a CMP—A case study of parallel bioinformatics workloads
- Proc. International Symposium on high Performance Computing
, 2006
"... With the continuing growth in the amount of genetic data, members of the bioinformatics community are developing a variety of data-mining applications to understand the data and discover meaningful information. These applications are important in defining the design and performance decisions of futu ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
With the continuing growth in the amount of genetic data, members of the bioinformatics community are developing a variety of data-mining applications to understand the data and discover meaningful information. These applications are important in defining the design and performance decisions of future high performance microprocessors. This paper presents a detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads. For a CMP with a three-level cache hierarchy, we model the last-level of the cache hierarchy as either multiple private caches or a single cache shared amongst different cores of the CMP. Our experiments show that the bioinformatics workloads exhibit significant data-sharing—50–95% of the data cache is shared by the different threads of the workload. Furthermore, regardless of the amount of data cache shared, for some workloads, as many as 98 % of the accesses to the last-level cache are to shared data cache lines. Additionally, the amount of data-sharing exhibited by the workloads is a function of the total cache size available—the larger the data cache the better the sharing behavior. Thus, partitioning the available last-level cache silicon area into multiple private caches can cause applications to lose their inherent data-sharing behavior. For the workloads in this study, a shared 32MB last-level cache is able to capture a tremendous amount of data-sharing and outperform a 32MB private cache configuration by several orders of magnitude. Specifically, with shared last-level caches, the bandwidth demands beyond the last-level cache can be reduced by factors of 3–625 when compared to private last-level caches. 1.
An Investigation of Phylogenetic Likelihood Methods
, 2003
"... We analyze the performance of likelihood-based approaches used to reconstruct phylogenetic trees. Unlike other techniques such as Neighbor-Joining (NJ) and Maximum Parsimony (MP), relatively little is known regarding the behavior of algorithms founded on the principle of likelihood. ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We analyze the performance of likelihood-based approaches used to reconstruct phylogenetic trees. Unlike other techniques such as Neighbor-Joining (NJ) and Maximum Parsimony (MP), relatively little is known regarding the behavior of algorithms founded on the principle of likelihood.
Phylogenetic hidden Markov models
- IN STATISTICAL METHODS IN MOLECULAR EVOLUTION
, 2005
"... Phylogenetic hidden Markov models, or phylo-HMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way this process changes from one site to the next. By treating molecular evolution as a combination of tw ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Phylogenetic hidden Markov models, or phylo-HMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way this process changes from one site to the next. By treating molecular evolution as a combination of two Markov processes—one that operates in the dimension of space (along a genome) and one that operates in the dimension of time (along the branches of a phylogenetic tree)—these models allow aspects of both sequence structure and sequence evolution to be captured. Moreover, as we will discuss, they permit key computations to be performed exactly and efficiently. Phylo-HMMs allow evolutionary information to be brought to bear on a wide variety of problems of sequence “segmentation, ” such as gene prediction and the identification of conserved elements. Phylo-HMMs were first proposed as a way of improving phylogenetic models that allow for variation among sites in the rate of substitution [8, 52]. Soon afterward, they were adapted for the problem of secondary structure
Combining Multiple Datasets in a Likelihood Analysis: Which Models are Best?
"... Until recently, phylogenetic analyses have been routinely based on homologous sequences of a single gene. Given the vast number of gene sequences now available, phylogenetic studies are now based on the analysis of multiple genes. Thus, it has become necessary to devise statistical methods to combin ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Until recently, phylogenetic analyses have been routinely based on homologous sequences of a single gene. Given the vast number of gene sequences now available, phylogenetic studies are now based on the analysis of multiple genes. Thus, it has become necessary to devise statistical methods to combine multiple molecular datasets. Here, we compare several models for combining different genes for the purpose of evaluating the likelihood of tree topologies. Three methods of branch length estimation were studied: assuming all genes have the same branch lengths (concatenate model); assuming that branch lengths are proportional among genes (proportional model); or assuming that each gene has a separate set of branch lengths (separate model). We also compared three models of among-site rate variation: the homogenous model, a model that assumes one gamma parameter for all genes, and a model that assumes one gamma parameter for each gene. On the basis of two nuclear and one mitochondrial amino-acid datasets, our results suggest that, depending on the dataset chosen, either the separate model or the proportional model represent the most appropriate method for branch length analysis. For all datasets examined, one gamma parameter to each gene represents the best model for among-site rate variation. Using these models, we analyzed alternative mammalian tree topologies and describe the effect of the assumed model on the maximum likelihood tree. We show that the choice of the model has an impact on the best phylogeny obtained.
Using evolutionary expectation maximization to estimate indel rates, Bioinformatics 21
, 2005
"... Motivation: The Expectation Maximization (EM) algorithm, in the form of the Baum–Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Motivation: The Expectation Maximization (EM) algorithm, in the form of the Baum–Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiplesequence evolutionary modelling, it would be useful to apply the EM algorithm to estimate not only the probability parameters of the stochastic grammar, but also the instantaneous mutation rates of the underlying evolutionary model (to facilitate the development of stochastic grammars based on phylogenetic trees, also known as Statistical Alignment). Recently, we showed how to do this for the point substitution component of the evolutionary process; here, we extend these results to the indel process. Results: We present an algorithm for maximum-likelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the single-residue indel model owing to Thorne, Kishino and Felsenstein (the ‘TKF91 ’ model). The algorithm converges extremely rapidly, gives accurate results on simulated data that are an improvement over parsimonious estimates (which are shown to underestimate the true indel rate), and gives plausible results on experimental data (coronavirus envelope domains). Owing to the algorithm’s close similarity to the Baum–Welch algorithm for training hidden Markov models, it can be used in an ‘unsupervised ’ fashion to estimate rates for unaligned sequences, or estimate several sets of rates for sequences with heterogenous rates. Availability: Software implementing the algorithm and the benchmark is available under GPL from
CE: Mammalian genomes ease location of human DNA functional segments but not their description
- Stat Appl Genet Mol Biol
"... Copyright c○2004 by the authors. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, bepres ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Copyright c○2004 by the authors. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, bepress, which has been given certain exclusive rights by the author. Statistical Applications in Genetics and Molecular Biology is produced by The Berkeley Electronic Press (bepress).
Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach
- Nucleic Acids Research
, 2007
"... Biologically significant sites in a protein may be
identified by contrasting the rates of synonymous
(K
s
) and non-synonymous (K
a
) substitutions. This
enables the inference of site-specific positive
Darwinian selection and purifying selection. We
present here Selecton version 2.2 (http://selecto ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Biologically significant sites in a protein may be
identified by contrasting the rates of synonymous
(K
s
) and non-synonymous (K
a
) substitutions. This
enables the inference of site-specific positive
Darwinian selection and purifying selection. We
present here Selecton version 2.2 (http://selecton.
bioinfo.tau.ac.il), a web server which automatically
calculates the ratio between K
a
and K
s
(u) at each
site of the protein. This ratio is graphically displayed
on each site using a color-coding scheme, indicat-
ing either positive selection, purifying selection
or lack of selection. Selecton implements an
assembly of different evolutionary models, which
allow for statistical testing of the hypothesis that a protein has undergone positive selection.
Specifically, the recently developed mechanistic-
empirical model is introduced, which takes into
account the physicochemical properties of amino
acids. Advanced options were introduced to allow
maximal fine tuning of the server to the user’s
specific needs, including calculation of statistical
support of the u values, an advanced graphic
display of the protein’s 3-dimensional structure,
use of different genetic codes and inputting of a
pre-built phylogenetic tree. Selecton version 2.2 is
an effective, user-friendly and freely available web
server which implements up-to-date methods for
computing site-specific selection forces, and the
visualization of these forces on the protein’s
sequence and structure.
Efficient Generation of Uniform Samples from Phylogenetic Trees
"... In this paper, we introduce new algorithms for selecting taxon (leaf) samples from large phylogenetic trees, uniformly at random, under certain biologically relevant constraints on the taxa. All the algorithms run in polynomial time and have been implemented. The algorithms have direct applications. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this paper, we introduce new algorithms for selecting taxon (leaf) samples from large phylogenetic trees, uniformly at random, under certain biologically relevant constraints on the taxa. All the algorithms run in polynomial time and have been implemented. The algorithms have direct applications...

