Results 1  10
of
32
Learning in graphical models
 STATISTICAL SCIENCE
, 2004
"... Statistical applications in fields such as bioinformatics, information retrieval, speech processing, image processing and communications often involve largescale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for ..."
Abstract

Cited by 655 (10 self)
 Add to MetaCart
Statistical applications in fields such as bioinformatics, information retrieval, speech processing, image processing and communications often involve largescale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the general graphical model formalism. We review some of the basic ideas underlying graphical models, including the algorithmic ideas that allow graphical models to be deployed in largescale data analysis problems. We also present examples of graphical models in bioinformatics, errorcontrol coding and language processing.
Gene prediction with a hidden Markov model and a new intron submodel
 Bioinformatics
, 2003
"... The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here ..."
Abstract

Cited by 89 (5 self)
 Add to MetaCart
The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here existing programs tend to predict many false exons. We have developed a new program, AUGUSTUS, for the ab initio prediction of protein coding genes in eukaryotic genomes. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels. It employs a new way of modeling intron lengths. We use a new donor splice site model, a new model for a short region directly upstream of the donor splice site model that takes the reading frame into account and apply a method that allows better GCcontent dependent parameter estimation. AUGUSTUS predicts on longer sequences far more human and drosophila genes accurately than the ab initio gene prediction programs we compared it with, while at the same time being more specific. A web interface for AUGUSTUS and the executable program are located at
Optimal cluster preserving embedding of nonmetric proximity data
 IEEE Trans. Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concern ..."
Abstract

Cited by 42 (4 self)
 Add to MetaCart
(Show Context)
Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concerning the problem of unsupervised structure detection or clustering, in this paper, a new embedding method for pairwise data into Euclidean vector spaces is introduced. We show that all clustering methods, which are invariant under additive shifts of the pairwise proximities, can be reformulated as grouping problems in Euclidian spaces. The most prominent property of this constant shift embedding framework is the complete preservation of the cluster structure in the embedding space. Restating pairwise clustering problems in vector spaces has several important consequences, such as the statistical description of the clusters by way of cluster prototypes, the generic extension of the grouping procedure to a discriminative prediction rule, and the applicability of standard preprocessing methods like denoising or dimensionality reduction. Index Terms—Clustering, pairwise proximity data, cost function, embedding, MDS. 1
Empirically Estimating Order Constraints for Content Planning in Generation
, 2001
"... In a language generation system, a content planner embodies one or more "plans" that are usually handcrafted, sometimes through manual analysis of target text. In this paper, we present a system that we developed to automatically learn elements of a plan and the ordering constraints amon ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
In a language generation system, a content planner embodies one or more "plans" that are usually handcrafted, sometimes through manual analysis of target text. In this paper, we present a system that we developed to automatically learn elements of a plan and the ordering constraints among them. As training data, we use semantically annotated transcripts of domain experts performing the task our system is designed to mimic. Given the large degree of variation in the spoken language of the transcripts, we developed a novel algorithm to find parallels between transcripts based on techniques used in computational genomics. Our proposed methodology was evaluated twofold: the learning and generalization capabilities were quantitatively evaluated using cross validation obtaining a level of accuracy of 89%. A qualitative evaluation is also provided.
Bootstrapping phylogenetic trees: theory and methods
 Statist. Sci
, 2003
"... Abstract. This is a survey of the use of the bootstrap in the area of systematic and evolutionary biology. I present the current usage by biologists of the bootstrap as a tool both for making inferences and for evaluating robustness, and propose a framework for thinking about these problems in terms ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
Abstract. This is a survey of the use of the bootstrap in the area of systematic and evolutionary biology. I present the current usage by biologists of the bootstrap as a tool both for making inferences and for evaluating robustness, and propose a framework for thinking about these problems in terms of mathematical statistics. Key words and phrases: Bootstrap, phylogenetic trees, confidence regions, nonpositive curvature. 1. AN INTRODUCTION TO SYSTEMATICS The objects of study in systematics are binary rooted semilabeled trees that link species or families by their coancestral relationships. For example, Figure 1 shows a tree with seven strains of HIV.
Data Verification and Reconciliation With Generalized ErrorControl Codes
 IEEE Trans. on Info. Theory
, 2001
"... We consider the problem of data reconciliation, which we model as two separate multisets of data that must be reconciled with minimum communication. Under this model, we show that the problem of reconciliation is equivalent to a variant of the graph coloring problem and provide consequent upper a ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
We consider the problem of data reconciliation, which we model as two separate multisets of data that must be reconciled with minimum communication. Under this model, we show that the problem of reconciliation is equivalent to a variant of the graph coloring problem and provide consequent upper and lower bounds on the communication complexity of reconciliation. More interestingly, we show by means of an explicit construction that the problem of reconciliation is, under certain general conditions, equivalent to the problem of finding good errorcorrecting codes. We show analogous results for the problem of multiset verification, in which we wish to determine whether two multisets are equal using minimum communication. As a result, a wide body of literature in coding theory may be applied to the problems of reconciliation and verification.
Pandit: an evolutioncentric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res 34: D327–D331
, 2006
"... nucleotide domains with inferred trees ..."
(Show Context)
Genomics and proteomics: A signal processor’s tour
 IEEE Circuits Syst. Mag
, 2005
"... The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief review of molecular biology, followed by a review of the applications of signal processing theory. This includes the problem of gene finding using digital filtering, and the use of transform domain methods in the study of protein binding spots. The relatively new topic of noncoding genes, and the associated problem of identifying ncRNA buried in DNA sequences are also described. This includes a discussion of hidden Markov models and context free grammars. Several new directions in genomic signal processing are briefly outlined in the end. Keywords—Genomicsignalprocessing, bioinformatics, genes, proteincoding, DNA, and ncRNA.
Generic forward and backward simulations II: Probabilistic simulations
 International Conference on Concurrency Theory (CONCUR 2010), Lect. Notes Comp. Sci
, 2010
"... Abstract. Jonsson and Larsen’s notion of probabilistic simulation is studied from a coalgebraic perspective. The notion is compared with two generic coalgebraic definitions of simulation: Hughes and Jacobs ’ one, and the one introduced previously by the author. We show that the first almost coincid ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Jonsson and Larsen’s notion of probabilistic simulation is studied from a coalgebraic perspective. The notion is compared with two generic coalgebraic definitions of simulation: Hughes and Jacobs ’ one, and the one introduced previously by the author. We show that the first almost coincides with the second, and that the second is a special case of the last. We investigate implications of this characterization; notably the JonssonLarsen simulation is shown to be sound, i.e. its existence implies trace inclusion. 1
Statistical Calibration of the SEQUEST XCorr Function
 J. Proteome Res. 2009
"... Abstract: Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LCMS/MS) experiments requires a score function that consistently ranks correct peptidespectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract: Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LCMS/MS) experiments requires a score function that consistently ranks correct peptidespectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest score function Xcorr, the inability to discriminate between correct and incorrect PSMs is due in part to spectrumspecific properties of the score distribution. In other words, some spectra score well regardless of which peptides they are scored against, and other spectra score well because they are scored against a large number of peptides. We describe a protocol for calibrating PSM score functions, and we demonstrate its application to Xcorr and the preliminary Sequest score function Sp. The protocol accounts for spectrum and peptidespecific effects by calculating p values for each spectrum individually, using only that spectrum’s score distribution. We demonstrate that these calculated p values are uniform under a null distribution and therefore accurately measure significance. These p values can be used to estimate the false discovery rate, therefore, eliminating the need for an extra search against a decoy database. In addition, we show that the p values are better calibrated than their underlying scores; consequently, when ranking topscoring PSMs from multiple spectra, p values are better at discriminating between correct and incorrect PSMs. The calibration protocol is generally applicable to any PSM score function for which an appopriate parametric family can be identified.