Results 1  10
of
28
Learning in graphical models
, 2004
"... Statistical applications in fields such as bioinformatics, information retrieval, speech processing, image processing and communications often involve largescale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for ..."
Abstract

Cited by 628 (10 self)
 Add to MetaCart
Statistical applications in fields such as bioinformatics, information retrieval, speech processing, image processing and communications often involve largescale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the general graphical model formalism. We review some of the basic ideas underlying graphical models, including the algorithmic ideas that allow graphical models to be deployed in largescale data analysis problems. We also present examples of graphical models in bioinformatics, errorcontrol coding and language processing. Key words and phrases: Probabilistic graphical models, junction tree algorithm, sumproduct algorithm, Markov chain Monte Carlo, variational inference, bioinformatics, errorcontrol coding.
Gene prediction with a hidden Markov model and a new intron submodel
 Bioinformatics
, 2003
"... The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here ..."
Abstract

Cited by 78 (5 self)
 Add to MetaCart
The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here existing programs tend to predict many false exons. We have developed a new program, AUGUSTUS, for the ab initio prediction of protein coding genes in eukaryotic genomes. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels. It employs a new way of modeling intron lengths. We use a new donor splice site model, a new model for a short region directly upstream of the donor splice site model that takes the reading frame into account and apply a method that allows better GCcontent dependent parameter estimation. AUGUSTUS predicts on longer sequences far more human and drosophila genes accurately than the ab initio gene prediction programs we compared it with, while at the same time being more specific. A web interface for AUGUSTUS and the executable program are located at
Optimal cluster preserving embedding of nonmetric proximity data
 IEEE Trans. Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concern ..."
Abstract

Cited by 43 (4 self)
 Add to MetaCart
Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concerning the problem of unsupervised structure detection or clustering, in this paper, a new embedding method for pairwise data into Euclidean vector spaces is introduced. We show that all clustering methods, which are invariant under additive shifts of the pairwise proximities, can be reformulated as grouping problems in Euclidian spaces. The most prominent property of this constant shift embedding framework is the complete preservation of the cluster structure in the embedding space. Restating pairwise clustering problems in vector spaces has several important consequences, such as the statistical description of the clusters by way of cluster prototypes, the generic extension of the grouping procedure to a discriminative prediction rule, and the applicability of standard preprocessing methods like denoising or dimensionality reduction. Index Terms—Clustering, pairwise proximity data, cost function, embedding, MDS. 1
Empirically Estimating Order Constraints for Content Planning in Generation
, 2001
"... In a language generation system, a content planner embodies one or more "plans" that are usually handcrafted, sometimes through manual analysis of target text. In this paper, we present a system that we developed to automatically learn elements of a plan and the ordering constraints amon ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
In a language generation system, a content planner embodies one or more "plans" that are usually handcrafted, sometimes through manual analysis of target text. In this paper, we present a system that we developed to automatically learn elements of a plan and the ordering constraints among them. As training data, we use semantically annotated transcripts of domain experts performing the task our system is designed to mimic. Given the large degree of variation in the spoken language of the transcripts, we developed a novel algorithm to find parallels between transcripts based on techniques used in computational genomics. Our proposed methodology was evaluated twofold: the learning and generalization capabilities were quantitatively evaluated using cross validation obtaining a level of accuracy of 89%. A qualitative evaluation is also provided.
Data Verification and Reconciliation With Generalized ErrorControl Codes
 IEEE Trans. on Info. Theory
, 2001
"... We consider the problem of data reconciliation, which we model as two separate multisets of data that must be reconciled with minimum communication. Under this model, we show that the problem of reconciliation is equivalent to a variant of the graph coloring problem and provide consequent upper a ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
We consider the problem of data reconciliation, which we model as two separate multisets of data that must be reconciled with minimum communication. Under this model, we show that the problem of reconciliation is equivalent to a variant of the graph coloring problem and provide consequent upper and lower bounds on the communication complexity of reconciliation. More interestingly, we show by means of an explicit construction that the problem of reconciliation is, under certain general conditions, equivalent to the problem of finding good errorcorrecting codes. We show analogous results for the problem of multiset verification, in which we wish to determine whether two multisets are equal using minimum communication. As a result, a wide body of literature in coding theory may be applied to the problems of reconciliation and verification.
Pandit: an evolutioncentric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res 34: D327–D331
, 2006
"... nucleotide domains with inferred trees ..."
Genomics and proteomics: A signal processor’s tour
 IEEE Circuits Syst. Mag
, 2005
"... The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief review of molecular biology, followed by a review of the applications of signal processing theory. This includes the problem of gene finding using digital filtering, and the use of transform domain methods in the study of protein binding spots. The relatively new topic of noncoding genes, and the associated problem of identifying ncRNA buried in DNA sequences are also described. This includes a discussion of hidden Markov models and context free grammars. Several new directions in genomic signal processing are briefly outlined in the end. Keywords—Genomicsignalprocessing, bioinformatics, genes, proteincoding, DNA, and ncRNA.
Statistical Calibration of the SEQUEST XCorr Function
 J. Proteome Res. 2009
"... Abstract: Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LCMS/MS) experiments requires a score function that consistently ranks correct peptidespectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract: Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LCMS/MS) experiments requires a score function that consistently ranks correct peptidespectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest score function Xcorr, the inability to discriminate between correct and incorrect PSMs is due in part to spectrumspecific properties of the score distribution. In other words, some spectra score well regardless of which peptides they are scored against, and other spectra score well because they are scored against a large number of peptides. We describe a protocol for calibrating PSM score functions, and we demonstrate its application to Xcorr and the preliminary Sequest score function Sp. The protocol accounts for spectrum and peptidespecific effects by calculating p values for each spectrum individually, using only that spectrum’s score distribution. We demonstrate that these calculated p values are uniform under a null distribution and therefore accurately measure significance. These p values can be used to estimate the false discovery rate, therefore, eliminating the need for an extra search against a decoy database. In addition, we show that the p values are better calibrated than their underlying scores; consequently, when ranking topscoring PSMs from multiple spectra, p values are better at discriminating between correct and incorrect PSMs. The calibration protocol is generally applicable to any PSM score function for which an appopriate parametric family can be identified.
Generic forward and backward simulations II: Probabilistic simulations
 International Conference on Concurrency Theory (CONCUR 2010), Lect. Notes Comp. Sci
, 2010
"... Abstract. Jonsson and Larsen’s notion of probabilistic simulation is studied from a coalgebraic perspective. The notion is compared with two generic coalgebraic definitions of simulation: Hughes and Jacobs ’ one, and the one introduced previously by the author. We show that the first almost coincid ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. Jonsson and Larsen’s notion of probabilistic simulation is studied from a coalgebraic perspective. The notion is compared with two generic coalgebraic definitions of simulation: Hughes and Jacobs ’ one, and the one introduced previously by the author. We show that the first almost coincides with the second, and that the second is a special case of the last. We investigate implications of this characterization; notably the JonssonLarsen simulation is shown to be sound, i.e. its existence implies trace inclusion. 1
Dinucleotide Weight Matrices for Predicting Transcription Factor Binding Sites: Generalizing the Position Weight Matrix
, 2010
"... Background: Identifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as ‘‘position weight matrices’ ’ (PWMs). Though convenient, the PWM has significant limitations, in particula ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Background: Identifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as ‘‘position weight matrices’ ’ (PWMs). Though convenient, the PWM has significant limitations, in particular the assumed independence of positions within the binding motif; and predictions based on PWMs are usually not very specific to known functional sites. Analysis here on binding sites in yeast suggests that correlation of dinucleotides is not limited to nearneighbours, but can extend over considerable gaps. Methodology/Principal Findings: I describe a straightforward generalization of the PWM model, that considers frequencies of dinucleotides instead of individual nucleotides. Unlike previous efforts, this method considers all dinucleotides within an extended binding region, and does not make an attempt to determine a priori the significance of particular dinucleotide correlations. I describe how to use a ‘‘dinucleotide weight matrix’ ’ (DWM) to predict binding sites, dealing in particular with the complication that its entries are not independent probabilities. Benchmarks show, for many factors, a dramatic improvement over PWMs in precision of predicting known targets. In most cases, significant further improvement arises by extending the commonly defined ‘‘core motifs’ ’ by about 10bp on either side. Though this flanking sequence shows no strong motif at the nucleotide level, the predictive power of the dinucleotide model suggests that the ‘‘signature’ ’ in DNA sequence of proteinbinding affinity extends beyond the core proteinDNA contact region.