Results 1 - 10
of
13
Learning in graphical models
, 2004
"... Statistical applications in fields such as bioinformatics, information retrieval, speech processing, image processing and communications often involve large-scale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for ..."
Abstract
-
Cited by 469 (8 self)
- Add to MetaCart
Statistical applications in fields such as bioinformatics, information retrieval, speech processing, image processing and communications often involve large-scale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the general graphical model formalism. We review some of the basic ideas underlying graphical models, including the algorithmic ideas that allow graphical models to be deployed in large-scale data analysis problems. We also present examples of graphical models in bioinformatics, error-control coding and language processing. Key words and phrases: Probabilistic graphical models, junction tree algorithm, sum-product algorithm, Markov chain Monte Carlo, variational inference, bioinformatics, error-control coding.
Gene prediction with a hidden Markov model and a new intron submodel
- Bioinformatics
, 2003
"... The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here existing programs tend to predict many false exons. We have developed a new program, AUGUSTUS, for the ab initio prediction of protein coding genes in eukaryotic genomes. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels. It employs a new way of modeling intron lengths. We use a new donor splice site model, a new model for a short region directly upstream of the donor splice site model that takes the reading frame into account and apply a method that allows better GC-content dependent parameter estimation. AUGUSTUS predicts on longer sequences far more human and drosophila genes accurately than the ab initio gene prediction programs we compared it with, while at the same time being more specific. A web interface for AUGUSTUS and the executable program are located at
Optimal cluster preserving embedding of nonmetric proximity data
- IEEE Trans. Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concern ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concerning the problem of unsupervised structure detection or clustering, in this paper, a new embedding method for pairwise data into Euclidean vector spaces is introduced. We show that all clustering methods, which are invariant under additive shifts of the pairwise proximities, can be reformulated as grouping problems in Euclidian spaces. The most prominent property of this constant shift embedding framework is the complete preservation of the cluster structure in the embedding space. Restating pairwise clustering problems in vector spaces has several important consequences, such as the statistical description of the clusters by way of cluster prototypes, the generic extension of the grouping procedure to a discriminative prediction rule, and the applicability of standard preprocessing methods like denoising or dimensionality reduction. Index Terms—Clustering, pairwise proximity data, cost function, embedding, MDS. 1
Empirically Estimating Order Constraints for Content Planning in Generation
, 2001
"... In a language generation system, a content planner embodies one or more "plans" that are usually hand--crafted, sometimes through manual analysis of target text. In this paper, we present a system that we developed to automatically learn elements of a plan and the ordering constraints among them. As ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
In a language generation system, a content planner embodies one or more "plans" that are usually hand--crafted, sometimes through manual analysis of target text. In this paper, we present a system that we developed to automatically learn elements of a plan and the ordering constraints among them. As training data, we use semantically annotated transcripts of domain experts performing the task our system is designed to mimic. Given the large degree of variation in the spoken language of the transcripts, we developed a novel algorithm to find parallels between transcripts based on techniques used in computational genomics. Our proposed methodology was evaluated two--fold: the learning and generalization capabilities were quantitatively evaluated using cross validation obtaining a level of accuracy of 89%. A qualitative evaluation is also provided.
Data Verification and Reconciliation With Generalized Error-Control Codes
- IEEE Trans. on Info. Theory
, 2001
"... We consider the problem of data reconciliation, which we model as two separate multisets of data that must be reconciled with minimum communication. Under this model, we show that the problem of reconciliation is equivalent to a variant of the graph coloring problem and provide consequent upper a ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
We consider the problem of data reconciliation, which we model as two separate multisets of data that must be reconciled with minimum communication. Under this model, we show that the problem of reconciliation is equivalent to a variant of the graph coloring problem and provide consequent upper and lower bounds on the communication complexity of reconciliation. More interestingly, we show by means of an explicit construction that the problem of reconciliation is, under certain general conditions, equivalent to the problem of finding good error-correcting codes. We show analogous results for the problem of multi-set verification, in which we wish to determine whether two multi-sets are equal using minimum communication. As a result, a wide body of literature in coding theory may be applied to the problems of reconciliation and verification.
Pandit: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res 34: D327–D331
, 2006
"... nucleotide domains with inferred trees ..."
Genomics and proteomics: A signal processor’s tour
- IEEE Circuits Syst. Mag
, 2005
"... The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief review of molecular biology, followed by a review of the applications of signal processing theory. This includes the problem of gene finding using digital filtering, and the use of transform domain methods in the study of protein binding spots. The relatively new topic of noncoding genes, and the associated problem of identifying ncRNA buried in DNA sequences are also described. This includes a discussion of hidden Markov models and context free grammars. Several new directions in genomic signal processing are briefly outlined in the end. Keywords—Genomic-signal-processing, bioinformatics, genes, proteincoding, DNA, and ncRNA.
Entropy Filtering Method and Insertion/Deletion Robust Algorithm for Multiple Local Sequence Alignment
, 2000
"... OF THE DISSERTATION Entropy Filtering Method and Insertion/Deletion Robust Algorithm for Multiple Local Sequence Alignment by Jun Xie Doctor of Philosophy in Statistics University of California, Los Angeles, 2000 Professor Ker-Chau Li, Chair Bayesian models have been developed for finding ungappe ..."
Abstract
- Add to MetaCart
OF THE DISSERTATION Entropy Filtering Method and Insertion/Deletion Robust Algorithm for Multiple Local Sequence Alignment by Jun Xie Doctor of Philosophy in Statistics University of California, Los Angeles, 2000 Professor Ker-Chau Li, Chair Bayesian models have been developed for finding ungapped motifs in multiple protein sequences (Liu, Neuwald and Lawrence 1995). In this article we extend the model to allow for deletions and insertions in the motifs. Direct generalization of the ungapped algorithm, based on Gibbs sampling, proves unsuccessful because of the configuration space has become much larger. To alleviate this difficulty, a method called entropy filtering is introduced which allows us to find a better starting point. In addition to Gibbs sampling, we also provide a Metropolis-Hastings algorithm which shows more stable performance. The significance of the alignment is discussed at the end. xi CHAPTER 1 Introduction 1.1 Protein sequences and motifs A protein is an unbran...
Algorithms for Molecular Biology Fall Semester, 2001
"... Introduction Genetics as a set of principles and analytical procedures did not begin until 1866, when an Augustinian monk named Gregor Mendel performed a set of experiments that pointed to the existence of biological elements called genes - the basic units responsible for possession and passing on ..."
Abstract
- Add to MetaCart
Introduction Genetics as a set of principles and analytical procedures did not begin until 1866, when an Augustinian monk named Gregor Mendel performed a set of experiments that pointed to the existence of biological elements called genes - the basic units responsible for possession and passing on of a single characteristic. Until 1944, it was generally assumed that chromosomal proteins carry genetic information, and that DNA plays a secondary role. This view was shattered by Avery and McCarty who demonstrated that the molecule deoxy-ribonucleic acid (DNA) is the major carrier of genetic material in living organisms, i.e. is responsible for inheritance. In 1953 James Watson and Francis Crick deduced the three dimensional structure of DNA and immediately inferred its method of replication. In February 2001, due to a joint venture of the Human Genome Project and a commercial company Celera, the first draft of the human genome was published. 1.1.2 DNA Composition The basic elements of
Methodologies for Constructing and Training Large Hierarchical Hidden Markov Models for Sequence Analysis
- CURRENTS IN COMPUTATIONAL MOLECULAR BIOLOGY
, 2001
"... Hidden Markov Models (HMMs) are a widely used modeling tool for biological sequence analysis [2][5]. However, for many tasks of interest large hierarchical models must be constructed and optimized using a large amount of training data. We present some methodologies that we have used for constructing ..."
Abstract
- Add to MetaCart
Hidden Markov Models (HMMs) are a widely used modeling tool for biological sequence analysis [2][5]. However, for many tasks of interest large hierarchical models must be constructed and optimized using a large amount of training data. We present some methodologies that we have used for constructing and training large HMMs for biological sequence analysis. We illustrate these techniques for the task of splice site prediction in vertebrate genes.

