Results 1  10
of
16
The Hierarchical Hidden Markov Model: Analysis and Applications
 MACHINE LEARNING
, 1998
"... . We introduce, analyze and demonstrate a recursive hierarchical generalization of the widely used hidden Markov models, which we name Hierarchical Hidden Markov Models (HHMM). Our model is motivated by the complex multiscale structure which appears in many natural sequences, particularly in langua ..."
Abstract

Cited by 300 (3 self)
 Add to MetaCart
. We introduce, analyze and demonstrate a recursive hierarchical generalization of the widely used hidden Markov models, which we name Hierarchical Hidden Markov Models (HHMM). Our model is motivated by the complex multiscale structure which appears in many natural sequences, particularly in language, handwriting and speech. We seek a systematic unsupervised approach to the modeling of such structures. By extendingthe standard forwardbackward(BaumWelch) algorithm, we derive an efficient procedure for estimating the model parameters from unlabeled data. We then use the trained model for automatic hierarchical parsing of observation sequences. We describe two applications of our model and its parameter estimation procedure. In the first application we show how to construct hierarchical models of natural English text. In these models different levels of the hierarchy correspond to structures on different length scales in the text. In the second application we demonstrate how HHMMs can b...
The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length
 Machine Learning
, 1996
"... . We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions gene ..."
Abstract

Cited by 208 (17 self)
 Add to MetaCart
(Show Context)
. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KLdivergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in humanmachine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second ...
Variations on Probabilistic Suffix Trees: Statistical Modeling and Prediction of Protein Families
, 2001
"... Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor ..."
Abstract

Cited by 63 (7 self)
 Add to MetaCart
Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. Results: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as GappedBLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster. Availability: The programs are available upon request from the authors. Contact: jill@cs.huji.ac.il; golan@cs.cornell.edu
Modeling Protein Families Using Probabilistic Suffix Trees
, 1999
"... We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method ..."
Abstract

Cited by 35 (5 self)
 Add to MetaCart
We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Incorporating basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can improve the performance in some cases. The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on one of the state of the art databases of protein families, namely, the Pfam database of HMMs, with satisfactory performance.
Passively Learning Finite Automata
, 1996
"... We provide a survey of methods for inferring the structure of a finite automaton from passive observation of its behavior. We consider both deterministic automata and probabilistic automata (similar to Hidden Markov Models). While it is computationally intractible to solve the general problem exactl ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
We provide a survey of methods for inferring the structure of a finite automaton from passive observation of its behavior. We consider both deterministic automata and probabilistic automata (similar to Hidden Markov Models). While it is computationally intractible to solve the general problem exactly, we will consider heuristic algorithms, and also special cases which are tractible. Most of the algorithms we consider are based on the idea of building a tree which encodes all of the examples we have seen, and then merging equivalent nodes to produce a (near) minimal automaton. Contents 1 Introduction 4 1.1 Applications of automaton inference : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.2 Why PFAs instead of other probabilistic models? : : : : : : : : : : : : : : : : : : : : : : : : : 5 1.3 The input to/output from the algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 1.4 Batch vs. online algorithms : : : : : : : : : : : : : : : : : : : : : :...
Learning to Model Sequences Generated by Switching Distributions
 Proceedings of the Eighth Annual Conference on Computational Learning Theory
, 1995
"... We study efficient algorithms for solving the following problem, which we call the switching distributions learning problem. A sequence S = oe 1 oe 2 : : : oe n , over a finite alphabet \Sigma is generated in the following way. The sequence is a concatenation of K runs, each of which is a consecut ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
We study efficient algorithms for solving the following problem, which we call the switching distributions learning problem. A sequence S = oe 1 oe 2 : : : oe n , over a finite alphabet \Sigma is generated in the following way. The sequence is a concatenation of K runs, each of which is a consecutive subsequence. Each run is generated by independent random draws from a distribution ~ p i over \Sigma, where ~p i is an element in a set of distributions f~p 1 ; : : : ; ~p N g. The learning algorithm is given this sequence and its goal is to find approximations of the distributions ~p 1 ; : : : ; ~p N , and give an approximate segmentation of the sequence into its constituting runs. We give an efficient algorithm for solving this problem and show conditions under which the algorithm is guaranteed to work with high probability. 1 Introduction Our work is motivated by the Hidden Markov Model (HMM). The HMM is a model for the distribution of sequences over a finite alphabet \Sigma. An HMM ...
Local Prediction Approach for Protein Classification Using Probabilistic Suffix Trees
"... Probabilistic suffix tree (PST) is a stochastic model that uses a suffix tree as an index structure to store conditional probabilities associated with subsequences. PST has been successfully used to model and predict protein families following global approach. Their approach takes into account the e ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Probabilistic suffix tree (PST) is a stochastic model that uses a suffix tree as an index structure to store conditional probabilities associated with subsequences. PST has been successfully used to model and predict protein families following global approach. Their approach takes into account the entire sequence, and thus is not suitable for partially conserved families. We develop two variants of PST for local prediction: multipledomain prediction and bestdomain prediction. The multipledomain method predicts the probability that a protein belongs to a family based on one or more significant conserved regions, while the bestdomain method does it based on the most conserved region in the query sequence. The time complexity of both of our approaches is the same as that of the global prediction, that is, O(Lm) where L is the depth bound of the tree and m is the size of the query sequence. We tested our algorithms on the Pfam database of protein families and compared the results with the global prediction method . The experimental results show that our approaches have higher accuracy of prediction than that of global approach. We also show that, our local prediction approach is an effective way to extract motifs/domains. Our approaches employ a linear time method for building PST by adapting the linear time construction of Probabilistic Automata reported by A.Apostolico et al.
Learning to impersonate
 In ICML 2006
, 2006
"... Consider Alice, who is interacting with Bob. Alice and Bob have some shared secret which helps Alice identify Bobimpersonators. Now consider Eve, who knows Alice and Bob, but does not know their shared secret. Eve would like to impersonate Bob and “fool ” Alice without knowing the secret. If Eve is ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Consider Alice, who is interacting with Bob. Alice and Bob have some shared secret which helps Alice identify Bobimpersonators. Now consider Eve, who knows Alice and Bob, but does not know their shared secret. Eve would like to impersonate Bob and “fool ” Alice without knowing the secret. If Eve is computationally unbounded, how long does she need to observe Alice and Bob interacting before she can successfully impersonate Bob? What is a good strategy for Eve in this setting? If Eve runs in polynomial time, and if there exists a oneway function, then it is not hard to see that Alice and Bob may be “safe ” from impersonators, but is the existence of oneway functions an essential condition? Namely, if oneway functions do not exist, can an efficient Eve always impersonate Bob? In this work we consider these natural questions from the point of view of Ever, who is trying to observe Bob and learn to impersonate him. We formalize this setting in a new computational learning model of learning adaptively changing distributions (ACDs), which we believe captures a wide variety of natural learning tasks and is of interest from both cryptographic and computational learning points of view. We present a learning algorithm that Eve can use to successfully learn to impersonate Bob in the informationtheoretic setting. We also show that in the computational setting an efficient Eve can learn to impersonate any efficient Bob if and only if oneway function do not exist. 1
Advances in Hidden Markov Models for Sequence Annotation
"... One of the most basic tasks of bioinformatics is to identify features in a biological sequence. Whether those features are the binding sites of a protein, the regions of a DNA sequence that are most subject to selective pressures, or coding sequences found in an expressed sequence tag, this phase is ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
One of the most basic tasks of bioinformatics is to identify features in a biological sequence. Whether those features are the binding sites of a protein, the regions of a DNA sequence that are most subject to selective pressures, or coding sequences found in an expressed sequence tag, this phase is fundamental to the process of sequence
Variations on Probabilistic Suffix Trees  a New Tool for Modeling and Prediction of Protein Families
"... Motivation We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor i ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Motivation We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. However, basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve the performance. Results: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on one of the state of the art databases of protein families, namely, the Pfam database of Hidden Markov Models (HMMs), with s...