Results 11  20
of
226
Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space
 Journal of Computational Biology
, 2000
"... Statistical modeling of sequences is a central paradigm of machine learning that � nds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, � xedmemory Markov models. In practice, such automa ..."
Abstract

Cited by 55 (7 self)
 Add to MetaCart
Statistical modeling of sequences is a central paradigm of machine learning that � nds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, � xedmemory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. Recently, D. Ron, Y. Singer, and N. Tishby built much more compact, treeshaped variants of probabilistic automata under the assumption of an underlying Markov process of variable memory length. These variants, called Probabilistic Suf � x Trees (PSTs) were subsequently adapted by G. Bejerano and G. Yona and applied successfully to learning and prediction of protein families. The process of learning 2 the automaton from a given training set of sequences requires worstcase time, where is the total length of the sequences in and is the length of a longest substring of to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of characters may cost time 2 in the worst case. The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties: Learning the automaton, for any, takes time. Prediction of a string of symbols by the automaton takes time. Along the way, the paper presents an evolving learning scheme and addresses notions of empirical probability and related ef � cient computation, which is a byproduct possibly of more general interest. Key words: amnesic automata, probabilistic suf � x trees, variable memory Markovian models, protein families, protein classi � cation. 1
Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences
 In Proceedings of the 20th conference on Uncertainty in artificial intelligence (UAI'04
, 2004
"... We present a new method for nonlinear prediction of discrete random sequences under minimal structural assumptions. We give a mathematical construction for optimal predictors of such processes, in the form of hidden Markov models. We then describe an algorithm, CSSR (CausalState Splitting Reconst ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
(Show Context)
We present a new method for nonlinear prediction of discrete random sequences under minimal structural assumptions. We give a mathematical construction for optimal predictors of such processes, in the form of hidden Markov models. We then describe an algorithm, CSSR (CausalState Splitting Reconstruction), which approximates the ideal predictor from data. We discuss the reliability of CSSR, its data requirements, and its performance in simulations. Finally, we compare our approach to existing methods using variablelength Markov models and crossvalidated hidden Markov models, and show theoretically and experimentally that our method delivers results superior to the former and at least comparable to the latter. 1
Modeling system calls for intrusion detection with dynamic window sizes
 In Proceedings of DARPA Information Survivabilty Conference and Exposition II (DISCEX
, 2001
"... We extend prior research on system call anomaly detection modeling methods for intrusion detection by incorporating dynamic window sizes. The window size is the length of the subsequence of a system call trace which is used as the basic unit for modeling program or process behavior. In this work we ..."
Abstract

Cited by 42 (8 self)
 Add to MetaCart
(Show Context)
We extend prior research on system call anomaly detection modeling methods for intrusion detection by incorporating dynamic window sizes. The window size is the length of the subsequence of a system call trace which is used as the basic unit for modeling program or process behavior. In this work we incorporate dynamic window sizes and show marked improvements in anomaly detection. We present two methods for estimating the optimal window size based on the available training data. The first method is an entropy modeling method which determines the optimal single window size for the data. The second method is a probability modeling method that takes into account context dependent window sizes. A context dependent window size model is motivated by the way that system calls are generated by processes. Sparse Markov transducers (SMTs) are used to compute the context dependent window size model. We show over actual system call traces that the entropy modeling methods lead to the optimal single window size. We also show that context dependent window sizes outperform traditional system call modeling methods. 1
Modeling Protein Families Using Probabilistic Suffix Trees
, 1999
"... We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method ..."
Abstract

Cited by 40 (6 self)
 Add to MetaCart
We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Incorporating basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can improve the performance in some cases. The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on one of the state of the art databases of protein families, namely, the Pfam database of HMMs, with satisfactory performance.
Protein family classification using sparse markov transducers
 PROC. 8TH INT. CONF. INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY
, 2003
"... We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov ..."
Abstract

Cited by 37 (10 self)
 Add to MetaCart
We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wildcards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wildcards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and stateoftheart protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to stateoftheart protein homology methods.
A Survey of POMDP Solution Techniques
, 2000
"... this paper, we assume all actions take one unit of discrete time at some (unspecied) time scale. If we allow actions to take variable lengths of time, we end up with a semiMarkov model; see e.g., [SPS99]. ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
(Show Context)
this paper, we assume all actions take one unit of discrete time at some (unspecied) time scale. If we allow actions to take variable lengths of time, we end up with a semiMarkov model; see e.g., [SPS99].
A MonteCarlo AIXI Approximation
, 2009
"... This paper describes a computationally feasible approximation to the AIXI agent, a universal reinforcement learning agent for arbitrary environments. AIXI is scaled down in two key ways: First, the class of environment models is restricted to all prediction suffix trees of a fixed maximum depth. Thi ..."
Abstract

Cited by 28 (9 self)
 Add to MetaCart
This paper describes a computationally feasible approximation to the AIXI agent, a universal reinforcement learning agent for arbitrary environments. AIXI is scaled down in two key ways: First, the class of environment models is restricted to all prediction suffix trees of a fixed maximum depth. This allows a Bayesian mixture of environment models to be computed in time proportional to the logarithm of the size of the model class. Secondly, the finitehorizon expectimax search is approximated by an asymptotically convergent Monte Carlo Tree Search technique. This scaled down AIXI agent is empirically shown to be effective on a wide class of toy problem domains, ranging from simple fully observable games to small POMDPs. We explore the limits of this approximate agent and propose a general heuristic framework for scaling this technique to much larger problems.
An Efficient Extension to Mixture Techniques for Prediction and Decision Trees
 Machine Learning
, 1999
"... We present an e#cient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for "nodebased" prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edgeb ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
We present an e#cient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for "nodebased" prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edgebased prunings. The method includes an online weightallocation algorithm that can be used for prediction, compression and classification. Although the set of edgebased prunings of a given tree is much larger than that of nodebased prunings, our algorithm has similar space and time complexity to that of previous mixture algorithms for trees. Using the general online framework of Freund & Schapire (1997), we prove that our algorithm maintains correctly the mixture weights for edgebased prunings with any bounded loss function. We also give a similar algorithm for the logarithmic loss function with a corresponding weightallocation algorithm. Finally, we describe experiments comparing nodebased and edgebased mixture models for estimating the probability of the next word in English text, which show the advantages of edgebased models. Keywords: mixture models, decision and prediction trees, online learning, statistical language modeling 1.
Conventional And Periodic NGrams in the Transcription of Drum Sequences
 In Proc. of IEEE International Conference on Multimedia and Expo
, 2003
"... In this paper, we describe a system for transcribing polyphonic drum sequences from an acoustic signal to a symbolic representation. Lowlevel signal analysis is done with an acoustic model consisting of a Gaussian mixture model and a support vector machine. For higherlevel modeling, periodic Ngra ..."
Abstract

Cited by 27 (7 self)
 Add to MetaCart
(Show Context)
In this paper, we describe a system for transcribing polyphonic drum sequences from an acoustic signal to a symbolic representation. Lowlevel signal analysis is done with an acoustic model consisting of a Gaussian mixture model and a support vector machine. For higherlevel modeling, periodic Ngrams are proposed to construct a "language model" for music, based on the repetitive nature of musical structure. Also, a technique for estimating relatively long Ngrams is introduced. The performance of Ngrams in the transcription was evaluated using a database of realistic drum sequences from different genres and yielded a performance increase of 7.6 % compared to a the use of only prior (unigram) probabilities with the acoustic model.