Results 1 - 10
of
43
The Hierarchical Hidden Markov Model: Analysis and Applications
- MACHINE LEARNING
, 1998
"... . We introduce, analyze and demonstrate a recursive hierarchical generalization of the widely used hidden Markov models, which we name Hierarchical Hidden Markov Models (HHMM). Our model is motivated by the complex multi-scale structure which appears in many natural sequences, particularly in langua ..."
Abstract
-
Cited by 176 (3 self)
- Add to MetaCart
. We introduce, analyze and demonstrate a recursive hierarchical generalization of the widely used hidden Markov models, which we name Hierarchical Hidden Markov Models (HHMM). Our model is motivated by the complex multi-scale structure which appears in many natural sequences, particularly in language, handwriting and speech. We seek a systematic unsupervised approach to the modeling of such structures. By extendingthe standard forward-backward(BaumWelch) algorithm, we derive an efficient procedure for estimating the model parameters from unlabeled data. We then use the trained model for automatic hierarchical parsing of observation sequences. We describe two applications of our model and its parameter estimation procedure. In the first application we show how to construct hierarchical models of natural English text. In these models different levels of the hierarchy correspond to structures on different length scales in the text. In the second application we demonstrate how HHMMs can b...
The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length
- Machine Learning
, 1996
"... . We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions gene ..."
Abstract
-
Cited by 148 (15 self)
- Add to MetaCart
. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second ...
Universal prediction of individual sequences
- IEEE Transactions on Information Theory
, 1992
"... Abstruct-The problem of predicting the next outcome of an individual binary sequence using finite memory, is considered. The finite-state predictability of an infinite sequence is defined as the minimum fraction of prediction errors that can be made by any finite-state (FS) predictor. It is proved t ..."
Abstract
-
Cited by 129 (7 self)
- Add to MetaCart
Abstruct-The problem of predicting the next outcome of an individual binary sequence using finite memory, is considered. The finite-state predictability of an infinite sequence is defined as the minimum fraction of prediction errors that can be made by any finite-state (FS) predictor. It is proved that this FS pre-dictability can be attained by universal sequential prediction schemes. Specifically, an efficient prediction procedure based on the incremental parsing procedure of the Lempel-Ziv data com-pression algorithm is shown to achieve asymptotically the FS predictability. Finally, some relations between compressibility and predictability are pointed out, and the predictability is proposed as an additional measure of the complexity of a sequence. Index Terms-Predictability, compressibility, complexity, fi-nite-state machines, Lempel- Ziv algorithm.
Universal Prediction
- IEEE Transactions on Information Theory
, 1998
"... This paper consists of an overview on universal prediction from an information-theoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. ..."
Abstract
-
Cited by 99 (6 self)
- Add to MetaCart
This paper consists of an overview on universal prediction from an information-theoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression.
Variable Length Markov Chains
- Annals of Statistics
, 1999
"... We study estimation in the class of stationary variable length Markov chains (VLMC) on a finite space. The processes in this class are still Markovian of higher order, but with memory of variable length yielding a much bigger and structurally richer class of models than ordinary higher order Markov ..."
Abstract
-
Cited by 66 (5 self)
- Add to MetaCart
We study estimation in the class of stationary variable length Markov chains (VLMC) on a finite space. The processes in this class are still Markovian of higher order, but with memory of variable length yielding a much bigger and structurally richer class of models than ordinary higher order Markov chains. From a more algorithmic view, the VLMC model class has attracted interest in information theory and machine learning but statistical properties have not been explored very much. Provided that good estimation is available, an additional structural richness of the model class enhances predictive power by finding a better trade-off between model bias and variance and allows better structural description which can be of specific interest. The latter is exemplified with some DNA data. A version of the tree-structured context algorithm, proposed by Rissanen (1983) in an information theoretical set-up, is shown to have new good asymptotic properties for estimation in the class of VLMC's, even when the underlying model increases in dimensionality: consistent estimation of minimal state spaces and mixing properties of fitted models are given. We also propose a new bootstrap scheme based on fitted VLMC's. We show its validity for quite general stationary categorical time series and for a broad range of statistical procedures. AMS 1991 subject classifications. Primary 62M05; secondary 60J10, 62G09, 62M10, 94A15 Key words and phrases. Bootstrap, categorical time series, central limit theorem, context algorithm, data compression, finite-memory sources, FSMX model, Kullback-Leibler distance, model selection, tree model. Short title: Variable Length Markov Chain 1 Research supported in part by the Swiss National Science Foundation. Part of the work has been done while visiting th...
The Context Tree Weighting Method: Basic Properties
- IEEE Transactions on Information Theory
, 1995
"... We describe a sequential universal data compression procedure for binary tree sources that performs the "double mixture". Using a context tree, this method weights in an efficient recursive way the coding distributions corresponding to all bounded memory tree sources, and achieves a desirable coding ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
We describe a sequential universal data compression procedure for binary tree sources that performs the "double mixture". Using a context tree, this method weights in an efficient recursive way the coding distributions corresponding to all bounded memory tree sources, and achieves a desirable coding distribution for tree sources with an unknown model and unknown parameters. Computational and storage complexity of the proposed procedure are both linear in the source sequence length. We derive a natural upper bound on the cumulative redundancy of our method for individual sequences. The three terms in this bound can be identified as coding, parameter and model redundancy. The bound holds for all source sequence lengths, not only for asymptotically large lengths. The analysis that leads to this bound is based on standard techniques and turns out to be extremely simple. Our upper bound on the redundancy shows that the proposed context tree weighting procedure is optimal in the sense that i...
Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space
- Journal of Computational Biology
, 2000
"... Statistical modeling of sequences is a central paradigm of machine learning that � nds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, � xed-memory Markov models. In practice, such automa ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
Statistical modeling of sequences is a central paradigm of machine learning that � nds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, � xed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. Recently, D. Ron, Y. Singer, and N. Tishby built much more compact, tree-shaped variants of probabilistic automata under the assumption of an underlying Markov process of variable memory length. These variants, called Probabilistic Suf � x Trees (PSTs) were subsequently adapted by G. Bejerano and G. Yona and applied successfully to learning and prediction of protein families. The process of learning 2 the automaton from a given training set of sequences requires worst-case time, where is the total length of the sequences in and is the length of a longest substring of to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of characters may cost time 2 in the worst case. The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties: Learning the automaton, for any, takes time. Prediction of a string of symbols by the automaton takes time. Along the way, the paper presents an evolving learning scheme and addresses notions of empirical probability and related ef � cient computation, which is a by-product possibly of more general interest. Key words: amnesic automata, probabilistic suf � x trees, variable memory Markovian models, protein families, protein classi � cation. 1
A Natural Law of Succession
, 1995
"... We present a new solution to multinomial estimation and demonstrate that our solution outperforms standard solutions both in theory and in practice. The novelty of our approach lies in our use of combinatorial priors on strings. I. Natural Strings An alphabet represents the set of logically possib ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
We present a new solution to multinomial estimation and demonstrate that our solution outperforms standard solutions both in theory and in practice. The novelty of our approach lies in our use of combinatorial priors on strings. I. Natural Strings An alphabet represents the set of logically possible events. In this world, all strings are finite and most are very short. For this basic reason, natural strings do not include all the symbols in the alphabet. This claim is tautological for short strings, but it is also true for long strings. To model this phenomenon, we propose a uniform prior on the cardinalities of all nonempty subsets of the alphabet. Such a prior on an alphabet of size k entails the probability pN (x n jn) = min(k; n) ` k q '` n \Gamma 1 q \Gamma 1 '` n fn i g ' \Gamma1 for strings x n of length n with cardinality q. This probability is not Kolmogorov compatible. To obtain a conditional probability, we must use p(ijx n ; n + 1) instead of the more o...
Spam filtering using statistical data compression models
- Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
An Efficient Extension to Mixture Techniques for Prediction and Decision Trees
- Machine Learning
, 1999
"... We present an e#cient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for "node-based" prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edge-based prunings. The ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
We present an e#cient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for "node-based" prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edge-based prunings. The method includes an online weight-allocation algorithm that can be used for prediction, compression and classification. Although the set of edge-based prunings of a given tree is much larger than that of node-based prunings, our algorithm has similar space and time complexity to that of previous mixture algorithms for trees. Using the general online framework of Freund & Schapire (1997), we prove that our algorithm maintains correctly the mixture weights for edge-based prunings with any bounded loss function. We also give a similar algorithm for the logarithmic loss function with a corresponding weight-allocation algorithm. Finally, we describe experiments comparing node-based and edge-based mixture models for estimating the probability of the next word in English text, which show the advantages of edge-based models. Keywords: mixture models, decision and prediction trees, on-line learning, statistical language modeling 1.

