## The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length (1996)

### Cached

### Download Links

- [www.cs.huji.ac.il]
- [www.eng.tau.ac.il]
- [www.cs.huji.ac.il]
- [portal.research.bell-labs.com]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 172 - 16 self |

### BibTeX

@INPROCEEDINGS{Ron96thepower,

author = {Dana Ron and Yoram Singer and NAFTALI TISHBY},

title = {The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length},

booktitle = {Machine Learning},

year = {1996},

pages = {117--149}

}

### Years of Citing Articles

### OpenURL

### Abstract

. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second ...

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...istributions. Similar definitions can be considered for other distance measures such as the variation and the quadratic distances. Note that the KL-divergence bounds the variation distance as follows =-=[6]-=-: DKL [P 1 jjP 2 ]s1 2 jjP 1 \Gamma P 2 jj 2 1 . Since the L 1 norm bounds the L 2 norm, the last bound holds for the quadratic distance as well. Note that the KL-divergence between distributions, gen... |

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... HMM from a given sample is a maximum likelihood parameter estimation procedure that is based on the Baum-Welch method [3], [2] (which is a special case of the EM (Expectation-Maximization) algorithm =-=[7]-=-). However, this algorithm is guaranteed to converge only to a local maximum, and thus we are not assured that the hypothesis it outputs can serve 4 DANA RON, YORAM SINGER, NAFTALI TISHBY as a good ap... |

4273 | A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...opular) model used in modeling natural sequences is the Hidden Markov Model (HMM). A detailed tutorial on the theory of HMMs as well as selected applications in speech recognition is given by Rabiner =-=[22]-=-. A commonly used procedure for learning an HMM from a given sample is a maximum likelihood parameter estimation procedure that is based on the Baum-Welch method [3], [2] (which is a special case of t... |

2611 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ...erwise. Note that the sum (10c) can be computed efficiently in a recursive manner. Moreover, the maximization of Equation (10a) can be performed efficiently by using a dynamic programming (DP) scheme =-=[4]-=-. This scheme requires O(jQj \Theta t) operations. If jQj is large, then approximation schemes to the optimal DP, such as the stack decoding algorithm [13] can be employed. Using similar methods it is... |

834 | A tutorial on hidden Markov models
- Rabiner, Juang
- 1989
(Show Context)
Citation Context ... far for any of the more popular sequence modeling algorithms. 1.1. Related Work The most powerful (and perhaps most popular) model used in modeling natural sequences is the Hidden Markov Model (HMM) =-=[16]-=-, for which there exists a maximum likelihood estimation procedure 1 which is widely used in many applications [15]. From the computational learning theory point of view however, the HMM has severe dr... |

772 |
A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ...h recognition is given by Rabiner [22]. A commonly used procedure for learning an HMM from a given sample is a maximum likelihood parameter estimation procedure that is based on the Baum-Welch method =-=[3]-=-, [2] (which is a special case of the EM (Expectation-Maximization) algorithm [7]). However, this algorithm is guaranteed to converge only to a local maximum, and thus we are not assured that the hypo... |

730 | Compression of individual sequences via variable-rate coding
- Ziv, Lempel
- 1978
(Show Context)
Citation Context ...ey are able to show that this PST convergence only in the limit of infinite sequence length to that source. Vitter and Krishnan [31], [16] adapt a version of the Ziv-Lempel data compression algorithm =-=[34]-=- to get a page prefetching algorithm, where the sequence of page accesses is assumed to be generated by a PFA. They show that the page fault rate of their algorithm converges to the page fault rate of... |

512 |
An inequality and associated maximization technique in statistical estimation of probabilistic functions of a markov process
- Baum
- 1972
(Show Context)
Citation Context ...ognition is given by Rabiner [22]. A commonly used procedure for learning an HMM from a given sample is a maximum likelihood parameter estimation procedure that is based on the Baum-Welch method [3], =-=[2]-=- (which is a special case of the EM (Expectation-Maximization) algorithm [7]). However, this algorithm is guaranteed to converge only to a local maximum, and thus we are not assured that the hypothesi... |

339 |
Prediction and Entropy of Printed English
- Shannon
- 1951
(Show Context)
Citation Context ... that the conditional probability distribution does not change substantially if we condition it on preceding subsequences of length greater than L. This observation lead Shannon, in his seminal paper =-=[29]-=-, to suggest modeling such sequences by Markov chains of order L ? 1, where the order is the memory length of the model. Alternatively, such sequences may be modeled by Hidden Markov Models (HMMs) whi... |

338 | Self-organized Language Modeling for Speech Recognition
- Jelinek
- 1990
(Show Context)
Citation Context ...rning due to its wide variety of natural applications. The most noticeable examples of such applications are statistical models in human communication such as natural language, handwriting and speech =-=[14]-=-, [21], and statistical models of biological sequences such as DNA and proteins [17]. These kinds of complex sequences clearly do not have any simple underlying statistical source since they are gener... |

236 | Optimal Prefetching via Data Compression
- Vitter, Krishnan
- 1991
(Show Context)
Citation Context ...th PST. However, in case the source generating the examples is a PST, they are able to show that this PST convergence only in the limit of infinite sequence length to that source. Vitter and Krishnan =-=[31]-=-, [16] adapt a version of the Ziv-Lempel data compression algorithm [34] to get a page prefetching algorithm, where the sequence of page accesses is assumed to be generated by a PFA. They show that th... |

185 | Learning decision trees using the Fourier spectrum
- Kushilevitz, Mansour
- 2005
(Show Context)
Citation Context ...xponential grow-up in the number of strings tested. A similar type of branch-and-bound technique (with various bounding criteria) is applied in many algorithms which use trees as data structures (cf. =-=[18]-=-). The set of strings tested at each step, denoted bysS, can be viewed as a kind of potential frontier of the growing treesT , which is of bounded size. After the construction ofsT is completed, we de... |

160 |
A universal data compression system
- Rissanen
- 1983
(Show Context)
Citation Context ...m for learning noisy parity functions in the PAC model. The machines used as our hypothesis representation, namely Probabilistic Suffix Trees (PSTs), were introduced (in a slightly different form) in =-=[23]-=- and have been used for other tasks such as universal data compression [23], [24], [32], [33]. Perhaps the strongest among these results (which has been brought to our attention after the completion o... |

107 |
Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process
- Fill
- 1991
(Show Context)
Citation Context ... this convergence rate can be bounded using the expansion properties of a weighted graph related to UM [20] or more generally, using algebraic properties of UM , namely, its second largest eigenvalue =-=[8]-=-. 4. Emulation of PSAs by PSTs In this section we show that for every PSA there exists an equivalent PST which is not much larger. This allows us to consider the PST equivalent to our target PSA, when... |

103 |
A fast sequential decoding algorithm using a stack
- Jelinek
- 1969
(Show Context)
Citation Context ...ly by using a dynamic programming (DP) scheme [4]. This scheme requires O(jQj \Theta t) operations. If jQj is large, then approximation schemes to the optimal DP, such as the stack decoding algorithm =-=[13]-=- can be employed. Using similar methods it is also possible to correct errors when insertions and deletions of symbols occur as well. We tested the algorithm by taking a text from Jenesis and corrupti... |

92 | On the learnability of discrete distributions
- Kearns, Mansour, et al.
- 1994
(Show Context)
Citation Context ...the rate in which the target machine converges to its stationary distribution. Despite an intractability result concerning the learnability of distributions generated by Probabilistic Finite Automata =-=[15]-=- (described in Section 1.1), our restricted model can be learned in a PAC-like sense efficiently. This has not been shown so far for any of the more popular sequence modeling algorithms. We present tw... |

85 | On the Computational Complexity of Approximating Distributions by Probabilistic Automata
- Abe, Warmuth
- 1998
(Show Context)
Citation Context ...e that the problem can be overcome by improving the algorithm used or by finding a new approach. Unfortunately, there is strong evidence that the problem cannot be solved efficiently. Abe and Warmuth =-=[1]-=- study the problem of training HMMs. The HMM training problem is the problem of approximating an arbitrary, unknown source distribution by distributions generated by HMMs. They prove that HMMs are not... |

79 | The context tree weighting method: Basic properties
- Willems, Shtarkov, et al.
- 1995
(Show Context)
Citation Context ...epresentation, namely Probabilistic Suffix Trees (PSTs), were introduced (in a slightly different form) in [23] and have been used for other tasks such as universal data compression [23], [24], [32], =-=[33]-=-. Perhaps the strongest among these results (which has been brought to our attention after the completion of this work) and which is most tightly related to our result is [33]. This paper describes an... |

76 | The power of amnesia
- RON, SINGER, et al.
- 1994
(Show Context)
Citation Context ...e size of the sample times L. 7. Applications A slightly modified version of our learning algorithm was applied and tested on various problems such as: correcting corrupted text, predicting DNA bases =-=[25]-=-, and LEARNING PROBABILISTIC AUTOMATA WITH VARIABLE MEMORY LENGTH 19 part-of-speech disambiguation resolving [28]. We are still exploring other possible applications of the algorithm. Here we demonstr... |

70 | On the Learnability and Usage of Acyclic Probabilistic Finite Automata
- Ron, Singer, et al.
- 1995
(Show Context)
Citation Context ...model to correct corrupted text. In the second application we construct a simple stochastic model for E.coli DNA. Combined with a learning algorithm for a different subclass of probabilistic automata =-=[26]-=-, the algorithm presented here is part of a complete cursive handwriting recognition system [30]. 1.1. Related Work The most powerful (and perhaps most popular) model used in modeling natural sequence... |

63 |
A hidden markov model that finds genes in e. coli dna
- Krogh, Haussler
- 1994
(Show Context)
Citation Context ...of such applications are statistical models in human communication such as natural language, handwriting and speech [14], [21], and statistical models of biological sequences such as DNA and proteins =-=[17]-=-. These kinds of complex sequences clearly do not have any simple underlying statistical source since they are generated by natural sources. However, they typically exhibit the following statistical p... |

61 |
Conductance and convergence of Markov chains: a combinatorial treatment of expanders
- Mihail
(Show Context)
Citation Context ...re convergence of the probability of visiting a state to the stationary probability. We show that this convergence rate can be bounded using the expansion properties of a weighted graph related to UM =-=[20]-=- or more generally, using algebraic properties of UM , namely, its second largest eigenvalue [8]. 4. Emulation of PSAs by PSTs In this section we show that for every PSA there exists an equivalent PST... |

53 |
Markov source modeling in text generation
- JELINEK
- 1985
(Show Context)
Citation Context ... substantially if we condition it on preceding subsequences of length greater than L. These observations suggests modeling such sequences by Markov chains of order L ? 1, (also known as n-gram models =-=[9]-=-), where the order is the memory length of the model, or alternatively, by Hidden Markov Models (HMM). These statisti2 DANA RON, YORAM SINGER, NAFTALI TISHBY cal models capture rich families of sequen... |

52 |
Complexity of strings in the class of Markov sources
- Rissanen
- 1986
(Show Context)
Citation Context ...hypothesis representation, namely Probabilistic Suffix Trees (PSTs), were introduced (in a slightly different form) in [23] and have been used for other tasks such as universal data compression [23], =-=[24]-=-, [32], [33]. Perhaps the strongest among these results (which has been brought to our attention after the completion of this work) and which is most tightly related to our result is [33]. This paper ... |

47 | Efficient learning of typical finite automata from walks
- Freund, Kearns, et al.
- 1993
(Show Context)
Citation Context ... families of distributions related to the ones studied in this paper, but his algorithms depend exponentially and not polynomially on the order, or memory length, of the distributions. Freund et. al. =-=[9]-=- point out that their result for learning typical deterministic finite automata from random walks without membership queries, can be extended to learning typical PFAs. Unfortunately, there is strong e... |

35 |
Estimation of probabilities m the language model of the IBM speech recognition system
- Nadas
- 1984
(Show Context)
Citation Context ...due to its wide variety of natural applications. The most noticeable examples of such applications are statistical models in human communication such as natural language, handwriting and speech [14], =-=[21]-=-, and statistical models of biological sequences such as DNA and proteins [17]. These kinds of complex sequences clearly do not have any simple underlying statistical source since they are generated b... |

33 |
Learning and robust learning of product distributions
- Hoffgen
- 1993
(Show Context)
Citation Context ...tial in L and hence, if we want to capture more than very short term memory dependencies in the sequences, of substantial length in the sequences, then these models are clearly not practical. Hoffgen =-=[12]-=- studies families of distributions related to the ones studied in this paper, but his algorithms depend exponentially and not polynomially on the order, or memory length, of the distributions. Freund ... |

33 |
A sequential algorithm for the universal coding of finite memory sources,” submitted to
- Weinberger, Lempel, et al.
(Show Context)
Citation Context ...esis representation, namely Probabilistic Suffix Trees (PSTs), were introduced (in a slightly different form) in [23] and have been used for other tasks such as universal data compression [23], [24], =-=[32]-=-, [33]. Perhaps the strongest among these results (which has been brought to our attention after the completion of this work) and which is most tightly related to our result is [33]. This paper descri... |

31 | Discrete sequence prediction and its applications
- Laird, Saul
- 1994
(Show Context)
Citation Context ...has full LEARNING PROBABILISTIC AUTOMATA WITH VARIABLE MEMORY LENGTH 5 knowledge of the source. This is true for almost all page access sequences (in the limit of the sequence length). Laird and Saul =-=[19]-=- describe a prediction algorithm which is similar in spirit to our algorithm and is based on the Markov tree or Directed Acyclic Word Graph approach which is used for data compression [5]. They do not... |

27 | Optimal prediction for prefetching in the worst case
- Krishnan, Vitter
- 1998
(Show Context)
Citation Context .... However, in case the source generating the examples is a PST, they are able to show that this PST convergence only in the limit of infinite sequence length to that source. Vitter and Krishnan [31], =-=[16]-=- adapt a version of the Ziv-Lempel data compression algorithm [34] to get a page prefetching algorithm, where the sequence of page accesses is assumed to be generated by a PFA. They show that the page... |

21 | Part-of-speech tagging using a variable memory Markov model
- Schütze, Singer
- 1994
(Show Context)
Citation Context ...nd tested on various problems such as: correcting corrupted text, predicting DNA bases [25], and LEARNING PROBABILISTIC AUTOMATA WITH VARIABLE MEMORY LENGTH 19 part-of-speech disambiguation resolving =-=[28]-=-. We are still exploring other possible applications of the algorithm. Here we demonstrate how the algorithm can be used to correct corrupted text and how to build a simple model for DNA strands. 7.1.... |

15 |
Inference and minimization of hidden markov chains
- Gillman, Sipser
- 1994
(Show Context)
Citation Context ...mating an arbitrary, unknown source distribution by distributions generated by HMMs. They prove that HMMs are not trainable in time polynomial in the alphabet size, unless RP = NP. Gillman and Sipser =-=[10]-=- study the problem of exactly inferring an (ergodic) HMM over a binary alphabet when the inference algorithm can query a probability oracle for the long-term probability of any binary string. They pro... |

4 |
genes, sequences, and computers: An Escherichia coli case study
- Maps
- 1992
(Show Context)
Citation Context ...of protein coding genes and fillers between those regions named intergenic regions. Locating the coding genes is necessary, prior to any further DNA analysis. Using manually segmented data of E. coli =-=[27]-=- we built two different PSAs, one for the coding regions and one for the intergenic regions. We disregarded the internal (triplet) structure of the coding genes and the existence of start and stop cod... |

3 |
An adaptive cursive handwriting recognition system
- Singer, Tishby
- 1995
(Show Context)
Citation Context ...l for E.coli DNA. Combined with a learning algorithm for a different subclass of probabilistic automata [26], the algorithm presented here is part of a complete cursive handwriting recognition system =-=[30]-=-. 1.1. Related Work The most powerful (and perhaps most popular) model used in modeling natural sequences is the Hidden Markov Model (HMM). A detailed tutorial on the theory of HMMs as well as selecte... |

3 |
Error bounds for convulutional codes and an asymptotically optimal decoding algorithm
- Viterbi
- 1967
(Show Context)
Citation Context ...e computed efficiently in a recursive manner. Moreover, the maximization of Equation (10a) can be performed efficiently by using a dynamic programming (DP) scheme, also known as the Viterbi algorithm =-=[22]-=-. This scheme requires O(jQj \Theta n) operations. If jQj is large, then approximation schemes to the optimal DP, such as the stack decoding algorithm [8] can be employed. Using similar methods it is ... |

2 |
Statistics of language: Introduction
- Good
- 1969
(Show Context)
Citation Context ...r method is information theoretic and does not depend on separation assumptions for any complexity classes. Natural simpler alternatives, which are often used as well, are order L Markov chains [29], =-=[11]-=-, also known as n-gram models. As noted earlier, the size of an order L Markov chain is exponential in L and hence, if we want to capture more than very short term memory dependencies in the sequences... |

1 |
Applications of DAWGs to data compression
- Blumer
(Show Context)
Citation Context ...rd and Saul [19] describe a prediction algorithm which is similar in spirit to our algorithm and is based on the Markov tree or Directed Acyclic Word Graph approach which is used for data compression =-=[5]-=-. They do not analyze the correctnes of the algorithm formally, but present several applications of the algorithm. 1.2. Overview of the Paper The paper is organized as follows. In Section 2 we give ba... |