## On the Computational Complexity of Approximating Distributions by Probabilistic Automata (1990)

### Cached

### Download Links

- [ftp.cse.ucsc.edu]
- [www.es.dis.titech.ac.jp]
- [ftp.cse.ucsc.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 85 - 0 self |

### BibTeX

@INPROCEEDINGS{Abe90onthe,

author = {Naoki Abe and Manfred K. Warmuth},

title = {On the Computational Complexity of Approximating Distributions by Probabilistic Automata},

booktitle = {Machine Learning},

year = {1990},

pages = {205--260}

}

### Years of Citing Articles

### OpenURL

### Abstract

We introduce a rigorous performance criterion for training algorithms for probabilistic automata (PAs) and hidden Markov models (HMMs), used extensively for speech recognition, and analyze the complexity of the training problem as a computational problem. The PA training problem is the problem of approximating an arbitrary, unknown source distribution by distributions generated by a PA. We investigate the following question about this important, well-studied problem: Does there exist an efficient training algorithm such that the trained PAs provably converge to a model close to an optimum one with high confidence, after only a feasibly small set of training data? We model this problem in the framework of computational learning theory and analyze the sample as well as computational complexity. We show that the number of examples required for training PAs is moderate -- essentially linear in the number of transition probabilities to be trained and a low-degree polynomial in the example l...

### Citations

1693 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ...n independently at random according to b D S . Notice that at each iteration the length of S 0 increases by one with probability at least a half. It is easy to see, by applying Chernoff's bound (c.f. =-=[Val84]-=-) that in p(m 0 ; 1s) many iterations, the length of S 0 becomes m 0 with probability at least 1 \Gamma , where p is a certain polynomial. If this fails to occur, i.e. the length of S 0 is shorter tha... |

624 | Learnability and the vapnik-chervonenkis dimension - Blumer, Ehrenfeucht, et al. - 1989 |

567 |
Convergence of Stochastic Processes
- Pollard
- 1984
(Show Context)
Citation Context ...d as maxf 32(n + 1) 2 t ffl 2 ln 3 64t(n + 1) 2 t ffl 2 ; 8(n + 1) 2 ln 1 ffi ffl 2 log 2 8t(n + 1) 2 ln 1 ffi ffl 2 g Proof of Lemma 3.1 We use the following lemma due to Hoeffding. (See for example =-=[Pol84]-=-.) Lemma 3.2 (Hoeffding): Let F be a finite class of bounded random variables on a set X, that is for each f 2 F , f : X ! [0; M ] for some real M 2 R. Let D be an arbitrary distribution over X. Then ... |

512 |
An inequality and associated maximization technique in statistical estimation of probabilistic functions of a markov process
- Baum
- 1972
(Show Context)
Citation Context ...ood on the input sample, is hard to 2 The existence of an algorithm for training hidden Markov models which always outputs a near local optimum on a given sample is well-known (`Baum-Welch' algorithm =-=[Bau72]-=-) and is used extensively in practice. Note that in applications to speech recognition, the alphabet size is determined by how precise the acoustic signals are quantized. The alphabet size is often in... |

372 | Decision theoretic generalizations of the PAC model for neural net and other learning applications - Haussler - 1992 |

222 |
Computational complexity of probabilistic Turing machines
- GILL
- 1977
(Show Context)
Citation Context ...dable in polynomial time and NP the class of decision problems acceptable in non-deterministic polynomial time. RP denotes the class of decision problems that are acceptable in random polynomial time =-=[Gil77]-=-: A decision problem L is said to be accepted in random polynomial time if and only if there exists a randomized algorithm A, that is, A has access to a fair coin, such that A halts in polynomial time... |

220 | Complexity of automaton identification from given data - Gold - 1978 |

205 | Minimum complexity density estimation - Barron, Cover - 1991 |

197 | Efficient Distribution-free Learning of Probabilistic Concepts
- Kearns, RE
- 1994
(Show Context)
Citation Context ...e well-known `Kullback-Leibler divergence.' Other commonly used measures of distance between probability distributions are, for example, thes2 distance, the variation distance, the quadratic distance =-=[KS90]-=-, and the Hellinger distance. The Kullback-Leibler divergence is a standard notion of distance, which enjoys many desirable properties (see Section 2). Furthermore, the Kullback-Leibler divergence is ... |

181 |
An introduction to the application of the theory of probabilistic functions of a return process to automatic speech recognition
- Levinson, Rabiner
- 1983
(Show Context)
Citation Context ...nd hidden Markov models 1 (HMMs) which are closely related to PAs, are used extensively as models for probabilistic generation of speech signals for the purpose of speech recognition (see for example =-=[LRS83]-=-). The problem addressed in the present paper corresponds to that of training a parameterized hidden Markov model for a particular spoken word with a set of actual speech signals for that word. In par... |

92 | On the complexity of minimum inference of regular sets - Angluin - 1978 |

82 | The minimum consistent DFA problem cannot be approximated within any polynomial - Pitt, Warmuth - 1993 |

41 |
Identifying Languages from Stochastic Examples
- Angluin
- 1988
(Show Context)
Citation Context ...is inspired by the model of efficient unsupervised learning of Laird [Lai88]. It is also related to the models for learning languages from stochastic data in the limit proposed and studied by Angluin =-=[Ang88]-=-. Our formulation requires the algorithm to be particularly robust in the sense that we do not assume anything about the target distribution -- a formulation which is closely related to the `robust' g... |

20 |
Generalizing the PAC model for neural net and other learning applications (Tech. Rep. UCSC-CRL-8930
- Haussler
- 1989
(Show Context)
Citation Context ...y robust in the sense that we do not assume anything about the target distribution -- a formulation which is closely related to the `robust' generalization of the PAC paradigm proposed by Haussler in =-=[Hau89]-=-. The distance measure between the distributions used in this paper to evaluate the accuracy of a hypothesis with respect to the target distribution is the well-known `Kullback-Leibler divergence.' Ot... |

15 | Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence - Abe, Takeuchi, et al. - 1991 |

12 | A lower bound for discrimination in terms of variation - Kullback - 1967 |

6 |
Coding and Information Theory, second edition
- Hamming
- 1986
(Show Context)
Citation Context ...(D) is the `entropy' of D, defined as P x2X D(x) log 1 D(x) . Recall that log 1 D(x) is the code length for x with respect to the ideal code for D, and H(D) is the expected code length 4 of that code =-=[Ham86]-=- for the source distribution D. In other words, for the source distribution D, the divergence dKL (D; Q) measures the expected additional code length required when using the ideal code for Q instead o... |

3 | Efficient unsupervised learning
- Laird
- 1988
(Show Context)
Citation Context ...maton with a given number of states. Our model is a natural adaptation of the PAC-learning paradigm of Valiant [Val84, BEHW89] and is inspired by the model of efficient unsupervised learning of Laird =-=[Lai88]-=-. It is also related to the models for learning languages from stochastic data in the limit proposed and studied by Angluin [Ang88]. Our formulation requires the algorithm to be particularly robust in... |

3 |
The equivalence and learning of probabilistic automata
- Tzeng
- 1989
(Show Context)
Citation Context ...state sums to one, rather than the total probability for each state-letter pair as is the case for PAs as acceptors. Tzeng considers the incomparable problem of learning PAs as acceptors from queries =-=[Tze89]-=-. 6 is a pair C = hI ; Gi where I is the initial state set and G is the transition graph of C. I is a subset of the set S of all states. G is a subset of the set S \Theta S \Theta \Sigma of all transi... |

3 |
A learning criterion for stochastic rules. Machine Learning, the same issue. A Proofs of Technical Lemmas In this appendix we prove the technical lemmas 5.2 through 5.5. Proof of Lemma 5.2 Recall that C[M ](w) = P G2\Gamma n ( M(;;G) ) ](G;w) with the con
- Yamanishi
- 1992
(Show Context)
Citation Context ...llinger distance as well as the square of the variation distance and of the quadratic distance. These relationships for the more general case of conditional distributions are surveyed by Yamanishi in =-=[Yam91]-=-. 1 HMMs are similar to probabilistic automata, except that outputs in an HMM are associated with the states rather than the transitions, and thus the transitions are unlabeled state to state pairs. 1... |

1 | On the complexity of minimal inference of regular sets - Angluin - 1978 |

1 |
Minimum consistent 2-state DFA problem is NP-complete. Unpublished manuscript
- Angluin
- 1989
(Show Context)
Citation Context ...uch as automata and boolean formulas [Gol78,Ang78,PW89]. In particular, our proof makes use of notions used in Angluin's proof of the NP-completeness of the sample consistency problem for 2-state DFA =-=[Ang89]-=-. The proof given here is, however, significantly more complex than the proof of the discrete case, since corresponding to `consistency' we have `probability,' which is continuous and is thus much har... |