Results 1 - 10
of
55
Inducing Features of Random Fields
- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 1997
"... We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the ..."
Abstract
-
Cited by 464 (14 self)
- Add to MetaCart
We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the Kullback-Leibler divergence between the model and the empirical distribution of the training data. A greedy algorithm determines how features are incrementally added to the field and an iterative scaling algorithm is used to estimate the optimal values of the weights. The random field models and techniques introduced in this paper differ from those common to much of the computer vision literature in that the underlying random fields are non-Markovian and have a large number of parameters that must be estimated. Relations to other learning approaches, including decision trees, are given. As a demonstration of the method, we describe its application to the problem of automatic word classifica...
The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length
- Machine Learning
, 1996
"... . We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions gene ..."
Abstract
-
Cited by 148 (15 self)
- Add to MetaCart
. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second ...
Two decades of statistical language modeling: Where do we go from here
- Proceedings of the IEEE
, 2000
"... Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here ..."
Abstract
-
Cited by 119 (1 self)
- Add to MetaCart
Statistical Language Models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state of the art. We review them here, point to a few promising directions, and argue for a Bayesian approach to integration of linguistic theories with data. 1. OUTLINE Statistical language modeling (SLM) is the attempt to capture regularities of natural language for the purpose of improving the performance of various natural language applications. By and large, statistical language modeling amounts to estimating the probability distribution of various linguistic units, such as words, sentences, and whole documents. Statistical language modeling is crucial for a large variety of language technology applications. These include speech recognition (where SLM got its start), machine translation, document classification and routing, optical character recognition, information retrieval, handwriting recognition, spelling correction, and many more. In machine translation, for example, purely statistical approaches have been introduced in [1]. But even researchers using rule-based approaches have found it beneficial to introduce some elements of SLM and statistical estimation [2]. In information retrieval, a language modeling approach was recently proposed by [3], and a statistical/information theoretical approach was developed by [4]. SLM employs statistical estimation techniques using language training data, that is, text. Because of the categorical nature of language, and the large vocabularies people naturally use, statistical techniques must estimate a large number of parameters, and consequently depend critically on the availability of large amounts of training data.
Inducing Probabilistic Grammars by Bayesian Model Merging
, 1994
"... We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are incorporated by adding ad-hoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are merged to achieve generalization and a more compact repr ..."
Abstract
-
Cited by 112 (0 self)
- Add to MetaCart
We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are incorporated by adding ad-hoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are merged to achieve generalization and a more compact representation. The choice of what to merge and when to stop is governed by the Bayesian posterior probability of the grammar given the data, which formalizes a trade-off between a close fit to the data and a default preference for simpler models (`Occam's Razor'). The general scheme is illustrated using three types of probabilistic grammars: Hidden Markov models, class-based n-grams, and stochastic context-free grammars. 1 Introduction Probabilistic modeling has become increasingly important for applications such as speech recognition, information retrieval, machine translation, and biological sequence processing. The types of models used vary widely, ranging from simple n-grams to Hidden Mark...
Entropy-based pruning of backoff language models
- In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop
"... A criterion for pruning parameters from N-gram backoff language models is developed, based on the relative entropy between the original and the pruned model. It is shown that the relative entropy resulting from pruning a single N-gram can be computed exactly and efficiently for backoff models. The r ..."
Abstract
-
Cited by 88 (6 self)
- Add to MetaCart
A criterion for pruning parameters from N-gram backoff language models is developed, based on the relative entropy between the original and the pruned model. It is shown that the relative entropy resulting from pruning a single N-gram can be computed exactly and efficiently for backoff models. The relative entropy measure can be expressed as a relative change in training set perplexity. This leads to a simple pruning criterion whereby all N-grams that change perplexity by less than a threshold are removed from the model. Experiments show that a production-quality Hub4 LM can be reduced to 26 % its original size without increasing recognition error. We also compare the approach to a heuristic pruning criterion by Seymore and Rosenfeld [9], and show that their approach can be interpreted as an approximation to the relative entropy criterion. Experimentally, both approaches select similar sets of N-grams (about 85% overlap), with the exact relative entropy criterion giving marginally better performance. 1.
Best-first Model Merging for Hidden Markov Model Induction
, 1994
"... This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are p ..."
Abstract
-
Cited by 86 (7 self)
- Add to MetaCart
This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are produced by merging HMM states. A Bayesian posterior probability criterion is used to determine which states to merge and when to stop generalizing. The procedure may be considered a heuristic search for the HMM structure with the highest posterior probability. We discuss a variety of possible priors for HMMs, as well as a number of approximations which improve the computational efficiency of the algorithm. We studied three applications to evaluate the procedure. The first compares the merging algorithm with the standard Baum-Welch approach in inducing simple finitestate languages from small, positive-only training samples. We found that the merging procedure is more robust and accurate, part...
Markovian Models for Sequential Data
, 1996
"... Hidden Markov Models (HMMs) are statistical models of sequential data that have been used successfully in many machine learning applications, especially for speech recognition. Furthermore, in the last few years, many new and promising probabilistic models related to HMMs have been proposed. We firs ..."
Abstract
-
Cited by 69 (2 self)
- Add to MetaCart
Hidden Markov Models (HMMs) are statistical models of sequential data that have been used successfully in many machine learning applications, especially for speech recognition. Furthermore, in the last few years, many new and promising probabilistic models related to HMMs have been proposed. We first summarize the basics of HMMs, and then review several recent related learning algorithms and extensions of HMMs, including in particular hybrids of HMMs with artificial neural networks, Input-Output HMMs (which are conditional HMMs using neural networks to compute probabilities), weighted transducers, variable-length Markov models and Markov switching state-space models. Finally, we discuss some of the challenges of future research in this very active area. 1 Introduction Hidden Markov Models (HMMs) are statistical models of sequential data that have been used successfully in many applications in artificial intelligence, pattern recognition, speech recognition, and modeling of biological ...
Learning Variable Length Markov Models of Behaviour
- Computer Vision and Image Understanding
, 2001
"... In recent years therehasbeen an increasedinterest in the modelling and recognition of human activities involving highly structured and semantically rich behaviour such as dance, aerobics, and sign language. A novel approachispresented for automatically acquiring stochastic models of the high-level s ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
In recent years therehasbeen an increasedinterest in the modelling and recognition of human activities involving highly structured and semantically rich behaviour such as dance, aerobics, and sign language. A novel approachispresented for automatically acquiring stochastic models of the high-level structureof an activity without the assumption of any prior knowledge. The process involves temporal segmentation into plausible atomic behaviour components and the use of variable length Markov models for the efficient representation of behaviours. Experimental results arepresented which demonstrate the synthesis of realistic sample behaviours and the performanceofmodels for long-term temporal prediction. Keywords: modelling behaviour, behaviour prediction, behaviour synthesis, variable length Markov models, Markov models, N-grams, hidden Markov models, probabilistic finite state automata, statistical grammars, computer animation. 2 1
Design of a Linguistic Postprocessor using Variable Memory Length Markov Models
- In International Conference on Document Analysis and Recognition
, 1995
"... We present the design of a linguistic postprocessor for character recognizers. The central module of our system is a trainable variable memory length Markov model (VLMM) which predicts the next character given a variable length window of past characters. The overall system is composed of several fin ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
We present the design of a linguistic postprocessor for character recognizers. The central module of our system is a trainable variable memory length Markov model (VLMM) which predicts the next character given a variable length window of past characters. The overall system is composed of several finite state automata, including the main VLMM and a proper noun VLMM. The best model reported in the literature (Brown et al 1992) achieves 1.75 bits per character on the Brown corpus. On that same corpus, our model, trained on 10 times less data, reaches 2.19 bits per character and is 200 times smaller (_ 160,000 parameters). The model was designed for handwriting recognition applications but can be used for other OCR problems and speech recognition.

