Results 1  10
of
14
A Neural Probabilistic Language Model
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen ..."
Abstract

Cited by 145 (12 self)
 Add to MetaCart
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on ngrams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on stateoftheart ngram models, and that the proposed approach allows to take advantage of longer contexts.
Learning multilevel distributed representations for highdimensional sequences
, 2006
"... We describe a family of nonlinear sequence models that is substantially more powerful than hidden Markov models or linear dynamical systems. Our models have simple approximate inference and learning procedures that work well in practice. Multilevel representations of sequential data can be learned ..."
Abstract

Cited by 27 (7 self)
 Add to MetaCart
We describe a family of nonlinear sequence models that is substantially more powerful than hidden Markov models or linear dynamical systems. Our models have simple approximate inference and learning procedures that work well in practice. Multilevel representations of sequential data can be learned one hidden layer at a time, and adding extra hidden layers improves the resulting generative models. The models can be trained with very highdimensional, very nonlinear data such as raw pixel sequences. Their performance is demonstrated using synthetic video sequences of two balls bouncing in a box.
Reinforcement learning with factored states and actions
 Journal of Machine Learning Research
, 2004
"... A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces. The method approximates stateaction values as negative free energies in an undirected graphical model called a product of exper ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces. The method approximates stateaction values as negative free energies in an undirected graphical model called a product of experts. The model parameters can be learned efficiently because values and derivatives can be efficiently computed for a product of experts. Actions can be found even in large factored action spaces by the use of Markov chain Monte Carlo sampling. Simulation results show that the product of experts approximation can be used to solve large problems. In one simulation it is used to find actions in action spaces of size 2 40.
Products of Random Latent Variable Grammars
"... We show that the automatically induced latent variable grammars of Petrov et al. (2006) vary widely in their underlying representations, depending on their EM initialization point. We use this to our advantage, combining multiple automatically learned grammars into an unweighted product model, which ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We show that the automatically induced latent variable grammars of Petrov et al. (2006) vary widely in their underlying representations, depending on their EM initialization point. We use this to our advantage, combining multiple automatically learned grammars into an unweighted product model, which gives significantly improved performance over stateoftheart individual grammars. In our model, the probability of a constituent is estimated as a product of posteriors obtained from multiple grammars that differ only in the random seed used for initialization, without any learning or tuning of combination weights. Despite its simplicity, a product of eight automatically learned grammars improves parsing accuracy from 90.2 % to 91.8 % on English, and from 80.3 % to 84.5 % on German. 1
Product of Gaussians for speech recognition
 Computer Speech & Language
, 2003
"... 1 Introduction Mixture of Gaussians (MoG) are commonly used as the state representation in hidden Markov model (HMM) based speech recognition. These Gaussian mixture models are easy to train using expectation maximisation (EM) techniques [4] and are able to approximate any distribution given a suffi ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
1 Introduction Mixture of Gaussians (MoG) are commonly used as the state representation in hidden Markov model (HMM) based speech recognition. These Gaussian mixture models are easy to train using expectation maximisation (EM) techniques [4] and are able to approximate any distribution given a sufficient number of components. However, only a limited number of parameters can be effectively trained given a finite quantity of training data. This limitation restricts the ability of MoG systems to model highly complex distributions. A range of distributed representations have been developed to overcome this problem. These distributed representations may be split into two basic forms. The first assumes that the sources are asynchronous. The second assumes that the sources are synchronous. o o ot1 t t+1 q q qt1 t t+1
Training Recurrent Neural Networks
, 2013
"... Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging probl ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems. We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train. Next, we present a new variant of the Hessianfree (HF) optimizer and show that it can train RNNs on tasks that have extreme longrange temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to characterlevel language modelling and get excellent results. We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances. Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with longterm dependencies. This directly contradicts widespread beliefs about the inability of firstorder methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.
Combining Simple Models to Approximate Complex Dynamics
 In Proc. Workshop on Statistical Methods in Video Processing
, 2004
"... Stochastic tracking of structured models in monolithic state spaces often requires modeling complex distributions that are difficult to represent with either parametric or samplebased approaches. We show that if redundant representations are available, the individual state estimates may be improved ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Stochastic tracking of structured models in monolithic state spaces often requires modeling complex distributions that are difficult to represent with either parametric or samplebased approaches. We show that if redundant representations are available, the individual state estimates may be improved by combining simpler dynamical systems, each of which captures some aspect of the complex behavior. For example, human body parts may be robustly tracked individually, but the resulting pose combinations may not satisfy articulation constraints. Conversely, the results produced by fullbody trackers satisfy such constraints, but such trackers are usually fragile due to the presence of clutter. We combine constituent dynamical systems in a manner similar to a Product of HMMs model. Hidden variables are introducied to represent system appearance. While the resulting model contains loops, making the inference hard in general, we present an approximate nonloopy filtering algorithm based on sequential application of Belief Propagation to acyclic subgraphs of the model.
Relative Density Nets: A New Way to Combine Backpropagation with HMM's
, 2001
"... Logistic units in the rst hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. This leads us to consider substituting other density models. We present an architecture for performing discriminative learning of Hidden Markov Models usi ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Logistic units in the rst hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. This leads us to consider substituting other density models. We present an architecture for performing discriminative learning of Hidden Markov Models using a network of many small HMM's. Experiments on speech data show it to be superior to the standard method of discriminatively training HMM's.
Combining Object and Feature Dynamics in Probabilistic Tracking
"... Objects can exhibit different dynamics at different scales, and this is often exploited by visual tracking algorithms. A local dynamic model is typically used to extract image features that are then used as input to a system for tracking the entire object using a global dynamic model. Approximate lo ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Objects can exhibit different dynamics at different scales, and this is often exploited by visual tracking algorithms. A local dynamic model is typically used to extract image features that are then used as input to a system for tracking the entire object using a global dynamic model. Approximate local dynamics may be brittle—point trackers drift due to image noise and adaptive background models adapt to foreground objects that become stationary—but constraints from the global model can make them more robust. We propose a probabilistic framework for incorporating global dynamics knowledge into the local feature extraction processes. A global tracking algorithm can be formulated as a generative model and used to predict feature values that are incorporated into an observation process of the feature extractor. We combine such models in a multichain graphical model framework. We show the utility of our framework for improving feature tracking and thus shape and motion estimates in a batch factorization algorithm. We also propose an approximate filtering algorithm appropriate for online applications, and demonstrate its application to background subtraction. 1.
Bethe Free Energy and Contrastive Divergence Approximations for Undirected Graphical Models
, 2003
"... Bethe Free Energy and Contrastive Divergence Approximations for Undirected Graphical Models Yee Whye Teh Doctorate of Philosophy Graduate Department of Computer Science University of Toronto 2003 As the machine learning community tackles more complex and harder problems, the graphical models ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Bethe Free Energy and Contrastive Divergence Approximations for Undirected Graphical Models Yee Whye Teh Doctorate of Philosophy Graduate Department of Computer Science University of Toronto 2003 As the machine learning community tackles more complex and harder problems, the graphical models needed to solve such problems become larger and more complicated. As a result performing inference and learning exactly for such graphical models become ever more expensive, and approximate inference and learning techniques become ever more prominent.